# hopcast_calibration_of_autoregressive_dynamics_models__51e1e4ea.pdf

Published in Transactions on Machine Learning Research (12/2025)

Hop Cast: Calibration of Autoregressive Dynamics Models

Muhammad Bilal Shahid belal@iastate.edu Department of Mechanical Engineering Iowa State University Cody Fleming flemingc@iastate.edu Department of Mechanical Engineering Iowa State University

Reviewed on Open Review: https: // openreview. net/ forum? id= ws O6nxv Gof

Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. Many of these models are optimized to predict one-step ahead; such approaches produce calibrated one-step predictions if the predictive model can quantify uncertainty, such as Deep Ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method to produce calibrated multi-step predictions. This work introduces an alternative Predictor Corrector approach named Hop Cast that uses Modern Hopfield Networks (MHN) to learn the errors of a deterministic Predictor that approximates the dynamical system. The Corrector predicts a set of errors for the Predictor s output based on a context state at any timestep during autoregression. The set of errors creates sharper and well-calibrated prediction intervals with higher predictive accuracy compared to baselines without uncertainty propagation. The calibration and prediction performances are evaluated across a set of dynamical systems. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors. We also evaluate Hop Cast as a substitute for Deep Ensembles within a model-based reinforcement learning planner, demonstrating improved performance across multiple control tasks.

1 Introduction

Approximating dynamical systems that can be modeled using differential equations has many applications in systems biology (Wang et al., 2021; Brauer & Kribs, 2016), control (Moerland et al., 2023; Lu et al., 2021), economics (Tu, 2012), and cyber-physical systems (Bilal Shahid & Fleming, 2025; Shahid et al., 2024; Robison et al., 2024). Many deep learning approaches to approximate dynamical systems optimize the model to predict one-step ahead during training (Khansari-Zadeh & Billard, 2011; Coates et al., 2008; Ko et al., 2007). The multi-step predictions are generated via autoregression at inference time (Lu et al., 2021). In doing so, the model error accumulates over time, resulting in multi-step predictions diverging from the ground truth (Venkatraman et al., 2015; Janner et al., 2021). Hence, it is imperative to develop methods to generate accurate and calibrated multi-step predictions.

State-of-the-art methods for generating accurate and calibrated predictions include Probabilistic Ensembles (Deep Ensembles) (Lakshminarayanan et al., 2017). To approximate the dynamics typically associated with differential equations, these models are optimized to predict one-step ahead during training and then autoregressively generate multi-step predictions at inference time. To generate calibrated multi-step predictions, Probabilistic Ensembles need sound uncertainty propagation methods such as Trajectory Sampling (Chua et al., 2018a).

In contrast, we introduce a Predictor-Corrector mechanism to produce accurate and calibrated multi-step predictions for dynamical systems without the need for uncertainty propagation. The Predictor is optimized

Published in Transactions on Machine Learning Research (12/2025)

to predict a one-step ahead point estimate of the next state of the system during training and produce multistep predictions via autoregression at inference time. Our Corrector is based on the idea that similar states of a system lead to similar errors. This idea was leveraged by Auer et al. (2023) to build calibrated prediction intervals for one-step prediction utilizing Modern Hopfield Networks (MHN). However, the same state may not lead to the same error for multi-step predictions for dynamical systems. In such situations, the error also depends on how the system arrives at that state, i.e., where the system started and when it reached the state. To that end, we introduce the concept of context state, assuming that similar context states lead to similar errors. The Corrector, utilizing MHN, learns similarity between context states in terms of errors. At any timestep during autoregression, the Predictor predicts, and the Corrector generates prediction intervals for the prediction. To control the width of prediction intervals for calibration, we introduce the concept of Attention Span. We name our approach Hop Cast.

Our contributions are:

the introduction of the Predictor-Corrector mechanism, where the Predictor autoregressively generates the next state for a dynamical system, and the Corrector predicts a set of errors for the full forecasting horizon. The expected error corrects the prediction, while the set of errors generates the calibrated prediction intervals.

a method Hop Cast that has no assumptions about the Predictor except that it predicts a point estimate of the next state of the system. It is therefore a general framework that can provide multi-step calibrated predictions irrespective of the form or capability of the underlying Predictor.

lower calibration and prediction error across several benchmarks, without the use of complex uncertainty propagation techniques. Unlike uncertainty propagation methods, Hop Cast creates sharper prediction intervals based on context state instead of accumulating uncertainty over time during autoregression. We also test Hop Cast as a replacement for Probabilistic Ensembles in a model-based reinforcement learning planner, achieving strong performance on standard control tasks.

2 Related Work

Since we propose a mechanism to quantify the predictive uncertainty and generate calibrated prediction intervals, the related works include those from the uncertainty quantification (UQ) and model calibration. In UQ, there are two lines of work: Bayesian and Ensemble methods.

Bayesian Methods. In Bayesian models, a distribution over the parameters of the underlying model is learned, called posterior distribution. At inference time, multiple samples of parameters can be taken from the posterior distribution using an accurate sampling method, such as Hamiltonian Monte Carlo (HMC) (Betancourt, 2017). These samples serve as multiple predictive models and give us the desired diversity in the predictive variable of interest. A notable method from this domain is Bayes by Backprop (Blundell et al., 2015), which learns an approximate posterior distribution over the weights of a feedforward model via Variational Inference (VI) by optimizing the evidence lower bound (ELBO) (Kingma et al., 2015).

Ensemble Methods. In classical ensemble methods, an ensemble of models is trained with variations in the data for each model in the ensemble to train robust predictors (Davison & Hinkley, 1997). Deep Ensembles, however, are constructed using modern deep learning models with random initializations in the parameter space. These over-parametrized models, when initialized randomly, converge to different local minima, giving us the desired diversity in the predictive variable (Fort et al., 2019) These models were first presented by Lakshminarayanan et al. (2017) in the context of predictive uncertainty estimation. A notable extension of that is Neural Ensemble Search (Zaidi et al., 2021), which uses different architectures for models in an ensemble and improves upon deep ensembles. The Deep Ensemble is a scalable way of quantifying uncertainty compared to Bayesian methods (Wilson & Izmailov, 2020). Hence, these are used as baseline models with different uncertainty propagation methods in the current work.

Model Calibration. Calibration has been studied widely for classification models. In binary classification, Platt Scaling (Platt et al., 1999) and isotonic regression (Niculescu-Mizil & Caruana, 2005) have been used

Published in Transactions on Machine Learning Research (12/2025)

successfully for recalibration. There are extensions of such works to multi-class classification problems (Zadrozny & Elkan, 2002). In the context of regression, Gneiting & Raftery (2007) proposed several proper scoring rules to evaluate the calibration of a probabilistic model for continuous variables. Those scoring rules have been used in the literature as loss functions, for example, continuous ranked probability score (CRPS) (Gasthaus et al., 2019). Calibration has also been discussed in the literature on probabilistic forecasting, mainly in the context of meteorology (Gneiting & Raftery, 2005), resulting in specialized calibration systems (Raftery et al., 2005). An approach called calibrated regression was proposed by Kuleshov et al. (2018), which used isotonic regression to recalibrate Bayesian models. In contrast, our work turns a deterministic Predictor into a calibrated Predictor via a Predictor-Corrector mechanism. Also, we focus on Predictors that autoregressively generate the next state of a dynamical system, a setting rarely addressed in the model calibration literature to the best of our knowledge.

3 Problem Description

Consider a dynamical system described by a multivariate ordinary differential equation (ODE).

x(t) = dx(t)

dt = f(x(t)) (1)

where x(t) RD denotes the state of a D-dimensional system at time t and x(t) RD is its first order time derivative. The f(x(t)) specifies the vector-valued time derivative function. The state of a dynamical system at time t can be obtained by integrating the ODE as:

x(t) = x0 + Z t

0 f(x(τ))dτ (2)

The ODE is integrated forward in time starting from the initial condition x(0) = x0. We assume that the vector field f is unknown, but it can be learned based on observed data. The goal of a predictive model is to predict x(t) as accurately as possible, for as long as possible, that is, to minimize the distance between the prediction ˆx(t), and ground-truth trajectory, x(t). Additionally, in this work, for D predictions of each variable in x(t), we wish to create a set of models that correct such predictions, Mi(t) for i {1, . . . , D}.

The following exposition describes the data generation process and nomenclature for Predictors, Correctors, trajectory types, errors, and so forth. The method is general and can be used for any D-dimensional problem as in equation (1). For ease of exposition and without loss of generality, we select a two-dimensional ODE to describe the inner workings and preliminaries. The Lotka-Volterra (LV) system is used here and in section 4 for illustrative purposes; subsequent sections show experimental results for a variety of systems. LV dynamics can be written as:

dt = αx βxy (3)

dt = δxy γy (4)

The state of this system can be defined as: s = (x, y). We assume a set of ground-truth N trajectories of T timesteps IN = {τn}N n=1 where τn = {(sn t , sn t+1)t}T t=1 and sn t = (xn t , yn t ). This dataset can be generated by solving the equations with a numerical ODE solver such as solve_ivp from scipy with random initial conditions. To simulate measurement noise, we perturb each state post-integration with additive Gaussian noise, i.e., s = s + ϵ, where ϵ N(0, σ Σ) and Σ = diag(σx, σy). The σ is the scaling factor. We name IN an Integrated dataset. We assume that there is a Predictor B (a learned representation of vector-valued function f), trained on IN, that predicts st+1 1 given st, autoregressively generating a predicted set of N trajectories starting from the same initial conditions. Precisely, AN = { τn}N n=1 where τn = {( sn t , sn t+1)t}T t=1 and sn t = ( xn t , yn t ). We call AN an Autoregressive dataset. The error (en t ) is the difference between tth

1The overhead bar over a variable shows that it belongs to Autoregressive dataset.

Published in Transactions on Machine Learning Research (12/2025)

timestep of nth trajectory from Integrated and Autoregressive, that is, en t = (exn t , eyn t ) where en t = sn t sn t . Therefore, the set of errors of N trajectories is FN = {{en t }T t=1}N n=1.

For the LV example, we assume two correction models, Mx and My, corresponding to each output of the Predictor B. Mx and My will predict a set of errors E x = {εi xt+1}s i=1 and E y = {εi yt+1}s i=1 associated with tth timestep of a trajectory at inference time, respectively. The z% prediction interval, i.e., PIz xt+1 and PIz yt+1 for xt+1 and yt+1 at tth timestep can be generated as follows:

PIz xt+1 = xt+1 + [Qα(E x), Q1 α(E x)] (5)

PIz yt+1 = yt+1 + [Qα(E y), Q1 α(E y)] (6)

where α = 1 z

2 . For example, a 90% prediction interval corresponds to α = 0.05. The Qα denotes the αquantile. We generate prediction intervals for the outputs of Predictor B at tth timestep during autoregression solely based on the set of errors predicted by correction models (Mx and My), thereby avoiding uncertainty propagation.

The proposed methodology, Hop Cast, consists of a Predictor-Corrector mechanism shown in Fig. 1. At any timestep (t) during autoregression, the Predictor B produces a point forecast of the next state of the system ( xt+1, yt+1) given previous state ( xt, yt). The Corrector retrieves context-dependent errors (E x and E y) to correct the forecast and quantify uncertainty. In particular, the Corrector consists of correction models (Mx and My), one for each output of the Predictor B. Each correction model (M) consists of an Encoder (m) and an MHN (Ramsauer et al., 2020; Auer et al., 2023). The Encoder (m) is a fully connected feedforward model. Architectural details are included in Appendix E.1. We frame correction as a pattern retrieval task (Ramsauer et al., 2020). Precisely, Mx and My take in the context state ( x1, y1, xt, yt, t) as a pattern and outputs E x and E y, respectively. E x and E y contain errors associated with similar context states from the past. The context state is the input of Predictor B ( xt, yt) augmented with the initial condition ( x1, y1) and time step (t). The context state captures both local and trajectory-level structure, enabling the Corrector to retrieve errors specific to the current dynamical regime. The E x & E y are used to derive prediction intervals.

Figure 1: Predictor-Corrector mechanism

The remainder of this section consists of three subsections. Section 4.1 discusses the shuffling of IN, AN, & FN datasets to train correction models for pattern retrieval tasks. Section 4.2 then discusses the training of the correction model (Mx), and section 4.3 describes the inference with Mx where we generate the set of errors (E x) given the context state.

Published in Transactions on Machine Learning Research (12/2025)

4.1 Shuffled Data

The MHN in Hop Cast is trained as a pattern retrieval model rather than a sequence model. To this end, we break every trajectory into independent context error pairs and shuffle them, so the model learns to retrieve errors based on the similarity of context, rather than the temporal ordering of the data. We explain that below.

Shuffled Queries (SQ). Before training the correction model (Mx), the states of the system from Autoregressive (AN) dataset are augmented with context information, that is, initial condition and time step. Appendix E.2 emphasizes the importance of adding initial condition as a context. Specifically, the tth

timestep of nth trajectory τn from AN, i.e., ( xn t , yn t ), will be cn t = ( xn 1, yn 1 , xn t , yn t , t) after adding context information. The tuple ( xn 1, yn 1 ) is the initial condition of trajectory τn. The cn t denotes the context state associated with the tth timestep of nth trajectory. The set of all context states constructed from dataset AN is {{ cn t }T t=1}N n=1. This nested set of context states is flattened and shuffled randomly to form a new set

SQ = Shuffle

n=1 { cn t }T t=1

. Appendix E.7 highlights the importance of shuffling the data.

Shuffled Keys (SK). Likewise, we construct the context states {{cn t }T t=1}N n=1 for Integrated (IN) dataset as well, where cn t = (xn 1, yn 1 , xn t , yn t , t). Similar to SQ, we flatten the nested set {{cn t }T t=1}N n=1 and randomly

shuffle it to form SK = Shuffle

n=1 {cn t }T t=1

Shuffled Values (SVx & SVy). Given errors FN = {{en t }T t=1}N n=1 where en t = (exn t , eyn t ), FN can be split into two sets of errors F x N = {{exn t }T t=1}N n=1 and F y N = {{eyn t }T t=1}N n=1. Recall that we train separate correction models (Mx and My) for each output of Predictor B. F x N and F y N are flattened and shuffled to

form SVx = Shuffle

n=1 {exn t }T t=1

and SVy = Shuffle

n=1 {eyn t }T t=1

Once we have SQ, SK, SVx, SVy, they are split into train/test with an 80/20 split. To train Mx, we need dataset Dx = {(Q, K, Vx)i} comprising triplets (Q, K, Vx). Each triplet of Q, K, and Vx are constructed by sampling SL (Sequence Length) number of elements from sets STrain Q , STrain K , and STrain Vx , respectively. For instance, one triplet with SL = 5 is: Q = ( c3 4, c1 5, c5 7, c3 2, c7 1) , K = (c3 4, c1 5, c5 7, c3 2, c7 1) , and Vx = (ex3 4, ex1 5, ex5 7, ex3 2, ex7 1) . The error ex3 4 = x3 4 x3 4 denotes the error of Predictor B at 4th timestep of 3rd

trajectory. Likewise, we construct Dy = {(Q, K, Vy)i} using STrain Q , STrain K and STrain Vy . The 20% split with inference data is used for final evaluation.

4.2 Training

Figure 2 illustrates how the correction model (Mx) retrieves context-dependent errors using an Encoder (mx) and a Modern Hopfield Network (MHNx): (i) the mx maps each context state into a d-dimensional embedding, and (ii) the MHNx forms associations between embedded queries and keys and retrieves the corresponding error values. Specifically, the mx takes in as input the context state, such as c3 4, and outputs a d-dimensional embedding. We choose d = 4 in our experiments, though other values of d are expected to perform comparably [Appendix E.6]. We use the same mx to construct embedded queries Qψ and keys Kψ from Q and K, respectively. These appear as row/column labels in the Training block of Fig. 2. The MHNx uses an attention mechanism (Vaswani, 2017) to construct an association matrix (ASL SL) based on the similarity of elements in Qψ with elements in Kψ. Mathematically,

ASL SL = softmax

ˆVx = ASL SL.Vx (8)

SL Vx ˆVx 2 2 (9)

Published in Transactions on Machine Learning Research (12/2025)

The association matrix (ASL SL) is shown in the Training block of Fig. 2 with SL = 5. We mask the association from Autoregressive context state to its Integrated context state. This masking is reflected in the zeroed diagonal entries of the association matrix. The (i, j)th entry of ASL SL shows the similarity of ith element of Qψ with jth element of Kψ. We use softmax over ASL SL[i, :] to get association weights for ith element of Qψ. Therefore, each row of ASL SL sums up to one. The loss function L is used to learn the parameters of the mx via backpropagation, where the ˆVx denotes the predicted errors.

Figure 2: Training and Inference with Correction Model (Encoderx + MHNx).

4.3 Inference

At inference time, the correction model (Mx) retrieves past errors based on the similarity between the encoded query and the encoded keys stored in the MHNx Association memory. To build the Association memory, a set of K keys (Kmem) with the corresponding values (Vx mem) is sampled from STrain K and STrain Vx , respectively. For instance, a randomly sampled set of keys and values would be Kmem = (c3 4, c1 5, c5 7, , c2 5) and Vx mem = (ex3 4, ex1 5, ex5 7, , ex2 5), respectively. The trained Encoder mx is used to encode the Kmem as Kψ mem. The Kψ mem and V x mem constitute the Association memory of MHNx. It is shown in the Inference block in Fig. 2. We loaded 2000 keys in memory for evaluation [Appendix E.5]. Once the memory is set up, the correction model (Mx) takes in the context state ( x1, y1, xt, yt, t) as a query Qt at tth time step during autoregression and encodes it as Qψ t . The association weights (at x) at tth timestep are,

at x = softmax

Qψ t Kψ mem

Figure 3: An example set of errors (Ex) sampled from Association Memory at Timestep 106 during autoregression for the x-output of the Predictor for the Lorenz system, where the red dotted line denotes the ground truth error.

Published in Transactions on Machine Learning Research (12/2025)

where at x = (p1, , p K), PK i=1 pi = 1, and K denotes the number of keys in memory. The weights at x show how strongly the query Qψ t matches each entry in Kψ mem. A set of s categories is sampled as: {bi}s i=1 Categorical(at x). We select s = 1000 in our experiments [Appendix E.3]. A set of errors (E x) is selected from Vx mem based on sampled categories {bi}s i=1, i.e., E x = {V x mem[bi]}s i=1. A sampled set of errors E x is shown in Fig. 3 for the x-output of the Predictor B for the Lorenz system. An alternative approach to sampling from at x is to retrieve top-k probabilities. We provide an ablation study on this in Appendix E.4.

We have discussed the training of Mx and how to utilize it to obtain a set of errors (E x) for the x-output of the Predictor B at the inference time. To obtain E y by training My, we repeat the same procedure stated in section 4.2 and section 4.3 with dataset Dy, and this can be done for an arbitrary number of dimensions depending on the system dynamics.

5 Experiments

5.1 Evaluation Metrics

We propose three metrics to evaluate the quality of Hop Cast against the benchmarks.

Calibration Error (Kuleshov et al., 2018) (CE). A calibrated model has observed fractions (ˆp) of its predicted random variable match with the expected fractions (p). The difference between these two is calibration error. To evaluate CE, w(= 9) equally spaced prediction intervals are chosen from 10% to 90%. For each prediction interval, we count the number of times the observed variable falls between intervals. Mathematically, CE = Pw i=1(ˆpi pi)2.

Prediction Interval Width (Auer et al., 2023) (PI-Width). A calibrated model should also be sharp. Sometimes, it is trivial to reduce the CE by always predicting the expected value of a random variable (Kuleshov et al., 2018). A good predictor should be calibrated and sharp. Hence, we propose to use PIWidth as a measure of sharpness. Mathematically,

PI-Width = 1

j=1 |U i j Li j|, (11)

where w = 9 is for equally spaced PI, and v denotes the number of validation data points; i.e., v = |SInference Q |.

Mean Squared Error (MSE). MSE is used to evaluate the predictive accuracy of the proposed approach against the baselines.

5.2 Datasets

We have discussed one dynamical system, i.e., LV, in section 3. Other dynamical systems include Lorenz, Fitz Hugh-Nagumo (FHN), Lorenz95, and the Glycolytic Oscillator. To generate datasets, we randomly sample N initial conditions from within a specified range for the state variables of each system. The solve_ivp method from scipy is used to integrate the dynamics with the adaptive-step RK45 solver. To produce uniformly sampled trajectories, system states are extracted at fixed time intervals t as mentioned in Table 4. For Lorenz and LV, the dynamics of both system states and their derivatives are modeled, whereas the dynamics of states are modeled for the rest of the systems. The mathematical forms of each dynamical system, ranges of initial conditions, and parameter values are given in Appendix B.

5.3 Baselines

The baseline methods include three uncertainty propagation approaches, i.e., Expectation, Moment Matching, and Trajectory Sampling. These approaches are used to propagate uncertainty with Probabilistic Ensembles (Chua et al., 2018a). The details of Probabilistic Ensembles training and uncertainty propagation methods are presented in Appendix A.

Published in Transactions on Machine Learning Research (12/2025)

Overall Comparison As shown in Table 1, Hop Cast outperforms the baseline approaches in terms of CE and MSE in the vast majority of cases. It achieves the lowest average CE of 0.046 among all, with the secondbest approach being the Trajectory Sampling with CE of 0.075. In terms of MSE, Hop Cast outperforms in 11 out of 15 cases, showing the effectiveness of our Predictor-Corrector mechanism. The Expectation and Moment Matching do not perform well in terms of CE and MSE. These two approaches generate conservative intervals with average PI-Widths of 7.56 and 3.63, respectively, resulting in overconfident models. The overconfidence is evident in the calibration curves of Expectation and Moment Matching in Fig. 6. The average CE of these two approaches is 0.21 and 0.68, respectively, showing high miscalibration compared to Trajectory Sampling and Hop Cast. PI-Width is only useful when discussed in conjunction with CE as it is trivial to reduce the PI-Width while being overconfident and miscalibrated.

Table 1: Prediction Interval Widths (PI-Widths) and Calibration Error (CE) metrics of Hop Cast and baseline approaches with different scaling factor (σ) across various dynamical systems. The results are averaged across 3 runs of the same experiment. All of the CEs within 5% of the best one in a row are highlighted. The PI-Width is only used as a tie-breaker if there are different models with CEs within 5% of the best. In that case, the best-performing model based on CE is denoted with an asterisk and PI-Width is highlighted. For MSE, we simply highlight anything within 5% of the best one in a row.

Model Hop Cast

Probabilistic Ensemble (PE)

Expectation Moment Matching Trajectory Sampling

Metrics MSE PI-Width CE MSE PI-Width CE MSE PI-Width CE MSE PI-Width CE

Lotka Volterra

σ = 0.05 6.87 0.32 2.93 0.17 0.022 0.004 20.58 1.25 4.01 0.40 0.017 0.009 28.9 1.04 0.65 0.06 0.56 0.12 19.45 0.36 5.43 0.10 0.22 0.11

σ = 0.1 6.10 0.20 1.88 0.05 0.0070 0.002 24.1 0.15 0.83 0.02 0.47 0.17 23.9 0.04 0.96 0.04 0.41 0.14 21.01 0.27 5.87 0.59 0.18 0.10

σ = 0.3 9.69 0.23 3.19 0.12 0.0051 0.004 25.6 0.09 2.13 0.02 0.21 0.05 25.4 0.08 2.93 0.02 0.059 0.02 23.3 0.05 6.19 0.22 0.13 0.07

σ = 0.05 1126 7.97 28.76 0.74 0.011 0.006 1101.9 28.3 30.37 0.80 0.002 0.001 1991.3 4.03 4.45 0.009 1.23 0.07 1080.10 5.59 38.79 0.49 0.16 0.04

σ = 0.1 1248.17 15 38.66 0.38 0.019 0.03 1413.1 38.83 35.49 1.11 0.011 0.015 2579.7 48.7 8.78 0.01 1.21 0.08 1314.60 3.80 45.28 0.69 0.078 0.03

σ = 0.3 1681 46.36 40.44 0.91 0.019 0.004 1848.5 15.8 25.9 1.43 0.38 0.10 2098.5 18.08 23.13 0.06 0.58 0.30 1788.6 10.9 48.52 0.35 0.018 0.016

σ = 0.05 0.076 0.002 0.20 0.005 0.091 0.004 0.68 0.13 1.13 0.061 0.044 0.034 0.95 0.45 0.28 0.007 0.91 0.42 0.66 0.04 1.25 0.007 0.054 0.026

σ = 0.1 0.17 0.03 1.03 0.05 0.32 0.05 1.39 0.17 0.76 0.064 0.38 0.19 2.21 0.13 0.38 0.003 1.42 0.086 0.83 0.005 1.35 0.003 0.051 0.02

σ = 0.3 0.57 0.04 1.27 0.03 0.11 0.02 1.53 0.086 1.009 0.046 0.39 0.13 1.25 0.003 1.52 0.010 0.108 0.019 1.12 0.008 1.55 0.004 0.061 0.011

σ = 0.05 10.44 0.08 4.41 0.13 0.007 0.003 10.11 0.21 3.25 0.015 0.13 0.017 14.11 0.48 1.84 0.009 0.62 0.04 9.30 0.031 4.90 0.023 0.022 0.007

σ = 0.1 13.98 0.11 5.78 0.018 0.0028 0.007 13.31 0.20 3.29 0.05 0.27 0.025 17.38 1.14 3.24 0.03 0.35 0.06 11.63 0.11 5.44 0.04 0.004 0.003

σ = 0.3 16.44 0.66 6.08 0.06 0.0015 0.001 17.46 0.11 4.61 0.04 0.13 0.029 16.37 0.23 6.01 0.03 0.0036 0.003 15.13 0.075 6.49 0.047 0.009 0.006

Glycolytic Oscillator

σ = 0.05 0.03 0.001 0.20 0.007 0.043 0.018 0.10 0.006 0.24 0.02 0.07 0.04 0.13 0.019 0.056 0.002 1.35 0.32 0.09 0.004 0.36 0.006 0.06 0.026

σ = 0.1 0.072 0.003 0.25 0.007 0.018 0.006 0.12 0.005 0.21 0.01 0.10 0.03 0.15 0.005 0.08 0.004 0.99 0.18 0.10 0.001 0.41 0.01 0.07 0.03

σ = 0.3 0.11 0.001 0.37 0.008 0.016 0.003 0.18 0.001 0.23 0.01 0.62 0.28 0.25 0.002 0.21 0.003 0.54 0.07 0.17 0.005 0.46 0.003 0.015 0.01

Average - - 9.03 0.046 - 7.56 0.21 - 3.63 0.68 - 11.48 0.075

Attention Span Hop Cast lets us control the confidence of the model with precise control over the width of PI based on a concept we introduce called Attention Span. The Attention Span can be increased/decreased by increasing/decreasing the SL of Q and K in ASL SL. To demonstrate that, we collect a dataset H = {(xi, yi)}6000 i=1 using the sine function with heteroscedastic noise (Chua et al., 2018a) shown in equation 12 and Fig. 4(a). The xi s are uniformly sampled from the following domain [ 5π

2 , π] [π, 5π

(x, y) x, y + N 0, 0.2 sin 3

We use the same correction model (Mx) consisting of Encoder (mx) and MHNx with the same training and inference procedures described in section 4.2 and 4.3, respectively. The only exception is that the set of triplets {(Q, K, V)i} are constructed from dataset H. For instance, a triplet with SL = 4 is: Q =

Published in Transactions on Machine Learning Research (12/2025)

(x10, x13, x14, x1), K = (x10, x13, x14, x1), and V = (y10, y13, y14, y1). We trained two Mx models with SL of 3 & 8. At the inference time, the Mx is given a data point x = 3.28 as a query to be associated with the keys in memory based on similarity in terms of y. The keys from memory are sampled based on their association weights. The frequency plots of sampled keys for SL of 3 & 8 are in Fig. 4(b) & (c), respectively. For SL = 3, the sampled keys are clustered around 3.28 in Fig. 4(b), which are the most similar keys to the query x = 3.28. As we increase SL to 8, the Mx expands its Attention Span by picking other keys from memory around x = 3.2 & 6.5 that have nearly similar y values as shown in Fig. 4(c).

Figure 4: (a) Upper center: Sine function with heteroscedastic noise (b) Lower left: Attention Span for x = 3.28 as a query with SL = 3 (c) Lower right: Attention Span for x = 3.28 as a query with SL = 8

Calibration Performance The reason Hop Cast performs well in terms of CE compared to baselines is owing to the SL that can be increased/decreased to increase/decrease the diversity of sampled errors E x from MHN memory. As we increase the SL, the Mx starts to expand its Attention Span by associating other keys from memory (Kψ men) to the query (Qψ t ) based on their similarity in terms of errors, effectively controlling the width of prediction intervals. We train separate correction models (e.g., Mx and My) for each output of the Predictor and tune their SL separately. The SL hyperparameter and number of models in a Probabilistic Ensemble for each setting are shown in Appendix C. For the baseline approaches, we tune the number of models in an ensemble. In general, we observed that the Probabilistic Ensembles tend to increase the uncertainty (or widen their prediction intervals) as we increase the number of models, and the uncertainty plateaus at some point. Therefore, if the uncertainty plateaus and the model is overconfident, adding more models to the ensemble will not make it a calibrated model. This was observed in the calibration curves of Expectation and Moment Matching in Fig. 6. Trajectory Sampling, in contrast, provided calibrated uncertainty with few models in the ensemble (i.e., 3 or 4) in most cases. It is owing to its nature of taking predicted uncertainty into account during propagation.

Predictive Performance Hop Cast outperforms in most cases in terms of MSE because we proposed a Corrector that predicts a set of errors, and prediction intervals are a side effect of the approach. Also, our model generates a smaller PI-width in most cases because we generate prediction intervals based on the context state (x0, y0, xt, yt, t) at any timestep during autoregression. On the other hand, uncertainty propagation approaches generate intervals based on uncertainty accumulated due to propagation up to a certain time step (t = t ) starting from t = 1. A manifestation of that is shown in Fig. 5 (b) & (d). For Fig. 5(a), the output of Predictor B is shown in the dark, and the Mean indicates the prediction after adding the expected error (Ps i=1 at x[bi]Ex[i]) from the Corrector. For Fig. 5(b), (c), & (d), the Mean indicates the expected prediction from all models in a Probabilistic ensemble. Fig. 5(a) shows that Hop Cast reduces its PI-Width around time step 270 due to smaller error, resulting in sharper intervals. On the other hand, Expectation and Trajectory Sampling have wider PI due to accumulated uncertainty up to that time step, even though the Mean is close to the ground truth.

Published in Transactions on Machine Learning Research (12/2025)

Figure 5: (a). Upper left: Hop Cast (b). Upper right: Expectation (c). Lower left: Moment Matching (d). Lower right: Trajectory Sampling. A comparison of prediction intervals generated by Hop Cast and three uncertainty propagation approaches. The x-axis denotes the time steps, and the y-axis shows the x output of the Lorenz system. The Upper PI and Lower PI show the m = 9 equally spaced prediction intervals from 10% to 90%.

Figure 6: The calibration curves of Hop Cast and three uncertainty propagation methods for Glycolytic Oscillator at σ = 0.3. The dark dotted line corresponds to the perfect calibration. The calibration curves are shown separately for seven states of the system.

Trajectory Sampling vs. Counterparts On average, Trajectory Sampling performs better than its counterpart baselines Expectation and Moment Matching in terms of CE, owing to how it propagates uncertainty. It takes into account the predicted uncertainty of the model while propagating uncertainty from time step t to t+1. It samples the state for the next step from the predicted Multivariate Normal distribution rather than propagating just the predicted mean like the Expectation. Moment Matching also samples the next state from the predicted Multivariate Normal distribution, just like Trajectory Sampling. However, it recasts prediction from all models in an ensemble at every timestep (t) to a Multivariate Normal distribution and then resamples next states before propagation. This results in overly conservative PI, as shown in Fig. 5(c), and evidenced by the PI-Width of 3.63 in Table 1.

Published in Transactions on Machine Learning Research (12/2025)

7 Hyperparameter Tuning

SL (Sequence Length) is a hyperparameter that must be tuned for calibrated uncertainty. We propose Algorithm 1 to tune it based on the intuition that uncertainty typically increases with larger SL (Section 6; Fig. 7). A small value (e.g., 10) produces overconfident estimates, while a large value (e.g., 1000) leads to underconfidence. Intermediate values such as SL = 500 yield near-perfect calibration, whereas SL = 400 and SL = 600 show overconfidence and underconfidence, respectively. We iteratively refine SL with the calibration curve and typically converge to an optimal setting within a few trials. Appendix E.9 provides additional experiments. The number of models in an ensemble is a hyperparameter. As discussed in the section 6, the ensembles tend to build up the uncertainty until it plateaus [Appendix E.8]. We keep adding models to the ensemble until it provides calibrated uncertainty. In some cases, the uncertainty saturates while the model is overconfident. In those cases, the number of optimal models is very high, e.g., 8 or 7. Appendix F contains additional details about hyperparameters.

Figure 7: Calibration curves of the youtput of Lorenz for different SL.

Algorithm 1 SL Tuning for Calibration

1: Initialize Optimal SL True 2: Pick a large SL & a small SL 3: Pick an SL between large and small SL 4: while Optimal SL do 5: Three possibilities on Calibration Curve 6: if Calibrated Uncertainty then 7: Optimal SL False 8: else if Underconfidence then 9: Decrease SL 10: else if Overconfidence then 11: Increase SL 12: end if 13: end while

8 Model-Based Reinforcement Learning

We compare the performance of Hop Cast and Probabilistic Ensembles as the uncertainty-aware dynamics models within a model-based reinforcement learning (RL) algorithm. PETS (Probabilistic Ensembles with Trajectory Sampling) is a widely used model-based RL algorithm that relies on Probabilistic Ensembles to model the transition dynamics and propagate uncertainty for planning via Model Predictive Control (MPC) (Chua et al., 2018b). In PETS, the planning quality is heavily influenced by the accuracy and calibration of the dynamics model, model bias compounds rapidly during long-horizon rollouts, and miscalibrated dynamics models lead the planner to choose overly optimistic or overly conservative action sequences.

Our earlier experiments in Table 1 demonstrate that Hop Cast produces more accurate and better-calibrated multi-step predictions than Probabilistic Ensembles, which utilize different uncertainty propagation methods, i.e., Expectation (E) A.1, Moment Matching (MM) A.3, and Trajectory Sampling (TS) A.2. In Table 2, we show the performance of Hop Cast on four standard control tasks when it is used as a drop-in replacement for the Probabilistic Ensembles in the PETS algorithm. The results for baselines, i.e., Probabilistic Ensembles & Deterministic Ensembles, with three uncertainty propagation methods, i.e., Expectation, Moment Matching, & Trajectory Sampling, are taken from Chua et al. (2018b) 2. The Hop Cast yields comparable performance

2Code available at: https://github.com/kchua/handful-of-trials

Published in Transactions on Machine Learning Research (12/2025)

to Probabilistic Ensembles on Cartpole and 7-DOF Reacher, while it outperforms on 7-DOF Pusher and Half Cheetah tasks. The Probabilistic Ensembles, which utilize uncertainty propagation methods, outperform on Cartpole and 7-DOF Reacher, showing their competitiveness against Hop Cast. However, the Deterministic Ensembles exhibit competitive performance in a relatively low-dimensional setting, such as Cartpole, but show performance degradation on high-dimensional tasks. These experiments demonstrate the efficacy of our proposed methodology Hop Cast within a model-based RL algorithm on various control tasks.

Table 2: The performance of different tasks in terms of rewards with different dynamics models, i.e., Probabilistic Ensembles (PE) & Deterministic Ensembles (DE), and various uncertainty propagation methods, i.e., Expectation (E), Moment Matching (MM), & Trajectory Sampling (TS). The Hop Cast shows the results with our proposed methodology as a drop-in replacement of PE within the PETS planning algorithm. The best results are shown in bold.

Hop Cast PE-E PE-MM PE-TS DE-E DE-MM DE-TS

Cartpole 181 2.3 180 181 183 179 177 181

7-DOF Pusher 45 4.3 48 46 46 95 97 93

7-DOF Reacher 44 3.4 44 45 43 93 94 96

Half-cheetah 7170 40 5700 200 7100 3800 190 3950

9 Conclusions

We proposed a Predictor-Corrector mechanism for autoregressive dynamics models to correct the Predictors predictions and generate calibrated prediction intervals for it at any timestep during autoregression. Hop Cast performs competitively well against the alternative approach (i.e., uncertainty propagation), giving accurate and calibrated autoregressive dynamics models. Out of three uncertainty propagation approaches, i.e., Trajectory Sampling, Moment Matching, and Expectation, Trajectory Sampling performs competitively well against ours. This is due to its nature of taking predicted uncertainty into account during propagation. Hop Cast offers a lower calibration and prediction error with sharper prediction intervals. The sharper prediction intervals result directly from modeling context-specific errors, and the lower prediction error is the consequence of modeling errors in the form of a Corrector. In addition, we introduce a concept called Attention Span that gives precise control over the width of prediction intervals with a hyperparameter we introduce called Sequence Length (SL). We also deploy Hop Cast within a model-based reinforcement learning planner as an alternative to Probabilistic Ensembles, demonstrating its competitive performance on diverse control tasks.

Limitations and Future Directions One of the limitations of our work is due to the use of explicit timestep IDs (i.e., t =1,2,3, etc.) in the context state (cn t ). At inference time, the Corrector will only generate calibrated prediction intervals for a trajectory length equal to or less than what it was trained on. Moreover, the Corrector was trained on trajectories sampled at a fixed time interval t and does not generalize to other sampling intervals. A modified version of the Corrector could be made discretization-invariant. We hand-tuned the SL hyperparameter, training multiple models to get a lower calibration error. An alternative approach to obtaining calibrated uncertainty with Hop Cast without tuning the SL hyperparameter is an interesting area for future work.

10 Acknowledgements

This work was partially supported by the National Aeronautics and Space Administration, USA, under grant 80NSSC24CA037.

Published in Transactions on Machine Learning Research (12/2025)

Andreas Auer, Martin Gauch, Daniel Klotz, and Sepp Hochreiter. Conformal prediction for time series with modern hopfield networks. Advances in Neural Information Processing Systems, 36:56027 56074, 2023.

Michael Betancourt. A conceptual introduction to hamiltonian monte carlo. ar Xiv preprint ar Xiv:1701.02434, 2017.

Muhammad Bilal Shahid and Cody Fleming. Toward robust car-following dynamics modeling via blackbox models: Methodology, analysis, and recommendations. Transportation Research Record, 2679(4):114 128, 2025.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pp. 1613 1622. PMLR, 2015.

Fred Brauer and Christopher Kribs. Dynamical systems for biological modeling. CRC press, 2016.

Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113 (15):3932 3937, 2016.

Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018a.

Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018b. URL https://arxiv.org/abs/1805.12114.

Adam Coates, Pieter Abbeel, and Andrew Y Ng. Learning for control from multiple demonstrations. In Proceedings of the 25th international conference on Machine learning, pp. 144 151, 2008.

Bryan C Daniels and Ilya Nemenman. Efficient inference of parsimonious phenomenological models of cellular dynamics using s-systems and alternating regression. Plo S one, 10(3):e0119821, 2015.

Anthony Christopher Davison and David Victor Hinkley. Bootstrap methods and their application. Number 1. Cambridge university press, 1997.

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019.

Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Salinas, Valentin Flunkert, and Tim Januschowski. Probabilistic forecasting with spline quantile function rnns. In The 22nd international conference on artificial intelligence and statistics, pp. 1901 1910. PMLR, 2019.

Tilmann Gneiting and Adrian E Raftery. Weather forecasting with ensemble methods. Science, 310(5746): 248 249, 2005.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007.

E. M. Izhikevich and R. Fitz Hugh. Fitz Hugh-Nagumo model. Scholarpedia, 1(9):1349, 2006. doi: 10.4249/ scholarpedia.1349. revision #123664.

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273 1286, 2021.

S Mohammad Khansari-Zadeh and Aude Billard. Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 27(5):943 957, 2011.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https: //arxiv.org/abs/1412.6980.

Published in Transactions on Machine Learning Research (12/2025)

Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.

Jonathan Ko, Daniel J Klein, Dieter Fox, and Dirk Haehnel. Gp-ukf: Unscented kalman filters with gaussian process prediction and observation models. In 2007 IEEE/RSJ international conference on intelligent robots and systems, pp. 1901 1907. IEEE, 2007.

Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning, pp. 2796 2804. PMLR, 2018.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.

Edward N Lorenz. Predictability: A problem partly solved. In Proc. Seminar on predictability, volume 1. Reading, 1996.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https://arxiv.org/ abs/1711.05101.

Cong Lu, Philip J Ball, Jack Parker-Holder, Michael A Osborne, and Stephen J Roberts. Revisiting design choices in offline model-based reinforcement learning. ar Xiv preprint ar Xiv:2110.04135, 2021.

Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-based reinforcement learning: A survey. Foundations and Trends in Machine Learning, 16(1):1 118, 2023.

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp. 625 632, 2005.

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61 74, 1999.

Adrian E. Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski. Using bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review, 133(5):1155 1174, 2005. doi: 10.1175/ MWR2906.1. URL https://journals.ametsoc.org/view/journals/mwre/133/5/mwr2906.1.xml.

Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, et al. Hopfield networks is all you need. ar Xiv preprint ar Xiv:2008.02217, 2020.

Robert Robison, Tom Shafer, Victor Diloreto, Natalia M Alexandrov, Muhammad Bilal Shahid, and Cody H Fleming. Differential equation approximation using gradient-boosted quantile regression. In AIAA SCITECH 2024 Forum, pp. 0107, 2024.

Muhammad Bilal Shahid, Robert Robison, Tom Shafer, Victor Diloreto, Natalia M Alexandrov, and Cody H Fleming. Uncertainty quantification using deep ensembles for decision making in cyber-physical systems. In AIAA SCITECH 2024 Forum, pp. 0108, 2024.

Pierre NV Tu. Dynamical systems: an introduction with applications in economics and biology. Springer Science & Business Media, 2012.

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

Arun Venkatraman, Martial Hebert, and J Bagnell. Improving multi-step prediction of learned time series models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.

Ying Wang, Bo Hu, Yuxue Zhao, Guofang Kuang, Yaling Zhao, Qingwei Liu, and Xiuli Zhu. Applications of system dynamics models in chronic disease prevention: a systematic review. Preventing Chronic Disease, 18:E103, 2021.

Peter J Wangersky. Lotka-volterra population models. Annual Review of Ecology and Systematics, 9:189 218, 1978.

Published in Transactions on Machine Learning Research (12/2025)

Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697 4708, 2020.

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive approximation, 26(2):289 315, 2007.

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694 699, 2002.

Sheheryar Zaidi, Arber Zela, Thomas Elsken, Chris C Holmes, Frank Hutter, and Yee Teh. Neural ensemble search for uncertainty estimation and dataset shift. Advances in Neural Information Processing Systems, 34:7898 7911, 2021.

Published in Transactions on Machine Learning Research (12/2025)

A Baseline Methods

This section will discuss how we propagate uncertainty with Probabilistic Ensembles using three uncertainty propagation methods, i.e., Expectation. Moment Matching and Trajectory Sampling. These will be discussed with reference to the Lotka-Volterra (LV) system, whose state at time step (t + 1) is st+1 = (xt+1, yt+1). We will see how to generate z% prediction intervals associated with st+1 (PIz st+1) at any timestep during autoregression with these approaches.

Probabilistic Ensembles To construct Probabilistic Ensembles, we train a population of fully connected feedforward models, i.e., Ω= {fi}i, where fi represents the ith model within the population. Each model is trained with Gaussian negative log-likelihood (NLL) loss as demonstrated in Lakshminarayanan et al. (2017). Each model predicts a Multivariate Normal (MVN) over the next state st+1 given the previous state of the system st. Formally, st+1 MVN(µt+1, Σt+1). For the LV system, µt+1 = (µx t+1, µy t+1) and Σt+1 = diag(σx t+1, σy t+1).

Uncertainty Propagation At inference time, a sample of M models {fm}M m=1 can be taken from the population (Ω) to form a Probabilistic Ensemble. All three propagation approaches use a particle-based approach, where the core idea is to generate multiple trajectories via autoregression starting from the same initial condition (Chua et al., 2018a). Let sp,m t denote the state of the system associated with pth particle from mth model at tth time step. Each model fm receives P copies of the same initial condition {sp,m t=0}P p=1, where P denotes the number of particles assigned to model fm. The propagation of uncertainty starts with P M copies of the same initial conditions across {fm}M m=1 models and results in P M number of trajectories. The diversity in these trajectories will come from two sources, i.e., M different models and their probabilistic predictions. These diverse trajectories will be used to derive prediction intervals. In contrast, Hop Cast used a set of errors (e.g., E x & E y) to derive prediction intervals.

The rest of the section discusses the algorithmic details of three propagation approaches and the construction of prediction intervals using the outputs of three algorithms.

A.1 Expectation

The Expectation method always uses P = 1 particle for each model fm. This is because the predicted uncertainty (Σp,m t+1) is ignored, and the predicted mean (µp,m t+1) is propagated to the next time step. Therefore, assigning more than one particle to a model fm will not generate diverse trajectories from that model. Even though this approach ignores predicted uncertainty (Σp,m t+1) during propagation, it will be used to generate prediction intervals as explained in section A.3. The overall procedure of this approach is in Algorithm 2.

Algorithm 2 Expectation

1: Input: Initial states {s1,m t=0}M m=1, models f1, . . . , f M, timesteps T 2: for t = 0 to T 1 do 3: for m = 1 to M do 4: Predict µ1,m t+1, Σ1,m t+1 fm(s1,m t )

5: s1,m t+1 = µ1,m t+1 Propagate mean 6: end for 7: {s1,m t+1}M m=1 inputs to the next timestep 8: end for 9: return {s1,m 1:T }M m=1 the 1 M number of trajectories of length T

A.2 Trajectory Sampling

This method uses P particles for each model fm since it considers predicted uncertainty (Σp,m t+1) while propagating uncertainty to the next step. The overall procedure of this algorithm is summarized in Algorithm 3.

Published in Transactions on Machine Learning Research (12/2025)

Algorithm 3 Trajectory Sampling

1: Input: Initial states {{sp,m t=0}P p=1}M m=1, models f1, . . . , f M, timesteps T, particles P 2: for t = 0 to T 1 do 3: for m = 1 to M do 4: for p = 1 to P do 5: Predict µp,m t+1, Σp,m t+1 fm(sp,m t )

6: Sample sp,m t+1 MVN(µp,m t+1, Σp,m t+1) considers predicted uncertainty 7: end for 8: end for 9: {{sp,m t+1}P p=1}M m=1 inputs to the next timestep 10: end for 11: return {{sp,m 1:T }P p=1}M m=1 the P M number of trajectories of length T

A.3 Moment Matching

This method uses P particles for each model m since it considers predicted uncertainty (Σp,m t ) while propagating uncertainty to the next step. This is different from Trajectory Sampling in that it fits a Gaussian distribution to P M predictions {{sp,m t+1}P p=1}M m=1 at tth timestep, that is, MVN(µt+1, Σt+1). We assume independence between all P M predictions. A set of P M samples {{sp,m t+1}P p=1}M m=1 is taken from that distribution which becomes input at the next time step. The overall procedure is summarized in Algorithm 4.

Algorithm 4 Moment Matching

1: Input: Initial states {{sp,m 0 }P p=1}M m=1, models f1, . . . , f M, timesteps T, particles P 2: for t = 0 to T 1 do 3: for m = 1 to M do 4: for p = 1 to P do

5: Predict µp,m t+1, Σp,m t+1 fm(sp,m t ) 6: Sample sp,m t+1 MVN(µp,m t+1, Σp,m t+1) 7: end for 8: end for 9: {{sp,m t+1}P p=1}M m=1 fit a Gaussian distribution

10: µt+1 1 P M PM m=1 PP p=1 sp,m t+1 Mean

11: Σt+1 1 P M PM m=1 PP p=1(sp,m t+1 µt+1)(sp,m t+1 µt+1)T Variance

12: {{sp,m t+1}P p=1}M m=1 MVN(µt+1, Σt+1) sample inputs for the next timestep

13: end for 14: return {{sp,m 1:T }P p=1}M m=1 the P M number of trajectories of length T

Prediction intervals using uncertainty propagation All three approaches output P M number of trajectories of length T {{sp,m 1:T }P p=1}M m=1. At any time step t during propagation, the (PIz st+1) are generated using

{{sp,m t }P p=1}M m=1. Specifically, we construct a MVNnet whose parameters are: µnet = 1 P M PM m=1 PP p=1 sp,m t and Σnet = 1 P M PM m=1 PP p=1 Σp,m t + (sp,m t µnet)(sp,m t µnet)T . For LV system, the µnet = (µx net, µy net) and Σnet = diag(σx net, σy net). The prediction intervals for x and y are derived as follows:

PIz xt+1 = µx net Φ 1(1 α)σx net (13)

PIz yt+1 = µy net Φ 1(1 α)σy net (14)

where α = 1 z

2 for z% prediction intervals. The Φ 1 denotes the inverse cumulative distribution function of the Standard Normal distribution. In contrast, equations 5 & 6 were used to generate prediction intervals with Hop Cast.

Published in Transactions on Machine Learning Research (12/2025)

B Dynamical Systems

There are five dynamical systems studied in this paper. One of those is Lotka-Volterra (LV) equations, which are already discussed in section 3 along with its closed form, parameters, and ranges of initial conditions. Here, we repeat the description of the LV system and then describe the rest of the systems. We add zeromean additive Gaussian noise to each differential equation post-integration as discussed in section 3, with the variance equal to the standard deviation of the variable modelled by the differential equation scaled by a factor σ.

B.1 Lotka-Volterra (Wangersky, 1978)

dt = αx βxy (15)

dt = δxy γy (16)

Parameters: α = 1.1; β = 0.4; γ = 0.4; δ = 0.1 Initial Condition Ranges: x [5, 20]; y [5, 10]

B.2 Lorenz (Brunton et al., 2016)

dt = σ(y x) (17)

dt = x(ρ z) y (18)

dt = xy βz (19)

Parameters: σ = 10; ρ = 28; β = 8

3 Initial Condition Ranges: x [ 20, 20]; y [ 20, 20]; z [0, 50]

B.3 Fitz Hugh-Nagumo (FHN) (Izhikevich & Fitz Hugh, 2006)

3 w + I (20)

dt = ϵ(v + a bw) (21)

Parameters: a = 0.7;b = 0.8;ϵ = 0.08;I = 0.5 Initial Condition Ranges: v [ 1.5, 1.5];w [ 1.5, 1.5]

B.4 Lorenz95 (Lorenz, 1996)

dt = (X2 X5)X4 X1 + F (22)

dt = (X3 X1)X5 X2 + F (23)

dt = (X4 X2)X1 X3 + F (24)

dt = (X5 X3)X2 X4 + F (25)

dt = (X1 X4)X3 X5 + F (26)

Parameters: F = 8 Initial Condition Ranges: Xi [ 10.5, 10.5] where i {1, 2, 3, 4, 5}

Published in Transactions on Machine Learning Research (12/2025)

B.5 Glycolytic Oscillator (Daniels & Nemenman, 2015)

dt = J0 k1S1S6 1 + (S6/K1)q (27)

dt = 2 k1S1S6 1 + (S6/K1)q k2S2(N S5) k6S2S5 (28)

dt = k2S2(N S5) k3S3(A S6) (29)

dt = k3S3(A S6) k4S4S5 κ(S4 S7) (30)

dt = k2S2(N S5) k4S4S5 k6S2S5 (31)

dt = 2 k1S1S6 1 + (S6/K1)q + 2k3S3(A S6) k5S6 (32)

dt = ψκ(S4 S7) k S7 (33)

Parameters: J0 = 2.5; k1 = 100; k2 = 6; k3 = 16; k4 = 100; k5 = 1.28; k6 = 12; k = 1.8; κ = 13; q = 4; K1 = 0.52; ψ = 0.1; N = 1; A = 4 Initial Condition Ranges: S1 [0.15, 1.60]; S2 [0.19, 2.16]; S3 [0.04, 0.20]; S4 [0.10, 0.35]; S5 [0.08, 0.30]; S6 [0.14, 2.67]; S7 [0.05, 0.10]

C Hyperparameters

In Table 3, we provide the Sequence Length (SL) for each output of the Predictor and a number of models in Probabilistic Ensembles for each experimental setting in Table 1.

D Data Generation

Table 4: Details about data generation

Model t Timesteps Trajectories

Lotka Volterra 0.1 300 500

Lorenz 0.01 300 1000

FHN 0.5 400 350

Lorenz95 0.01 300 666

Glycolytic 0.01 400 750

E Ablation Studies

E.1 Encoder Architecture

As discussed in section 4, each correction model utilizes an Encoder with MHN. The architecture of the Encoder (e.g., mx) is a fully connected feedforward network with one layer and 100 neurons for all experiments. Here, we show the impact of adding more layers of fully connected neurons to the Encoder on the proposed metrics. Table 5 shows that adding more layers doesn t significantly impact the MSE and CE. However, the MSE and CE increase considerably when the Encoder is removed from the correction models.

Published in Transactions on Machine Learning Research (12/2025)

Table 3: Sequence lengths (SL) for all outputs of Predictor and number of models in Probabilistic ensemble across various dynamical systems and noise levels

Model Hop Cast Probabilistic Ensemble

Expectation Moment Matching

Trajectory Sampling

Hyperparameter Sequence lengths Models Models Models

Lotka Volterra

σ = 0.05 30,30,50,60 4 5 3

σ = 0.1 70,30,70,40 5 5 3

σ = 0.3 70,20,60,70 6 5 4

σ = 0.05 30,35,35,35,50,50,100 6 8 3

σ = 0.1 500,500,500,300,300,300 7 7 4

σ = 0.3 25,25,25,500,500,500 8 7 3

σ = 0.05 35,55 5 5 3

σ = 0.1 15,15 4 5 3

σ = 0.3 15,15 5 4 3

σ = 0.05 2200 for all 6 7 3

σ = 0.1 2000 for all 6 5 4

σ = 0.3 2000 for all 7 6 4

Glycolytic Oscillator

σ = 0.05 35,50,35,100,35,50,30 8 6 3

σ = 0.1 50,50,35,35,800,35,100 6 7 3

σ = 0.3 50 for all 7 8 3

Table 5: Effect of different Encoding network architectures on the proposed metrics. The results without the Encoder are also included. The FC(100)1 means the fully connected feedforward model with one layer of 100 neurons. The results are reported over three runs of the same experiment at random seeds.

Dynamical System Lorenz Lotka Volterra

Noise Level σ = 0.3 σ = 0.3

Metrics MSE PI-Width CE MSE PI-Width CE

Without Encoder 1907.07 9.33 38.69 0.42 0.06 0.005 16.12 0.36 3.37 0.23 0.025 0.012

FC(100)1 1681 46.36 40.44 0.91 0.019 0.004 9.69 0.23 3.19 0.12 0.0051 0.004

FC(100)2 1717.362 30.3 41.84 0.68 0.014 0.002 9.43 0.23 3.08 0.09 0.006 0.002

FC(100)3 1710.18 46.19 42.45 0.55 0.018 0.002 9.37 0.007 2.98 0.07 0.0035 0.001

E.2 Initial condition and context state (cn t )

As discussed in section 1, our context state (cn t ) comprises the initial condition, the current state of the system, and the timestep ID. Here, we show the results on two systems without the initial condition a part of the context state in Table 6. The MSE increases significantly for both cases, while the CE exhibits minor changes. This shows the significance of making the initial condition a part of the context state.

Published in Transactions on Machine Learning Research (12/2025)

Table 6: Comparison of proposed metrics with and without initial condition included in the context state. The results are averaged over three runs.

Dynamical System

Glycolytic Oscillator

Lotka Volterra

Noise Level σ = 0.1 σ = 0.1

Metrics MSE PI-Width CE MSE PI-Width CE

With IC 0.072 0.003 0.25 0.007 0.018 0.006 6.10 0.20 1.88 0.05 0.0070 0.002

Without IC 0.176 0.003 0.416 0.003 0.027 0.008 43.90 1.09 5.17 0.14 0.011 0.002

E.3 Number of Retrieved Samples (s)

To generate calibrated prediction intervals, a set of s samples is drawn from at x as described in section 4.3. Here, we analyze the sensitivity of proposed metrics with varying values of s = {100, 500, 800, 1000} for two settings from Table 1. The results everywhere else in the paper are reported with s = 1000. In Table 7, it can be seen that the results are largely unchanged as the number of samples varies from 1000 to 500. However, at s = 100, we see a drop in performance in terms of MSE and CE, suggesting that too few samples limit the diversity of error samples, negatively impacting the performance.

Table 7: Comparison of proposed metrics with varying number of retrieved residual samples (s) from MHN memory. The results are averaged over three runs.

Dynamical System

Glycolytic Oscillator

Lotka Volterra

Noise Level σ = 0.1 σ = 0.1

Metrics MSE PI-Width CE MSE PI-Width CE

1000 0.072 0.003 0.25 0.007 0.018 0.006 6.10 0.20 1.88 0.05 0.0070 0.002

800 0.070 0.005 0.24 0.009 0.018 0.008 6.10 0.2 1.89 0.03 0.007 0.003

500 0.071 0.005 0.26 0.003 0.015 0.008 6.13 0.20 1.87 0.04 0.006 0.005

100 0.096 0.005 0.18 0.006 0.025 0.008 7.9 0.17 0.69 0.04 0.019 0.004

E.4 Top-k retrieval

As discussed in section 4.3, the s samples are drawn from at x to generate calibrated prediction intervals. Here, we analyze an alternative approach to sampling, i.e., using top-k probabilities from at x. Table 8 shows the results on proposed metrics with varying values of k = {100, 500, 800, 1000}. The results remain largely unchanged for k 500. At s = 100, the CE shows an increase for both settings, indicating that too few samples lead to decreased diversity in sampled errors.

E.5 Size of Association Memory

In section 4.3, we discussed the inference with the correction model (Mx). A set of K keys (Kmem) with the corresponding values (Vx mem) is randomly sampled from STrain K and STrain Vx , respectively and loaded into the Association memory. Here, the impact of the size of Association memory on the proposed metrics will be discussed. Table 9 shows the variation in proposed metrics as the number of keys in memory varies from 10 to 2000. For all systems, the MSE and CE improve significantly as the keys increase from 10 to 50. The MSE and CE that showed a considerable change as the number of keys increases are shown in bold. As the

Published in Transactions on Machine Learning Research (12/2025)

Table 8: Comparison of proposed metrics based on top-k probabilities from ax t instead of sampling. The results are reported for k = {100, 500, 800, 1000}. The results are averaged over three runs.

Dynamical System

Glycolytic Oscillator

Lotka Volterra

Noise Level σ = 0.1 σ = 0.1

Metrics MSE PI-Width CE MSE PI-Width CE

1000 0.67 0.002 0.24 0.007 0.011 0.005 7.31 0.99 2.38 0.33 0.016 0.003

800 0.067 0.002 0.24 0.01 0.011 0.005 7.52 1.18 2.41 0.38 0.014 0.004

500 0.069 0.003 0.24 0.009 0.017 0.005 8.39 2.50 2.53 0.64 0.017 0.008

100 0.07 0.004 0.20 0.005 0.06 0.03 6.36 0.22 1.24 0.11 0.25 0.036

keys increase beyond 50, not all systems MSE and CE showed a considerable increase. The MSE and CE show insignificant changes as the number of keys increases beyond 100 until 2000. Thus, we fix the number of keys to 2000 for our final evaluation of results.

Table 9: The results on proposed metrics with varying numbers of keys in memory. The MSE and CE that showed a considerable improvement with the increase in the number of keys in memory are shown in bold. The results are averaged over three runs.

Keys in Memory Metric Lorenz Glycolytic Oscillator

Lotka Volterra FHN Lorenz95

σ = 0.1 σ = 0.3 σ = 0.05 σ = 0.05 σ = 0.3

MSE 1765.48 170.73 0.15 0.009 26.86 4.80 0.15 0.03 24.81 3.11

PI-Width 34.59 5.69 0.37 0.006 3.61 1.06 0.24 0.03 6.59 0.69

CE 0.11 0.03 0.080 0.009 0.10 0.04 0.24 0.19 0.039 0.05

MSE 1419.29 103.78 0.13 0.001 9.66 1.51 0.072 0.001 18.10 1.75

PI-Width 37.64 0.55 0.36 0.008 2.90 0.03 0.21 0.012 5.98 0.11

CE 0.025 0.009 0.012 0.002 0.038 0.003 0.12 0.02 0.016 0.003

MSE 1285.90 30.9 0.12 0.002 7.76 0.67 0.076 0.001 18.57 0.51

PI-Width 36.83 0.39 0.36 0.015 2.99 0.06 0.21 0.01 6.09 0.15

CE 0.015 0.007 0.018 0.007 0.026 0.006 0.111 0.003 0.006 0.005

MSE 1249 11.03 0.12 0.001 6.89 0.30 0.076 0.002 17.06 0.86

PI-Width 38.19 1.39 0.36 0.001 2.85 0.08 0.21 0.006 6.02 0.10

CE 0.020 0.004 0.013 0.003 0.022 0.003 0.093 0.03 0.0021 0.003

MSE 1241.72 7.55 0.12 0.003 7.08 0.01 0.077 0.001 16.91 0.74

PI-Width 38.37 1.91 0.37 0.005 2.91 0.093 0.21 0.001 6.13 0.09

CE 0.018 0.001 0.017 0.004 0.021 0.004 0.092 0.01 0.0020 0.002

MSE 1248.17 15 0.11 0.001 6.87 0.32 0.076 0.002 16.44 0.66

PI-Width 38.66 0.38 0.37 0.008 2.93 0.17 0.20 0.005 6.08 0.06

CE 0.019 0.03 0.016 0.003 0.022 0.004 0.091 0.004 0.0015 0.001

Published in Transactions on Machine Learning Research (12/2025)

E.6 Embedding dimension

As discussed in section 4.2, the Encoder outputs a d-dimensional embedding given context state (cn t ), where d = 4. Here, we want to show that the performance of correction models is insensitive to the embedding dimension provided that the SL is retuned accordingly. In Table 10, the results on metrics and SL for LV at σ = 0.1 in the first row are copied from Tables 1 & 3, respectively. Table 10 shows that the CE shows a considerable increase from 0.0070 to 0.065 as we change the d to 5 from 4, while the MSE shows minor changes. The last row shows the results at d = 5 when the SL is retuned, where the CE (i.e., 0.0098) is close to what we had earlier at d = 4 (i.e., 0.0070).

Table 10: The impact of change in embedding dimension on the proposed metrics. The last row shows the results when SL is retuned for the new embedding dimension (d = 5). The results are reported over three runs.

Embedding Dimension (d)

SL of all outputs

Lotka Volterra

MSE PI-Width CE

d = 4 [70,30,70,40] 6.10 0.20 1.88 0.05 0.0070 0.002

d = 5 [70,30,70,40] 6.75 0.065 1.94 0.02 0.065 0.005

d = 5 [5,5,70,5] 6.98 0.115 1.69 0.01 0.0098 0.003

E.7 Impact of shuffled data

As discussed in section 4.1, the dataset is shuffled before forming queries, keys, and values for training. Here, we show that this shuffling brings two major benefits. One, the MSE and CE show significant improvement. Second, CE and MSE show changes with respect to the number of keys in memory that are consistent across different systems.

Table 11 shows the results on the proposed metrics for FHN and LV with and without shuffling the data. The MSE and CE increase considerably for both systems when the data is not shuffled, irrespective of the number of keys in memory. For LV, the MSE starts to decrease and CE starts to increase as we reduce the number of keys in memory from 2000 to 10. For FHN, in contrast, MSE and CE begin to grow as the number of keys goes down to 10 from 2000. On the other hand, the MSE and CE remain largely unchanged for both systems as we reduce the number of keys from 2000 to 500 when the data is shuffled. For keys less than 500, the MSE and CE show considerable change for both systems.

When the data is shuffled, the changes in MSE and CE as we reduce the number of keys show a generalizable pattern for both systems. Hence, the guidelines regarding the number of keys in the memory at the inference time are generalizable.

E.8 Number of models in Probabilistic Ensembles

As discussed in section 7, the Probabilistic Ensembles tend to saturate uncertainty after a certain number of models in the ensemble. Here, we demonstrate that for all propagation approaches via calibration curves. As shown in Fig. 8(a) for Expectation, the uncertainty goes up from overconfidence to underconfidence as we increase the number of models in the Probabilistic Ensemble. The last three increments in the number of models (13 to 15) don t impact uncertainty much. For Moment Matching (Fig. 8(b)), the uncertainty increases with the number of models and saturates later while the model is still overconfident. For Trajectory Sampling (Fig. 8(c)), increasing the number of models leads to a marginal change in uncertainty and saturates after four models.

Published in Transactions on Machine Learning Research (12/2025)

Table 11: Performance on proposed metrics at different numbers of keys in memory, with and without data shuffling. The results are averaged over three runs.

Keys in Memory Metric

Lotka-Volterra FHN

σ = 0.05 σ = 0.05

Shuffled Data Non-Shuffled Data Shuffled Data Non-Shuffled Data

MSE 26.86 4.80 33.09 0.009 0.15 0.03 0.16 0.01

PI-Width 3.61 1.06 0.54 0.001 0.24 0.03 0.17 0.005

CE 0.10 0.04 0.98 0.007 0.24 0.19 1.15 0.05

MSE 9.66 1.51 32.91 0.03 0.072 0.001 0.12 0.02

PI-Width 2.90 0.03 0.49 5.42 0.21 0.012 0.18 0.01

CE 0.038 0.03 0.74 0.005 0.12 0.02 0.38 0.06

MSE 7.76 0.67 32.63 0.09 0.076 0.001 0.10 0.02

PI-Width 2.99 0.06 0.51 0.001 0.21 0.01 0.16 0.01

CE 0.026 0.006 0.72 0.001 0.111 0.003 0.46 0.08

MSE 6.89 0.30 42.14 22.03 0.076 0.002 0.12 0.03

PI-Width 2.85 0.08 3.32 0.09 0.21 0.006 0.18 0.009

CE 0.022 0.003 0.12 0.06 0.093 0.03 0.41 0.09

MSE 7.08 0.01 34.80 16.27 0.077 0.001 0.092 0.01

PI-Width 2.91 0.093 3.36 0.07 0.21 0.001 0.27 0.006

CE 0.021 0.004 0.083 0.053 0.092 0.01 0.12 0.12

MSE 6.87 0.32 60.84 11.17 0.076 0.002 0.091 0.013

PI-Width 2.93 0.17 4.77 0.10 0.20 0.005 0.37 0.005

CE 0.022 0.004 0.063 0.015 0.091 0.004 0.16 0.042

Figure 8: The calibration curves of three propagation approaches for x-output of Lorenz (σ = 0.05) to delineate the saturation of uncertainty as the number of models in an ensemble goes up from 2 to 15.

E.9 Sequence length (SL) & Calibration

The section 7 discusses the impact of varying SL on the model s confidence for the y-output of Lorenz. Here, we show the same for all outputs of the two systems, i.e., Lorenz and Glycolytic Oscillator. The calibrated confidence of both systems with their optimal SL is shown in the middle plot of Fig. 9a & 9b. To show underconfidence and overconfidence, we pick a small SL = 10 and a large SL = 1000. At small SL = 10, we expect all outputs of both systems to be overconfident as shown in the leftmost plots of Fig. 9a & 9b. At

Published in Transactions on Machine Learning Research (12/2025)

large SL = 1000, we expect all outputs of both systems to be underconfident as shown in the rightmost plots of Fig. 9a & 9b. It shows that the concept of the increase in uncertainty with the increase in SL generalizes well across different outputs of different systems.

(a) Lorenz (σ = 0.1). The leftmost plot shows the overconfidence of all six outputs of Lorenz at small SL. The middle plot shows the calibrated confidence for all outputs at optimal SL. The rightmost plot shows the underconfidence of all outputs at large SL.

(b) Glycolytic Oscillator (σ = 0.3). The leftmost plot shows the overconfidence of all seven outputs at small SL. The middle plot shows the calibrated confidence for all outputs at optimal SL. The rightmost plot shows the underconfidence of all outputs at large SL.

Figure 9: Calibration curves of Hop Cast for two dynamical systems to demonstrate the impact of SL on calibration. At the bottom of each plot, SL of all outputs are shown in ascending order such that the first value corresponds to the first output, second to the second output, and so on.

F Implementation Details

This section contains details regarding the implementation of baselines and Hop Cast. All experiments were run on NVIDIA A100-SXM4-80GB. Py Torch is used to implement baselines and Hop Cast.

F.1 Baselines Implementation

We train a population of 15 models for the Probabilistic Ensembles. Regarding architecture, two layers of a fully connected feedforward model with 400 neurons each were used for LV and FHN, and three layers with 400 neurons for Lorenz, Lorenz95, and Glycolytic Oscillator. The train/test datasets were prepared following an 80/20 split. The batch size, learning rate, optimizer, and epochs were kept the same for all experiments, and are 128, 0.001, Adam (Kingma & Ba, 2017), and 1000, respectively. The early stopping was used as a criterion to train the Probabilistic Ensembles (Yao et al., 2007). The number of particles P for each model in Probabilistic Ensembles was 1 for Expectation, and 20 for Moment Matching & Trajectory Sampling (Chua et al., 2018a).

Published in Transactions on Machine Learning Research (12/2025)

F.2 Hop Cast Implementation

The Encoder was a fully connected feedforward model with one layer and 100 neurons. The section 7 contains details about tuning an important hyperparameter, i.e., SL, of Hop Cast. Table 3 has SL for each output of all systems at various noise scaling factors σ. The learning rate of 0.001 and the optimizer Adam W (Loshchilov & Hutter, 2019) were kept the same for all experiments. The batch size and epochs were different for each experiment, and are provided in the form of yml files as a supplementary material along with other hyperparameters for each system and noise scaling factor (σ).

The deterministic Predictor was a fully connected feedforward model of two layers with 400 neurons each for LV and FHN, and three layers with 400 neurons for Lorenz, Lorenz95, and Glycolytic Oscillator.