# prediff_precipitation_nowcasting_with_latent_diffusion_models__8fbe6401.pdf Pre Diff: Precipitation Nowcasting with Latent Diffusion Models Zhihan Gao Hong Kong University of Science and Technology zhihan.gao@connect.ust.hk Xingjian Shi Boson AI xshiab@connect.ust.hk Boran Han AWS boranhan@amazon.com Hao Wang AWS AI Labs howngz@amazon.com Xiaoyong Jin Amazon jxiaoyon@amazon.com Danielle Maddix AWS AI Labs dmmaddix@amazon.com Yi Zhu Boson AI yi@boson.ai Mu Li Boson AI mu@boson.ai Yuyang Wang AWS AI Labs yuyawang@amazon.com Earth system forecasting has traditionally relied on complex physical models that are computationally expensive and require significant domain expertise. In the past decade, the unprecedented increase in spatiotemporal Earth observation data has enabled data-driven forecasting models using deep learning techniques. These models have shown promise for diverse Earth system forecasting tasks. However, they either struggle with handling uncertainty or neglect domain-specific prior knowledge; as a result, they tend to suffer from averaging possible futures to blurred forecasts or generating physically implausible predictions. To address these limitations, we propose a two-stage pipeline for probabilistic spatiotemporal forecasting: 1) We develop Pre Diff, a conditional latent diffusion model capable of probabilistic forecasts. 2) We incorporate an explicit knowledge alignment mechanism to align forecasts with domain-specific physical constraints. This is achieved by estimating the deviation from imposed constraints at each denoising step and adjusting the transition distribution accordingly. We conduct empirical studies on two datasets: N-body MNIST, a synthetic dataset with chaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset. Specifically, we impose the law of conservation of energy in N-body MNIST and anticipated precipitation intensity in SEVIR. Experiments demonstrate the effectiveness of Pre Diff in handling uncertainty, incorporating domain-specific prior knowledge, and generating forecasts that exhibit high operational utility. 1 Introduction Earth s intricate climate system significantly influences daily life. Precipitation nowcasting, tasked with delivering accurate rainfall forecasts for the near future (e.g., 0-6 hours), is vital for decisionmaking across numerous industries and services. Recent advancements in data-driven deep learning (DL) techniques have demonstrated promising potential in this field, rivaling conventional numerical methods [8, 5] with their advantages of being more skillful [5], efficient [37], and scalable [3]. However, accurately predicting the future rainfall remains challenging for data-driven algorithms. Work conducted during an internship at Amazon. Work conducted while at Amazon. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). The state-of-the-art Earth system forecasting algorithms [47, 61, 41, 37, 8, 69, 2, 29, 3] typically generates blurry predictions. This is caused by the high variability and complexity inherent to Earth s climatic system. Even minor differences in initial conditions can lead to vastly divergent outcomes that are difficult to predict. Most methods adopt a point estimation of the future rainfall and are trained by minimizing pixel-wise loss functions (e.g., mean-squared error). These methods lack the capability of capturing multiple plausible futures and will generate blurry forecasts which lose important operational details. Therefore, what are needed instead are probabilistic models that can represent the uncertainty inherent in stochastic systems. The probabilistic models can capture multiple plausible futures, generating diverse high-quality predictions that better align with real-world data. The emergence of diffusion models (DMs) [22] has enabled powerful probabilistic frameworks for generative modeling. DMs have shown remarkable capabilities in generating high-quality images [40, 45, 43] and videos [15, 23]. As a likelihood-based model, DMs do not exhibit mode collapse or training instabilities like GANs [10]. Compared to autoregressive (AR) models [53, 46, 63, 39, 65] that generate images pixel-by-pixel, DMs can produce higher resolution images faster and with higher quality. They are also better at handling uncertainty [62, 34, 57 59] without drawbacks like exposure bias [13] in AR models. Latent diffusion models (LDMs) [42, 52] further improve on DMs by separating the model into two phases, only applying the costly diffusion in a compressed latent space. This alleviates the computational costs of DMs without significantly impairing performance. Despite DMs success in image and video generation [42, 15, 66, 36, 32, 56], its application to precipitation nowcasting and Earth system forecasting is in early stages [16]. One of the major concerns is that this purely data-centric approach lacks constraints and controls from prior knowledge about the dynamic system. Some spatiotemporal forecasting approaches have incorporated domain knowledge by modifying the model architecture or adding extra training losses [11, 1, 37]. This enables them to be aware of prior knowledge and generate physically plausible forecasts. However, these approaches still face challenges, such as requiring to design new model architectures or retrain the entire model from scratch when constraints change. More detailed discussions on related works are provided in Appendix A. Inspired by recent success in controllable generative models [68, 24, 4, 33, 6], we propose a general two-stage pipeline for training data-driven Earth system forecasting model. 1) In the first stage, we focus on capturing the intrinsic semantics in the data by training an LDM. To capture Earth s long-term and complex changes, we instantiate the LDM s core neural network as a UNet-style architecture based on Earthformer [8]. 2) In the second stage, we inject prior knowledge of the Earth system by training a knowledge alignment network that guides the sampling process of the LDM. Specifically the alignment network parameterizes an energy function that adjusts the transition probabilities during each denoising step. This encourages the generation of physically plausible intermediate latent states while suppressing those likely to violate the given domain knowledge. We summarize our main contributions as follows: We introduce a novel LDM based model Pre Diff for precipitation nowcasting. We propose a general two-stage pipeline for training data-driven Earth system forecasting models. Specifically, we develop knowledge alignment mechanism to guide the sampling process of Pre Diff. This mechanism ensures that the generated predictions align with domain-specific prior knowledge better, thereby enhancing the reliability of the forecasts, without requiring any modifications to the trained Pre Diff model. Our method achieves state-of-the-art performance on the N-body MNIST [8] dataset and attains state-of-the-art perceptual quality on the SEVIR [55] dataset. We follow [47, 48, 55, 1, 8] to formulate precipitation nowcasting as a spatiotemporal forecasting problem. The Lin-step observation is represented as a spatiotemporal sequence y = [yj]Lin j=1 RLin H W C, where H and W denote the spatial resolution, and C denotes the number of measurements at each space-time coordinate. Probabilistic forecasting aims to model the conditional probabilistic distribution p(x|y) of the Lout-step-ahead future x = [xj]Lout j=1 RLout H W C, given the observation y. In what follows, we will present the parameterization of p(x|y) by a controllable LDM. 𝑧!"#$, 𝑧% 𝑧!"#$, 𝑧&'( 𝑧!"#$, 𝑧& 𝑧!"#$, 𝑧) 𝑝!,#(𝑧$|𝑧$%&, 𝑦, ℱ') 𝑝! 𝑧$ 𝑧$%&, 𝑧()*+ Earthformer-UNet Knowledge Alignment 𝑒,-ℱ." /#,$, 1 ,ℱ$(1) Figure 1: Overview of Pre Diff inference with knowledge alignment. An observation sequence y is encoded into a latent context zcond by the frame-wise encoder E. The latent diffusion model pθ(zt|zt+1, zcond), which is parameterized by an Earthformer-UNet, then generates the latent future z0 by autoregressively denoising Gaussian noise z T conditioned on zcond. It takes the concatenation of the latent context zcond (in the blue border) and the previous-step noisy latent future zt+1 (in the cyan border) as input, and outputs zt. The transition distribution of each step from zt+1 to zt can be further refined as pθ,ϕ(zt|zt+1, y, F0) via knowledge alignment, according to auxiliary prior knowledge. This denoising process iterates from t = T to t = 0, resulting in a denoised latent future z0. Finally, z0 is decoded back to pixel space by the frame-wise decoder D to produce the final prediction bx. (Best viewed in color). 2.1 Preliminary: Diffusion Models Diffusion models (DMs) learn the data distribution p(x) by training a model to reverse a predefined noising process that progressively corrupts the data. Specifically, the noising process is defined as q(xt|xt 1) = N(xt; αtxt 1, (1 αt)I), 1 t T, where x0 p(x) is the true data, and x T N(0, I) is random noise. The coefficients αt follow a fixed schedule over the timesteps t. DMs factorize and parameterize the joint distribution over the data x0 and noisy latents xi as pθ(x0:T ) = p(x T ) QT t=1 pθ(xt 1|xt), where each step of the reverse denoising process is a Gaussian distribution pθ(xt 1|xt) = N(µθ(xt, t), Σθ(xt, t)), which is trained to recover xt 1 from xt. To apply DMs for spatiotemporal forecasting, p(x|y) is factorized and parameterized as pθ(x|y) = R pθ(x0:T |y)dx1:T = R p(x T ) QT t=1 pθ(xt 1|xt, y)dx1:T , where pθ(xt 1|xt, y) represents the conditional denoising transition with the condition y. 2.2 Conditional Diffusion in Latent Space To improve the computational efficiency of DM training and inference, our Pre Diff follows LDM to adopt a two-phase training that leverages the benefits of lower-dimensional latent representations. The two sequential phases of the Pre Diff training are: 1) Training a frame-wise variational autoencoder (VAE) [28] that encodes pixel space into a lower-dimensional latent space, and 2) Training a conditional DM that generates predictions in this acquired latent space. Frame-wise autoencoder. We follow [7] to train a frame autoencoder using a combination of the pixel-wise loss (e.g. L2 loss) and an adversarial loss. Different from [7], we exclude the perceptual loss since there are no standard pretrained models for perception on Earth observation data. Specifically, the encoder E is trained to encode a data frame xj RH W C to a latent representation zj = E(xj) RHz Wz Cz. The decoder D learns to reconstruct the data frame bxj = D(zj) from the encoded latent. We denote z p E(z|x) RL Hz Wz Cz as equivalent to z = [zj] = [E(xj)], representing encoding a sequence of frames in pixel space into a latent spatiotemporal sequence. And x p D(x|z) denotes decoding a latent spatiotemporal sequence. Latent diffusion. With the context y being encoded by the frame-wise encoder E into the learned latent space as zcond RLin Hz Wz Cz as (1). The conditional distribution pθ(z0:T |zcond) of the latent future zi RLout Hz Wz Cz given zcond is factorized and parameterized as (2): zcond p E(zcond|y), (1) pθ(z0:T |zcond) = p(z T ) t=1 pθ(zt 1|zt, zcond). (2) where z T p(z T ) = N(0, I). As proposed by [22, 45],an equivalent parameterization is to have the DMs learn to match the transition noise ϵθ(zt, t) of step t instead of directly predicting zt 1. The training objective of Pre Diff is simplified as shown in (3): LCLDM = E(x,y),t,ϵ N(0,I) ϵ ϵθ(zt, t, zcond) 2 2. (3) where (x, y) is a sampled context sequence and target sequence data pair, and given that, zt q(zt|z0)p E(z0|x) and zcond p E(zcond|y). Instantiating pθ(zt 1|zt, zcond). Compared to images, modeling spatiotemporal observation data in precipitation nowcasting poses greater challenges due to their higher dimensionality. We propose replacing the UNet backbone in LDM [42] with Earthformer-UNet, derived from Earthformer s encoder [8], which is known for its ability to model intricate and extensive spatiotemporal dependencies in the Earth system. Earthformer-UNet adopts a hierarchical UNet architecture with self cuboid attention [8] as the building blocks, excluding the bridging cross-attention in the encoder-decoder architecture of Earthformer. More details of the architecture design of Earthformer-UNet are provide in Appendix B.1. We find Earthformer-UNet to be more stable and effective at modeling the transition distribution pθ(zt 1|zt, zcond). It takes the concatenation of the encoded latent context zcond and the noisy latent future zt along the temporal dimension as input, and predicts the one-step-ahead noisy latent future zt 1 (in practice, the transition noise ϵ from zt to zt 1 is predicted as shown in (3)). 2.3 Incorporating Knowledge Alignment Algorithm 1 One training step of the knowledge alignment network Uϕ 1: (x, y) sampled from data 2: t Uniform(0, T) 3: zt q(zt|z0)p E(z0|x) 4: LU Uϕ(zt, t, y) F(x, y) Though DMs hold great promise for diverse and realistic generation, the generated predictions may violate physical constraints, or disregard domain-specific prior knowledge, thereby fail to give plausible and non-trivial results [14, 44]. One possible reason for this is that DMs are not necessarily trained on data full compliant with domain knowledge. When trained on such data, there is no guarantee that the generations sampled from the learned distribution will remain physically realizable. The causes may also stem from the stochastic nature of chaotic systems, the approximation error in denoising steps, etc. To address this issue, we propose knowledge alignment to incorporate auxiliary prior knowledge: F(bx, y) = F0(y) Rd, (4) into the diffusion generation process. The knowledge alignment imposes a constraint F on the forecast bx, optionally with the observation y, based on domain expertise. E.g., for an isolated physical system, the knowledge E(bx, ) = E0(y Lin) R imposes the conservation of energy by enforcing the generation bx to keep the total energy E(bx, ) the same as the last observation E0(y Lin). The violation F(bx, y) F0(y) quantifies the deviation of a prediction bx from prior knowledge. The larger violation indicates bx diverges further from the constraints. Knowledge alignment hence aims to suppress the probability of generating predictions with large violation. Notice that even the target futures x from training data may violate the knowledge, i.e. F(x, y) = F0(y), due to noise in data collection or simulation. Inspired by classifier guidance [4], we achieve knowledge alignment by training a knowledge alignment network Uϕ(zt, t, y) to estimate F(bx, y) from the intermediate latent zt at noising step t. The key idea is to adjust the transition probability distribution pθ(zt 1|zt, zcond) in (2) during each latent denoising step to reduce the likelihood of sampling zt values expected to violate the constraints: pθ,ϕ(zt|zt+1, y, F0) pθ(zt|zt+1, zcond) e λF Uϕ(zt,t,y) F0(y) , (5) where λF is a guidance scale factor. The knowledge alignment network is trained by optimizing the objective LU in Alg. 1. According to [4], (5) can be approximated by shifting the predicted mean of the denoising transition µθ(zt+1, t, zcond) by λFΣθ zt Uϕ(zt, t, y) F0(y) , where Σθ is the variance of the original transition distribution pθ(zt|zt+1, zcond) = N(µθ(zt+1, t, zcond), Σθ(zt+1, t, zcond)). Detailed derivation is provided in Appendix C. The training procedure of knowledge alignment is outlined in Alg. 1. The noisy latent zt for training the knowledge alignment network Uϕ is sampled by encoding the target x using the frame-wise encoder E and the forward noising process q(zt|z0), eliminating the need for an inference sampling process. This makes the training of the knowledge alignment network Uϕ independent of the LDM training. At inference time, the knowledge alignment mechanism is applied as a plug-in, without impacting the trained VAE and the LDM. This modular approach allows training lightweight knowledge alignment networks Uϕ to flexibly explore various constraints and domain knowledge, without the need for retraining the entire model. This stands as a key advantage over incorporating constraints into model architectures or training losses. 3 Experiments We conduct empirical studies and compare Pre Diff with other state-of-the-art spatiotemporal forecasting models on a synthetic dataset N-body MNIST [8] and a real-world precipitation nowcasting benchmark SEVIR2 [55] to verify the effectiveness of Pre Diff in handling the dynamics and uncertainty in complex spatiotemporal systems and generating high quality, accurate forecasts. We impose data-specific knowledge alignment: energy conservation on N-body MNIST and anticipated precipitation intensity on SEVIR. Experiments demonstrate that Pre Diff under the guidance of knowledge alignment (Pre Diff-KA) is able to generate predictions that comply with domain expertise much better, without severely sacrificing fidelity. 3.1 N-body MNIST Digits Motion Forecasting Dataset. The Earth is a chaotic system with complex dynamics. The real-world Earth observation data, such as radar echo maps and satellite imagery, are usually not physically complete. We are unable to directly verify whether certain domain knowledge, like conservation laws of energy and momentum, is satisfied or not. This makes it difficult to verify if a method is really capable of modeling certain dynamics and adhering to the corresponding constraints. To address this, we follow [8] to generate a synthetic dataset named N-body MNIST3, which is an extension of Moving MNIST [50]. The dataset contains sequences of digits moving subject to the gravitational force from other digits. The governing equation for the motion is d2xi j =i Gmj(xi xj) (|xi xj|+dsoft)r , where xi is the spatial coordinates of the i-th digit, G is the gravitational constant, mj is the mass of the j-th digit, r is a constant representing the power scale in the gravitational law, and dsoft is a small softening distance that ensures numerical stability. The motion occurs within a 64 64 frame. When a digit hits the boundaries of the frame, it bounces back by elastic collision. We use N = 3 for chaotic 3-body motion [35]. The forecasting task is to predict 10-step ahead future frames x R10 64 64 1 given the length-10 context y R10 64 64 1. We generate 20,000 sequences for training and 1,000 sequences for testing. Empirical studies on such a synthetic dataset with known dynamics helps provide useful insights for model development and evaluation. 2Dataset is available at https://sevir.mit.edu/ 3Code available at https://github.com/amazon-science/earth-forecasting-transformer/ tree/main/src/earthformer/datasets/nbody Target: 𝑥 E.MSE=0.0261 Conv LSTM E.MSE=0.0329 Earthformer E.MSE=0.0296 Video GPT E.MSE=0.0253 LDM E.MSE=0.0354 Pre Diff E.MSE=0.0277 Pre Diff-KA E.MSE=0.0086 step 1 step 2 step 3 step 4 step 5 step 6 step 7 step 8 step 9 step 10 Figure 2: A set of example predictions on the N-body MNIST test set. From top to bottom: context sequence y, target sequence x, predictions by Conv LSTM [47], Earthformer [8], Video GPT [65], LDM [42], Pre Diff, and Pre Diff with knowledge alignment (Pre Diff-KA). E.MSE denotes the average error between the total energy (kinetic + potential) of the predictions E(bxj) and the total energy of the last context frame E(y Lin). The red dashed line is to help the reader to judge the position of the digit 2 in the last frame. Evaluation. In addition to standard metrics MSE, MAE and SSIM, we also report the scores of Fréchet Video Distance (FVD) [51], a metric for evaluating the visual quality of generated videos. Similar to Fréchet Inception Distance (FID) [20] for evaluating image generation, FVD estimates the distance between the learned distribution and the true data distribution by comparing the statistics of feature vectors extracted from the generations and the real data. The inception network used in FVD for feature extraction is pre-trained on video classification and is not specifically adapted for processing "unnatural videos" such as spatiotemporal observation data in Earth systems. Consequently, the FVD scores on the N-body MNIST and SEVIR datasets cannot be directly compared with those on natural video datasets. Nevertheless, the relative ranking of the FVD scores remains a meaningful indicator of model ability to achieve high visual quality, as FVD has shown consistency with expert evaluations across various domains beyond natural images [38, 26]. Scores for all involved metrics are calculated using an ensemble of eight samples from each model. 3.1.1 Comparison with the State of the Art We evaluate seven deterministic spatiotemporal forecasting models: UNet [55], Conv LSTM [47], Pred RNN [61], Phy DNet [11], E3D-LSTM [60], Rainformer [1] and Earthformer [8], as well as two probabilistic spatiotemporal forecasting models: Video GPT [65] and LDM [42]. All baselines are trained following the default configurations in their officially released code. More implementation details of baselines are provided in Appendix B.2. Results in Table 1 show that Pre Diff outperforms these baselines by a large margin in both conventional video prediction metrics (i.e., MSE, MAE, SSIM), and a perceptual quality metric, FVD. The example predictions in Fig. 2 demonstrate that Pre Diff generate predictions with sharp and clear digits in accurate positions. In contrast, deterministic baselines resort to generating blurry predictions to accommodate uncertainty. Probabilistic baselines, though producing sharp strokes, either predict incorrect positions or fail to reconstruct the digits. The performance gap between LDM [42] and Pre Diff serves as an ablation study that highlights the importance of the latent backbone s spatiotemporal modeling capacity. Specifically, the Earthformer UNet utilized in Pre Diff demonstrates superior performance compared to the UNet in LDM [42]. 3.1.2 Knowledge Alignment: Energy Conservation In the N-body MNIST simulation, digits move based on Newton s law of gravity, and interact with the boundaries through elastic collisions. Consequently, this system obeys the law of conservation of Table 1: Performance comparison on N-body MNIST. We report conventional frame quality metrics (MSE, MAE, SSIM), along with Fréchet Video Distance (FVD) [51] for assessing visual quality. Energy conservation is evaluated via E.MSE and E.MAE between the energy of predictions Edet(bx) and the initial energy E(y Lin). Lower values on the energy metrics indicate better compliance with conservation of energy. Model #Param. (M) Frame Metrics Energy Metrics MSE MAE SSIM FVD E.MSE E.MAE Target - 0.000 0.000 1.0000 0.000 0.0132 0.0697 Persistence - 104.9 139.0 0.7270 168.3 - - UNet [55] 16.6 38.90 94.29 0.8260 142.3 - - Conv LSTM [47] 14.0 32.15 72.64 0.8886 86.31 - - Pred RNN [61] 23.8 21.76 54.32 0.9288 20.65 - - Phy DNet [11] 3.1 28.97 78.66 0.8206 178.0 - - E3D-LSTM [60] 12.9 22.98 62.52 0.9131 22.28 - - Rainformer [1] 19.2 38.89 96.47 0.8036 163.5 - - Earthformer [8] 7.6 14.82 39.93 0.9538 6.798 - - Video GPT [65] 92.2 53.68 77.42 0.8468 39.28 0.0228 0.1092 LDM [42] 410.3 46.29 72.19 0.8773 3.432 0.0243 0.1172 Pre Diff 120.7 9.492 25.01 0.9716 0.987 0.0226 0.1083 Pre Diff-KA 129.4 21.90 43.57 0.9303 4.063 0.0039 0.0443 energy. The total energy of the whole system E(xj) at any future time step j during evolution should equal the total energy at the last observation time step E(y Lin). We impose the law of conservation of energy for the knowledge alignment on N-body MNIST in the form of (4) : F(bx, y) [E(bx1), . . . , E(bx Lout)]T , (6) F0(y) [E(y Lin), . . . , E(y Lin)]T . (7) The ground-truth values of the total energy E(y Lin) and E(xj) are directly accessible since N-body MNIST is a synthetic dataset from simulation. The total energy can be derived from the velocities (kinetic energy) and positions (potential energy) of the moving digits. A knowledge alignment network Uϕ is trained following Alg. 1 to guide the Pre Diff to generate forecasts bx that conserve the same energy as the initial step E(y Lin). To verify the effectiveness of the knowledge alignment on guiding the generations to comply with the law of conservation of energy, we train an energy detector Edet(bx)4 that detects the total energy of the forecasts bx. We evaluate the energy error between the forecasts and the initial energy using E.MSE(bx, y) MSE(Edet(bx), E(y Lin)) and E.MAE(bx, y) MAE(Edet(bx), E(y Lin)). In this evaluation, we exclude the methods that generate blurred predictions with ambiguous digit positions. We only focus on the methods that are capable of producing clear digits in precise positions. As illustrated in Table 1, Pre Diff-KA substantially outperforms all baseline methods and Pre Diff without knowledge alignment in E.MSE and E.MAE. This demonstrates that the forecasts of Pre Diff KA comply much better with the law of conservation of energy, while still maintaining high visual quality with an FVD score of 4.063. Furthermore, we detect energy errors in the target data sequences. The first row of Table 1 indicates that even the target from the training data may not strictly adhere to the prior knowledge. This could be due to discretization errors in the simulation. Table 1 shows that all baseline methods and Pre Diff have larger energy errors than the target, meaning purely data-oriented approaches cannot eliminate the impact of noise in the training data. In contrast, Pre Diff-KA, guided by the law of conservation of energy, overcomes the intrinsic defects in the training data, achieving even lower energy errors compared to the target. A typical example shown in Fig. 2 demonstrates that while Pre Diff precisely reproduces the groundtruth position of digit 2 in the last frame (aligned to the red dashed line), resulting in nearly the same energy error (E.MSE = 0.0277) as the ground-truth s (E.MSE = 0.0261), Pre Diff-KA successfully 4The test MSE of the energy detector is 5.56 10 5, which is much smaller than the scores of E.MSE shown in Table 1. This indicates that the energy detector has high precision and reliability for verifying energy conservation in the model forecasts. Conv LSTM CSI = 0.339 CSI-p16 = 0.471 Earthformer CSI = 0.339 CSI-p16 = 0.454 Video GPT CSI = 0.309 CSI-p16 = 0.566 LDM CSI = 0.266 CSI-p16 = 0.568 Pre Diff CSI = 0.329 CSI-p16 = 0.619 +10 Min +20 Min +30 Min +40 Min +50 Min +60 Min -50 Min -40 Min -30 Min -20 Min -10 Min 0 Min Figure 3: A set of example forecasts from baselines and Pre Diff on the SEVIR test set. From top to bottom: context sequence y, target sequence x, forecasts from Conv LSTM [47], Earthformer [8], Video GPT[65], LDM [42], Pre Diff. corrects the motion of digit 2 , providing it with physically plausible velocity and position (slightly off the red dashed line). The knowledge alignment ensures that the generation complies better with the law of conservation of energy, resulting in a much lower E.MSE = 0.0086. On the contrary, none of the evaluated baselines can overcome the intrinsic noise from the data, resulting in energy errors comparable to or larger than that of the ground-truth. Notice that the pixel-wise scores MSE, MAE and SSIM are less meaningful for evaluating Pre Diff KA, since correcting the noise of the energy results in changing the velocities and positions of the digits. A minor change in the position of a digit can cause a large pixel-wise error, even though the digit is still generated sharply and in high quality as shown in Fig. 2. 3.2 SEVIR Precipitation Nowcasting Dataset. The Storm EVent Image Ry (SEVIR) [55] is a spatiotemporal Earth observation dataset which consists of 384 km 384 km image sequences spanning over 4 hours. Images in SEVIR are sampled and aligned across five different data types: three channels (C02, C09, C13) from the GOES-16 advanced baseline imager, NEXRAD Vertically Integrated Liquid (VIL) mosaics, and GOES-16 Geostationary Lightning Mapper (GLM) flashes. The SEVIR benchmark supports scientific research on multiple meteorological applications including precipitation nowcasting, synthetic radar generation, front detection, etc. Due to computational resource limitations, we adopt a downsampled version of SEVIR for benchmarking precipitation nowcasting. The task is to predict the future VIL up to 60 minutes (6 frames) given 70 minutes of context VIL (7 frames) at a spatial resolution of 128 128, i.e. x R6 128 128 1, y R7 128 128 1. Evaluation. Following [55, 8], we adopt the Critical Success Index (CSI) for evaluation, which is commonly used in precipitation nowcasting and is defined as CSI = #Hits #Hits+#Misses+#F.Alarms. To count the #Hits (truth=1, pred=1), #Misses (truth=1, pred=0) and #F.Alarms (truth=0, pred=1), the prediction and the ground-truth are rescaled to the range 0 255 and binarized at thresholds [16, 74, 133, 160, 181, 219]. We also follow [41] to report the CSI at pooling scale 4 4 and 16 16, which evaluate the performance on neighborhood aggregations at multiple spatial scales. These pooled CSI metrics assess the models ability to capture local pattern distributions. Additionally, we incorporate FVD [51] and continuous ranked probability score Target: 𝑥 Avg Int = 0.137 +10 Min +20 Min +30 Min +40 Min +50 Min +60 Min -50 Min -40 Min -30 Min -20 Min -10 Min 0 Min +4𝜎! Avg Int = 0.201 +2𝜎! Avg Int = 0.192 Pre Diff Avg Int = 0.126 2𝜎! Avg Int = 0.104 4𝜎! Avg Int = 0.067 Figure 4: A set of example forecasts from Pre Diff-KA, i.e., Pre Diff under the guidance of anticipated average intensity. From top to bottom: context sequence y, target sequence x, forecasts from Pre Diff and Pre Diff-KA showcasing different levels of anticipated future intensity (µτ + nστ), where n takes the values of 4, 2, 2, 4. (CRPS) [9] for assessing the visual quality and uncertainty modeling capabilities of the investigated methods. CRPS measures the discrepancy between the predicted distribution and the true distribution. When the predicted distribution collapses into a single value, as in deterministic models, CRPS reduces to Mean Absolute Error (MAE). A lower CRPS value indicates higher forecast accuracy. Scores for all involved metrics are calculated using an ensemble of eight samples from each model. 3.2.1 Comparison to the State of the Art We adjust the configurations of involved baselines accordingly and tune some of the hyperparameters for adaptation on the SEVIR dataset. More implementation details of baselines are provided in Appendix B.2. The experiment results listed in Table 2 show that probabilistic spatiotemporal forecasting methods are not good at achieving high CSI scores. However, they are more powerful at capturing the patterns and the true distribution of the data, hence achieving much better FVD scores and CSI-pool16. Qualitative results shown in Fig. 3 demonstrate that CSI is not aligned with human perceptual judgement. For such a complex system, deterministic methods give up capturing the real patterns and resort to averaging the possible futures, i.e. blurry predictions, to keep the scores from appearing too inaccurate. Probabilistic approaches, of which Pre Diff is the best, though are not favored by per-pixel metrics, perform better at capturing the data distribution within a local area, resulting in higher CSI-pool16, lower CRPS, and succeed in keeping the correct local patterns, which can be crucial for recognizing weather events. More detailed quantitative results on SEVIR are provided in Appendix D. 3.2.2 Knowledge Alignment: Anticipated Average Intensity Earth system observation data, such as the Vertically Integrated Liquid (VIL) data in SEVIR, are usually not physically complete, posing challenges for directly incorporating physical laws for guidance. However, with highly flexible knowledge alignment mechanism, we can still utilize auxiliary prior knowledge to guide the forecasting effectively. Specifically for precipitation nowcasting on SEVIR, we use anticipated precipitation intensity to align the generations to simulate possible extreme weather events. We denote the average intensity of a data sequence as I(x) R+. In order to estimate the conditional quantiles of future intensity, we train a simple probabilistic time series forecasting model with a parametric (Gaussian) distribution pτ(I(x)|[I(yj)]) = N(µτ([I(yj)]), στ([I(yj)])) that predict Table 2: Performance comparison on SEVIR. The Critical Success Index, also known as the intersection over union (Io U), is calculated at different precipitation thresholds and denoted as CSI-thresh. CSI reports the mean of CSI-[16, 74, 133, 160, 181, 219]. CSI-pools with s = 4 and s = 16 report the CSI at pooling scales of 4 4 and 16 16. Besides, we include the continuous ranked probability score (CRPS) for probabilistic forecast assessment, and the scores of Fréchet Video Distance (FVD) for evaluating visual quality. Model #Param. (M) Metrics FVD CRPS CSI CSI-pool4 CSI-pool16 Persistence - 525.2 0.0526 0.2613 0.3702 0.4690 UNet [55] 16.6 753.6 0.0353 0.3593 0.4098 0.4805 Conv LSTM [47] 14.0 659.7 0.0332 0.4185 0.4452 0.5135 Pred RNN [61] 46.6 663.5 0.0306 0.4080 0.4497 0.5005 Phy DNet [11] 13.7 723.2 0.0319 0.3940 0.4379 0.4854 E3D-LSTM [60] 35.6 600.1 0.0297 0.4038 0.4492 0.4961 Rainformer [1] 184.0 760.5 0.0357 0.3661 0.4232 0.4738 Earthformer [8] 15.1 690.7 0.0304 0.4419 0.4567 0.5005 DGMR [41] 71.5 485.2 0.0435 0.2675 0.3431 0.4832 Video GPT [65] 99.6 261.6 0.0381 0.3653 0.4349 0.5798 LDM [42] 438.6 133.0 0.0280 0.3580 0.4022 0.5522 Pre Diff 220.5 33.05 0.0246 0.4100 0.4624 0.6244 Pre Diff-KA ( [ 2στ, 2στ]) 229.4 34.18 - - - - the distribution of the average future intensity I(x) given the average intensity of each context frame [I(yj)]Lin j=1 (abbreviated as [I(yj)]). By incorporating F(bx, y) I(bx) and F0(y) µτ + nστ for knowledge alignment, Pre Diff-KA gains the capability of generating forecasts for potential extreme cases, e.g., where I(bx) falls outside the typical range of µτ στ. Fig. 4 shows a set of generations from Pre Diff and Pre Diff-KA with anticipated future intensity µτ + nστ, n { 4, 2, 2, 4}. This qualitative example demonstrates that Pre Diff is not only capable of capturing the distribution of the future, but also flexible at highlighting possible extreme cases like rainstorms and droughts with the knowledge alignment mechanism, which is crucial for decision-making and precaution. According to Table 2, the FVD score of Pre Diff-KA (34.18) is only slightly worse than the FVD score of Pre Diff (33.05). This indicates that knowledge alignment effectively aligns the generations with prior knowledge while maintaining fidelity and adherence to the true data distribution. 4 Conclusions and Broader Impacts In this paper, we propose Pre Diff, a novel latent diffusion model for precipitation nowcasting. We also introduce a general two-stage pipeline for training DL models for Earth system forecasting. Specifically, we develop knowledge alignment mechanism that is capable of guiding Pre Diff to generate forecasts in compliance with domain-specific prior knowledge. Experiments demonstrate that our method achieves state-of-the-art performance on N-body MNIST and SEVIR datasets. Our work has certain limitations: 1) Benchmark datasets and evaluation metrics for precipitation nowcasting and Earth system forecasting are still maturing compared to the computer vision domain. While we utilize conventional precipitation forecasting metrics and visual quality evaluation, aligning these assessments with expert judgement remains an open challenge. 2) Effective integration of physical principles and domain knowledge into DL models for precipitation nowcasting remains an active research area. Close collaboration between DL researchers and domain experts in meteorology and climatology will be key to developing hybrid models that effectively leverage both data-driven learning and scientific theory. 3) While Earth system observation data have grown substantially in recent years, high-quality data remain scarce in many domains. This scarcity can limit Pre Diff s ability to accurately capture the true distribution, occasionally resulting in unrealistic forecast hallucinations under the guidance of prior knowledge as it attempts to circumvent the knowledge alignment mechanism. Further research on enhancing the sample efficiency of Pre Diff and the knowledge alignment mechanism is needed. In conclusion, Pre Diff represents a promising advance in knowledge-aligned DL for Earth system forecasting, but work remains to improve benchmarking, incorporate scientific knowledge, and boost model robustness through collaborative research between AI and domain experts. [1] Cong Bai, Feng Sun, Jinglin Zhang, Yi Song, and Shengyong Chen. Rainformer: Features extraction balanced network for radar-based precipitation nowcasting. IEEE Geoscience and Remote Sensing Letters, 19:1 5, 2022. [2] Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks. Nature, pages 1 6, 2023. [3] Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. ar Xiv preprint ar Xiv:2304.02948, 2023. [4] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021. [5] Lasse Espeholt, Shreya Agrawal, Casper Sønderby, Manoj Kumar, Jonathan Heek, Carla Bromberg, Cenk Gazen, Jason Hickey, Aaron Bell, and Nal Kalchbrenner. Skillful twelve hour precipitation forecasts using large context neural networks. ar Xiv preprint ar Xiv:2111.07470, 2021. [6] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. ar Xiv preprint ar Xiv:2302.03011, 2023. [7] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873 12883, 2021. [8] Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Wang, Mu Li, and Dit-Yan Yeung. Earthformer: Exploring space-time transformers for earth system forecasting. In Neur IPS, 2022. [9] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007. [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [11] Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11474 11484, 2020. [12] John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro. Adaptive fourier neural operators: Efficient token mixers for transformers. ar Xiv preprint ar Xiv:2111.13587, 2021. [13] Shantanu Gupta, Hao Wang, Zachary Lipton, and Yuyang Wang. Correcting exposure bias for link recommendation. In ICML, 2021. [14] Derek Hansen, Danielle C. Maddix, Shima Alizadeh, Gaurav Gupta, and Michael W. Mahoney. Learning physical models that can respect conservation laws. In Proceedings of the 40th of International Conference on Machine Learning, volume 202, 2023. [15] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. ar Xiv preprint ar Xiv:2205.11495, 2022. [16] Yusuke Hatanaka, Yannik Glaser, Geoff Galgon, Giuseppe Torri, and Peter Sadowski. Diffusion models for high-resolution solar forecasts. ar Xiv preprint ar Xiv:2302.00170, 2023. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [18] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. [19] Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999 2049, 2020. [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. [21] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. [22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. [23] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. ar Xiv preprint ar Xiv:2204.03458, 2022. [24] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. ar Xiv preprint ar Xiv:2302.09778, 2023. [25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448 456. pmlr, 2015. [26] Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr\ echet audio distance: A metric for evaluating music enhancement algorithms. ar Xiv preprint ar Xiv:1812.08466, 2018. [27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [28] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [29] Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Alexander Pritzel, Suman Ravuri, Timo Ewalds, Ferran Alet, Zach Eaton-Rosen, et al. Graphcast: Learning skillful medium-range global weather forecasting. ar Xiv preprint ar Xiv:2212.12794, 2022. [30] Jussi Leinonen, Ulrich Hamann, Daniele Nerini, Urs Germann, and Gabriele Franch. Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification. ar Xiv preprint ar Xiv:2304.12891, 2023. [31] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [32] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. ar Xiv e-prints, pages ar Xiv 2303, 2023. [33] Franccois Mazé and Faez Ahmed. Topodiff: A performance and constraint-guided diffusion model for topology optimization. ar Xiv preprint ar Xiv:2208.09591, 2022. [34] Lu Mi, Hao Wang, Yonglong Tian, and Nir Shavit. Training-free uncertainty estimation for neural networks. In AAAI, 2022. [35] Valtonen MJ, Mauri Valtonen, and Hannu Karttunen. The three-body problem. Cambridge University Press, 2006. [36] Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models, 2023. [37] Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Four Cast Net: A global data-driven high-resolution weather model using adaptive fourier neural operators. ar Xiv preprint ar Xiv:2202.11214, 2022. [38] Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Gunter Klambauer. Fréchet chemnet distance: a metric for generative models for molecules in drug discovery. Journal of chemical information and modeling, 58(9):1736 1741, 2018. [39] Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent video transformer. ar Xiv preprint ar Xiv:2006.10704, 2020. [40] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. [41] Suman Ravuri, Karel Lenc, Matthew Willson, Dmitry Kangin, Remi Lam, Piotr Mirowski, Megan Fitzsimons, Maria Athanassiadou, Sheleem Kashem, Sam Madge, et al. Skilful precipitation nowcasting using deep generative models of radar. Nature, 597(7878):672 677, 2021. [42] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022. [43] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500 22510, 2023. [44] Nadim Saad, Gaurav Gupta, Shima Alizadeh, and Danielle C. Maddix. Guiding continuous operator learning through physics-based boundary constraints. In Proceedings of the 11th International Conference on Learning Representations, 2023. [45] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022. [46] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ar Xiv preprint ar Xiv:1701.05517, 2017. [47] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Neur IPS, volume 28, 2015. [48] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In Neur IPS, volume 30, 2017. [49] Casper Kaae Sønderby, Lasse Espeholt, Jonathan Heek, Mostafa Dehghani, Avital Oliver, Tim Salimans, Shreya Agrawal, Jason Hickey, and Nal Kalchbrenner. Metnet: A neural weather model for precipitation forecasting. ar Xiv preprint ar Xiv:2003.12140, 2020. [50] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using LSTMs. In ICML, pages 843 852. PMLR, 2015. [51] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. In DGS@ICLR, 2019. [52] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Neural Information Processing Systems (Neur IPS), 2021. [53] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016. [54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, volume 30, 2017. [55] Mark Veillette, Siddharth Samsi, and Chris Mattioli. SEVIR: A storm event imagery dataset for deep learning applications in radar and satellite meteorology. Advances in Neural Information Processing Systems, 33:22009 22019, 2020. [56] Vikram Voleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Masked conditional video diffusion for prediction, generation, and interpolation. ar Xiv preprint ar Xiv:2205.09853, 2022. [57] Hao Wang, SHI Xingjian, and Dit-Yan Yeung. Natural-parameter networks: A class of probabilistic neural networks. In NIPS, pages 118 126, 2016. [58] Hao Wang and Dit-Yan Yeung. Towards bayesian deep learning: A framework and some existing methods. TDKE, 28(12):3395 3408, 2016. [59] Hao Wang and Dit-Yan Yeung. A survey on bayesian deep learning. CSUR, 53(5):1 37, 2020. [60] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. Eidetic 3D LSTM: A model for video prediction and beyond. In International conference on learning representations, 2018. [61] Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, Philip Yu, and Mingsheng Long. Pred RNN: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [62] Ziyan Wang and Hao Wang. Variational imbalanced regression: Fair uncertainty quantification via probabilistic smoothing. In Neur IPS, 2023. [63] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In International Conference on Learning Representations, 2019. [64] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3 19, 2018. [65] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Video GPT: Video generation using vq-vae and transformers. ar Xiv preprint ar Xiv:2104.10157, 2021. [66] Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. [67] Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. ar Xiv preprint ar Xiv:2212.02500, 2022. [68] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. ar Xiv preprint ar Xiv:2302.05543, 2023. [69] Lu Zhou and Rong-Hua Zhang. A self-attention based neural network for three-dimensional multivariate modeling and its skillful enso predictions. Science Advances, 9(10):eadf2827, 2023. A Related Work Deep learning for precipitation nowcasting In recent years, the field of DL has experienced remarkable advancements, revolutionizing various domains of study, including Earth science. One area where DL has particularly made significant strides is in the field of Earth system forecasting, especially precipitation nowcasting. Precipitation nowcasting benefits from the success of DL architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformers, which have demonstrated their effectiveness in handling spatiotemporal tensors, the typical formulation for Earth system observation data. Conv LSTM [47], a pioneering approach in DL for precipitation nowcasting, combines the strengths of CNNs and LSTMs processing spatial and temporal data. Pred RNN [61] builds upon Conv LSTM by incorporating a spatiotemporal memory flow structure. E3D-LSTM [60] integrates 3D CNN to LSTM to enhance long-term high-level relation modeling. Phy DNet [11] incorporated partial differential equation (PDE) constraints in the latent space. Met Net [49] and its successor, Met Net-2 [5], propose architectures based on Conv LSTM and dilated CNN, enabling skillful precipitation forecasts up to twelve hours ahead. DGMR [41] takes an adversarial training approach to generate sharp and accurate nowcasts, addressing the issue of blurry predictions. In addition to precipitation nowcasting, there has been a surge in the modeling of global weather and medium-range weather forecasting due to the availability of extensive Earth observation data, such as the European Centre for Medium-Range Weather Forecasts (ECMWF) s ERA5 [19] dataset. Several DL-based models have emerged in this area. Four Cast Net [37] proposes an architecture with Adaptive Fourier Neural Operators (AFNO) [12] as building blocks for autoregressive weather forecasting. Feng Wu [3] introduces a multi-model Transformer-based global medium-range weather forecast model that achieves skillful forecasts up to ten days ahead. Graph Cast [29] combines graph neural networks with convolutional LSTMs to tackle sub-seasonal forecasting tasks, representing weather phenomena as spatiotemporal graphs. Pangu-Weather [2] proposes a 3D Transformer model with Earth-specific priors and a hierarchical temporal aggregation strategy for medium-range global weather forecasting. While recent years have seen remarkable progress in DL for precipitation nowcasting, existing methods still face some limitations. Some methods are deterministic, failing to capture uncertainty and resulting in blurry generation. Others lack the capability of incorporating prior knowledge, which is crucial for machine learning for science. In contrast, Pre Diff captures the uncertainty in the underlying data distribution via diffusion models, avoiding simply averaging all possibilities into blurry forecasts. Our knowledge alignment mechanism facilitates post-training alignment with physical principles and domain-specific prior knowledge. Diffusion models Diffusion models (DMs) [22] are a class of generative models that have become increasingly popular in recent years. DMs learn the data distribution by constructing a forward process that adds noise to the data, and then approximating the reverse process to remove the noise. Latent diffusion models (LDMs) [42] are a variant of DMs that are trained on latent vector outputs from a variational autoencoder. LDMs have been shown to be more efficient in both training and inference compared to original DMs. Building on the success of DMs in image generation, DMs have also been adopted for video generation. MCVD [56] trains a DM by randomly masking past and/or future frames in blocks and conditioning on the remaining frames. It generates long videos by autoregressively sampling blocks of frames in a sliding window manner. PVDM [66] projects videos into low-dimensional latent space as 2D vectors, and presents a joint training of unconditional and frame conditional video generations. LFDM [36] employs a flow predictor to estimate latent flows between video frames and learns a DM for temporal latent flow generation. Video Fusion [32] decomposes the transition noise in DMs into per-frame noise and the noise along time axis, and trains two networks jointly to match the noise decomposition. While DMs have demonstrated impressive performance in video synthesis, its applications to precipitation nowcasting and other Earth science tasks have not been well explored. Hatanaka et al. [16] uses DMs to super-resolve coarse numerical predictions for solar forecast. Concurrent to our work, LDCast [30] applies LDMs for precipitation nowcasting. However, LDCast has not studied how to integrate prior knowledge to the DM, which is a unique advantage and novelty of Pre Diff. Conditional controls on diffusion models Another key advantage of DMs is the ability to condition generation on text, class labels, and other modalities for controllable and diverse output. For instance, Control Net [68] enables fine-tuning a pretrained DM by freezing the base model and training a copy end-to-end with conditional inputs. Composer [24] decomposes images into representative factors used as conditions to guide the generation. Beyond text and class labels, conditions in other modalities, including physical constraints, can also be leveraged to provide valuable guidance. Top Diff [33] constrains topology optimization using load, boundary conditions, and volume fraction. Physdiff [67] trains a physics-based motion projection module with reinforcement learning to project denoised motions in diffusion steps into physically plausible ones. Nonetheless, while conditional control has proven to be a powerful technique in various domains, its application in DL for precipitation nowcasting remains an unexplored area. B Implementation Details All experiments are conducted on machines with NVIDIA A10G GPUs (24GB memoery). All models, including Pre Diff, knowledge alignment networks and the baselines, can fit in a single GPU without the need for gradient checkpointing or model parallelization. B.1 Pre Diff Frame-wise autoencoder We follow [7, 42] to build frame-wise VAEs (not VQVAEs) and train them adversarially from scratch on N-body MNIST and SEVIR frames. As shown in Sec. 2.2, on N-body MNIST dataset, the spatial downsampling ratio is 4 4. A frame xj R64 64 1 is encoded to zj R16 16 3 by parameterizing p E(zj|xj) = N(µE(xj)|σE(xj)). On SEVIR dataset, the spatial downsampling ratio is 8 8. A frame xj R128 128 1 is encoded to zj R16 16 4 similarly. The detailed configurations of the encoder and decoder of the VAE on N-body MNIST are shown in Table 3 and Table 4. The detailed configurations of the encoder and decoder of the VAE on SEVIR are shown in Table 5 and Table 6. The discriminators for adversarial training on N-body MNIST and SEVIR datasets share the same configurations, which are shown in Table 7. 2D-CNN + Downsample Cuboid Attention Block 𝐷 Cuboid Attention Block 𝐷 Cuboid Attention Block 𝐷 2D-CNN + Upsample res connect Figure 5: Earthformer-UNet architecture. Pre Diff employs an Earthformer UNet as the backbone for parameterizing the latent diffusion model pθ(zt|zt+1, zcond). It takes the concatenation of the latent context zcond (in the blue border) and the previous-step noisy latent future zt+1 (in the cyan border) along the temporal dimension (the sequence length axis) as input, and outputs zt. (Best viewed in color). Latent diffusion model that instantiates pθ(zt 1|zt, zcond) Stemming from Earthformer [8], we build Earthformer-UNet, which is a hierarchical UNet with self cuboid attention [8] layers as basic building blocks, as shown in Fig. 5. On N-body MNIST, it takes the concatenation along the temporal dimension (the sequence length axis) of zcond R10 16 16 3 and zt R10 16 16 3 as input, and outputs zt 1 R10 16 16 3. On SEVIR, it takes the concatenation along the temporal dimension (the sequence length axis) of zcond R7 16 16 4 and zt R6 16 16 4 as input, and outputs zt 1 R6 16 16 4. Besides, we add the embedding of the denoising step t to the state in front of each cuboid attention block via an embeding layer TEmbed, following [22]. The detailed configurations of the Earthformer-UNet is described in Table 8. Knowledge alignment networks A knowledge alignment network parameterizes Uϕ(zt, t, y) to predict F(bx, y) using the noisy latent zt. In practice, we build an Earthformer encoder [8] with a final pooling block as the knowledge alignment network to parameterize Uϕ(zt, t, zcond), which takes t, and the concatenation of zcond and zt, instead of t, y and zt as the inputs. We find this implementation accurate enough when t is small. The detailed configurations of the knowledge alignment network is described in Table 9 Optimization We train the frame-wise VAEs using the Adam optimizer [27] following [7]. We train the latent Earthformer-UNet and the knowledge alignment network using the Adam W optimizer [31] following [8]. Detailed configurations are shown in Table 10, Table 11 and Table 12 for the frame-wise VAE, the latent Earthformer-UNet and the knowledge alignment network, respectively. We adopt data parallel and gradient accumulation to use a larger total batch size while the GPU can only afford a smaller micro batch size. Table 3: The details of the encoder of the frame-wise VAE on N-body MNIST frames. It encodes an input frame xj R64 64 1 into a latent zj R16 16 3. Conv3 3 is the 2D convolutional layer with 3 3 kernel. Group Norm32 is the Group Normalization (GN) layer [64] with 32 groups. Si LU is the Sigmoid Linear Unit activation layer [18] with function Si LU(x) = x sigmoid(x). The Attention is the self attention layer [54] that first maps the input to queries Q, keys K and values V by three Linear layers, and then does self attention operation: Attention(x) = Softmax(QKT / Block Layer Resolution Channels Input xj - 64 64 1 2D CNN Conv3 3 64 64 1 128 Res Net Block 2 Group Norm32 64 64 128 Conv3 3 64 64 128 Group Norm32 64 64 128 Conv3 3 64 64 128 Si LU 64 64 128 Downsampler Conv3 3 64 64 32 32 128 Res Net Block 2 Group Norm32 32 32 128 Conv3 3 32 32 128 256, 256 Group Norm32 32 32 256 Conv3 3 32 32 256 Si LU 32 32 256 Downsampler Conv3 3 32 32 16 16 256 Res Net Block 2 Group Norm32 16 16 256 Conv3 3 16 16 256 512, 512 Group Norm32 16 16 512 Conv3 3 16 16 512 Si LU 16 16 512 Self Attention Block Group Norm32 16 16 512 Attention 16 16 512 Linear 16 16 512 Res Net Block 2 Group Norm32 16 16 512 Conv3 3 16 16 512 Group Norm32 16 16 512 Conv3 3 16 16 512 Si LU 16 16 512 Output Block Group Norm32 16 16 512 Si LU 16 16 512 Conv3 3 16 16 512 6 Conv3 3 16 16 6 Table 4: The details of the decoder of the frame-wise VAE on N-body MNIST frames. It decodes a latent zj R16 16 3 back to a frame in pixel space xj R64 64 1. Conv3 3 is the 2D convolutional layer with 3 3 kernel. Group Norm32 is the Group Normalization (GN) layer [64] with 32 groups. Si LU is the Sigmoid Linear Unit activation layer [18] with function Si LU(x) = x sigmoid(x). The Attention is the self attention layer [54] that first maps the input to queries Q, keys K and values V by three Linear layers, and then does self attention operation: Attention(x) = Softmax(QKT / Block Layer Resolution Channels Input zj - 16 16 3 2D CNN Conv3 3 16 16 3 Conv3 3 16 16 3 512 Self Attention Block Group Norm32 16 16 512 Attention 16 16 512 Linear 16 16 512 Res Net Block 3 Group Norm32 16 16 512 Conv3 3 16 16 512 Group Norm32 16 16 512 Conv3 3 16 16 512 Si LU 16 16 512 Upsampler Conv3 3 16 16 32 32 512 Res Net Block 3 Group Norm32 32 32 512 Conv3 3 32 32 512 256, 256, 256 Group Norm32 32 32 256 Conv3 3 32 32 256 Si LU 32 32 256 Upsampler Conv3 3 32 32 64 64 256 Res Net Block 3 Group Norm32 64 64 256 Conv3 3 64 64 256 128, 128, 128 Group Norm32 64 64 128 Conv3 3 64 64 128 Si LU 64 64 128 Output Block Group Norm32 64 64 128 Si LU 64 64 128 Conv3 3 64 64 128 1 Table 5: The details of the encoder of the frame-wise VAE on SEVIR frames. It encodes an input frame xj R128 128 1 into a latent zj R16 16 4. Conv3 3 is the 2D convolutional layer with 3 3 kernel. Group Norm32 is the Group Normalization (GN) layer [64] with 32 groups. Si LU is the Sigmoid Linear Unit activation layer [18] with function Si LU(x) = x sigmoid(x). The Attention is the self attention layer [54] that first maps the input to queries Q, keys K and values V by three Linear layers, and then does self attention operation: Attention(x) = Softmax(QKT / Block Layer Resolution Channels Input xj - 128 128 1 2D CNN Conv3 3 128 128 1 128 Res Net Block 2 Group Norm32 128 128 128 Conv3 3 128 128 128 Group Norm32 128 128 128 Conv3 3 128 128 128 Si LU 128 128 128 Downsampler Conv3 3 128 128 64 64 128 Res Net Block 2 Group Norm32 64 64 128 Conv3 3 64 64 128 256, 256 Group Norm32 64 64 256 Conv3 3 64 64 256 Si LU 64 64 256 Downsampler Conv3 3 64 64 32 32 256 Res Net Block 2 Group Norm32 32 32 256 Conv3 3 32 32 256 512, 512 Group Norm32 32 32 512 Conv3 3 32 32 512 Si LU 32 32 512 Downsampler Conv3 3 32 32 16 16 512 Res Net Block 2 Group Norm32 16 16 512 Conv3 3 16 16 512 Group Norm32 16 16 512 Conv3 3 16 16 512 Si LU 16 16 512 Self Attention Block Group Norm32 16 16 512 Attention 16 16 512 Linear 16 16 512 Res Net Block 2 Group Norm32 16 16 512 Conv3 3 16 16 512 Group Norm32 16 16 512 Conv3 3 16 16 512 Si LU 16 16 512 Output Block Group Norm32 16 16 512 Si LU 16 16 512 Conv3 3 16 16 512 8 Conv3 3 16 16 8 Table 6: The details of the decoder of the frame-wise VAE on SEVIR frames. It decodes a latent zj R16 16 4 back to a frame in pixel space xj R128 128 1. Conv3 3 is the 2D convolutional layer with 3 3 kernel. Group Norm32 is the Group Normalization (GN) layer [64] with 32 groups. Si LU is the Sigmoid Linear Unit activation layer [18] with function Si LU(x) = x sigmoid(x). The Attention is the self attention layer [54] that first maps the input to queries Q, keys K and values V by three Linear layers, and then does self attention operation: Attention(x) = Softmax(QKT / Block Layer Resolution Channels Input zj - 16 16 4 2D CNN Conv3 3 16 16 4 Conv3 3 16 16 4 512 Self Attention Block Group Norm32 16 16 512 Attention 16 16 512 Linear 16 16 512 Res Net Block 3 Group Norm32 16 16 512 Conv3 3 16 16 512 Group Norm32 16 16 512 Conv3 3 16 16 512 Si LU 16 16 512 Upsampler Conv3 3 16 16 32 32 512 Res Net Block 3 Group Norm32 32 32 512 Conv3 3 32 32 512 Group Norm32 32 32 512 Conv3 3 32 32 512 Si LU 32 32 512 Upsampler Conv3 3 32 32 64 64 512 Res Net Block 3 Group Norm32 64 64 512 Conv3 3 64 64 512 256, 256, 256 Group Norm32 64 64 256 Conv3 3 64 64 256 Si LU 64 64 256 Upsampler Conv3 3 64 64 128 128 256 Res Net Block 3 Group Norm32 128 128 256 Conv3 3 128 128 256 128, 128, 128 Group Norm32 128 128 128 Conv3 3 128 128 128 Si LU 128 128 128 Output Block Group Norm32 128 128 128 Si LU 128 128 128 Conv3 3 128 128 128 1 Table 7: The details of the discriminator for the adversarial loss of on N-body MNIST and SEVIR frames. Conv4 4 is the 2D convolutional layer with 4 4 kernel, 2 2 or 1 1 stride, and 1 1 padding. Batch Norm is the Batch Normalization (BN) layer [25] . The negative slope in Leaky Re LU is 0.2. Block Layer Resolution Channels N-body MNIST SEVIR Input xj - 64 64 128 128 1 2D CNN Conv4 4 64 64 32 32 128 128 64 64 1 64 Downsampler Leaky Re LU 32 32 64 64 64 Conv4 4 32 32 16 16 64 64 32 32 64 128 Batch Norm 16 16 32 32 128 Downsampler Leaky Re LU 16 16 32 32 128 Conv4 4 16 16 8 8 32 32 16 16 128 256 Batch Norm 8 8 16 16 256 Downsampler Leaky Re LU 8 8 16 16 256 Conv4 4 8 8 7 7 16 16 15 15 256 512 Batch Norm 7 7 15 15 512 Output Block Leaky Re LU 7 7 15 15 512 Conv4 4 7 7 6 6 15 15 14 14 1 Avg Pool 6 6 1 15 15 1 1 Table 8: The details of the Earthformer-UNet as the latent diffusion backbone on N-body MNIST and SEVIR datasets. The Concat Mask layer for the Observation Mask block concatenates one more channel to the input to indicates whether the input is the encoded observation zcond or the noisy latent zt. 1 for zcond and 0 for zt. Conv3 3 is the 2D convolutional layer with 3 3 kernel. Group Norm32 is the Group Normalization (GN) layer [64] with 32 groups. If the number of the input data channels is smaller than 32, then the number of groups is set to the number of channels. Si LU is the Sigmoid Linear Unit activation layer [18] with function Si LU(x) = x sigmoid(x). The negative slope in Leaky Re LU is 0.1. Dropout is the dropout layer [21] with the probability 0.1 to drop an element to be zeroed. The FFN consists of two Linear layers separated by a Ge LU activation layer [18]. Pos Embed is the positional embedding layer [54] that adds learned positional embeddings to the input. TEmbed is the embedding layer [22] that embeds the denoising step t. Patch Merge splits a 2D input tensor with C channels into N non-overlapping p p patches and merges the spatial dimensions into channels, gets N 1 1 patches with p2 C channels and concatenates them back along spatial dimensions. Residual connections [17] are added from blocks in the downsampling phase to corresponding blocks in the upsampling phase. Block Layer Spatial Resolution Channels N-body MNIST SEVIR Input [zcond, zt] - 16 16 3 4 Observation Mask Concat Mask 16 16 3 4 4 5 Group Norm32 16 16 4 5 Si LU 16 16 4 5 Conv3 3 16 16 4 256 5 256 Group Norm32 16 16 256 Si LU 16 16 256 Dropout 16 16 256 Conv3 3 16 16 256 Positional Embedding Pos Embed 16 16 256 Cuboid Attention Block 4 TEmbed 16 16 256 Layer Norm 16 16 256 Cuboid(T, 1, 1) 16 16 256 FFN 16 16 256 Layer Norm 16 16 256 Cuboid(1, H, 1) 16 16 256 FFN 16 16 256 Layer Norm 16 16 256 Cuboid(1, 1, W) 16 16 256 FFN 16 16 256 Downsampler Patch Merge 16 16 8 8 256 1024 Layer Norm 8 8 1024 Linear 8 8 1024 Cuboid Attention Block 8 TEmbed 8 8 1024 Layer Norm 8 8 1024 Cuboid(T, 1, 1) 8 8 1024 FFN 8 8 1024 Layer Norm 8 8 1024 Cuboid(1, H, 1) 8 8 1024 FFN 8 8 1024 Layer Norm 8 8 1024 Cuboid(1, 1, W) 8 8 1024 FFN 8 8 1024 Upsampler Nearest Neighbor Interp 8 8 16 16 1024 Conv3 3 16 16 1024 256 Cuboid Attention Block 4 TEmbed 16 16 256 Layer Norm 16 16 256 Cuboid(T, 1, 1) 16 16 256 FFN 16 16 256 Layer Norm 16 16 256 Cuboid(1, H, 1) 16 16 256 FFN 16 16 256 Layer Norm 16 16 256 Cuboid(1, 1, W) 16 16 256 FFN 16 16 256 Output Block Linear 16 16 256 3 256 4 Table 9: The details of the Earthformer encoders for the parameterization of the knowledge alignment networks Uϕ(zt, t, zcond) on N-body MNIST and SEVIR datasets. The Concat Mask layer for the Observation Mask block concatenates one more channel to the input to indicates whether the input is the encoded observation zcond or the noisy latent zt. 1 for zcond and 0 for zt. Conv3 3 is the 2D convolutional layer with 3 3 kernel. Group Norm32 is the Group Normalization (GN) layer [64] with 32 groups. If the number of the input data channels is smaller than 32, then the number of groups is set to the number of channels. Si LU is the Sigmoid Linear Unit activation layer [18] with function Si LU(x) = x sigmoid(x). The negative slope in Leaky Re LU is 0.1. Dropout is the dropout layer [21] with the probability 0.1 to drop an element to be zeroed. The FFN consists of two Linear layers separated by a Ge LU activation layer [18]. Pos Embed is the positional embedding layer [54] that adds learned positional embeddings to the input. TEmbed is the embedding layer [22] that embeds the denoising step t. Patch Merge splits a 2D input tensor with C channels into N non-overlapping p p patches and merges the spatial dimensions into channels, gets N 1 1 patches with p2 C channels and concatenates them back along spatial dimensions. Residual connections [17] are added from blocks in the downsampling phase to corresponding blocks in the upsampling phase. The Attention is the self attention layer [54] with an extra cls token for information aggregation. It first flattens the input and concatenates it with the cls token. Then it maps the concatenated input to queries Q, keys K and values V by three Linear layers, and then does self attention operation: Attention(x) = Softmax(QKT / C)V ). Finally, the value of the cls token after self attention operation serves as the layer s output. Block Layer Spatial Resolution Channels N-body MNIST SEVIR Input [zcond, zt] - 16 16 3 4 Observation Mask Concat Mask 16 16 3 4 4 5 Group Norm32 16 16 4 5 Si LU 16 16 4 5 Conv3 3 16 16 4 64 5 64 Group Norm32 16 16 64 Si LU 16 16 64 Dropout 16 16 64 Conv3 3 16 16 64 Positional Embedding Pos Embed 16 16 64 Cuboid Attention Block TEmbed 16 16 64 Layer Norm 16 16 64 Cuboid(T, 1, 1) 16 16 64 FFN 16 16 64 Layer Norm 16 16 64 Cuboid(1, H, 1) 16 16 64 FFN 16 16 64 Layer Norm 16 16 64 Cuboid(1, 1, W) 16 16 64 FFN 16 16 64 Downsampler Patch Merge 16 16 8 8 64 256 Layer Norm 8 8 256 Linear 8 8 256 Cuboid Attention Block TEmbed 8 8 256 Layer Norm 8 8 256 Cuboid(T, 1, 1) 8 8 256 FFN 8 8 256 Layer Norm 8 8 256 Cuboid(1, H, 1) 8 8 256 FFN 8 8 256 Layer Norm 8 8 256 Cuboid(1, 1, W) 8 8 256 FFN 8 8 256 Output Pooling Block Group Norm32 8 8 256 Attention 8 8 1 256 Linear 1 256 1 Table 10: Hyperparameters of the Adam optimizer for training frame-wise VAEs and discriminators on N-body MNIST and SEVIR datasets. Hyper-parameter of VAE Value Learning rate 4.5 10 6 β1 0.5 β2 0.9 Weight decay 10 2 Batch size 512 Training epochs 200 Hyper-parameter of discriminator Value Learning rate 4.5 10 6 β1 0.5 β2 0.9 Weight decay 10 2 Batch size 512 Training epochs 200 Training start step 50000 Table 11: Hyperparameters of the Adam W optimizer for training LDMs on N-body MNIST and SEVIR datasets. Hyper-parameter of VAE Value Learning rate 1.0 10 3 β1 0.9 β2 0.999 Weight decay 10 5 Batch size 64 Training epochs 1000 Warm up percentage 10% Learning rate decay Cosine Table 12: Hyperparameters of the Adam W optimizer for training knowledge alignment networks on N-body MNIST and SEVIR datasets. Hyper-parameter of VAE Value Learning rate 1.0 10 3 β1 0.9 β2 0.999 Weight decay 10 5 Batch size 64 Training epochs 200 Warm up percentage 10% Learning rate decay Cosine B.2 Baselines We train baseline algorithms following their officially released configurations and tune the learning rate, learning rate scheduler, working resolution, etc., to optimize their performance on each dataset. We list the modifications we applied to the baselines for each dataset in Table 13. Table 13: Implementation details of baseline algorithms. Modifications based on the officially released implementations are listed according to different datasets. - means no modification is applied. reverse enc-dec means adopting the reversed encoder-decoder architecture proposed in [48]. Other terms listed are the hyperparameters in their officially released implementations. Model N-body MNIST SEVIR UNet [55] - - Conv LSTM [47] reverse enc-dec [48] reverse enc-dec [48] conv_kernels = [(7,7),(5,5),(3,3)] conv_kernels = [(7,7),(5,5),(3,3)] deconv_kernels = [(6,6),(4,4),(4,4)] deconv_kernels = [(6,6),(4,4),(4,4)] channels=[96, 128, 256] channels=[96, 128, 256] Pred RNN [61] - - Phy DNet [11] - convcell_hidden = [256, 256, 256, 64] E3D-LSTM [60] - - Rainformer [1] downscaling_factors=[2, 2, 2, 2] downscaling_factors=[4, 2, 2, 2] hidden_dim=32 - heads=[4, 4, 8, 16] - head_dim=8 - Earthformer [8] - - DGMR [41] - context_steps = 7 forecast_steps = 6 Video GPT [65] vqvae_n_codes = 512 vqvae_n_codes = 512 vqvae_downsample = [1, 4, 4] vqvae_downsample = [1, 8, 8] vae: 64 64 1 16 16 3 vae: 128 128 1 16 16 4 conv_dim = 3 conv_dim = 3 model_channels = 256 model_channels = 256 C Derivation of the Approximation to Knowledge Alignment Guidance We derive the approximation to the knowledge alignment guided denoising transition (5) following [4]. We rewrite (5) to (8) using a normalization constant Z that normalizes Z R e λF Uϕ(zt,t,y) F0(y) dzt = 1: pθ,ϕ(zt|zt+1, y, F0) = pθ(zt|zt+1, zcond) Ze λF Uϕ(zt,t,y) F0(y) . (8) In what follows, we abbreviate µθ(zt+1, t, zcond) as µθ, and Σθ(zt+1, t, zcond) as Σθ for brevity. We use Ci, i = {1, . . . , 7} to denote constants. pθ(zt|zt+1, zcond) = N(µθ, Σθ), log pθ(zt|zt+1, zcond) = 1 2(zt µθ)T Σ 1 θ (zt µθ) + C1, log Ze λF Uϕ(zt,t,y) F0(y) = λF Uϕ(zt, t, y) F0(y) + C2, By assuming that log Ze λF Uϕ(zt,t,y) F0(y) has low curvature compared to Σ 1 θ , which is reasonable in the limit of infinite diffusion steps ( Σθ 0), we can approximate it by a Taylor expansion at zt = µθ log Ze λF Uϕ(zt,t,y) F0(y) λF Uϕ(zt, t, y) F0(y) |zt=µθ (zt µθ)λF zt Uϕ(zt, t, y) F0(y) |zt=µθ = (zt µθ)g + C3, (10) where g = λF zt Uϕ(zt, t, y) F0(y) |zt=µθ. By taking the log of (8) and applying the results from (9) and (10), we get log pθ,ϕ(zt|zt+1, y, F0) = log pθ(zt|zt+1, zcond) + log Ze λF Uϕ(zt,t,y) F0(y) 2(zt µθ)T Σ 1 θ (zt µθ) + (zt µθ)g + C4 2(zt µθ Σθg)T Σ 1 θ (zt µθ Σθg) + 1 2g T Σθg + C5 2(zt µθ Σθg)T Σ 1 θ (zt µθ Σθg) + C6 = log p(z) + C7, z N(µθ + Σθg, Σθ). Therefore, the transition distribution under the guidance of knowledge alignment shown in (5) can be approximated by a Gaussian similar to the transition without knowledge guidance, but with its mean shifted by Σθ. D More Quantitative Results on SEVIR D.1 Quantitative Analysis of BIAS on SEVIR Similar to Critical Success Index (CSI) introduced in Sec. 3.2, BIAS = #Hits+#F.Alarms #Hits+#Misses is calculated by counting the #Hits (truth=1, pred=1), #Misses (truth=1, pred=0) and #F.Alarms (truth=0, pred=1) of the predictions binarized at thresholds [16, 74, 133, 160, 181, 219]. This measurement assesses the model s inclination towards either F.Alarms or Misses. The results from Table 14 demonstrate that deterministic spatiotemporal forecasting models, such as UNet [55], Conv LSTM [47], Pred RNN [61], Phy DNet [11], E3D-LSTM [60], and Earthformer [8], tend to produce predictions with lower intensity. These models prioritize avoiding high-intensity predictions that have a higher chance of being incorrect due to their limited ability to handle such uncertainty effectively. On the other hand, probabilistic spatiotemporal forecasting baselines, including DGMR [41], Video GPT [65] and LDM [42], demonstrate a more daring approach by predicting possible high-intensity signals, even if it results in lower CSI scores, as depicted in Table 2. Among these baselines, Pre Diff achieves the best performance in BIAS. It consistently achieves BIAS scores closest to 1, irrespective of the chosen threshold. These results demonstrate that Pre Diff has effectively learned to unbiasedly capture the distribution of intensity. Table 14: Quantitative Analysis of BIAS on SEVIR. The BIAS is calculated at different precipitation thresholds and denoted as BIAS-thresh. BIAS-m reports the mean of BIAS-[16, 74, 133, 160, 181, 219]. A BIAS score closer to 1 indicates that the model is less biased to either F.Alarms or Misses. The best BIAS score is in boldface while the second best is underscored. Model Metrics BIAS-m BIAS-219 BIAS-181 BIAS-160 BIAS-133 BIAS-74 BIAS-16 Persistence 1.0177 1.0391 1.0323 1.0258 1.0099 1.0016 0.9983 UNet [55] 0.6658 0.2503 0.4013 0.5428 0.7665 0.9551 1.0781 Conv LSTM [47] 0.8341 0.5344 0.6811 0.7626 0.9643 0.9957 1.0663 Pred RNN [61] 0.6605 0.2565 0.4377 0.4909 0.6806 0.9419 1.1554 Phy DNet [11] 0.6798 0.3970 0.6593 0.7312 1.0543 1.0553 1.2238 E3D-LSTM [60] 0.6925 0.2696 0.4861 0.5686 0.8352 0.9887 1.0070 Earthformer [8] 0.7043 0.2423 0.4605 0.5734 0.8623 0.9733 1.1140 DGMR [41] 0.7302 0.3704 0.5254 0.6495 0.8312 0.9594 1.0456 Video GPT [65] 0.8594 0.6106 0.7738 0.8629 0.9606 0.9681 0.9805 LDM [42] 1.2951 1.4534 1.3525 1.3827 1.3154 1.1817 1.0847 Pre Diff 0.9769 0.9647 0.9268 0.9617 0.9978 1.0047 1.0058 D.2 CSI at Varying Thresholds on SEVIR We include representative deterministic methods Conv LSTM and Earthformer, and all studied probabilistic methods to compare CSI, CSI, CSI-pool4 and CSI-pool16 at varying thresholds. It is important to note that CSI tends to favor conservative predictions, especially in situations with high levels of uncertainty. To ensure a fair comparison, we calculated the CSI scores by averaging the samples for each model, while scores in other metrics are averaged over the scores of each sample. The results presented in Table 15, 16, 17 demonstrate that our Pre Diff achieves competitive CSI scores and outperforms baselines in CSI scores at pooling scale 4 4 and 16 16, particularly at higher thresholds. Table 15: CSI at thresholds [16, 74, 133, 160, 181, 219] on SEVIR. Model Metrics CSI-m CSI-219 CSI-181 CSI-160 CSI-133 CSI-74 CSI-16 Conv LSTM [47] 0.4185 0.1220 0.2381 0.2905 0.4135 0.6846 0.7510 Earthformer [8] 0.4419 0.1791 0.2848 0.3232 0.4271 0.6860 0.7513 DGMR [41] 0.2675 0.0151 0.0537 0.0970 0.2184 0.5500 0.6710 Video GPT [65] 0.3653 0.1029 0.1997 0.2352 0.3432 0.6062 0.7045 LDM [42] 0.3580 0.1019 0.1894 0.2340 0.3537 0.5848 0.6841 Pre Diff 0.4100 0.1154 0.2357 0.2848 0.4119 0.6740 0.7386 Table 16: CSI-pool4 at thresholds [16, 74, 133, 160, 181, 219] on SEVIR. Model Metrics CSI-pool4-m CSI-pool4-219 CSI-pool4-181 CSI-pool4-160 CSI-pool4-133 CSI-pool4-74 CSI-pool4-16 Conv LSTM [47] 0.4452 0.1850 0.2864 0.3245 0.4502 0.6694 0.7556 Earthformer [8] 0.4567 0.1484 0.2772 0.3341 0.4911 0.7006 0.7892 DGMR [41] 0.3431 0.0414 0.1194 0.1950 0.3452 0.6302 0.7273 Video GPT [65] 0.4349 0.1691 0.2825 0.3268 0.4482 0.6529 0.7300 LDM [42] 0.4022 0.1439 0.2420 0.2964 0.4171 0.6139 0.6998 Pre Diff 0.4624 0.2065 0.3130 0.3613 0.4807 0.6691 0.7438 Table 17: CSI-pool16 at thresholds [16, 74, 133, 160, 181, 219] on SEVIR. Model Metrics CSI-pool16-m CSI-pool16-219 CSI-pool16-181 CSI-pool16-160 CSI-pool16-133 CSI-pool16-74 CSI-pool16-16 Conv LSTM [47] 0.5135 0.2651 0.3679 0.4153 0.5408 0.7039 0.7883 Earthformer [8] 0.5005 0.1798 0.3207 0.3918 0.5448 0.7304 0.8353 DGMR [41] 0.4832 0.1218 0.2804 0.3924 0.5364 0.7465 0.8216 Video GPT [65] 0.5798 0.3101 0.4543 0.5211 0.6285 0.7583 0.8065 LDM [42] 0.5522 0.2896 0.4247 0.4987 0.5895 0.7229 0.7876 Pre Diff 0.6244 0.3865 0.5127 0.5757 0.6638 0.7789 0.8289 E More Qualitative Results on N-body MNIST Fig. 6 to Fig. 13 show several sets of example predictions on the N-body MNIST test set. In each figure, visualizations from top to bottom are context sequence y, target sequence x, predictions by Conv LSTM [47], Earthformer [8], Video GPT [65], LDM [42], Pre Diff, Pre Diff-KA. E.MSE denotes the average error between the total energy (the sum of kinetic energy and potential energy) of the predictions E(bxj) and the total energy of the last step context E(y Lin). Figure 6: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 1 in the last frame. Figure 7: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 0 in the last frame. Figure 8: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 0 in the last frame. Figure 9: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 8 in the last frame. Figure 10: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 4 in the last frame. Figure 11: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 1 in the last frame. Figure 12: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 7 in the last frame. Figure 13: A set of example predictions on the N-body MNIST test set. The red dashed line is to help the reader to judge the position of the digit 7 in the last frame. F More Qualitative Results on SEVIR Fig. 14 to Fig. 19 show several sets of example predictions on the SEVIR test set. In subfigure (a) of each figure, visualizations from top to bottom are context sequence y, target sequence x, predictions by Conv LSTM [47], Earthformer [8], Video GPT [65], LDM [42], Pre Diff, Pre Diff-KA. In subfigure (b) of each figure, visualizations from top to bottom are context sequence y, target sequence x, predictions by Pre Diff-KA with anticipated average future intensity µτ + nστ, n = 4, 2, 0 2, 4. Figure 14: A set of example predictions on the SEVIR test set. (a) Comparison of Pre Diff with baselines. (b) Predictions by Pre Diff-KA under the guidance of anticipated average intensity. Figure 15: A set of example predictions on the SEVIR test set. (a) Comparison of Pre Diff with baselines. (b) Predictions by Pre Diff-KA under the guidance of anticipated average intensity. Figure 16: A set of example predictions on the SEVIR test set. (a) Comparison of Pre Diff with baselines. (b) Predictions by Pre Diff-KA under the guidance of anticipated average intensity. Figure 17: A set of example predictions on the SEVIR test set. (a) Comparison of Pre Diff with baselines. (b) Predictions by Pre Diff-KA under the guidance of anticipated average intensity. Figure 18: A set of example predictions on the SEVIR test set. (a) Comparison of Pre Diff with baselines. (b) Predictions by Pre Diff-KA under the guidance of anticipated average intensity. Figure 19: A set of example predictions on the SEVIR test set. (a) Comparison of Pre Diff with baselines. (b) Predictions by Pre Diff-KA under the guidance of anticipated average intensity.