# sada_stabilityguided_adaptive_diffusion_acceleration__cac44a7e.pdf

SADA: Stability-guided Adaptive Diffusion Acceleration

Ting Jiang * 1 Yixiao Wang * 2 Hancheng Ye * 1 Zishan Shao 1 2 Jingwei Sun 1 Jingyang Zhang 1 Zekai Chen 1

Jianyi Zhang 1 Yiran Chen 1 Hai Li 1

Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic-attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this fidelity gap arises because (a) different prompts correspond to varying denoising trajectory, and (b) such methods do not consider the underlying ODE formulation and its numerical solution. In this paper, we propose Stability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that unifies step-wise and token-wise sparsity decisions via a single stability criterion to accelerate sampling of ODE-based generative models (Diffusion and Flow-matching). For (a), SADA adaptively allocates sparsity based on the sampling trajectory. For (b), SADA introduces principled approximation schemes that leverage the precise gradient information from the numerical ODE solver. Comprehensive evaluations on SD-2, SDXL, and Flux using both EDM and DPM++ solvers reveal consistent 1.8 speedups with minimal fidelity degradation (LPIPS 0.10 and FID 4.5) compared to unmodified baselines, significantly outperforming prior methods. Moreover, SADA adapts seamlessly to other pipelines and modalities: It accelerates Control Net without any modifications and speeds up Music LDM by 1.8 with 0.01 spectrogram LPIPS. Our code is available at: https://github.com/Ting-Justin-Jiang/sadaicml.

*Equal contribution 1Department of Electrical and Computer Engineering, Duke University, Durham, U.S.A 2Department of Statistical Science, Duke University, Durham, U.S.A. Correspondence to: Ting Jiang <justin.jiang@duke.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Recent advancements in diffusion models have set new benchmarks across various tasks, including image, video, text, and audio generation (Sohl-Dickstein et al., 2015; Song et al., a), significantly enhancing productivity and creativity. However, the deployment of these models at high resolutions faces two fundamental efficiency bottlenecks: (i) the iterative nature of the denoising process and (ii) the quadratic complexity of attention mechanisms. To address these challenges, existing training-free acceleration methods have primarily focused on two corresponding fronts: (i) reducing the number of inference steps (Song et al., a; Lu et al., 2022b; Karras et al., 2022), (ii) reducing the computational cost per step (Bolya & Hoffman, 2023; Ma et al., 2024b; Zhao et al., 2024; Wang et al., 2024; Zou et al., 2024; Ye et al., 2024).

Sampling with generative models can be cast as transporting between two distributions through a reverse ordinary differential equation (ODE) (Song & Ermon, 2019; Song et al., b; Lipman et al., 2023; Liu et al., 2022). Numerical ODE solvers (Karras et al., 2022; Lu et al., 2022a;b) that significantly reduce the number of inference steps while preserving sample fidelity have become a fundamental component for practical workflows. Orthogonal to these advanced schedulers, recent works that reduce per-step computational cost in category (ii) exploit the sparsity empirically observed in pretrained architectures (Rombach et al., 2022; Chen et al., 2024b; Podell et al.; Esser et al., 2024) during the sampling procedure, either on a token-wise (Bolya & Hoffman, 2023; Kim et al., 2024; Zou et al., 2024) or step-wise (Ma et al., 2024b; Zhao et al., 2024; Ye et al., 2024) granularity.

While effectively reducing sampling latency, these architectural strategies in category (ii) demonstrate low faithfulness compared to original samples, as measured in LPIPS (Zhang et al., 2018) and FID (Heusel et al., 2017). We hypothesize that this fidelity gap arises because: (a) Fixed (Ma et al., 2024b) or pre-searched (Yuan et al., 2024) sparsity patterns cannot adapt to the variability of each prompt s denoising trajectory, and (b) Such methods do not explicitly leverage the underlying ODE formulation of the denoising process, nor interplay with the specific ODE-solver used.

Motivated by these observations, we introduce Stability-

SADA: Stability-guided Adaptive Diffusion Acceleration

Figure 1. Accelerating {Flux, SDXL, SD-2} by {2.02 , 1.86 , 1.80 } with Stability-guided Adaptive Diffusion Acceleration with 50 {Diffusion, Flow-matching} inference steps.

guided Adaptive Diffusion Acceleration (SADA), a trainingfree framework that dynamically exploits both step-wise and token-wise sparsity via a unified stability criterion. SADA addresses (a) by adaptively allocating computation along the denoising trajectory (See Section 3.3), and (b) by employing approximation schemes that leverage precise trajectory gradient information from the chosen ODE solver (See Section 3.4).

Specifically, leveraging the second-order difference of the precise gradient yt = dxt

dt according to its definition (Song et al., a; Lipman et al., 2023), SADA makes better decisions on sparsity allocation and principled approximation. For (a), we propose the stability criterion (Criterion 3.4) as the second-order difference of yt, which measures the local dynamics of the denoising trajectory. Implemented in a plug-and-play fashion, SADA dynamically identifies the sparsity mode {token-wise, step-wise, multistep-wise} at each timestep. For (b), we incorporate the gradient calculated by a specific ODE solver into both the stability criterion and the approximation correction. In particular, we derive two approximation schemes that unify the xt 0 and xt trajectory, yielding a principled estimate of the per-step clean sample xt 0 compatible with advanced diffusion schedulers.

Both theoretical and empirical analyses indicate that SADA is compatible with different backbone architectures (Ronneberger et al., 2015; Peebles & Xie, 2023) and solvers, and able to accelerate generative modeling in various modalities

(Chen et al., 2024c) and downstream tasks (Zhang et al., 2023) without additional training. In all tested scenarios, SADA achieves substantially better faithfulness ( 0.100 LPIPS) than existing strategies(Ye et al., 2024; Ma et al., 2024b; Liu et al., 2025a), while consistently delivering 1.8 speedup. These results establish SADA as a practical plug-in for high-throughput, high-fidelity generative sampling.

In summary, the core contributions of our work are: (i) We introduce SADA, a training-free framework that leverages a stability criterion to adaptively sample the denoising process with principled approximation. (ii) To our best knowledge, SADA is the first paradigm that directly bridges numerical solvers of the sampling trajectory with sparsity-aware architecture optimizations. (iii) Extensive experiments conducted on various baselines, solvers, and prediction frameworks demonstrate the effectiveness of SADA to existing acceleration strategies. Moreover, SADA can be easily adapted to new generation pipelines and modalities.

2. Related Work

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., b; Nichol & Dhariwal, 2021) have emerged as a transformative framework for generative modeling. They generate high-fidelity samples by iteratively reversing a stochastic process that gradually converts data into pure noise, delivering state-of-the-art results across a

SADA: Stability-guided Adaptive Diffusion Acceleration

wide range of tasks. In the realm of image synthesis, Stable Diffusion (Rombach et al., 2022) enhances computational efficiency by mapping high-dimensional pixel inputs to a compact latent space via a Variational Autoencoder (Kingma & Welling, 2014). Moreover, it incorporates text conditioning through a pre-trained text encoder (Radford et al., 2021; Oquab et al., 2024), driven by classifier-free guidance (Ho & Salimans, 2022). Building upon this foundation, later members of the latent diffusion family (Podell et al.; Chen et al., 2024b; Black-Forest-Labs, 2024) incorporate Transformers (Vaswani et al., 2017) as the primary architectural backbone, enabling improved scalability (Peebles & Xie, 2023) and higher-resolution generation (Chen et al., 2024a; Gao et al., 2024). However, these advancements come at the cost of efficiency: the iterative nature of the denoising process and the quadratic complexity of self-attention significantly slow down inference.

Accelerating Diffusion Models To mitigate the above two bottlenecks, existing acceleration strategies typically focus on two fronts: (1) reducing the number of inference steps, and (2) reducing the computational cost per step.

The first paradigm involves interpreting the underlying stochastic process, developing numerical methods, and introducing novel objectives and frameworks. DDIM (Song et al., a) reformulates the diffusion process as a non-Markovian procedure. Subsequent work identifies the diffusion process as a stochastic differential equation (Song & Ermon, 2019) and converts it into solving a probabilistic-flow ordinary differential equation (PF-ODE) that could be parameterized by various model output objectives (Song et al., b; Chen et al., 2023; Kim et al.; Kingma & Gao, 2023). Building on this perspective, advanced numerical solvers significantly decrease the required number of function evaluations. Among these, the Euler Discrete Multistep (EDM) solver (Karras et al., 2022; Liu et al., 2023) has been applied effectively, while the DPM-Solver series (Lu et al., 2022a;b; Zheng et al., 2023) further enhance efficiency and stability by computing the linear component analytically with respect to the semi-linearity of PF-ODE. More recent works further explore the sampling trajectory of generative modeling. Consistency models (Song et al., 2023; Lu & Song, 2024) establish a theoretical one-step mapping from any point on the PF-ODE to the data distribution, while the flow matching (Liu et al., 2022; Albergo & Vanden-Eijnden, 2022; Lipman et al., 2023) casts generative modeling as directly learning the ODE that transports samples from the noise distribution to the data distribution.

The second paradigm mainly leverages either step-wise or token-wise sparsity in diffusion models to reduce computation overhead. On the step-wise front, Deep Cache (Ma et al., 2024b) accelerates the U-Net (Ronneberger et al., 2015) based denoising model by caching high-level features

in deeper layers. Feature caching methods (Zhao et al., 2024; Ma et al., 2024a; Wimbauer et al., 2024; Zhen et al., 2025; Saghatchian et al., 2025; Liu et al., 2024; Chen et al., 2024d; Shen et al.; Liu et al., 2025b) shorten sampling latency by reusing attention outputs. PF-Diff (Wang et al., 2024) leverages previous model outputs to predict a look-ahead term in the ODE solver, analogous to the use of Nesterov momentum. These step-wise approaches typically employ a fixed schedule determined via a hyperparameter to guide the sparsity pattern during sampling. In parallel, token reduction strategies (Bolya & Hoffman, 2023; Kim et al., 2024; Zhen et al., 2025; Saghatchian et al., 2025) exploit the redundancy in image pixels to eliminate unnecessary tokens, thereby reducing the attention module s computational load. To Ca series (Zou et al., 2024; Zhang et al., 2024) combines caching and token pruning. Di TFast Attn (Yuan et al., 2024; Zhang et al., 2025) compresses the attention module leveraging the redundancies identified after a brief search.

Despite these advancements, none of these approaches dynamically allocate step-wise and token-wise sparsity to accelerate sampling in multi-granularity levels. Moreover, their end-to-end acceleration configurations typically hinge on specific hyperparameters or pre-trained models. Adaptive Diffusion (Ye et al., 2024) offers a promising perspective by adjusting its acceleration mode based on the prompt. However, it still requires hyperparameter tuning and does not correct for approximation error. In contrast, we frame diffusion acceleration as a stability-prediction problem rather than merely controlling error accumulation. SADA requires minimal hyperparameter tuning and incorporates principled approximation schemes that directly match our stability criterion.

3. Proposed Method

3.1. Preliminary

Diffusion Models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., a; Nichol & Dhariwal, 2021) formulate sample generation as an iterative denoising process. Starting from a Gaussian sample x T N(0, I), these models progressively recover a clean signal over T timesteps. The forward process is defined by a variance schedule {βt} with αt = 1 βt and αt = Qt i=1 αi. The reverse (sampling) process is then modeled as:

pθ(xt 1 | xt) = N (xt 1; µθ(xt, t), βt I) , (1)

where the mean µθ(xt, t) = 1 αt

xt 1 αt 1 αt ϵθ(xt, t) is

parameterized by a noise prediction model ϵθ(xt, t). Meanwhile, the data reconstruction xt 0 could be calculated as:

xt 0 = 1 αt

1 αtϵθ(xt, t) (2)

SADA: Stability-guided Adaptive Diffusion Acceleration

𝑦$, 𝑦$%#, 𝑦$%&

Criterion 3.4

𝜖$, 𝜖$ %𝑦$"#, 𝑦$, 𝑦$%#

① 𝜖!"# Cache re-use

Solver Model DP 𝑥!"&

%𝑥' !"# 𝜖!"#

output gradient

See Equation (16)

Speedup ( ) LPIPS ( ) 1.5

Speedup ( )

Speedup ( ) LPIPS ( ) 1.0

Speedup ( )

0.100 0.094

Deep Cache Adaptive Diffusion SADA

Figure 2. Overview paradigm of SADA. The sparsity mode (middle: step-wise, bottom: token-wise) at timestep t 1 is adaptively identified by the stability Criterion 3.4 after fresh computation at timestep t. Note that DP in the pipeline stands for Data Prediction . Right: Visualization of SADA and baseline methods performance in terms of faithfulness and efficiency. Our methods significantly outperforms existing baselines {Deep Cache, Adaptive Diffusion } on both metrics, using {SD-2 (Top), SDXL (Bottom)} with DPM-solver++ 50 steps.

For text-to-image synthesis, generation typically occurs in a learned latent space (Rombach et al., 2022). In this setting, the denoising model is usually implemented using a deep neural network composed of L transformer-based layers. At the l-th transformer layer at timestep t, the input latent representation is denoted by x(l) t RB H W C, where B is the batch size, and H, W, C are the spatial and embedding dimensions, respectively.

Diffusion ODEs Sampling with diffusion models can be reformulated as solving a reverse-time ordinary differential equation (ODE). In particular, previous works (Song & Ermon, 2019; Song et al., b) defined the Probability-flow ODE that characterizes the continuous-time evolution of samples:

x=xt = f(t)xt + g2(t)

2σt ϵθ(xt, t), (3)

where f(t) = d

dt log αt, g2(t) = dσt

dt (log αt) σt, and σt = 1 αt. Here, time is assumed to be continuous with t U([0, 1]). Recent works explored rectified flows (Liu et al., 2022; Albergo & Vanden-Eijnden, 2022; Lipman et al., 2023), where the reverse transformation is achieved with a learned vector field uθ(xt, t) predicted by the denoising model:

x=xt = uθ(xt, t). (4)

Without loss of generality, we mark dx

dt x=xt as y(xt, t). For brevity, we further denote y(xt, t) as yt, ϵθ(xt, t) as ϵt, and uθ(xt, t) as ut

Step-wise Sparsity accelerates the diffusion model by adaptively reducing the noise prediction steps. The recent

work (Ye et al., 2024) leverages a third-order difference of the latent feature map xt to identify temporal redundancy in the sampling process. Let (1)xt = xt xt+1 as backward finite difference. At step t, given the noise ϵt and a threshold τ, if:

( (1)xt+2 + (1)xt )/2 (1)xt+1

(1)xt+1 τ, (5)

then the denoising model is bypassed for the subsequent timestep t 1, and ϵt 1 is reused by ϵt.

Token-wise Sparsity accelerates diffusion models with high representation granularity. The token indices of the input sequence in a transformer layer are partitioned into Ireduce and Ifix, aiming to maximize |Ireduce| while maintaining acceptable image quality. An index mapping I : {1, . . . , N} {1, . . . , N } is defined, where N = |Ifix|. A standard token pruning (Bolya & Hoffman, 2023; Kim et al., 2024) process then discards Ireduce and retains only Ifix: x[j] = x[i ], j = 1, . . . , N , (6)

where I(i ) = j and i Ifix. After self-attention A( ), the feature map is reconstructed by:

ˆx[i] = A( x) I(i) , i = 1, . . . , N. (7)

Removing more tokens often degrades generation quality. Our analysis in Appendix C shows that token merging behaves as a low-pass filter, motivating our choice to build upon token pruning.

3.2. Overall Paradigm

Figure 2 illustrates the overall paradigm of SADA. We leverage the precise gradient from the sampling trajectory to

SADA: Stability-guided Adaptive Diffusion Acceleration

measure the denoising stability, as described in Sections 3.3. This Boolean measure serves as an overall criterion for our acceleration process. Specifically, when Criterion 3.4 returns True, we apply step-wise cache-assisted pruning (refer to Section 3.4) with dual approximation scheme {step-wise, multistep-wise}. Conversely, when Criterion 3.4 returns False, we apply token-wise cacheassisted pruning (refer to Section 3.5).

3.3. Modeling the Stability of the Denoising Process

We formulate the acceleration of the denoising process as a stability prediction problem. Given a pre-trained diffusion architecture with noise prediction ϵt at timestep t, we seek a dynamic criterion for step-wise and token-wise cacheassisted pruning (i.e., optimal sparsity allocation) that reduces computation at timestep t 1 without degrading sample fidelity.

The criterion for acceleration should match the specific approximation method that mitigates the loss resulting from step-wise pruning. Adaptive Diffusion (Ye et al., 2024) directly caches and reuses the noise across steps. However, approximating ˆϵt 1 = ϵt introduces a mismatch between the sample state xt 1 and its corresponding noise, causing error to accumulate. A more principled approach should incorporate historical information to correct this approximation.

Concretely, we propose a third-order extrapolation of the sample state at t 1, which implicitly carries historical noise along the ODE trajectory. A straightforward thirdorder backward finite-difference baseline can be written as ˆxt 1 = 3xt 3xt+1 + xt+2. This yields a correction term, xt 1 ˆxt 1 = (3)xt 1. Moreover, ˆxt 1 is theoretically bounded by Theorem 3.1, which guarantees its stability under small t.

Theorem 3.1. Let f Ck[a, b] be a smooth function and let x0 := x, with equally spaced grid points xi := x + ih for i = 0, 1, . . . , k 1. Define Pk 1(t) as the degree-(k 1) Lagrange interpolant of f at {xi}k 1 i=0 .

Then, the extrapolated value at x h satisfies:

i=0 αif(xi) + Rk(h), (8)

where the weights are given by:

αi = ( 1)i k i + 1

, i = 0, 1, . . . , k 1. (9)

The error bound of the remainder term is:

Rk(h) = O(hk). (10)

To derive our stability criterion, we now present the following supplemental theorems:

Theorem 3.2. The expected value of xt over the joint distribution of x0, ϵ, and timestep t satisfies: Ex0,ϵ,t[xt] = αt Ex0[x0].

(i) From Theorem 3.2, we know that the trajectory {xt} is continuous in expectation. Consequently, the empirical average xt also exhibits continuity by the law of large numbers (Feller et al., 1971).

Theorem 3.3. Let the network ϵθ(x, t) be trained with the standard mean-squared error (MSE) objective:

L(θ) = Ex0,ϵ,t ϵ ϵθ(xt, t) 2 , (11)

Suppose θ minimizes L, and the training is sufficiently converged. Then, following Assumption 1 that ϵθ(xt, t) is Lipschitz in xt and t, we have the following consistency property for sampling-time inputs ˆxt:

Ex0,ϵ,t[ϵ ϵθ (ˆxt, t)] 0 as ˆxt xt 0. (12)

(ii) From Theorem 3.3, a well-trained denoiser permits us to treat the sampling trajectory as a continuous process, preserving structural consistency.

We therefore assume that, in a stable regime, the sign of consecutive third-order differences remains consistent: sign (3)xt = sign (3)xt 1 . Consequently, if the extrapolation error xt 1 ˆxt 1 is aligned with the true curvature (3)xt 1, then ˆxt 1 serves as a directionally accurate approximation. To incorporate precise gradient information, we leverage the always-hold identity (2)yt (3)xt < 0 by construction. Combining the sign measure with the identity, we substitute (3)xt with (3)xt 1 = xt 1 ˆxt 1 and yield the following criterion:

Criterion 3.4. A timestep t is considered stable and eligible for acceleration if the extrapolation error is antialigned with the local curvature of velocity:

(xt 1 ˆxt 1) (2)yt < 0. (13)

This stability measure ensures that the extrapolated state ˆxt 1 lies in the correct direction with respect to curvature correction.

3.4. Step-wise Cache-Assisted Pruning

This section proposes two complementary approximation schemes for Step-wise Cache-Assisted Pruning: (step-wise) and (multistep-wise). Either approximation scheme produces a clean-sample estimate ˆx t 0, which is then fed into advanced samplers (e.g., DPM-Solver++ or

SADA: Stability-guided Adaptive Diffusion Acceleration

0 20 40 Step Index

Mean Squared Error

3rd order FDM Ours

Figure 3. Comparison of xt-approximation strategies. Left: Step-wise pruning result of the third-order finite difference method and third-order Adams-Moulton method. Right: Mean Squared Error comparison between two strategies, averaged over 50 randomly selected prompts. Shaded regions indicate the standard deviation at each step.

Semantic-planning Stage

Fidelity-improving Stage

Step-wise Approximation

Data Prediction

Multistep-wise Approximation

Approximated State Equation 14 w/ previous

Equation 16 w/ previous

Figure 4. Illustration of the proposed dual approximation scheme on the xt 0 and xt trajectory.

EDM (Karras et al., 2022; Lu et al., 2022a)). This unified framework aligns the xt 0 and xt trajectories (Fig. 4). Below, we detail the design choices for each approximation.

Step-wise Approximation As noted in Section 3.3, a simple baseline for approximating xt 1 is a third-order backward finite difference. Empirical experiments demonstrate low reconstruction errors, as shown in Figure 3. Note that we reuse the noise prediction at step t, formulated as ˆϵt 1 ϵt, under this scheme. However, since we have exact derivative information yt (See Eq. 3 and 4) we adopt a third-order Adams Moulton method (Iserles, 2009) along the ODE trajectory:

Theorem 3.5. Using the secondand third-order Adams Moulton method, we define the estimator:

ˆxt 1 := xt 5 t

6 yt+1 + 2 t

3 yt+2, (14)

whose local truncation error satisfies: ˆxt 1 xt 1 = O( t2).

The full derivation and the corresponding error bound are provided in Proposition B.1, Theorems 3.1, B.2, and 3.5 in the Appendix B.2. To quantify the benefit of these precise updates, we measure per-step reconstruction error on 50 randomly sampled MS-COCO prompts (Lin et al., 2014). Our Adams Moulton scheme results in lower mean error and smaller standard deviation, compared to third-order finite difference, as illustrated in Figure 3.

Given that xt 0 captures structural information and serves as the initial input to all schedulers, accurately reconstructing xt 0 is critical. Theorem 3.6 establishes an upper bound on the reconstruction error at timestep t, based on the thirdorder estimator in Theorem 3.5 and incorporating the effect of noise reuse.

Theorem 3.6. Let ˆxt 0 denote the reconstruction of x0 at time t. Then, the final reconstruction ˆxt 0 satisfies the following error bound:

ˆxt 0 xt 0 = O( t) + O( xt). (15)

This theorem characterizes the reconstruction error as a first-order term in both the scheduler resolution t and the variation xt. The detailed proof can be found in Appendix B.2.

Multistep-wise Approximation Prior work (Liu et al., 2025b) empirically divides the denoising trajectory into a semantic-planning stage and a subsequent fidelity-improving stage, corresponding to regions where data xt 0 is inherently stable (see Figure 4). In these stable regions, one can safely use larger effective step sizes by compensating with higherorder interpolation. Building on this insight, we implement a uniform step-wise pruning strategy with Lagrange interpolation, once the trajectory enters the stable regime.

For example, consider a 50-step process. To achieve a stepwise pruning interval of 4 after stabilization (i.e., compute every 4th step fully and interpolate the skipped steps via

SADA: Stability-guided Adaptive Diffusion Acceleration

Denoising Model *

Transformer

Data Prediction

Reconstruct

Figure 5. Illustration of the proposed Token-wise strategy. The criterion at token-level generates a mask to guide the pruning process at the input of the l-th attention layer. The pruned feature map is then reconstructed at the output of the layer using its cached representation Cl.

Lagrange), we store xt 0 every 4 steps before stabilization. Their indices define the fixed-size set I, which is a rolling buffer to limit memory usage. For any skipped t:

Theorem 3.7. Let I = {0, 1, . . . , k} be k + 1 distinct indexes with known cached values {xti 0 }i I. For any skipped timestep t / {ti}i I, define the interpolated reconstruction as:

xti 0 . (16)

Then, under the assumption that xτ 0 is (k + 1)-times continuously differentiable over τ [tmin, tmax], the interpolation error satisfies:

ˆxt 0 xt 0 = O(hk+1), (17)

where h is the maximum step spacing among {ti}.

This multi-step cache-assisted pruning strategy significantly induce step-wise sparsity while preserving sample fidelity.

3.5. Token-wise Cache-Assisted Pruning

At timestep t, if the Criterion 3.4 returns False, we continue to evaluate our stability measure under a high granularity level. The core idea of our token-wise algorithm is aligned with its step-wise counterpart: (i) fix unstable tokens for full calculation (ii) reduce the stable token and approximate it by previous representation in latent cache. This procedure partitions the tokens into two sets, Ifix (unstable tokens) and Ireduce (stable tokens), which define our adaptive pruning configuration for the subsequent timestep.

Our algorithm is formulated as follows:

(i) Cache Initialization: Let T denote the starting timestep for Cache-Assisted Pruning and i the caching interval. For an input at timestep t 1 and transformer layer l, if (t 1 T ) mod i = 0, we initialize the cache after a full computation:

Cl = A(x(l) t 1) RB N C (18)

where Cl is the feature map stored in the cache for layer l, and it is updated throughout the denoising process.

(ii) Cache Update: When (t 1 T ) mod i = 0, we prune the input data into x(l) t 1 RB N C where its length N = |Ifix|. After the Transformer (/Attention), we update Cl with fresh tokens:

Cl[i] = A( x(l) t 1)[ I(i) ] for i Ifix. (19)

(iii) Cache-Assisted Reconstruction: The pruned tokens are approximated by their cached representations.:

ˆx(l) t 1[i] =

( A( x(l) t 1)[ I(i) ], if i Ifix, Cl[i], if i Ireduce. (20)

We keep the reconstructed sequence ˆx(l) t 1 synchronized with Cl for subsequent timesteps.

4. Experiments

4.1. Experiment Settings

Model Configurations To verify the generalization of the proposed approach, we evaluate a set of widely used text-to-image models employing different backbone architectures: SD-2 (U-Net), SDXL (Podell et al.) (modified UNet), and Flux.1-dev (Black-Forest-Labs, 2024) (Di T). We perform evaluations using two sampling schedulers Euler Discrete Multistep (EDM) (Karras et al., 2022) Solver (firstorder) and DPM-Solver++ (Lu et al., 2022b) (second-order) each configured with 50 sampling steps. All pipelines are implemented using the Huggingface Diffusers framework. Experiments with Flux.1-dev are executed on a single NVIDIA A100 GPU, while the remaining experiments are run on a single NVIDIA A5000 GPU.

Evaluation Metrics We compare our proposed paradigm with widely-adopted training-free acceleration strategies, Deep Cache, Adaptive Diffusion, and Tea Cache. (Ma et al., 2024b; Ye et al., 2024; Liu et al., 2025a). Deep Cache caches and reuses the latent feature in the middle layer of the U-Net architecture. Adaptive Diffusion skips the noise predictor and reuses the previous predicted noise guided by a thirdorder estimator. Tea Cache introduces a caching threshold

SADA: Stability-guided Adaptive Diffusion Acceleration

Table 1. Quantitative results on MS-COCO 2017 (Lin et al., 2014).

Model Scheduler Methods PSNR LPIPS FID Speedup Ratio

DPM++ Deep Cache 17.70 0.271 7.83 1.43 Adaptive Diffusion 24.30 0.100 4.35 1.45 SADA 26.34 0.094 4.02 1.80

Euler Deep Cache 18.90 0.239 7.40 1.45 Adaptive Diffusion 21.90 0.173 7.58 1.89 SADA 26.25 0.100 4.26 1.81

DPM++ Deep Cache 21.30 0.255 8.48 1.74 Adaptive Diffusion 26.10 0.125 4.59 1.65 SADA 29.36 0.084 3.51 1.86

Euler Deep Cache 22.00 0.223 7.36 2.16 Adaptive Diffusion 24.33 0.168 6.11 2.01 SADA 28.97 0.093 3.76 1.85

Flux Flow-matching Tea Cache 19.14 0.216 4.89 2.00 SADA 29.44 0.060 1.95 2.02

Table 2. Ablation study on few-step sampling across schedulers. Results on MS-COCO 2017.

Scheduler Steps PSNR LPIPS FID Speedup

DPM++ 50 26.34 0.094 4.02 1.80 25 28.15 0.073 3.13 1.48 15 29.84 0.072 3.05 1.24

Euler 50 26.25 0.100 4.26 1.81 25 26.83 0.088 3.87 1.48 15 29.34 0.076 3.70 1.25

Scheduler Steps PSNR LPIPS FID Speedup

DPM++ 50 29.36 0.084 3.51 1.86 25 30.84 0.073 2.80 1.52 15 31.91 0.073 2.54 1.29

Euler 50 28.97 0.093 3.76 1.85 25 29.42 0.085 3.13 1.50 15 31.28 0.084 3.26 1.26

that measures the error accumulation. All experiments are conducted using the MSCOCO-2017 validation set as generation prompts under identical conditions to assess efficiency and quality. We report speedup ratios compared to the baseline as a measure of generation efficiency. Generation quality is evaluated using the Peak Signal-to-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS), and Fr echet Inception Distance (FID) between original generated and accelerated samples.

4.2. Main Results

As shown in Table 1, SADA consistently outperforms Deep Cache, Adaptive Diffusion, and Tea Cache across all settings. Compared to other acceleration strategies, it drives

FID down from 8.48 to 3.51 (59% ) on SDXL with DPM-Solver++, and on Flux.1-dev from 4.89 to 1.95 (60% ). Across every configuration, SADA maintains LPIPS 0.100 relative to the original samples substantially better than competing methods demonstrating only negligible perceptual deviation from the unmodified baseline. Crucially, these significant quality improvements incur no extra compute beyond the acceleration itself, yielding consistent 1.8 2 speedups, outperforming the competing methods in the majority of cases, underscoring SADA s effectiveness as a loss-free acceleration framework.

4.3. Ablation Studies

We provide justification for our choice of base step T = 50 in Figure A.3. To evaluate SADA s performance under fewstep sampling, Table 2 reports results on {SD-2, SDXL} using {Euler, DPM-Solver++} while varying the number of inference steps {50, 25, 15}. As the step count decreases, SADA achieves higher similarity: PSNR rises from 26.25 d B to 29.34 d B, FID falls from 4.26 to 3.70, and LPIPS drops from 0.100 to 0.076. We attribute this trend to reduced error accumulation when using fewer steps. Meanwhile, SADA can further accelerate sampling under a few-step setting. The speedup ratio maintains 1.5 under 25 steps denoising, and 1.25 under 15 steps denoising, highlighting SADA s effectiveness under a low computational budget. Note that the Lagrange interpolation parameters are slightly adjusted to match the shorter denoising schedules in these few-step settings.

4.4. Downstream Tasks and Data Modalities

This section demonstrates that SADA has potential to accelerate any generative modeling with an iterative generative process, regardless of the downstream task or data modality.

SADA: Stability-guided Adaptive Diffusion Acceleration

Figure 6. SADA deployment on Music LDM on different text prompts. SADA accelerates Music LDM by 1.81 while maintaining the spectrogram LPIPS under 0.020.

Figure 7. SADA deployment on Control Net. We demonstrate the SD-1.5-based Control Net pipeline trained on canny edges as conditional input. SADA accelerates Control Net by 1.41 while preserving fidelity.

Data Modality We evaluate SADA on music and audio generation using the Music LDM (Chen et al., 2024c) pipeline to synthesize 8-second clips. We compare both perceptual quality and spectrogram similarity between accelerated samples and the unmodified baseline. As shown in Figure 6, SADA achieves an LPIPS of 0.010 between accelerated and baseline audio, while reducing sampling time by 1.81 . This implies that SADA has the potential to implemented in future baselines models for any-modality generation with little to no modifications.

Downstream Task To validate cross-pipeline compatibility, we apply SADA directly on top of Control Net (Zhang et al., 2023) without any additional fine-tuning or architectural changes. Figure 7 demonstrates that SADA preserves faithfulness between Control Net-conditioned outputs while accelerating the sampling by 1.41 . This implies that SADA could be seamlessly deployed in professional workflows.

5. Conclusion

We present Stability-guided Adaptive Diffusion Acceleration (SADA), a training-free paradigm that adaptively accelerates the sampling process of ODE-based generative models (mainly Diffusion and Flow-matching). Leveraging trajectory gradient calculated from the numerical solver, SADA dynamically exploits step-wise and token-wise sparsity during the procedure with principled and error-bounded approximation, bridging between the sampling trajectories and sparsity-aware optimizations. Extensive experiments on SD-2, SDXL, and Flux across EDM and DPM++ solvers both demonstrate consistent 1.8 speedups with negligible fidelity loss (LPIPS 0.10, FID 4.5) compared to unmodified baselines, significantly outperforming existing strategies. Moreover, we show that SADA generalizes across modalities achieving 1.81 acceleration on Music LDM and 1.41 on Control Net without additional tuning.

Acknowledgment

This material is based upon work supported by the U.S. National Science Foundation

under award No.2112562. This work is also supported by ARO W911NF-23-2-0224 and NAIRR Pilot project NAIRR240270. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the U.S. National Science Foundation, ARO, NAIRR, and their contractors. In addition, we thank the area chair and reviewers for their valuable comments.

Impact Statement

This paper aims to advance the field of adaptive acceleration of generative models. We believe that our work has significant potential impact for the deployment and inference of generative models in any modalities. While there are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022.

Black-Forest-Labs. Flux. https://github.com/ black-forest-labs/flux, 2024.

Bolya, D. and Hoffman, J. Token merging for fast stable diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4599 4603, 2023.

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang,

SADA: Stability-guided Adaptive Diffusion Acceleration

Z., Luo, P., Lu, H., and Li, Z. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pp. 74 91. Springer, 2024a.

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wang, Z., Kwok, J. T., Luo, P., Lu, H., and Li, Z. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024b.

Chen, K., Wu, Y., Liu, H., Nezhurina, M., Berg-Kirkpatrick, T., and Dubnov, S. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1206 1210. IEEE, 2024c.

Chen, P., Shen, M., Ye, P., Cao, J., Tu, C., Bouganis, C.-S., Zhao, Y., and Chen, T. δ-dit: A training-free acceleration method tailored for diffusion transformers. ar Xiv preprint ar Xiv:2406.01125, 2024d.

Chen, S., Daras, G., and Dimakis, A. Restorationdegradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. In International Conference on Machine Learning, pp. 4462 4484. PMLR, 2023.

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.

Feller, W. et al. An introduction to probability theory and its applications. 1971.

Gao, P., Zhuo, L., Liu, D., Du, R., Luo, X., Qiu, L., Zhang, Y., Lin, C., Huang, R., Geng, S., et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. ar Xiv preprint ar Xiv:2405.05945, 2024.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

Ho, J. and Salimans, T. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Iserles, A. A first course in the numerical analysis of differential equations. Number 44. Cambridge university press, 2009.

Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35: 26565 26577, 2022.

Kim, D., Sony, A., Lai, C.-H., Liao, W.-H., Murata, N., Takida, Y., He, Y., Mitsufuji, Y., and Ermon, S. Consistency trajectory models: Learning probability flow ode trajectory of diffusion.

Kim, M., Gao, S., Hsu, Y.-C., Shen, Y., and Jin, H. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1383 1392, 2024.

Kingma, D. and Gao, R. Understanding diffusion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36:65484 65516, 2023.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740 755. Springer, 2014.

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.

Liu, E., Ning, X., Yang, H., and Wang, Y. A unified sampling framework for solver searching of diffusion probabilistic models. In The Twelfth International Conference on Learning Representations, 2023.

Liu, F., Zhang, S., Wang, X., Wei, Y., Qiu, H., Zhao, Y., Zhang, Y., Ye, Q., and Wan, F. Timestep embedding tells: It s time to cache for video diffusion model. ar Xiv preprint ar Xiv:2411.19108, 2024.

Liu, F., Zhang, S., Wang, X., Wei, Y., Qiu, H., Zhao, Y., Zhang, Y., Ye, Q., and Wan, F. Timestep embedding tells: It s time to cache for video diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025a. URL https://cvpr.thecvf.com/ virtual/2025/poster/33872. Poster #190, Ex Hall D, Poster Session 2.

Liu, H., Zhang, W., Xie, J., Faccio, F., Xu, M., Xiang, T., Shou, M. Z., Perez-Rua, J.-M., and Schmidhuber, J. Faster diffusion through temporal attention

SADA: Stability-guided Adaptive Diffusion Acceleration

decomposition. Transactions on Machine Learning Research, 2025b. ISSN 2835-8856. URL https:// openreview.net/forum?id=x Xs2GKXPn H.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022.

Lu, C. and Song, Y. Simplifying, stabilizing and scaling continuous-time consistency models. ar Xiv preprint ar Xiv:2410.11081, 2024.

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775 5787, 2022a.

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpmsolver++: Fast solver for guided sampling of diffusion probabilistic models. ar Xiv preprint ar Xiv:2211.01095, 2022b.

Ma, X., Fang, G., Bi Mi, M., and Wang, X. Learningto-cache: Accelerating diffusion transformer via layer caching. Advances in Neural Information Processing Systems, 37:133282 133304, 2024a.

Ma, X., Fang, G., and Wang, X. Deepcache: Accelerating diffusion models for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15762 15772, 2024b.

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp. 8162 8171. PMLR, 2021.

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El Nouby, A., et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, pp. 1 31, 2024.

Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195 4205, 2023.

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M uller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. Pm LR, 2021.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234 241. Springer, 2015.

Saghatchian, O., Moghadam, A. G., and Nickabadi, A. Cached adaptive token merging: Dynamic token reduction and redundant computation elimination in diffusion model. ar Xiv preprint ar Xiv:2501.00946, 2025.

Shen, M., Chen, P., Ye, P., Xia, G., Chen, T., Bouganis, C.- S., and Zhao, Y. Md-dit: Step-aware mixture-of-depths for efficient diffusion transformers. In Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. pmlr, 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, a.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, b.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. In International Conference on Machine Learning, pp. 32211 32252. PMLR, 2023.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Wang, G., Cai, Y., Peng, W., Su, S.-Z., et al. Pfdiff: Trainingfree acceleration of diffusion models combining past and future scores. In The Thirteenth International Conference on Learning Representations, 2024.

Wimbauer, F., Wu, B., Schoenfeld, E., Dai, X., Hou, J., He, Z., Sanakoyeu, A., Zhang, P., Tsai, S., Kohler, J.,

SADA: Stability-guided Adaptive Diffusion Acceleration

et al. Cache me if you can: Accelerating diffusion models through block caching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6211 6220, 2024.

Yang, Z., Feng, R., Zhang, H., Shen, Y., Zhu, K., Huang, L., Zhang, Y., Liu, Y., Zhao, D., Zhou, J., et al. Lipschitz singularities in diffusion models. In The Twelfth International Conference on Learning Representations, 2023.

Ye, H., Yuan, J., Xia, R., Yan, X., Chen, T., Yan, J., Shi, B., and Zhang, B. Training-free adaptive diffusion with bounded difference approximation strategy. ar Xiv preprint ar Xiv:2410.09873, 2024.

Yuan, Z., Zhang, H., Lu, P., Ning, X., Zhang, L., Zhao, T., Yan, S., Dai, G., and Wang, Y. Ditfastattn: Attention compression for diffusion transformer models. ar Xiv preprint ar Xiv:2406.08552, 2024.

Zhang, E., Xiao, B., Tang, J., Ma, Q., Zou, C., Ning, X., Hu, X., and Zhang, L. Token pruning for caching better: 9 times acceleration on stable diffusion for free. ar Xiv preprint ar Xiv:2501.00375, 2024.

Zhang, H., Su, R., Yuan, Z., Chen, P., Fan, M. S. Y., Yan, S., Dai, G., and Wang, Y. Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers, 2025. URL https://arxiv.org/abs/2503. 22796.

Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3836 3847, 2023.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018.

Zhao, X., Jin, X., Wang, K., and You, Y. Real-time video generation with pyramid attention broadcast. Co RR, 2024.

Zhen, T., Cao, J., Sun, X., Pan, J., Ji, Z., and Pang, Y. Tokenaware and step-aware acceleration for stable diffusion. Pattern Recognition, 164:111479, 2025.

Zheng, K., Lu, C., Chen, J., and Zhu, J. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. Advances in Neural Information Processing Systems, 36:55502 55542, 2023.

Zou, C., Liu, X., Liu, T., Huang, S., and Zhang, L. Accelerating diffusion transformers with token-wise feature caching. ar Xiv preprint ar Xiv:2410.05317, 2024.

SADA: Stability-guided Adaptive Diffusion Acceleration

A. Appendix

Due to the page limit of the main manuscript, we provide more theoretical and implementation details in the appendix, organized as follows:

Sec. B: Mathematical Foundations

Sec. B.1: Theoretical Assumptions. Sec. B.2: Proofs of Main Theorems.

Sec. C: Analysis on Existing Token Reduction Methods.

Sec. D: Additional Experimentss.

B. Mathematical Foundations

B.1. Theoretical Assumptions

1 The network output ϵθ(xt, t) is jointly Lipschitz continuous in both xt and t. In our experiments, we skip the first and last time steps to avoid potential issues with infinite Lipschitz (Yang et al., 2023) constants near the boundaries.

B.2. Proofs of Main Theorems

Theorem 3.2. Let xt = αtx0 + 1 αt ϵ, where ϵ N(0, I) and x0 p(x0) is independent of ϵ. Then the expected value of xt over the joint distribution of x0, ϵ, and timestep t satisfies

Ex0,ϵ,t[xt] = αt Ex0[x0]. (A.1)

Proof. By the definition of the forward process in diffusion models, the latent variable at timestep t is given by:

xt = αtx0 +

1 αt ϵ. (A.2)

Taking expectation over x0 p(x0), ϵ N(0, I), and t Uniform({1, . . . , T}), we get:

Ex0,ϵ,t[xt] = Et Ex0[ αtx0] + Eϵ[

1 αt ϵ] . (A.3)

Since ϵ N(0, I), we have E[ϵ] = 0, and thus:

Ex0,ϵ,t[xt] = Et αt Ex0[x0] = αt Ex0[x0]. (A.4)

This concludes the proof.

Theorem 3.3 (Consistency of the network estimator under trajectory approximation). Let the network ϵθ(x, t) be trained with the standard mean-squared error (MSE) objective:

L(θ) = Ex0,ϵ,t ϵ ϵθ(xt, t) 2 , xt = αtx0 + σtϵ. (A.5)

Suppose θ minimizes L, and the training is sufficiently converged. Then, following Assumption 1 that ϵθ(xt, t) is

SADA: Stability-guided Adaptive Diffusion Acceleration

Lipschitz in xt and t, we have the following consistency property for sampling-time inputs ˆxt:

Ex0,ϵ,t[ϵ ϵθ (ˆxt, t)] 0 as ˆxt xt 0. (A.6)

Proof. Since xt is a linear deterministic combination of (x0, ϵ) at time step t, the distribution p(xt, t) is induced from the joint distribution of x0, ϵ, and t. Therefore, minimizing the MSE loss over (x0, ϵ, t) is equivalent to minimizing it over (xt, t). The optimal solution in the L2 sense is the conditional expectation:

ϵθ (xt, t) = Ex0,ϵ,t|xt,t[ϵ | xt, t] (A.7)

Taking expectation again:

Ex0,ϵ,t[ϵ ϵθ (xt, t)] = 0. (A.8)

Now, for any approximation ˆxt, by Lipschitz continuity:

ϵθ (ˆxt, t) ϵθ (xt, t) L ˆxt xt . (A.9)

Ex0,ϵ,t[ϵ ϵθ (ˆxt, t)] Ex0,ϵ,t[ϵ ϵθ (xt, t)] + Ex0,ϵ,t [ ϵθ (xt, t) ϵθ (ˆxt, t) ] . (A.10)

The first term is zero; the second vanishes as ˆxt xt.

Ex0,ϵ,t[ϵ ϵθ (ˆxt, t)] 0, (A.11)

as ˆxt xt 0.

Theorem 3.7 (Lagrange Interpolation for Cache-Assisted Reconstruction). Let I = {0, 1, . . . , k} be k + 1 distinct indexes with known cached values {xti 0 }i I. For any skipped timestep t / {ti}i I, define the interpolated reconstruction as:

xti 0 . (A.12)

Then, under the assumption that xτ 0 is (k + 1)-times continuously differentiable over τ [tmin, tmax], the interpolation error satisfies:

ˆxt 0 xt 0 = O(hk+1), (A.13)

where h is the maximum step spacing among {ti}.

Proof. The expression in Equation A.12 is the Lagrange interpolation formula for approximating a function at an unobserved point t using k + 1 known values {xti 0 } at points {ti} I.

By classical interpolation theory, the interpolation error at t for a (k + 1)-times continuously differentiable function xτ 0 satisfies:

xt 0 ˆxt 0 = 1 (k + 1)! dk+1xξ 0 dξk+1 Y

i I (t ti), for some ξ [tmin, tmax]. (A.14)

SADA: Stability-guided Adaptive Diffusion Acceleration

Taking norm and bounding the derivative gives:

ˆxt 0 xt 0 1 (k + 1)! max ξ

dk+1xξ 0 dξk+1

i I |t ti|. (A.15)

If all time steps are approximately uniformly spaced with step size h, then |t ti| Ch for some constant C, and hence:

ˆxt 0 xt 0 = O(hk+1), (A.16)

as claimed.

Theorem 3.1 (Backward Extrapolation via Lagrange Interpolation). Let f Ck[a, b] be a smooth function and let x0 := x, with equally spaced grid points xi := x + ih for i = 0, 1, . . . , k 1. Define Pk 1(t) as the degree-(k 1) Lagrange interpolant of f at {xi}k 1 i=0 .

Then, the extrapolated value at x h satisfies:

i=0 αif(xi) + Rk(h), (A.17)

where the weights are given by:

αi = ( 1)i k i + 1

, i = 0, 1, . . . , k 1, (A.18)

and the remainder term is:

Rk(h) = f (k)(ξ)

j=0 (x h xj), ξ [x h, x + (k 1)h]. (A.19)

and the error bound of the remainder term is:

Rk(h) = O(hk). (A.20)

Proof. Let xi := x + ih for i = 0, . . . , k 1. We construct the Lagrange interpolating polynomial:

i=0 f(xi) ℓi(t), (A.21)

where the Lagrange basis functions are:

t xj xi xj . (A.22)

Evaluating at t = x h gives:

f(x h) = Pk 1(x h) + Rk(h), (A.23)

where the remainder term is the standard Lagrange error:

Rk(h) = f (k)(ξ)

j=0 (x h xj), ξ [x h, x + (k 1)h]. (A.24)

SADA: Stability-guided Adaptive Diffusion Acceleration

Since xj = x + jh, we have: Rk(h) = O(hk). Now compute the weights αi := ℓi(x h), we have:

(x h) (x + jh) (x + ih) (x + jh) =

(i j)h . (A.25)

Canceling h and simplifying signs:

ℓi(x h) = ( 1)k 1 (k 1)! i!(k 1 i)! 1 i + 1 = ( 1)i k i + 1

i=0 ( 1)i k i + 1

f(xi) + Rk(h), (A.27)

as claimed.

Proposition B.1 (High-order backward difference as weighted combination). Let αi := ( 1)i k i+1 be the extrapolation coefficients in Theorem 3.1. Then the following linear combination defines the k-th order backward finite difference:

i=0 αif(x + ih) = (k)f(x h), (A.28)

where (k) is the standard k-th order backward difference operator:

(k)f(x h) :=

i=0 ( 1)i k i

f(x + ih). (A.29)

Proof. The result follows directly by substituting the expression for αi from Theorem 3.1 into Equation (A.28) and matching terms with the standard definition of (k).

Theorem B.2 (Adams-Moulton Method via Forward Lagrange Quadrature). Let y(x) Ck[a, b] and let x0 := x with equally spaced grid points xi := x + ih for i = 0, 1, . . . , k 1. Let Pk 1(t) be the degree-(k 1) Lagrange interpolant of f, which is derivative of y, at {xi}k 1 i=0 .

Define the one-step approximation of the integral:

y(x h) = y(x) Z x

x h f(s)ds. (A.30)

Then, the Adams-Moulton method of order k is:

ˆyn 1 = yn h

i=0 βifn+i + β 1fn 1

where: yn+j = y(xj), fn+j = f(xj), and the quadrature weights β 1, β0, . . . , βk 1 are given by:

SADA: Stability-guided Adaptive Diffusion Acceleration

0 ℓj(s) ds, for j = 1, 0, . . . , k 1, (A.32)

where ℓj(s) are the Lagrange basis polynomials defined on nodes τj = j, with τ 1 = 1 for the future value yn 1.

The local truncation error is:

ˆyn 1 yn 1 = O(hk+1). (A.33)

Proof. We aim to approximate the integral:

y(x h) = y(x) Z x

x h f(s) ds. (A.34)

Let us define a variable substitution s = x h + τh, so that:

x h f(s) ds = h Z 1

0 f(x h + τh) dτ. (A.35)

Let τj = j for j = 0, . . . , k 1 and τ 1 = 1 (the future point), and define Lagrange basis polynomials:

τ τi τj τi . (A.36)

Let Pk(τ) = Pk 1 j= 1 f(x h + τjh)ℓj(τ) be the Lagrange interpolant of f(x h + τh) using nodes τj.

Now approximate the integral:

0 f(x h + τh) dτ = Z 1

0 Pk(τ) dτ =

0 ℓj(τ) dτ f(x h + τjh). (A.37)

0 ℓj(τ) dτ, j = 1, 0, . . . , k 1. (A.38)

Then we get:

x f(s) ds = h

j= 1 βjf(x h + τjh)

β 1f(x h) +

j=0 βjf(x + jh)

SADA: Stability-guided Adaptive Diffusion Acceleration

Substitute into the ODE integral:

ˆyn 1 = yn + h

which is exactly the Adams-Moulton method of order k.

Finally, since this is based on Lagrange interpolation over k + 1 nodes (including the implicit node at future time), the quadrature error is O(hk+1), leading to method of order k.

We now instantiate the backward Adams-Moulton method of order 3 from Theorem B.2, and quantify its effect in discrete difference form, including error propagation.

Theorem 3.5 (Third-order backward difference via Adams-Moulton estimation). Let xt be the trajectory satisfying the ODE: dxt

dt = yt. Consider estimating xt 1 by third-order backward difference (3)xt 1 using forward values xt, xt+1, xt+2, where the step size is t. Furthermore, we using second and third order of Adams-Moulton method to define the estimator:

ˆxt 1 := xt 5 t

6 yt+1 + 2 t

3 yt+2. (A.41)

This estimate is consistent with the discrete ODE, and the local truncation error satisfies:

ˆxt 1 xt 1 = O( t2). (A.42)

Proof. By Theorem B.2, the third-order Adams-Moulton method in reverse-time (backward extrapolation) reads:

\ xt 1 xt = t

12 (8yt yt+1 + 5yt 1) . (A.43)

In our scenario, we aim to express everything in terms of xt, xt+1, xt+2, and yt, yt+1, yt+2.

By using the identity:

(3)xt 1 = xt 1 3xt + 3xt+1 xt+2, (A.44)

we reverse it to get the original estimation of xt 1:

xt 1 = 3xt 3xt+1 + xt+2 = xt + 2(xt xt+1) (xt+1 xt+2). (A.45)

Then we plug in a higher-order correction from the Adams-Moulton method:

xt xt+1 = t

12 (5yt + 8yt+1 yt+2) + O( t3), xt+1 xt+2 = t

2 (yt+1 + yt+2) + O( t2). (A.46)

Combining these, we define the estimator ˆxt 1:

ˆxt 1 = xt 5 t

6 yt+1 + 2 t

3 yt+2. (A.47)

SADA: Stability-guided Adaptive Diffusion Acceleration

Thus we have the error bound

ˆxt 1 xt 1 = xt 5 t

6 yt+1 + 2 t

3 yt+2 xt 1

= xt 1 + O( t2) xt 1 = O( t2), (A.48)

as claimed.

Theorem 3.6 (Error bound for final reconstruction ˆxt 0). Let ˆxt 0 denote the reconstruction of x0 at time t, obtained using an extrapolated estimate ˆxt 1 from Theorem 3.5, which incurs an error of order O( t2). Then, the final reconstruction ˆxt 0 satisfies the following error bound:

ˆxt 0 xt 0 = O( t) + O( xt). (A.49)

Proof. This follows from the linearity of the prediction formula for ˆxt 0, which is a linear combination of xt and ˆϵt. Therefore, by following Assumption 1 that ϵθ(xt, t) is Lipschitz in xt and t

ˆxt 0 xt 0 C1 ˆxt xt + C2 ˆϵt ϵt = O( t) + O( xt), (A.50)

where C1 and C2 are constants depending on the schedule.

C. Analysis on Existing Token Reduction Methods

Compared to the baseline, token merging with a high merging ratio significantly loses detailed information, while token pruning at the same ratio partially preserves fine-grained details. However, pruning still leads to the loss of a substantial amount of information in the generated image. In this section, we provide a short analysis of token merging, token pruning, and the unmerging process.

Token merging We represent the token-merging procedure via an N N matrix M:

( 1 |Sj|, if i Sj,

0, otherwise,

where each set Sj {1, . . . , N} groups tokens deemed similar. The merged sequence x is then obtained by:

Within each subset Sj, the operation is an average: hj = 1 |Sj|1T . From a signal-processing perspective, such an averaging operation has a low-pass frequency response described by the sinc function:

H(u) = sin π u |Sj|

Hence, merging tokens suppresses high-frequency components while retaining lower-frequency content.

Token pruning Token pruning also cause information loss, as any unique feature captured by the pruned tokens are no longer available for subsequent computations. Since it removes tokens without averaging, token pruning does not exhibit the low-pass filtering effect, resulting in the loss of specific spatial information rather than a smoothing of the input sequence.

SADA: Stability-guided Adaptive Diffusion Acceleration

Unmerging process Furthermore, due to the non-linear mapping of self-attention, the high similarity of two input tokens x[i], x[j] (where i Sj) does not guarantee the high similarity of corresponding outputs after attention computation. Therefore, the unmerging procedure, which duplicates the processed merged tokens back to their original position results in an inherent downsampling effect. This duplication does not recover the high-frequency details lost during merging, thus further degrading the detailed features in the output.

D. Additional Experiments

Figure A.1. Comparison between SADA and Tea Cache on FLux.1 Dev. Our method shows significantly better faithfulness under an identical speedup ratio of 2.0 .

SADA: Stability-guided Adaptive Diffusion Acceleration

Figure A.2. Comparison between SADA and Adaptive Diffusion on SDXL with 50 steps DPM++. Our method shows better faithfulness with a much faster speedup of 1.81 vs 1.65 .

Figure A.3. The generated samples from DPM-Solver first dramatically change, then demonstrate convergence after setting the sample step to 50. Therefore, we believe our baseline setting is reasonable.