# temporal_difference_flows__0c063fd2.pdf

Temporal Difference Flows

Jesse Farebrother 1 2 Matteo Pirotta 3 Andrea Tirinzoni 3 R emi Munos 3

Alessandro Lazaric 3 Ahmed Touati 3

Predictive models of the future are fundamental for an agent s ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TDFlow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5 the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow s efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks, including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

1. Introduction

Predictive modeling lies at the heart of intelligent decisionmaking, enabling agents to reason and plan in complex environments. In Reinforcement Learning (RL), this pre-

Work done at Meta 1Mc Gill University 2Mila - Qu ebec AI Institute 3FAIR at Meta. Correspondence to: Jesse Farebrother <jfarebro@cs.mcgill.ca>, Ahmed Touati <atouati@meta.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

dictive capability has traditionally been achieved through world models that capture the transition structure of the environment. These models have enabled significant advances across numerous domains from robotics manipulation employing model-predictive control (Sikchi et al., 2021; Hafner et al., 2023; Hansen et al., 2022; 2024), to sampleefficient exploration strategies (Schmidhuber, 1991; Stadie et al., 2016; Pathak et al., 2017), and sophisticated planning algorithms (Silver et al., 2016; 2017; Schrittwieser et al., 2020). However, while world models have demonstrated impressive results, they face fundamental limitations when deployed for long-horizon reasoning. The standard approach of unrolling predictions step-by-step leads to compounding errors, as small inaccuracies in each prediction accumulate and propagate forward in time (Talvitie, 2014; Jafferjee et al., 2020; Lambert et al., 2022). This curse of horizon presents a significant challenge for applications requiring reliable long-range predictions.

An alternative approach is to learn a generative model of future states directly, avoiding compounding errors during inference. These models, usually referred to as Geometric Horizon Models (GHM; Thakoor et al., 2022) or γ-models (Janner et al., 2020), are learned by leveraging the temporal difference structure of the successor measure (Blier et al., 2021). However, their reliance on bootstrapped predictions during training can lead to instability and growing inaccuracy over long horizons. As a result, current methods struggle to make accurate predictions beyond 2050 steps, also limiting their utility for long-term decisionmaking. In this paper, we show that while state-of-the-art generative methods like flow matching (Lipman et al., 2023) and denoising diffusion (Ho et al., 2020) cannot be directly applied to learn long-horizon GHMs, their iterative nature can be leveraged to better exploit the temporal difference structure of the problem. This insight yields a new class of methods that provably converges to the successor measure while reducing the variance of their sample-based gradient estimates, enabling stable long-horizon predictions. Empirically, our approach produces significantly more accurate GHMs at all horizons, consistently outperforming state-ofthe-art algorithms across domains and metrics, including prediction accuracy, value function estimation, and generalized policy improvement.

Temporal Difference Flows

2. Background

In the following, we use capital letters to denote random variables, sans-serif fonts for sets, and P(A) to denote the space of probability measures over a measurable set A.

Markov Decision Process We consider a reward-free discounted Markov decision process M = (S, A, P, γ), which characterizes the dynamics of a sequential decision-making problem. At each step, the agent selects an action a A in state s S according to its policy π : S A. This action influences the transition to the next state s S, governed by the transition kernel P : S A P(S), which defines a probability measure over successor states. The discount factor γ [0, 1) can be interpreted as implying a process that either continues with probability γ or terminates with probability 1 γ. This interpretation naturally defines a geometric distribution of future states the agent will occupy, where states reached after k steps are discounted by γk.

Successor Measure The normalized successor measure (Dayan, 1993; Blier et al., 2021) of a policy π describes the discounted distribution of future states visited by π starting from an initial state-action pair (s, a). For the measurable subset X S the successor measure mπ(X | s, a) represents the probability that future states fall within X, geometrically discounted by γ according to the time of visitation. Formally, it is defined as:

mπ( X | s, a ) = (1)

k=0 γk Pr(Sk+1 X | S0 = s, A0 = a, π),

where Pr( | S0, A0, π) denotes the probability of stateaction sequences (Sk, Ak)k 0 generated from (S0, A0) following Sk P( | Sk 1, Ak 1) and Ak = π(Sk). The successor measure encapsulates the long-term dynamics of π, enabling value estimation for any reward function r : S R. Specifically, the value of taking action a A in state s S is the expected reward under states visited by π amplified by the effective horizon (1 γ) 1:

Qπ(s, a) = (1 γ) 1 EX mπ( |s,a)[r(X)] . (2)

Moreover, mπ is the fixed point of the Bellman operator T π : P(S)S A P(S)S A (Thakoor et al., 2022):

mπ( | s, a) = (T πmπ) ( | s, a) (3)

:= (1 γ)P( | s, a) + γ (P πmπ) ( | s, a) .

The operator P π applied to m mixes the one-step kernel with the successor measure, accounting for transitioning from (s, a) to a new state-action pair (s , π(s )) and querying the successor measure m( | s, π(s )) thereafter:

(P πm) (dx | s, a) = Z

s P(ds | s, a) m(dx | s , π(s )) .

Geometric Horizon Model A Geometric Horizon Model (GHM; Thakoor et al., 2022) or γ-model (Janner et al., 2020) is a generative model of the normalized successor measure. To learn the parametric model em( ; θ) mπ

we can minimize a Monte-Carlo cross-entropy objective over source states from the empirical distribution ρ as,

Es ρ, X mπ( |S,π(A))[ log em(X | S, A; θ))] .

In order to sample from mπ we deploy policy π for t Geom(1 γ) steps resulting in state X = St. Similar to other Monte Carlo methods in RL, this approach is problematic when learning from off-policy data, often resulting in high-variance estimators that rely on importance sampling.

Alternatively, we can leverage the Bellman equation (3) to construct an off-policy iterative method for estimating mπ. Given initial weights θ(0), each iteration updates θ by minimizing the following temporal-difference cross-entropy objective over transitions that need not come from policy π,

E(S,A) ρ,X (T π e m(n))( |S,A)[ log em(X | S, A; θ)]. (4)

In the equation above and throughout the paper, we adopt the shorthand em(n) = em( ; θ(n)). To generate samples X T π em(n) ( | S, A) we first draw a successor state S P( | S, A); then with probability 1 γ, we return S ; otherwise, with probability γ, we return a bootstrapped sample drawn from em(n)( | S , π(S )).

Several probabilistic models have been applied to this problem, including generative adversarial networks (e.g., Janner et al., 2020; Wiltzer et al., 2024b), normalizing flows (e.g., Janner et al., 2020), and variational auto-encoders (e.g., Thakoor et al., 2022; Tomar et al., 2024). We now turn our attention to a class of generative models based on the flowmatching framework specifically designed to leverage the underlying structure of the Bellman equation (3), enabling more effective generative models of the successor measure.

3. Temporal Difference Flows

Flow Matching (FM; Lipman et al., 2023; 2024; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023) constructs a timedependent probability path mt : S A P(S) for t [0, 1] that evolves smoothly from the source distribution m0 = p0 P(S) to the target distribution m1 mπ. This evolution is governed by a vector field vt : S S A S, which dictates the instantaneous movement of samples along mt. The relationship between vt and the resulting probability path mt is established through a time-dependent flow ψt : S S A S, defined by the following ODE:

d dtψt(x | s, a) = vt ψt(x | s, a) | s, a , ψ0(x | s, a) = x

ψt(x | s, a) = x + Z t

0 vτ ψτ(x|s, a) | s, a dτ .

Temporal Difference Flows

Coupled TD CFM

Figure 1. Visual depiction of TD-Flow variants. Samples are mapped from m0 to the target distribution m(n) 1 through the neural ODE ψ(n) t . Dashed lines depict the neural ODE trajectory; solid lines show the conditional probability path ut. (Left) TD-CFM maps X0 to X1 before creating a separate conditional path between X 0 and X1, resulting in crossing paths. (Middle) TD-CFM(C) directly couples X0 used to generate X1 when constructing the conditional probability path. (Right) TD2-CFM solves the neural ODE up to time t to directly obtain the target velocity vt.

We say that vt generates mt if its flow ψt satisfies Xt := ψt(X0 | S, A) mt( | S, A) for X0 m0. In words, the flow ψt pushes samples forward through time, ensuring they are distributed according to mt at time t. To learn this transformation, we can minimize the squared L2 distance between a parameterized vector field vt( ; θ) and the true vector field vt over t U([0, 1]), yielding the Monte-Carlo Flow Matching (MC-FM) loss ℓMC-FM(θ):

Eρ,t,Xt h vt(Xt | S, A; θ) vt(Xt | S, A) 2i ,

where Xt mt( | S, A) . (MC-FM; 5)

Despite its conceptual simplicity, direct optimization of the flow matching objective above proves challenging due to the inaccessibility of the true probability path mt and its associated vector field vt.

Alternatively, Lipman et al. (2023) shows that we can sidestep this problem entirely by introducing additional conditioning information. Instead of directly modeling the probability path mt we can introduce a random variable Z and define a conditional path on Z as pt|Z : S Z P(S) (Lipman et al., 2024; Tong et al., 2024). The conditional velocity field ut|Z : S Z S that generates pt|Z can now be computed in closed form for many simple choices of Z and pt|Z. One such choice is taking Z = X1 and performing a linear Gaussian interpolation from X0 X1 resulting in pt|1( | X1) = N( | t X1, (1 t)2I) with the corresponding vector field given by ut|1(x | X1) = (X1 x)/(1 t). Armed with the ability to sample from pt|1 and to compute ut|1, we can directly learn vt by optimizing the Monte-Carlo Conditional Flow Matching (MC-CFM) objective ℓMC-CFM(θ):

Eρ,t,Z,Xt h vt(Xt | S, A; θ) ut|Z(Xt | Z) 2i ,

where Z = X1 mπ( | S, A) , Xt pt|Z( | Z) . (MC-CFM; 6)

Remarkably, both (MC-FM; 5) and (MC-CFM; 6) share the same gradient and converge to the same solution.

Proposition 1 (Lipman et al. 2024). Given a conditional probability path pt|Z and vector field ut|Z with their associated marginal counterparts pt(x) and vt(x), we have

θ ℓMC-FM(θ) = θ ℓMC-CFM(θ).

TD-CFM While (MC-CFM; 6) requires direct access to samples from the target distribution mπ, we can instead learn from an offline dataset ρ containing only one-step transitions (S, A, S ) through an iterative process similar to (4). Starting with initial parameters θ(0), at each iteration, we minimize the TD-Conditional Flow Matching (TD-CFM) loss ℓTD-CFM an extension of (MC-CFM; 6) that differs only in its sampling procedure:

X0 p0 Z = X1 (1 γ) δS + γ δ e ψ(n) 1 (X0 | S ,π(S )) . (TD-CFM; 7) In this procedure, with probability 1 γ, we return the successor state S . Otherwise, with probability γ we sample from the neural ordinary differential equation (Chen et al., 2018) eψ(n) t with corresponding vector field v(n) t (Xt | S , π(S )) from X0 p0 to produce a sample X1 em(n)( | S , π(S )).

Coupled TD-CFM Although (TD-CFM; 7) offers a principled way of learning the flow from noise to data, an increasingly popular strategy to improve flow matching methods is to correlate noise and data whenever a natural coupling is available (e.g., Liu et al., 2023; Shi et al., 2023; Pooladian et al., 2023; Tong et al., 2024; De Bortoli et al., 2024). Motivated by this idea, we observe that the process used to generate X1 described above already provides a direct coupling between X0 and X1. We can leverage this coupling by conditioning the probability path pt|Z on both endpoints, i.e., Z = (X0, X1), rather than just conditioning on Z = X1 as in TD-CFM. As illustrated in Figure 1, this coupling helps align Xt with the path generated by

Temporal Difference Flows

eψ(n) t , potentially simplifying the regression problem. This procedure gives rise to the Coupled TD-Conditional Flow Matching (TD-CFM(C)) loss ℓTD-CFM(C) which now extends ℓTD-CFM, again, differing only in its sampling procedure:

X0 p0 X1 (1 γ) δS + γ δ e ψ(n) 1 (X0|S ,π(S )) Z = (X0, X1) . (TD-CFM(C); 8)

A convenient approach to specifying the conditional path pt|Z is to define Xt = ϕt(X0, X1) = αt X1 + βt X0 as the affine interpolant between X0 and X1, with the interpolation coefficients satisfying the boundary conditions α0 = β1 = 0, α1 = β0 = 1, and monotonicity constraints αt > 0, βt > 0, where the over-dot denotes the time derivative. From this definition, the conditional vector field arises as the time derivative of this interpolant defined as ut|0,1(Xt | X0, X1) = ϕt(X0, X1) = αt X1 + βt X0 (Albergo et al., 2023). A simple choice of the interpolation coefficients that yields a linear (straight-line) conditional path is given by βt = 1 αt = 1 t.

TD2-CFM While (TD-CFM(C); 8) improves upon (TD-CFM; 7) by accounting for the coupling between bootstrapped samples and their generating noise, both methods rely upon fitting an ad-hoc conditional vector field ut|Z that generates the surrogate conditional path pt|Z. To formulate a more structured approach, we exploit the linearity of the Bellman equation, as detailed in the following result.

Lemma 1. Let pt be a probability path for P generated by vector field vt and p (n) t be a probability path for P πm(n) 1 generated by v (n) t such that p0 = p (n) 0 = m0. For any t [0, 1] and (s, a) let v(n+1) t ( | s, a) be the solution of 1

arg min v : Rd Rd(1 γ)E

Xt pt( |s,a)

Xt | s, a) 2i

Xt p (n) t ( |s,a)

Xt) v (n) t (

Xt | s, a) 2i .

Then v(n+1) t induces a probability path m(n+1) t such that m(n+1) 0 = m0 and m(n+1) 1 = T πm(n) 1 .

This result shows that it is possible to use two independent probability paths for the two terms in the sampling process induced by the Bellman operator. For the first term, we can use a standard CFM approach for Z = X1 with conditional path pt|1 and vector field ut|1, which induces the marginal,

vt(x|s, a) = Z

ut|1(x | x1)

pt|1(x | x1)P(dx1|s, a)

pt(x|s, a) ,

1Notice here that the minimization is over the space of all functions and not the parameterized vector fields vt( ; θ).

where pt(x|s, a) = R pt|1(x|s )P(ds |s, a). For the second

term, we can leverage the GHM m(n) t learned at the previous iteration to construct the marginal,

v (n) t (x|s, a)= Z v(n) t (x|s , a )m(n) t (x|s , a )P(ds |s, a)

p (n) t (x|s, a) ,

where p (n) t (x | s, a) = R m(n) t (x | s , a )P(ds | s, a), and a = π(s ). This shows that m(n) t plays the role of a conditional probability path for the bootstrapped term and v(n) t is its associated conditional vector field. We can then use the equivalence between FM and CFM in Proposition 1 to replace the marginal probability paths and vector fields in Lemma 1 with their conditional counterparts to obtain the loss:

ℓ(θ) = Eρ,t,Z, Xt

Xt | S, A; θ) ut|Z(

Xt | Z) 2i ,

where Z = X1 P( | S, A),

Xt pt|Z( | Z) ,

ℓ(θ) = Eρ,t, Xt

Xt |S, A; θ) v(n) t (

Xt |S , π(S ) 2i ,

where X0 p0, S P( | S, A),

Xt = eψ(n) t (X0 | S , π(S )) ,

ℓTD2-CFM(θ) = (1 γ)

ℓ(θ) . (TD2-CFM; 9)

Since we now bootstrap the previous estimate not only in the sampling process but also in the objective function, we refer to this method as TD2-Conditional Flow Matching (TD2-

CFM). The right panel of Figure 1 depicts the process of obtaining the bootstrapped vector field v(n) t for TD2-CFM. We provide further implementation details and pseudo-code for all TD-Flow methods in Appendix C.3.1. Next, we extend our TD2 result to the class of denoising diffusion models.

3.1. Extension to Diffusion Models

Denoising Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) build a diffusion process starting from a data sample X0 q0 = mπ( | S, A)2 and corrupting it via a stochastic differential equation (SDE),

d Xt = f(t) Xt dt + g(t) d Wt , (10)

where t [0, T] for some time horizon T, f, g : [0, T] R is drift and diffusion term, and Wt Rd is a standard Brownian motion. The forward process of the linear SDE (10) has an analytic Gaussian kernel qt|0( |X0) = N( |αt X0, σ2 t I), where αt and σt can be computed in closed form. To sample from the target data distribution q0, we can solve the reverse SDE (Song & Ermon, 2019) from time T to 0:

d Xt = f(t) Xt g(t) Xtlog qt(Xt |S, A) dt+g(t) d W t (11)

2Different to flow matching, time is inverted in diffusion models and ranges from 0 to T.

Temporal Difference Flows

where W t is the reverse-time Brownian motion and qt is the marginal distribution of both the forward (16) and reverse (17) process. To simulate (11), we can train a parametrized score function st(x | s, a; θ) to approximate xt log qt(xt | s, a) using the denoising diffusion / score matching objective (Vincent, 2011) ℓDD(θ):

Eρ,t,X0,Xt h st(Xt | S, A; θ) Xtlog qt|0(Xt | X0) 2i ,

where X0 mπ( | S, A), Xt qt|0( | X0) . (DD; 12)

Temporal Difference Diffusion Following the blueprint in 3, we define an iterative process starting from s(0) = s( ; θ(0)) and minimize at each iteration the Temporal Difference Denoising Diffusion (TD-DD) loss ℓTD-DD(θ):

Eρ,t,X0,Xt h s(Xt | S, A; θ) x log qt|0(Xt | X0) 2i ,

where X0 T π em(n) 0|T ( | S, A), Xt qt|0( | X0) . (TD-DD; 13) Once again, to sample X0 T π em(n) 0|T ( | S, A), we proceed as follows: with probability 1 γ, we draw a successor state S P( | S, A); conversely, with probability γ, we sample from the bootstrapped model by solving the reverse SDE with score function s(n), initiated from XT . Following an approach analogous to Lemma 1, we demonstrate in Appendix B that we can employ two distinct diffusion processes for the two terms involved in the Bellman operator, which consequently leads to the TD2-DD objective:

ℓ(θ) = Eρ,t, Xt

Xt |S, A; θ)

Xt |S ) 2i ,

Xt qt|0( | S ) ,

ℓ(θ) = Eρ,t, Xt

Xt |S, A; θ) s(n) t (

Xt |S , π(S ) 2i ,

where XT q T ,

Xt q(n) t|T ( | S , π(S )) ,

ℓTD2-DD(θ) = (1 γ)

ℓ(θ) . (TD2-DD; 14)

4. Theoretical Analysis

We now study the learning dynamics of an idealized version of the TD-Flow methods, assuming that the flow-matching loss is minimized exactly at each iteration. Under this assumption, at each iteration we compute a probability path m(n) t such that m(n) 1 = T πm(n 1) 1 , which implies that m(n) 1 mπ by the contraction property of T π. The following result shows that the overall probability paths m(n) t follow a similar process. Proofs are deferred to Appendix E.

Theorem 1. For any n 1, the probability paths generated by TD-CFM, TD-CFM(C), or TD2-CFM satisfy

m(n+1) t (x | s, a) = Bπ t m(n) t (x | s, a), t [0, 1]

where Bπ t m := (1 γ)Pt + γP πm and Pt(x|s, a) := R pt|1(x | x1)P(x1|s, a)dx1. For any t [0, 1], the operator Bπ t is a γ-contraction in 1-Wasserstein distance, that is, for any couple of probability paths pt, qt,

sup s,a W1 ((Bπ t pt) ( | s, a), (Bπ t qt) ( | s, a))

γ sup s,a W1 (pt( | s, a), qt( | s, a)) .

Theorem 1 shows that all TD-flow methods fundamentally implement the same update where the probability path at t [0, 1] is obtained by applying a Bellman-like operator Bt to the previous iteration. This operator is a γ-contraction as T π, directly implying the following result.

Corollary 1. Let {m(n) t }n 0 be the sequence of probability paths produced by TD-CFM, TD-CFM(C), or TD2-CFM starting from an arbitrary vector field v(0) t . Then,

lim n m(n) t = mt = Btmt,

where mt is the unique fixed point of Bt, and mt = m MC t , where m MC t ( | s, a) = R pt|1( | x1) mπ(x1 | s, a) is the probability path of the Monte-Carlo approach (MC-CFM; 6).

This corollary shows that the fixed point of Bt coincides with the probability path generated in Monte-Carlo Conditional Flow Matching (MC-CFM; 6), which assumes direct access to samples of mπ. An important subtlety in Theorem 1 is that all algorithms apply the same operator for n 1, but the result holds for n = 0 only for TD2-CFM. This means that even starting from the same θ(0), the three algorithms may generate different sequences {m(n) t }n 0, while still converging to mt. In Theorems 5 and 6 , we show we can reconcile TD-CFM(C) and TD-CFM with TD2-CFM under a mild assumption on the form of the initial vector field.

While Theorem 1 analyzes an idealized version of the algorithms, in practice gradients are estimated from samples and the following analysis reveals important differences in their variance. We introduce the (unbiased) sample-based gradients for each of the algorithms,

E g TD-CFM(YTD-CFM) = θ ℓTD-CFM(θ),

E g TD-CFM(C)(YTD-CFM(C)) = θ ℓTD-CFM(C)(θ)

E g TD2-CFM(YTD2-CFM) = θ ℓTD2-CFM(θ),

where Y summarizes the random variables involved in the loss definitions in (TD-CFM; 7), (TD-CFM(C); 8), and (TD2-CFM; 9) (see Appendix E.6 for a formal definition of the gradients). We want to compare the total variance of the gradient estimates σ2 = Tr Cov Y [ g(Y ) ] , where Tr denotes the trace.

Temporal Difference Flows

Theorem 2. For any n 1 and t [0, 1], assume that m(n) t (x | s, a) = R pt|1(x | x1)m(n) 1 (x1 | s, a)dx1, then

σ2 TD-CFM = σ2 TD2-CFM +

γ2 E Tr Cov X1|s,a,Xt θvt(Xt|s, a; θ) ut|1(Xt|X1) .

Theorem 3. For any n 1 and t [0, 1], assume that m(n) t (x | s, a) = R pt|0,1(x | x0, x1)m(n) 0,1(x0, x1 | s, a)dx0dx1 3, then we obtain

σ2 TD-CFM(C) = σ2 TD2-CFM +

γ2E Tr Cov Z|S,A,Xt θvt(Xt|S, A; θ) ut|Z(Xt|Z) ,

where Z = (X0, X1). Furthermore, if we use straight conditional paths, i.e., Xt = t X1 + (1 t)X0, and the linear interpolant Xt does not intersect for any s, a, s , then σ2 TD-CFM(C) = σ2 TD2-CFM.

In both results, the probability path m(n) t from the previous iteration must be identical for the algorithms being compared. The analysis reveals that TD-CFM and TD-CFM(C) suffer from a larger variance compared to TD2-CFM, which uses the vector field v(n) both to sample Xt and as a target for the regression problem. This variance gap is discounted by γ2, which suggests that the performance of these algorithms would be similar for problems with small horizons but would increase as γ 1. The extra variance in both cases stems from samples generated by the algorithm (i.e., they do not depend on the transitions available in the dataset). In this sense, we can refer to it as computational variance, and in principle, it could be reduced by increasing the number of samples X0, X1, and Xt used in gradient computation. While the variance of TD-CFM and TD-CFM(C) cannot be directly compared, we expect that constructing Xt from X0 and X1 (instead of X1 only) will tend to reduce its variance. Specifically, when Xt is obtained by linear interpolation between X0 and X1, and it does not generate crossing paths, the variance of TD-CFM(C) reduces to the one of TD2-CFM.

5. Experiments

We now present a series of experiments to assess the efficacy of our TD-based flow and diffusion approaches with baselines employing Generative Adversarial Networks (Goodfellow et al., 2014) and β-Variational Auto-Encoders (Higgins et al., 2017). Following the methodology from Touati et al. (2023); Pirotta et al. (2024), we benchmark 22 tasks spanning 4 domains (Maze, Walker, Cheetah, Quadruped) from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020).

3m(n) 0,1(x0, x1|s, a) = m0(x0)δψ(n) 1 (x0|s,a)(x1) is the joint

distribution of (X0, X1), i.e the endpoints of the ODE.

For a single policy, we evaluate how well each method models its i) successor measure and ii) value function. While lower errors in estimating the successor measure are expected to lead to better value estimation, this is not always the case since modeling errors may disproportionally affect states with negligible rewards. Additionally, motivated by our theoretical results, we explore how the probability path s design affects our proposed methods relative performance.

Finally, we examine the scalability of our approach by learning a generative model of the successor measure across a class of parameterized policies derived from the Forward Backward (FB) representation (Touati & Ollivier, 2021; Touati et al., 2023), a non-generative model of the successor measure. We conclude by demonstrating how TD2 enables more effective planning for task-relevant policies when performing Generalized Policy Improvement (GPI; Barreto et al., 2017), far surpassing the capabilities of FB alone.

5.1. Empirical Evaluation of Geometric Horizon Models

Before benchmarking, we must first obtain a policy to evaluate. We follow the approach taken in Thakoor et al. (2022) and pre-train a set of deterministic policies one for each task using TD3 (Fujimoto et al., 2018). The final policy obtained from this pre-training phase is now fixed for the remainder of our experiments. GHM training proceeds in an off-policy manner where we learn the successor measure of a TD3 policy using transition data from the Exo RL dataset (Yarats et al., 2022); specifically, we use a dataset of 10M transitions collected by a random network distillation policy (Burda et al., 2019). All GHM methods are trained for 3M gradient steps using the Adam W optimizer (Loshchilov & Hutter, 2019) with a batch size of 1024 and weight decay of 0.001. We maintain a target network using an exponential moving average of the training parameters with a step size of 0.001. Special care was taken to match the capacity of the neural networks between methods with a UNet-style architecture employed for all flow and diffusion methods, while the GAN and VAE baselines use an MLP with residual connections for all their respective networks. Full details for the training methodology, network architecture, and hyperparameters can be found in Appendix C.

We implement all conditional flow matching methods (TDCFM, TD-CFM(C), TD2-CFM) with the Optimal Transport Gaussian conditional path from Lipman et al. (2023). When constructing our bootstrap targets, we sample from the neural ODE using the Midpoint solver with a constant step size of t/10 for a maximum of 10 steps. For TD2-CFM, we sample t U([0, 1]); otherwise, we integrate to t = 1 and construct Xt using the conditional path. For Denoising Diffusion methods (TD-DD, TD2-DD), we train a DDPM (Ho et al., 2020) by discretizing β (0.1, 20) using T = 1, 000 steps. We construct diffusion bootstrapped targets using

Temporal Difference Flows

5 10 20 50 100

Effective Horizon

Value Function MSE

TD-GAN TD-VAE Scaling Effective Horizon

Figure 2. Value-Function prediction error as a function of the effective horizon (1 γ) 1 for γ {0.8, 0.9, 0.95, 0.98, 0.99} on the POINTMASS loop task. TD2 methods show impressive robustness to increasingly long-horizon predictions.

20 steps of the DDIM (Song et al., 2021a) sampler. For

TD-DD, we solve to t = 0 and regress towards the noise that re-corrupted our sample. Alternatively, TD2-DD directly regresses towards the noise prediction from the target network at a randomly selected noise level. The first baseline we consider is a GHM instantiated as a Generative Adversarial Network (Goodfellow et al., 2014) similar to the one found in Janner et al. (2020). We follow the best practices from Huang et al. (2024) with the primary modification being a relativistic discriminator (Jolicoeur-Martineau, 2019) equipped with a zero-centered gradient penalty on both real and fake samples. For our second baseline, we implement a β-VAE (Higgins et al., 2017) following the practices outlined in Thakoor et al. (2022).

To evaluate the quality of our models, we first generate samples from the ground truth successor measure mπ according to the following procedure. We first randomly sample 64 source states S0 from the initial state distribution and execute policy π for 1, 000 steps. Along each trajectory, we resample 2048 states with replacement according to the stopping time t Geometric(1 γ). For the same 64 source states, we generate a matching set of 2048 samples from each GHM. Now in possession of these two sets of samples, we evaluate the: 1) log-likelihood of the true samples for models with tractable densities (i.e., diffusion and flow methods); 2) Earth Mover s Distance (EMD; Rubner et al., 2000), which quantifies the minimal transport cost between the two empirical distributions; and 3) mean-squared error of a Monte-Carlo estimate of the true value function Qπ

and the value function derived from GHM samples using (2). Full details can be found in Appendix C.1.

Having established our training framework, baselines, and evaluation protocol, we proceed to investigate a key prediction from our theoretical analysis. Our variance analysis

Method EMD Norm NLL MSE(V)

TD-DD 20.22 (0.26) 2.824 (0.195) 454.49 (131.97) TD2-DD 14.14 (1.08) 0.806 (0.016) 189.15 (23.63) TD-CFM 12.26 (0.02) 0.886 (0.024) 228.77 (2.20)

TD-CFM(C) 10.51 (0.06) 0.447 (0.020) 140.78 (18.72) TD2-CFM 10.57 (0.07) 0.422 (0.014) 135.22 (19.79) GAN 23.97 (0.46) 2463.22 (628.05) VAE 83.77 (0.41) 1284.27 (37.62)

TD-DD 0.149 (0.001) 2.974 (0.100) 1245.20 (29.27) TD2-DD 0.027 (0.001) 0.761 (0.082) 11.13 (3.09) TD-CFM 0.062 (0.003) 0.554 (0.033) 355.56 (82.83) TD-CFM(C) 0.022 (0.002) 0.696 (0.094) 11.89 (3.16) TD2-CFM 0.021 (0.000) 0.843 (0.027) 8.74 (2.09)

GAN 0.203 (0.037) 1257.26 (112.86) VAE 0.410 (0.036) 1821.89 (69.78)

TD-DD 28.33 (0.33) 1.908 (0.041) 1490.75 (444.49)

TD2-DD 22.64 (2.47) 0.861 (0.028) 159.03 (14.64) TD-CFM 15.73 (0.06) 1.056 (0.002) 525.06 (28.90) TD-CFM(C) 14.38 (0.03) 0.488 (0.003) 155.25 (5.58) TD2-CFM 14.51 (0.05) 0.379 (0.011) 141.77 (3.10) GAN 36772.12 (13898.25) 2634.69 (798.38) VAE 60.27 (0.28) 1156.33 (36.52)

TD-DD 20.58 (0.24) 2.649 (0.137) 382.40 (458.63) TD2-DD 12.09 (0.12) 0.537 (0.060) 39.04 (6.08) TD-CFM 13.53 (0.11) 0.713 (0.028) 225.27 (42.43)

TD-CFM(C) 11.91 (0.02) 0.219 (0.016) 30.71 (3.44) TD2-CFM 11.92 (0.10) 0.104 (0.001) 28.35 (6.10) GAN 24.51 (0.89) 3690.65 (1117.94) VAE 111.73 (2.53) 2457.61 (16.25)

Table 1. Evaluation results comparing our TD-based methods along with GAN and VAE baselines for a single-policy. Results are computed over 19 tasks from 4 domains and further averaged across 3 seeds. For each metric we highlight the best performing methods.

Method EMD Norm NLL MSE(V)

TD-CFM(C) 14.08 (12.42) 1.79 (1.98) 310.45 (258.94) TD2-CFM 0.09 (0.09) 0.01 (0.04) 3.36 (7.76)

Table 2. Performance difference for TD-CFM(C) and TD2-CFM when employing a curved instead of straight conditional path. Lower is better with negative values indicating a net improvement for using a curved path.

suggests that our TD-Flow framework should enable more stable training across extended temporal horizons. To validate this hypothesis, we train each GHM for 3 seeds on the loop task in the Maze domain while varying the effective horizon (1 γ) 1 across five values: {5, 10, 20, 50, 100}. Figure 2 illustrates the relationship between value function MSE and the effective horizon. The results demonstrate that TD2-based methods maintain consistent performance even as the effective horizon increases, while alternative approaches show significant performance degradation. Notably, at an effective horizon of 100, TD2-based methods maintain their accuracy and achieve performance improvements of nearly four orders of magnitude compared to their naive implementations. These results empirically support for our initial hypothesis, with the stability of TD2 methods aligning with our predictions.

In the following, we shift our attention to a more in-depth analysis of the largest horizon of 100 (γ = 0.99). For each

Temporal Difference Flows

Random Train Distribution Local Perturbation 10

% Improvement Over FB

Planning via Generalized Policy Improvement

FB-GPI DD-GPI FM-GPI Coupled TD²

Figure 3. Performance improvement over the zero-shot Forward Backward (FB; Touati & Ollivier, 2021) policies when planning with Generalized Policy Improvement (GPI; Barreto et al., 2017). FB-GPI performs GPI over the FB value-function Qπw. DD-GPI and FM-GPI perform GPI with the value function implied by the GHM mπw for our diffusion-based and flow-based methods, respectively. Results are averaged over 22 tasks across 4 domains.

algorithm, we train a GHM for 3 independent seeds for all domains and tasks. Table 1 reports aggregate performance across our full suite of metrics. For each domain and metric, we highlight results in a 1% range with respect to the bestperforming method. The results demonstrate a clear pattern of superior performance for TD2-based algorithms: TD2-

CFM achieves significant improvements over TD-CFM with a 10 reduction in value-function MSE, 1.5 reduction in EMD, and 3 reduction in log-likelihood, averaged across all four domains. In line with our theoretical predictions, the coupled variant of TD-CFM performs comparably to TD2CFM, given straight conditional paths. While a comparison between flow matching and diffusion is not at the core of this paper, in our experiments, flow matching generally outperforms diffusion across all metrics. We posit this is primarily due to noise in the diffusion process adversely impacting an already noisy prediction problem for large horizons.

Given the comparable performance between TD-CFM(C) and TD2-CFM with straight conditional paths, we next examine how these methods behave with alternative path geometries. Our theoretical analysis suggests an important distinction: TD2-CFM should maintain its effectiveness with non-straight paths, while the performance of TD-CFM(C) should degrade. To test this prediction, we maintain the methodology above while replacing conditional path in (TD2-CFM; 9) with the following curved path pt|1( | X1) = N( | αt X1, β2 t ) with coefficients αt = sin π

2 t and βt = cos π

2 t . The corresponding conditional vector field is

now given by ut|1(Xt|X1) = αt αt

βt X1 + βt βt Xt. Additionally, for TD-CFM(C) we condition the curved path above on X0 and X1 resulting in the conditional vector field

ut|0,1(Xt | X0, X1) = π

2 βt X1 αt X0 . Table 2 illustrates the performance difference relative to the straight path results (Table 1) averaged across all domains and tasks. The results strongly support our theoretical prediction: TD2-CFM not only maintained but surprisingly improved performance compared to the linear path. In contrast, TD-CFM(C) showed significant performance degradation, confirming our hypothesis about its limitations with non-straight paths.

5.2. Planning via Generalized Policy Improvement

We now turn our attention towards training policyconditioned GHMs which can be utilized for test-time planning. To accomplish this, we first pre-train a Forward Backward (FB; Touati & Ollivier, 2021; Touati et al., 2023) representation using the same dataset of 10M transitions as described in 5.1. This pre-training yields a class of wconditioned policies πw, where each w W = Sd 1(

d) represents an embedding of a reward function situated on a d-dimensional hypersphere with radius

d. We then train the GHM mπw conditioned on the policy by incorporating the embedding w directly into the model s input. All GHM methods are trained for 8M gradient steps, maintaining the same parameters used in 5.1, with the exception of a higher weight decay coefficient of 0.01. For additional insights into the accuracy of the policy-conditioned GHMs, we direct the reader to Appendix D. Overall, we observed similar trends to those seen in our single-policy experiments.

Given that both FB and w-conditioned GHM models enable estimation of a policy s value function Qπw, we can utilize this information to perform Generalized Policy Improvement (GPI; Barreto et al., 2017) during evaluation. Specifi-

Temporal Difference Flows

cally, at each time step t, we choose an action at = πwt(st), where wt is derived as follows:

wt arg max w D(W) (1 γ) 1 EX mπw ( |st,πw(st)))[r(X)] | {z } Qπw (st,πw(st))

(15) Here D(W) is a sampling distribution over W. We consider three such distributions: i) Random: uniform distribution over W; ii) Local Perturbation: we perturb the embedding wr of the task reward r by the uniform distribution; iii) Train Distribution: we sample w from the training distribution used by FB. To approximate (15), we sample 255 embeddings from D(W) and explicitly include the task embedding wr, resulting in a maximization over 256 policies. To estimate Qπw, we average the reward over 128 states sampled from mπw. Performance is measured by averaging returns over 100 episodes, each lasting 1000 steps.

Figure 3 illustrates the average percentage of improvement for each algorithm and w-sampling strategy relative to the performance of the FB policy πwr for the task reward r. We refer to Appendix D for a more detailed view of these results. All TD-based GHM approaches lead to a significant improvement over the base FB policy, with TD-CFM(C) and TD2-CFM providing 30%+ improvement with all sampling approaches. TD2-DD also leads to significant performance gains but is still dominated by the flow matching methods. Notably, FB-based GPI not only fails to improve performance but actually deteriorates it on average with significant degradation observed in three out of four domains (detailed results available in Appendix D). When comparing different distributions D(W), we observe that while FB-GPI s performance fluctuates considerably, GHM methods maintain their robustness across distributions, showing only minor variation. These results underscore the ability of our improved GHMs to make long-term predictions enabling powerful planning capabilities.

6. Discussion

In this paper, we introduced temporal difference flows, a novel generative modeling approach that significantly advances long-horizon predictive models of state. By leveraging the successor measure s temporal difference structure both in its sampling procedure and learning objective, TD2-

CFM and TD2-DD effectively address challenges associated with modeling long-range state dynamics. The methods developed in this paper provide a robust theoretical and empirical foundation that demonstrates the advantages of our framework across a range of tasks, metrics, and domains. We envision numerous exciting applications emerging from this work, particularly around imitation learning (Wu et al., 2025; Jain et al., 2025), planning (Sutton, 1991; Thakoor et al., 2022; Zhu et al., 2024), and off-policy evaluation (Precup et al., 2000; 2001; Nachum et al., 2019; Fujimoto et al.,

2021). Furthermore, recent work on consistency models (Song et al., 2023; Yang et al., 2024) and self-distillation (Frans et al., 2025) suggests promising avenues for tackling the computational burden of sampling a limitation common to the family of iterative generative models that our approach builds upon.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, (ICLR), 2023.

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. Co RR, abs/2303.08797, 2023.

Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982.

Ba, J., Kiros, J., and Hinton, G. E. Layer normalization. Co RR, abs/1607.06450, 2016.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., Silver, D., and van Hasselt, H. Successor features for transfer in reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2017.

Barreto, A., Hou, S., Borsa, D., Silver, D., and Precup, D. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences (PNAS), 117(48):30079 30087, 2020.

Blier, L., Tallec, C., and Ollivier, Y. Learning successor states and goal-dependent values: A mathematical viewpoint. Co RR, abs/2101.07123, 2021.

Borsa, D., Barreto, A., Quan, J., Mankowitz, D. J., van Hasselt, H., Munos, R., Silver, D., and Schaul, T. Universal successor features approximators. In International Conference on Learning Representations (ICLR), 2019.

Burda, Y., Edwards, H., Storkey, A. J., and Klimov, O. Exploration by random network distillation. In International Conference on Learning Representations (ICLR), 2019.

Cetin, E., Touati, A., and Ollivier, Y. Finer behavioral foundation models via auto-regressive features and advantage weighting. Co RR, abs/2412.04368, 2024.

Temporal Difference Flows

Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Neural Information Processing Systems (Neur IPS), 2018.

Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 1993.

De Bortoli, V., Korshunova, I., Mnih, A., and Doucet, A. Schr odinger bridge flow for unpaired data translation. In Neural Information Processing Systems (Neur IPS), 2024.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. In International Conference on Learning Representations (ICLR), Workshop Track Proceedings, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. In International Conference on Learning Representations (ICLR), 2017.

Farebrother, J., Greaves, J., Agarwal, R., Le Lan, C., Goroshin, R., Castro, P. S., and Bellemare, M. G. Protovalue networks: Scaling representation learning with auxiliary tasks. In International Conference on Learning Representations (ICLR), 2023.

Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021.

Frans, K., Hafner, D., Levine, S., and Abbeel, P. One step diffusion via shortcut models. In International Conference on Learning Representations (ICLR), 2025.

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (ICML), 2018.

Fujimoto, S., Meger, D., and Precup, D. A deep reinforcement learning approach to marginalized importance sampling with the successor representation. In International Conference on Machine Learning (ICML), 2021.

Ghosh, D., Bhateja, C. A., and Levine, S. Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning (ICML), 2023.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Neural Information Processing Systems (Neur IPS), 2014.

Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations (ICLR), 2019.

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. P. Mastering diverse domains through world models. Co RR, abs/2301.04104, 2023.

Hansen, N., Su, H., and Wang, X. Temporal difference learning for model predictive control. In International Conference on Machine Learning (ICML), 2022.

Hansen, N., Su, H., and Wang, X. TD-MPC2: Scalable, robust world models for continuous control. In International Conference on Learning Representations (ICLR), 2024.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR), 2017.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Neural Information Processing Systems (Neur IPS), 2020.

Huang, N., Gokaslan, A., Kuleshov, V., and Tompkin, J. The gan is dead; long live the gan! a modern gan baseline. In Neural Information Processing Systems (Neur IPS), 2024.

Jafferjee, T., Imani, E., Talvitie, E., White, M., and Bowling, M. Hallucinating value: A pitfall of dyna-style planning with imperfect environment models. Co RR, abs/2006.04363, 2020.

Jain, A. K., Lehnert, L., Rish, I., and Berseth, G. Maximum state entropy exploration using predecessor and successor representations. In Neural Information Processing Systems (Neur IPS), 2023.

Jain, A. K., Wiltzer, H., Farebrother, J., Rish, I., Berseth, G., and Choudhury, S. Non-adversarial inverse reinforcement learning via successor feature matching. In International Conference on Learning Representations (ICLR), 2025.

Janner, M., Mordatch, I., and Levine, S. Gamma-models: Generative temporal difference learning for infinitehorizon prediction. In Neural Information Processing Systems (Neur IPS), 2020.

Jolicoeur-Martineau, A. The relativistic discriminator: a key element missing from standard gan. In International Conference on Learning Representations (ICLR), 2019.

Temporal Difference Flows

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.

Lambert, N., Pister, K., and Calandra, R. Investigating compounding prediction errors in learned dynamics models. Co RR, abs/2203.09637, 2022.

Le Lan, C., Tu, S., Oberman, A., Agarwal, R., and Bellemare, M. G. On the generalization of representations in reinforcement learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.

Le Lan, C., Greaves, J., Farebrother, J., Rowland, M., Pedregosa, F., Agarwal, R., and Bellemare, M. G. A novel stochastic gradient descent algorithm for learning principal subspaces. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023a.

Le Lan, C., Tu, S., Rowland, M., Harutyunyan, A., Agarwal, R., Bellemare, M. G., and Dabney, W. Bootstrapped representations in reinforcement learning. In International Conference on Machine Learning (ICML), 2023b.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023.

Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T. Q., Lopez-Paz, D., Ben-Hamu, H., and Gat, I. Flow matching guide and code. Co RR, abs/2412.06264, 2024.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations (ICLR), 2023.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.

Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations (ICLR), 2018.

Machado, M. C., Bellemare, M. G., and Bowling, M. Countbased exploration with the successor representation. In AAAI Conference on Artificial Intelligence, 2020.

Machado, M. C., Barreto, A., Precup, D., and Bowling, M. Temporal abstraction in reinforcement learning with the

successor representation. Journal of Machine Learning Research (JMLR), 24:80:1 80:69, 2023.

Misra, D. Mish: A self regularized non-monotonic neural activation function. Co RR, abs/1908.08681, 2019.

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Neural Information Processing Systems (Neur IPS), 2019.

Park, S., Kreiman, T., and Levine, S. Foundation policies with hilbert representations. In International Conference on Machine Learning (ICML), 2024.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), 2017.

Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. C. Film: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Intelligence, 2018.

Pirotta, M., Tirinzoni, A., Touati, A., Lazaric, A., and Ollivier, Y. Fast imitation via behavior foundation models. In International Conference on Learning Representations (ICLR), 2024.

Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. T. Q. Multisample flow matching: Straightening flows with minibatch couplings. In International Conference on Machine Learning (ICML), 2023.

Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning (ICML), 2000.

Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal difference learning with function approximation. In International Conference on Machine Learning (ICML), 2001.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), 2015.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351, pp. 234 241, 2015.

Rubner, Y., Tomasi, C., and Guibas, L. J. The earth mover s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99 121, 2000.

Temporal Difference Flows

Schmidhuber, J. A possibility for implementing curiosity and boredom in model-building neural controllers. In International Conference on Simulation of Adaptive Behavior, 1991.

Schramm, L. and Boularias, A. Bellman diffusion models. Co RR, abs/2407.12163, 2024.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020.

Shi, Y., De Bortoli, V., Campbell, A., and Doucet, A. Diffusion schr odinger bridge matching. In Neural Information Processing Systems (Neur IPS), 2023.

Sikchi, H., Zhou, W., and Held, D. Learning off-policy with online planning. In Conference on Robot Learning (Co RL), 2021.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550(7676):354 359, 2017.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021a.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems (Neur IPS), 2019.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021b.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. In International Conference on Machine Learning (ICML), 2023.

Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing exploration in reinforcement learning with deep predictive models. In International Conference on Learning Representations (ICLR), 2016.

Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART, 2(4):160 163, 1991.

Talvitie, E. Model regularization for stable sample rollouts. In Conference on Uncertainty in Artificial Intelligence (UAI), 2014.

Thakoor, S., Rowland, M., Borsa, D., Dabney, W., Munos, R., and Barreto, A. Generalised policy improvement with geometric policy composition. In International Conference on Machine Learning (ICML), 2022.

Tirinzoni, A., Touati, A., Farebrother, J., Guzek, M., Kanervisto, A., Xu, Y., Lazaric, A., and Pirotta, M. Zero-shot whole-body humanoid control via behavioral foundation models. In International Conference on Learning Representations (ICLR), 2025.

Tomar, M., Hansen-Estruch, P., Bachman, P., Lamb, A., Langford, J., Taylor, M. E., and Levine, S. Video occupancy models. Co RR, abs/2407.09533, 2024.

Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., and Bengio, Y. Improving and generalizing flow-based generative models with minibatch optimal transport. In Transactions on Machine Learning Research (TMLR), 2024.

Touati, A. and Ollivier, Y. Learning one representation to optimize all rewards. In Neural Information Processing Systems (Neur IPS), 2021.

Touati, A., Rapin, J., and Ollivier, Y. Does zero-shot reinforcement learning exist? In International Conference on Learning Representations (ICLR), 2023.

Tunyasuvunakool, S., Muldal, A., Doron, Y., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap, T., Heess, N., and Tassa, Y. dm control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020.

van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In Neural Information Processing Systems (Neur IPS), 2017.

Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Wiltzer, H., Farebrother, J., Gretton, A., and Rowland, M. Foundations of multivariate distributional reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2024a.

Temporal Difference Flows

Wiltzer, H., Farebrother, J., Gretton, A., Tang, Y., Barreto, A., Dabney, W., Bellemare, M. G., and Rowland, M. A distributional analogue to the successor representation. In International Conference on Machine Learning (ICML), 2024b.

Wu, R., Chen, Y., Swamy, G., Brantley, K., and Sun, W. Diffusing states and matching scores: A new framework for imitation learning. In International Conference on Learning Representations (ICLR), 2025.

Yang, L., Zhang, Z., Zhang, Z., Liu, X., Xu, M., Zhang, W., Meng, C., Ermon, S., and Cui, B. Consistency flow matching: Defining straight flows with velocity consistency. Co RR, abs/2407.02398, 2024.

Yarats, D., Brandfonbrener, D., Liu, H., Laskin, M., Abbeel, P., Lazaric, A., and Pinto, L. Don t change the algorithm, change the data: Exploratory data for offline reinforcement learning. Co RR, abs/2201.13425, 2022.

Zhang, P., Chen, X., Zhao, L., Xiong, W., Qin, T., and Liu, T.-Y. Distributional reinforcement learning for multidimensional reward functions. In Neural Information Processing Systems (Neur IPS), 2021.

Zhu, C., Wang, X., Han, T., Du, S. S., and Gupta, A. Distributional successor features enable zero-shot policy optimization. In Neural Information Processing Systems (Neur IPS), 2024.

Temporal Difference Flows Appendices

A. Related Work

The Successor Representation (Dayan, 1993) was originally proposed for tabular MDPs and was later generalized to continuous state spaces with the Successor Measure (Blier et al., 2021). Successor Features (Barreto et al., 2017; 2020) extends these ideas by instead modeling the evolution of multi-dimensional features assuming rewards decompose linearly over these features. Prior works have leveraged these methods for zero-shot policy evaluation (Dayan, 1993; Barreto et al., 2017; Wiltzer et al., 2024b), zero-shot policy optimization (Borsa et al., 2019; Touati & Ollivier, 2021; Touati et al., 2023; Park et al., 2024; Zhu et al., 2024; Cetin et al., 2024; Tirinzoni et al., 2025), imitation learning (Pirotta et al., 2024; Jain et al., 2025), exploration (Machado et al., 2020; Jain et al., 2023), representation learning (Le Lan et al., 2022; 2023a;b; Farebrother et al., 2023; Ghosh et al., 2023), and building temporal abstractions (Machado et al., 2018; 2023).

Janner et al. (2020) originally proposed a method to learn a generative model of the successor measure with modeling techniques spanning from Generative Adversarial Networks (Goodfellow et al., 2014) to Normalizing Flows (Dinh et al., 2015; Rezende & Mohamed, 2015) like Real NVP (Dinh et al., 2017). Followup work (e.g., Thakoor et al., 2022; Tomar et al., 2024) explored other generative modeling techniques including various types of auto-encoders (e.g., Higgins et al., 2017; van den Oord et al., 2017). Also of note is recent work learning generative models of multi-dimensional cumulants including features (Wiltzer et al., 2024a; Zhu et al., 2024) and multi-variate reward functions (Zhang et al., 2021). Prior work by Wiltzer et al. (2024b) sought to deal with the instability of long-horizon predictions in GHMs by employing an n-step mixture distribution where they sample t Geometric(1 γ) and bootstrap if t > n; otherwise returning the state at time t along the trajectory. Without resorting to importance sampling this approach is limited to the on-policy setting. Finally, most closely related to our work is that of Schramm & Boularias (2024) who provide a preliminary and limited derivation of what we term TD2-DD. In contrast, our work not only rigorously formalizes and significantly extends these ideas but also integrates them into the more general flow-matching framework (Lipman et al., 2023; 2024), additionally incorporating extensions to score-matching (Song et al., 2021b;b) and diffusion (Sohl-Dickstein et al., 2015; Ho et al., 2020). Moreover, we conduct an extensive empirical analysis, demonstrating the efficacy of our approach an aspect notably absent from Schramm & Boularias (2024).

B. Extension to Score Matching and Diffusion Models

This section extends our framework to score matching and denoising diffusion models. We leverage the unification of these methods under stochastic differential equations (Song et al., 2021b) introducing an analogous class of Temporal Difference Diffusion methods.

B.1. Background

Both score-based generative modeling (Song & Ermon, 2019) and diffusion probabilistic modeling (Sohl-Dickstein et al., 2015; Ho et al., 2020) can be unified under the framework of stochastic differential equations (SDE) introduced in Song et al. (2021b). Unlike in flow-matching, time is inverted in diffusion models and ranges from time 0 to T. Given the data distribution q0 and prior simple distribution q T (the noise distribution), we construct a diffusion process {Xt}t [0,T ] such that X0 q0 and XT q T . This diffusion can be modeled as the solution to an Ito SDE:

d Xt = f(t) Xt dt + g(t) d Wt | X0 q0 , (16)

where Wt is a standard Brownian motion and f : [0, T] Rd is scalar function called the drift coefficient, and g : [0, T] R is scalar function known as diffusion coefficient.

Generating samples from X0 q0 consists in sampling XT q T and reversing the forward-SDE process in (16). A known result from Anderson (1982) states that the reverse of a diffusion process is also a diffusion process, running backward in time and given by the reverse-time SDE:

d Xt = f(t) Xt g(t)2 Xt log qt(Xt) dt + g(t) d W t | XT q T , (17)

where W t is a Brownian motion when time flows backwards from T to 0, dt is an infinitesimal negative timestep and qt is

Temporal Difference Flows

the marginal distribution of Xt. Therefore, once we learn the score of the marginal distribution x log qt(x), we can sample from q0 by simulating the reverse diffusion process (17).

To estimate x log qt(x), we can train a time-dependent score-based model s( ; θ) : [0, T] Rd Rd via the denoising diffusion / score matching objective (Vincent, 2011; Song & Ermon, 2019):

ℓDD(θ) = Et U([0,1]),X0 q0EXt qt|0( |X0) h st(Xt; θ) Xt log qt|0(Xt | X0) 2i . (18)

For ℓDD to be tractable, we need to know the conditional probability qt|0. Usually, specific choices of the drift and diffusion coefficients ft and gt are used such that qt|0 is always a Gaussian distribution N( | αtx0, σ2 t ), where the mean αt and variance σ2 t can be computed in closed-form. The global minimizer of ℓDD(θ) denoted by s t (x) is equal to the score function x log qt(x), thanks to the following proposition: Proposition 2 (Vincent 2011). Let qt(x) = R q0(x0)qt|0(x|x0) dx0, then we have:

θ ℓDD(θ) = θ Et,Xt qt h st(Xt; θ) Xt log qt(Xt) 2i . (19)

B.2. Temporal Difference Diffusion

To learn a predictive model of mπ using diffusion from an offline dataset, we follow a similar approach to what we presented in 3 and we define an iterative process starting from initial weights θ(0) and at each iteration minimizing the Temporal-Difference Denoising Diffusion (TD-DD) loss:

ℓTD-DD(θ) = Eρ,t,X0,Xt h st(Xt | S, A; θ) x log qt|0(Xt | X1) 2i ,

where , X0 T π em(n) 0|T ( | S, A), Xt qt|0( | X0) . (TD-DD; 20)

In order to sample X0 T π em(n) 0|T ( | s, a), with probability 1 γ, we return the successor state S P( | S, A).

Otherwise, with probability γ we solve the following reverse-time SDE from XT using the score s(n) t ,

d Xt = f(t) Xt g(t)2 s(n) t (Xt | S, A) dt + g(t)d W t . (21)

Minimizing ℓTD-DD(θ) leads to score function s(n+1) t (s | s, a) generating a marginal probability q(n+1) t that approximates T πq(n) 0 at t = 0.

Following the TD2-CFM blueprint, we can further exploit the structure of the target bootstrapped distribution to design an improved diffusion process that converts Gaussian noise to T πq(n) 0 . First, we show below that the mixture of a diffusion process is also a diffusion process with modified drift and diffusion functions.

Lemma 2. Consider two diffusion processes with drift functions

f, sharing the same diffusion coefficient g:

ft(Xt) dt + g(t) d W

ft(Xt) dt + g(t) d W .

Let qt and qt be their marginal distribution, then the diffusion process corresponding to the mixture marginal distribution qt = (1 γ) qt + γ qt is:

d Xt = (1 γ) qt

ft (1 γ) qt + γ qt (Xt) dt + g(t) d W .

Proof. The marginal probabilities p and p are characterized by the Fokker-Planck equations:

pt t = div( pt

ft) + g2 t 2 pt

pt t = div( pt

ft) + g2 t 2 pt

Temporal Difference Flows

where div is the divergence operator and = div is the Laplace operator. Therefore,

t = (1 γ) pt t + γ pt t

ft) + g2 t 2 pt div( pt

ft) + g2 t 2 pt

= div (1 γ) pt

ft + g2 t 2 ((1 γ) pt + γ pt)

pt (1 γ) pt

ft) (1 γ) pt + γ pt

+ g2 t 2 pt .

The drift (1 γ) pt ft+γ pt ft (1 γ) pt+γ pt and the diffusion coefficient gt satisfy the Fokker-Planck equation with the probability path pt, and therefore their associated diffusion process generate pt.

Lemma 2 can be easily extended to the case of a continuous mixture of diffusion processes.

This result shows that it is possible to use two independent diffusion processes for the two terms in the sampling process induced by the Bellman operator. For the first, we can use the standard noising diffusion process:

qt(x | s, a) = Z qt|0(x | s )P(ds | s, a) ,

where we sample Xt qt|0( | s ) by simulating a simple forward diffusion process (16). For the second term, we can

leverage the GHM m(n) t at the previous iteration to construct the process,

q (n) t (x | s, a) = Z m(n) t (x | s , π(s )) P(ds | s, a) ,

where m(n) t (x | s , a ) is the marginal probability of the reverse SDE induced by the score s(n),

d Xt = f(t) Xt g(t)2 s(n) t (Xt | s, a) dt + g(t) d W t .

Additionally, q (n) t (x | s, a), as continuous mixture of diffusion s marginals m(n) t (x | s , π(s )) weighted by P(s | s, a), can be generated by the diffusion process,

d Xt = f(t) Xt g(t)2 st(Xt | s, a) dt + g(t) d W t, where

st(xt | s, a) =

R P(ds | s, a) q(n) t (x | s , π(s )) s(n) t (xt | s , π(s )) R P(ds | s, a) q(n) t (x | s , π(s )) .

Given these two diffusion processes, the target probability q(n+1) t = (1 γ) qt + γ q (n) t can be generated by the following reverse SDE,

d Xt = f(t)Xt g(t)2 s(n+1) t (Xt | s, a) dt + g(t) d W t,

where s(n+1) t (x | s, a) = (1 γ) qt x log qt+γ q (n) t s (n) t (1 γ) qt+γ q (n) t (x | s, a). Therefore, we can learn st( ; θ) to approximate s(n+1) t by

minimizing the loss,

ℓ(θ) = (1 γ)Eρ,t,Xt qt( |S,A) h s(Xt | S, A; θ) Xt log qt(Xt | S, A) 2i (22)

+ γEρ,t,Xt q (n) t ( |S,A)

h s(Xt | S, A; θ) s (n) t (Xt | S, A) 2i .

Temporal Difference Flows

We can simplify the first term via Proposition 2 (since qt(x|s, a) = R qt|0(x|s )P(ds |s, a)), hence we have

θ Eρ,t,Xt qt( |s,a) h s(Xt | s, a; θ) Xt log qt(Xt | S, A) 2i =

θ Eρ,t,Xt qt|0( |S ) h s(Xt | S, A; θ) Xt log qt|0(Xt | S ) 2i .

Moreover, using a similar argument for equivalence between the gradient of marginal and conditional flow-matching objectives, we can show that

θ Eρ,t,Xt q (n) t ( |S,A)

h s(Xt | S, A; θ) s (n) t (Xt | S, A) 2i =

θ Eρ,t,XT q T ,Xt qn t|T ( |s,a) h s(Xt | S, A; θ) s(n) t (Xt | S, A) 2i .

This leads us to the final TD2-DD loss function,

ℓTD2-DD(θ) = (1 γ)Eρ,t,Xt qt|0( |S ) h st(Xt | S, A; θ) x log pt|0(Xt | S ) 2i (23)

+ γEρ,t,Xt q(n) t|T ( |S ,π(S ))

h s(Xt | S, A; θ) s(n) t (Xt | S , π(S ) 2i .

Temporal Difference Flows

C. Experimental Details

C.1. Evaluation

Table 3. Evaluation hyper-parameters for both single and multi-policy experiments.

Evaluation Hyperparameter Value

Number of states s0 64

Number of m-samples per state 2048

Number of episodes per state 1

Episode length 1000

Number of state s0 64

Number of GHM-samples per state 2048

Number of episodes per state 1

Episode length 1000

GPI Number of z samples 256

Number of GHM samples 128

Number of FB inference samples 250, 000

Evaluating a GHM can be challenging, TD-based losses employing bootstrapping do not provide a good signal as to the quality of the learned model. Instead, we opt to measure 1) the likelihood of a trajectory coming from the true discounted occupancy of a given policy, 2) the Earth Mover s Distance (EMD; Rubner et al., 2000) between samples from the true occupancy and our GHM which provides an estimate of the distance between these two probability distributions, and 3) the value-function approximation error. In all cases, to obtain samples from the true discounted occupancy, we collect trajectories {(s0, s1, . . . , s T )}N i=1 from policy π and subsequently resample states according to t Geometic(1 γ) for a particular discount factor γ [0, 1). Armed with samples from mπ we compute the aforementioned metrics following the procedures stated below along with the parameter values outlined in Table 3.

Normalized Negative Log-Likelihood. To compute the log-likelihood of our flow matching and diffusion methods, we take advantage of the following change in variables formula (Dinh et al., 2015; Rezende & Mohamed, 2015; Chen et al., 2018),

log ( em(x1 | s, a; θ)) = log φ(x0) + Z 1

log ( em(xt | s, a; θ))

where φ is the probability density function of a standard Gaussian distribution, which acts as the prior on x0. The change in log density over time can be written as the following differential equation called the instantaneous change of variables formula (Chen et al., 2018, Theorem 1),

log ( em(xt | s, a; θ))

xt = Tr vt(xt | s, a; θ)

We can now compute the log-likelihood for a sample X mπ( | s, a) by integrating the total change in log-density backward in time from x1 = X to obtain x0 which has tractable likelihood. In practice, we solve the following coupled initial value problem using numerical integration (Grathwohl et al., 2019),

x0 log em(x1 | s, a; θ) log φ(x0)

" vt(xt | s, a; θ)

Tr vt(xt | s,a;θ)

where x1 log em(x | s, a; θ) log em(x1 | s, a; θ)

For all experiments we report the negative log-likelihood normalized by the dimension of the observation space.

Earth Mover s Distance We compute the Earth Mover s Distance (EMD; Rubner et al., 2000), also known as the Wasserstein-1 distance, between m = 2048 samples from the ground truth distribution X mπ( | Sk, Ak) and our learned GHM e X em( | Sk, Ak; θ) for a set of randomly sampled state-action pairs {(Sk, Ak)}n k=1. Intuitively, the EMD quantifies the minimum cost required to transform one distribution into another, where the cost is defined in terms of the Euclidean distance between states X(i), X(j). Formally, we have,

EMD({X(1), . . . , X(m)}, { e X(1), . . . , e X(m)}) = min ξ Ξ

X(i) k e X(j) k 2 ,

Temporal Difference Flows

where ξ is a transport plan such that ξij specifies the proportion of mass moved from Xi to e Xj. We report the average EMD across n = 64 source states using the Python Optimal Transport (Flamary et al., 2021) library.

Value Function Mean Square Error (MSE(V)). We compute the mean square error between a Monte-Carlo estimation e V π MC of the value function V π(s) and the estimation e VGHM obtained using the learned model. We obtain e V π MC by collecting a trajectory {(s0, s1, . . . , s T )} from policy π and computing the discounted sum of rewards. We generate a single trajectory since both the policy and the environment are deterministic. The GHM estimate is given by (2), i.e.,

e V π GHM(s) = (1 γ) 1E e X e m( |s,π(s)) h r( e X) i .

Then, MSE(e V π MC, e V π GHM) = ES0 ν h (e V π GHM(S0) e V π MC(S0))2 i . We average our results over 64 initial states S0 sampled from the initial state distribution ν.

Planning with GPI. We evaluate planning performance by computing the average return over 100 episodes, each lasting 1, 000 steps, for every task. For the Forward-Backward representation (Touati & Ollivier, 2021), we directly follow the policy πwr (thus at = πwr(st)) where wr = E(S,R) ρ[B(s) R] is the zero-shot policy embedding inferred using 250, 000 transitions labeled with the task reward function r. Given that FB provides a direct way of estimating the value function of a policy (i.e., Qπw r (s, a) = F(s, a, w)T zr), we can do planning in the policy embedding space by solving the following problem: w FB-GPI t arg max w D(W) F(st, πw(st), w)T wr.

This optimization problem requires no generation except sampling from D(W). We approximate the max using 255 samples from D(W) and additionally incorporating wr to ultimately maximize over 256 policies. On the other hand, for GHM-GPI, we solve the following optimization problem,

w GHM-GPI t arg max w D(W) (1 γ) 1 EX mπw ( |st,πw(st)))[r(X)] | {z } Qπw (st,πw(st))

which requires generating samples from mπw. In our experiments we generate 128 samples from mπw.

C.2. Environments

Experiments in this paper were conducted with a subset of domains from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020) highlighted in Figure 4.

Figure 4. A visual depiction of each domain used in our experiments from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020). From left to right: MAZE, CHEETAH, QUADRUPED, WALKER.

Temporal Difference Flows

C.3. Geometric Horizon Models

This section describes each class of generative model used for our empirical experiments.

C.3.1. FLOW MATCHING

Algorithm 1 Template for TD-Flow algorithms

1: Inputs: offline dataset D, policy π, batch size n, Polyak coefficient ζ, weight decay λ, randomly initialized weights θ, discount factor γ, learning rate η, one-step conditional path pt|1 and conditional vector-field ut|1, bootstrap path pt and vector-field vt. 2: for n = 1, . . . do 3: Sample mini-batch {(Sk, Ak, S k)}K k=1 from D 4: for k = 1, . . . , K do 5: Sample tk U([0, 1]) 6: Sample Xk ptk|1( | S k) 7:

ℓk(θ) = vtk( Xk | Sk, Ak; θ) utk|1( Xk | S k) 2

8: Sample Xk ptk( | S k, π(S k); θ) 9:

ℓk(θ) = vtk( Xk | Sk, Ak; θ) vtk( Xk | S k, π(S k); θ) 2

10: end for 11: # Compute loss 12: ℓ(θ) = 1 K PK k=1(1 γ) ℓk(θ) + γ ℓk(θ) 13: # Perform gradient step 14: θ θ η θ ℓ(θ) + λ θ 2

15: # Update parameters of target vector field 16: θ ζ θ + (1 ζ)θ 17: end for

Table 4. Summary of how different TD-flow algorithms generate the target probability path and vector field. The neural ode ψt is defined by the vector field v t computed at iteration n.

ut|1(Xt | X1) X1 = ψ1(X0 | S , A ; θ)

Xt pt|1( | X1)

ut|0,1(Xt | X0, X1) X1 = ψ1(X0 | S , A ; θ)

Xt pt|0,1( | X0, X1)

X0 m0 vt(Xt | S , A ; θ) Xt = ψt(X0 | S , A ; θ)

To discuss the TD-Flow methods introduced herein, we first unify the loss function through defining a general template for the loss as,

ℓ(θ) = (1 γ)Eρ,t,Xt pt|1( |S ) h vt(Xt | S, A; θ) ut|1(Xt | S ) 2i

+ γEρ,t,Xt p (n) t ( |Z)

h vt(Xt | S, A; θ) v (n) t (Xt | Z) 2i .

We can now recover each algorithm by a specific choice of the target probability path p (n) t and vector field v (n) t as illustrated in Figure 4. Based on this unified structure, we present pseudo-code for the TD flow methods in Figure 1. In practice, instead of proceeding through full iterations, we use standard mini-batch gradient updates with a target network θ updated as a moving average of θ.

When employing the conditional probability path pt|1 and vector field ut|1 we use the standard Gaussian linear interpolation defined as pt|1( | X1) = N( | t X1, (1 t)2I), hence Xt = t X1 + (1 t)X0 pt|1, resulting in ut|1(Xt | X1) = (X1 Xt)/(1 t) (Lipman et al., 2023). The source distribution for all experiments is m0( ) = N( | 0, I). To sample from the Neural ODE we use the Midpoint method with a constant step size of dt = t/10 for a total of 10 steps. We found both coupled and TD2 methods do not require many solver steps and hypothesize this is due to the reduction in transport cost as analyzed in Appendix E.7.

For all flow and diffusion-based methods, we employ a U-Net-style architecture (Ronneberger et al., 2015) that has hierarchical skip connections throughout an MLP. We embed the timestep t by first increasing its dimensionality with a sinusoidal embedding before transforming it through a two-layer MLP with mish activations (Misra, 2019). We further process additional conditioning information, such as the state-action pair and Forward-Backward embedding z through an additional two-layer MLP, whose result then gets concatenated with our time embedding. Finally, the network integrates all prior conditioning information through Fi LM modulation (Perez et al., 2018) that replaces the learned affine transformation for layer normalization (Ba et al., 2016).

Temporal Difference Flows

C.3.2. DENOISING DIFFUSION

We train a Denoising Diffusion Probabilistic Model (DDPM; Ho et al., 2020) using the same architecture as our flow matching model above, with the output now being interpreted as a prediction of the noise seed ϵ0 that began the diffusion process. We discretize the diffusion process using 1, 000 steps with βmin = 0.1 and βmax = 20. We employ the DDIM sampler(Song et al., 2021a) with 50 sampling steps for both training and evaluation.

For evaluating our DDPM model, we compute exact log-likelihoods using the instantaneous change of variables formula (Chen et al., 2018) along with the probability flow ODE from Song et al. (2021b). That is, we solve the initial value problem in (24) using the vector field,

vt(xt | s, a; θ) = 1

2 (βmin + t (βmax βmin)) xt 1 1 αt ϵt(xt | s, a; θ) .

We now outline the losses for each of the TD-DPM experiments in the paper:

TD-DD To train our vanilla Diffusion GHM we employ the standard DDPM-style objective, that is, we optimize the following loss:

E ρ, t, ϵ N( | 0,I) X0 (T π e m(n))( |S,A)

h ϵ ϵt( αt X0 +

1 αtϵ | S, A; θ) 2i , (25)

where θ are the target parameters and α are the standard diffusion coefficients as seen in Ho et al. (2020).

TD2-DD As outlined in 3.1 we can split our DDPM loss into two terms, one that will use standard DDPM training on one-step transitions and the second term that will regress to our target networks noise prediction. This materializes as,

ℓ(θ) = Eρ,t,ϵ,X0 h |ϵt( αt X0 +

1 αtϵ | S, A; θ) ϵ | 2i

where X0 P( | S, A) ,

ℓ(θ) = Eρ,t,ϵ, Xt

Xt | S, A; θ) ϵ(n) t (

Xt | S , π(S )) | 2

Xt q(n) t|T ( | S , π(S ))

ℓTD2-DD(θ) = (1 γ)

C.3.3. GENERATIVE ADVERSARIAL NETWORK

We implement a modern Generative Adversarial Network (GAN; Goodfellow et al., 2014) baseline based on the recommendations in Huang et al. (2024). Specifically, we train a relativistic GAN (Jolicoeur-Martineau, 2019) resulting in the following loss,

ℓGAN(θG, θD) = Eρ,X0,X1[f (D(G(X0 | S, A; θG); θD) D(X1 | S, A; θD)) ] ,

where X0 N( | 0, I) , X1 T π em(n) ( | S, A) ,

We take f(x) = log (1 + exp ( x)) to be the log-sigmoid function (Jolicoeur-Martineau, 2019) and further add the following zero-centered gradient penalties on the discriminator,

R1(θD) = Eρ,X (T π e m(n))( |S,A) XD(X | S, A) 2 ,

R2(θG, θD) = Eρ,X (T π e m)( |S,A;θG) XD(X | S, A) 2 .

The penalty R1 penalizes the gradient norm of the discriminator D on real data sampled from our current iterate em(n), whereas R2 penalizes the gradient norm on fake data generated directly from the current generator. We experimented with different coefficients and schedules for these gradient penalties and settled on a linear decay schedule from 0.05 0.005

Temporal Difference Flows

throughout training. Furthermore, as is common practice, we impose a schedule on the second moment EMA coefficient β2 of Adam (Kingma & Ba, 2015) to increase from 0.9 0.99 throughout training.

The generator and discriminator architecture in our GAN is implemented as a Residual MLP with leaky Re LU activations with the same Fi LM-style conditioning (Perez et al., 2018) as our flow and diffusion models. The input to our generator is random noise sampled from an isotropic Gaussian with dimensionality equal to the number of state dimensions in the environment.

C.3.4. VARIATIONAL AUTO-ENCODER

We implement a β-Variational Auto-Encoder (Kingma & Welling, 2014; Higgins et al., 2017) following the best practices outlined in Thakoor et al. (2022). That is, we train our VAE to minimize the following loss,

ℓVAE(θE, θD) = Eρ,X1 h EX0 qθE( |S,A,X1)[ log pθD(X1 | S, A, X0)] βDKL(qθE p0) i ,

where X1 T π em(n) ( | S, A) .

We employ a similar architecture to our GAN-GHM and use a residual MLP for the encoder and decoder. We use an isotropic Gaussian latent space with the number of latents equal to the number of state dimensions in the environment. We also swept over β {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} on the MAZE domain and chose β = 0.5 for the rest of our experiments. Overall, we found the β-VAE-based GHM to be very unstable and likely requires very careful fine-tuning of β to get adequate performance at long-horizons.

Temporal Difference Flows

C.4. Hyperparameters

We report the hyper-parameters for training the GHM models used in the single and multi-policy experiments. Table 5 shows the parameters for Flow Matching and Denoising Diffusion. We also report the hyper-parameters for pre-training the Forward-Backward representation (Touati & Ollivier, 2021) utilized in the multi-policy GHM experiments in Table 8.

Table 5. Flow Matching and Denoising Diffusion hyper-parameters used for the single and multi-policy experiments across tasks and domains. We highlight any differences depending on the training context.

Hyperparameter Single Policy Multi-Policy

Flow Matching (Lipman et al., 2023)

ODE Solver Midpoint Midpoint

ODE dt (train) 0.1 0.1

ODE dt (eval) 0.1 0.05 (0.1 for GPI)

Diffusion (DDPM) (Ho et al., 2020)

βmin 0.1 0.1

Discretization Steps 1, 000 1, 000

SDE Solver DDIM (Song et al., 2021a) DDIM (Song et al., 2021a)

SDE Solver Steps (train) 20 20

SDE Solver Steps (eval) 20 20

Network (U-Net) (Ronneberger et al., 2015)

t-Positional Embedding Dim. 256 256

t-Positional Embedding MLP (256, 256) (256, 256)

Hidden Activation mish (Misra, 2019) mish (Misra, 2019)

Blocks per Stage 1 1

Block Dimensions (512, 512, 512) (1024, 1024, 1024)

Conditional Encoder Encoder Input s, a s, a, z

Encoder MLP (512, 512, 512) (1024, 1024, 1024)

Encoder Activation mish (Misra, 2019) mish (Misra, 2019)

Optimizer (Adam W) (Loshchilov & Hutter, 2019)

Adam W β1 0.9 0.9

Adam W β2 0.999 0.999

Adam W ϵ 10 4 10 4

Learning Rate 10 4 10 4

Weight Decay 10 3 10 2

Common Gradient Steps 3M 8M

Batch Size 1024 1024

Target Network EMA 10 3 10 4

Temporal Difference Flows

Table 6. β-VAE (Higgins et al., 2017) hyper-parameters for single policy experiments across tasks and domains.

Hyperparameter Value

β-VAE (Higgins et al., 2017)

β 10 Latent Prior N(0, I) Latent Dimension |S|

Encoder Residual MLP Decoder Residual MLP Hidden Activation mish (Misra, 2019) Blocks per Stage 1 Block Dimensions (512, 512, 512)

Conditional Encoder Encoder Input s, a Encoder MLP (512, 512, 512) Encoder Activation mish (Misra, 2019)

Optimizer (Adam W) (Loshchilov & Hutter, 2019)

Adam W β1 0.9 Adam W β2 0.999 Adam W ϵ 10 4

Learning Rate 10 4

Weight Decay 10 3

Common Gradient Steps 3M Batch Size 1024 Target Network EMA 10 3

Table 7. GAN hyper-parameters for single policy experiments across tasks and domains.

Hyperparameter Value

RGAN (Jolicoeur-Martineau, 2019)

Grad. Penalty Coef Linear(0.05 0.005) Latent Prior N(0, I) Latent Dimension |S|

Generator Residual MLP Discriminator Residual MLP Hidden Activation Leaky Re LU Blocks per Stage 1 Block Dimensions (512, 512, 512)

Conditional Encoder Encoder Input s, a Encoder MLP (512, 512, 512) Encoder Activation Leaky Re LU

Optimizer (Adam W) (Loshchilov & Hutter, 2019)

Adam W β1 0.9 Adam W β2 Linear(0.9 0.99) Adam W ϵ 10 4

Learning Rate 10 4

Weight Decay 10 3

Common Gradient Steps 3M Batch Size 1024 Target Network EMA 10 3

Table 8. Forward Backward Representation hyper-parameters. We largely reuse the hyper-parameters from Pirotta et al. (2024) and

highlight any deviations.

Hyperparameter Walker Cheetah Quadruped Maze

Forward Backward (Touati & Ollivier, 2021)

Embedding Dimension d 100 50 50 100

Embedding Prior Sd Sd Sd Sd

Embedding Prior Goal Prob. 0 0 0 1/2

B Normalization ℓ2 ℓ2 ℓ2 ℓ2 Orthonormal Loss Coeff. 1 1 1 1

Policy (TD3) (Fujimoto et al., 2018)

Target Policy Noise N(0, 0.2) N(0, 0.2) N(0, 0.2) N(0, 0.2)

Target Policy Clipping 0.3 0.3 0.3 0.3

Policy Update Frequency 1 1 1 1

Optimizer (Adam) (Kingma & Ba, 2015)

Learning Rate (F, B) (10 4, 10 4) (10 4, 10 4) (10 4, 10 4) (10 4, 10 6)

Learning Rate (π) 10 4 10 4 10 4 10 6

Adam β1 0.9 0.9 0.9 0.9

Adam β2 0.999 0.999 0.999 0.999

Adam ϵ 10 8 10 8 10 8 10 8

Batch Size 2048 1024 2048 1024

Gradient Steps 3M 3M 3M 5M

Discount Factor γ 0.98 0.98 0.98 0.99

Target Network EMA 0.99 0.99 0.99 0.99 Reward Inference Samples 250, 000 250, 000 250, 000 250, 000

Temporal Difference Flows

D. Additional Experimental Results

In this section, we report additional results about the experiments.

Single policy. We report metrics averaged over tasks using a curved conditional path in Table 12. We also report the performance per task in Table 13. Table 11 shows the performance of the single-policy experiments ( 5.1 in the main paper) expanded for each task. While the performance of TD-based methods is reasonably stable across tasks, VAE and GAN have a large variance across tasks. For example, the EMD of GAN diverges in 2 tasks out of 4 in QUADRUPED.

Multiple policies and planning. We report aggregate performance across our full suite of evaluation metrics for the multi-policy experiments in Table 14. We also report per-task metrics in Table 16. We can notice that TD2-DD achieves quite a high EMD compared to TD-DD while achieving a better MSE(V). By further inspecting the generated samples (see Figure 5), we found that TD-DD tends to generate highly concentrated samples, while TD2-DD is more diffuse. However, the samples generated by TD-DD appear to be better at a visual inspection. This may explain the discrepancy between the two metrics. Finally, we report aggregate planning performance in Table 15 and per-task results in Table 17.

Comparison with planning with one-step world model we include in Table 9 results for a Model Predictive Path Integral (MPPI) controller with a learned dynamics model. We train a similar capacity dynamics model to that of TD2-CFM before evaluating MPPI with a finite horizon of 32 for locomotion tasks and 128 for maze, where at each step we sample 256 action candidates and perform 10 optimization rounds with 64 elites (top-k actions) per round. The results show that GPI with TD2-CFM significantly outperforms MPPI in 3/4 domains with comparable results in Walker. MPPI notably displayed instability related to compounding errors in environments with difficult to model dynamics.

Impact of number of ODE integration steps we report in Table 10 an empirical analysis showing how prediction quality degrades as we reduce the number of integration steps on the Loop task in Pointmass Maze. The results show that TD2-CFM remains robust even at coarse discretizations of the ODE with as little as 5 integration steps, while we observe with a predictable degradation when the number of steps is too small,

Table 9. Comparison with planning with one-step world models

Domain FB TD²-CFM-GPI MPPI Cheetah 479.35 (14.56) 693.63 (5.50) 541.22 (5.28) Pointmass 472.45 (14.40) 800.99 (8.56) 286.43 (54.95) Quadruped 627.28 (1.98) 695.73 (2.07) 156.80 (122.89) Walker 526.66 (5.94) 627.63 (7.97) 658.15 (21.46)

Table 10. Ablation of the number of ODE integration steps

ODE Steps NLL EMD MSE(VF) 2 -0.48 (0.21) 0.076 (0.003) 379.52 (81.75) 5 -2.75 (0.15) 0.036 (0.000) 23.82 (2.05) 10 -2.85 (0.17) 0.025 (0.001) 7.71 (2.75) 20 -2.99 (0.04) 0.0218 (0.001) 4.40 (0.82)

Temporal Difference Flows

Table 11. Per task results for the single policy experiments.

Task Method EMD NLL MSE(V)

TD-DD 20.06 (0.27) 2.713 (0.189) 120.21 (52.91)

TD2-DD 11.05 (0.01) 0.543 (0.164) 24.02 (25.54)

TD-CFM 12.46 (0.35) 0.608 (0.026) 148.56 (29.24)

TD-CFM(C) 10.90 (0.05) 0.112 (0.018) 9.53 (1.37)

TD2-CFM 10.59 (0.13) 0.026 (0.005) 11.90 (7.59)

GAN 23.99 (1.15) 827.79 (130.38)

VAE 114.95 (2.51) 646.96 (21.57)

TD-DD 21.55 (0.17) 2.754 (0.062) 1812.90 (2016.68)

TD2-DD 13.57 (0.02) 0.561 (0.014) 21.81 (9.55)

TD-CFM 15.46 (0.26) 0.838 (0.021) 379.13 (180.63)

TD-CFM(C) 12.94 (0.08) 0.321 (0.009) 22.36 (4.99)

TD2-CFM 13.27 (0.15) 0.200 (0.007) 7.14 (1.72)

GAN 26.85 (1.98) 2948.74 (4541.66)

VAE 103.73 (6.86) 431.70 (87.04)

TD-DD 19.82 (0.07) 2.579 (0.180) 56.11 (18.32)

TD2-DD 12.15 (0.20) 0.487 (0.040) 21.65 (4.76)

TD-CFM 13.30 (0.17) 0.669 (0.041) 32.95 (8.33)

TD-CFM(C) 12.25 (0.11) 0.218 (0.002) 12.76 (3.17)

TD2-CFM 12.27 (0.14) 0.126 (0.019) 14.96 (10.23)

GAN 22.98 (1.31) 5041.85 (654.87)

VAE 114.46 (0.28) 3863.70 (38.24)

TD-DD 21.29 (0.46) 2.635 (0.072) 121.50 (34.67)

TD2-DD 11.59 (0.34) 0.558 (0.080) 88.67 (28.94)

TD-CFM 12.91 (0.16) 0.738 (0.084) 340.43 (63.65)

TD-CFM(C) 11.55 (0.02) 0.225 (0.036) 78.20 (14.20)

TD2-CFM 11.54 (0.27) 0.118 (0.022) 79.41 (16.24)

GAN 24.21 (1.43) 5944.23 (302.73)

VAE 113.79 (1.65) 4888.07 (78.85)

Task Method EMD NLL MSE(V)

TD-DD 27.89 (0.67) 1.890 (0.025) 1778.78 (611.15)

TD2-DD 25.62 (3.75) 0.906 (0.013) 12.88 (2.07)

TD-CFM 15.68 (0.15) 1.068 (0.006) 523.10 (42.47)

TD-CFM(C) 14.12 (0.00) 0.518 (0.002) 10.10 (1.32)

TD2-CFM 14.27 (0.06) 0.426 (0.005) 12.89 (2.86)

GAN 18.23 (0.34) 3546.34 (984.61)

VAE 60.54 (0.29) 1939.62 (22.15)

TD-DD 28.01 (1.02) 1.975 (0.061) 438.92 (310.44)

TD2-DD 22.79 (3.08) 0.856 (0.033) 32.38 (4.36)

TD-CFM 15.74 (0.05) 1.051 (0.026) 170.86 (19.61)

TD-CFM(C) 14.62 (0.11) 0.457 (0.006) 26.01 (4.44)

TD2-CFM 14.75 (0.05) 0.338 (0.004) 18.36 (2.62)

GAN 19.21 (0.13) 195.11 (144.29)

VAE 60.56 (0.21) 428.69 (10.48)

TD-DD 28.57 (0.50) 1.832 (0.034) 2083.77 (1767.03)

TD2-DD 20.81 (1.81) 0.867 (0.040) 20.09 (19.08)

TD-CFM 15.03 (0.18) 1.003 (0.026) 505.51 (88.47)

TD-CFM(C) 13.91 (0.02) 0.483 (0.005) 12.86 (4.65)

TD2-CFM 14.07 (0.12) 0.393 (0.021) 7.77 (0.91)

GAN 91273.39 (81559.61) 3631.15 (2289.14)

VAE 59.42 (0.49) 859.51 (101.82)

TD-DD 28.83 (0.41) 1.934 (0.075) 1661.52 (402.07)

TD2-DD 21.36 (1.70) 0.815 (0.040) 570.75 (35.38)

TD-CFM 16.48 (0.09) 1.103 (0.006) 900.78 (85.36)

TD-CFM(C) 14.89 (0.01) 0.494 (0.006) 572.02 (24.55)

TD2-CFM 14.96 (0.13) 0.361 (0.022) 528.06 (11.32)

GAN 55777.67 (28193.15) 3166.15 (54.62)

VAE 60.57 (0.54) 1397.52 (100.28)

Pointmass Maze

Task Method EMD NLL MSE(V)

TD-DD 0.189 (0.003) 3.462 (0.232) 4717.87 (83.53)

TD2-DD 0.031 (0.003) 0.577 (0.027) 4.27 (1.36)

TD-CFM 0.071 (0.007) 0.748 (0.070) 677.48 (154.81)

TD-CFM(C) 0.025 (0.002) 0.703 (0.032) 10.91 (2.35)

TD2-CFM 0.020 (0.001) 0.674 (0.072) 1.75 (0.13)

GAN 0.225 (0.014) 2276.26 (361.04)

VAE 0.456 (0.045) 4011.19 (85.44)

BOTTOM LEFT

TD-DD 0.139 (0.002) 2.808 (0.058) 320.80 (27.06)

TD2-DD 0.025 (0.001) 0.980 (0.174) 5.76 (3.15)

TD-CFM 0.059 (0.001) 0.520 (0.031) 224.13 (33.19)

TD-CFM(C) 0.024 (0.002) 0.729 (0.167) 16.58 (12.10)

TD2-CFM 0.020 (0.002) 0.984 (0.053) 10.44 (7.08)

GAN 0.269 (0.150) 1199.80 (212.47)

VAE 0.313 (0.029) 981.22 (195.70)

REACH BOTTOM RIGHT

TD-DD 0.174 (0.004) 3.270 (0.257) 230.79 (18.24)

TD2-DD 0.025 (0.001) 0.640 (0.283) 4.82 (2.61)

TD-CFM 0.066 (0.004) 0.549 (0.040) 166.07 (35.75)

TD-CFM(C) 0.023 (0.001) 0.759 (0.034) 10.95 (2.63)

TD2-CFM 0.020 (0.002) 0.855 (0.022) 4.84 (3.08)

GAN 0.170 (0.018) 416.75 (54.72)

VAE 0.505 (0.051) 489.06 (6.44)

REACH TOP LEFT

TD-DD 0.102 (0.001) 2.407 (0.059) 593.98 (72.33)

TD2-DD 0.033 (0.003) 0.863 (0.255) 34.43 (10.96)

TD-CFM 0.055 (0.006) 0.454 (0.167) 472.54 (308.65)

TD-CFM(C) 0.021 (0.003) 0.517 (0.445) 14.85 (3.28)

TD2-CFM 0.025 (0.002) 0.797 (0.057) 23.48 (5.46)

GAN 0.132 (0.022) 1350.49 (716.52)

VAE 0.321 (0.029) 2404.42 (498.13)

REACH TOP RIGHT

TD-DD 0.141 (0.002) 2.924 (0.243) 362.56 (8.06)

TD2-DD 0.023 (0.003) 0.743 (0.259) 6.38 (1.55)

TD-CFM 0.059 (0.002) 0.501 (0.018) 237.57 (47.18)

TD-CFM(C) 0.020 (0.001) 0.771 (0.090) 6.18 (3.37)

TD2-CFM 0.018 (0.001) 0.903 (0.074) 3.21 (2.22)

GAN 0.218 (0.044) 1043.01 (337.10)

VAE 0.453 (0.106) 1223.57 (80.69)

Task Method EMD NLL MSE(V)

TD-DD 20.31 (0.31) 2.669 (0.086) 601.62 (314.84)

TD2-DD 14.44 (1.79) 0.758 (0.028) 172.03 (35.51)

TD-CFM 11.90 (0.03) 0.868 (0.008) 211.92 (26.25)

TD-CFM(C) 10.55 (0.03) 0.485 (0.024) 124.08 (17.89)

TD2-CFM 10.67 (0.04) 0.447 (0.021) 67.76 (21.99)

GAN 23.55 (2.52) 3608.55 (1948.65)

VAE 83.00 (1.02) 3339.01 (44.80)

TD-DD 16.67 (0.02) 2.647 (0.186) 1043.27 (369.92)

TD2-DD 12.99 (2.64) 0.894 (0.025) 463.04 (89.08)

TD-CFM 10.91 (0.12) 0.927 (0.047) 398.66 (59.04)

TD-CFM(C) 9.90 (0.07) 0.542 (0.023) 410.49 (77.16)

TD2-CFM 10.11 (0.14) 0.542 (0.006) 370.69 (112.59)

GAN 20.80 (1.56) 3761.79 (785.37)

VAE 84.65 (0.31) 918.32 (25.62)

TD-DD 20.26 (0.06) 2.907 (0.336) 46.48 (13.06)

TD2-DD 16.91 (4.04) 0.813 (0.028) 86.53 (55.44)

TD-CFM 12.21 (0.05) 0.872 (0.032) 54.98 (11.01)

TD-CFM(C) 10.44 (0.08) 0.434 (0.018) 24.52 (5.89)

TD2-CFM 10.53 (0.08) 0.412 (0.020) 27.69 (5.44)

GAN 25.48 (2.01) 183.47 (72.39)

VAE 83.91 (0.57) 109.45 (9.86)

RUN BACKWARD

TD-DD 21.47 (0.32) 3.074 (0.376) 20.28 (5.95)

TD2-DD 13.04 (1.22) 0.818 (0.016) 14.87 (2.34)

TD-CFM 13.38 (0.20) 0.989 (0.056) 37.90 (2.98)

TD-CFM(C) 11.02 (0.05) 0.452 (0.023) 8.71 (1.05)

TD2-CFM 11.06 (0.08) 0.414 (0.016) 8.33 (1.89)

GAN 24.77 (0.43) 270.21 (4.08)

VAE 82.91 (0.36) 734.77 (22.94)

TD-DD 21.57 (0.84) 2.790 (0.151) 546.05 (86.30)

TD2-DD 12.85 (1.67) 0.780 (0.047) 238.01 (11.17)

TD-CFM 12.27 (0.12) 0.802 (0.034) 377.45 (101.61)

TD-CFM(C) 10.24 (0.17) 0.354 (0.021) 176.99 (28.54)

TD2-CFM 10.18 (0.08) 0.336 (0.021) 229.89 (21.93)

GAN 24.39 (1.11) 3520.88 (1050.76)

VAE 84.39 (0.41) 2138.32 (233.01)

WALK BACKWARD

TD-DD 21.05 (0.30) 2.854 (0.094) 469.23 (133.50)

TD2-DD 14.64 (2.48) 0.771 (0.019) 160.42 (42.25)

TD-CFM 12.89 (0.14) 0.857 (0.033) 291.71 (66.89)

TD-CFM(C) 10.88 (0.01) 0.412 (0.023) 99.90 (4.20)

TD2-CFM 10.86 (0.12) 0.381 (0.014) 106.97 (10.45)

GAN 24.86 (0.34) 3434.43 (189.45)

VAE 83.73 (0.63) 465.72 (16.06)

Temporal Difference Flows

Table 12. Results averaged over tasks for the single policy experiments with a curved conditional path.

Domain Method EMD NLL MSE(V)

TD-CFM 13.91 (0.73) 1.354 (0.017) 477.89 (40.53)

TD-CFM(C) 25.86 (18.91) 1.295 (0.067) 189.21 (17.69)

TD2-CFM 10.79 (0.03) 0.412 (0.014) 121.67 (5.68)

POINTMASS MAZE

TD-CFM 0.091 (0.003) 1.156 (0.081) 758.16 (103.54)

TD-CFM(C) 0.089 (0.008) 4.340 (0.456) 679.29 (20.11)

TD2-CFM 0.021 (0.000) 0.806 (0.017) 9.22 (1.40)

TD-CFM 15.63 (0.09) 1.478 (0.088) 273.68 (34.07)

TD-CFM(C) 34.00 (6.96) 0.930 (0.036) 522.28 (155.42)

TD2-CFM 14.56 (0.02) 0.327 (0.014) 142.18 (9.38)

TD-CFM 13.10 (0.11) 1.147 (0.042) 608.47 (124.62)

TD-CFM(C) 33.20 (4.66) 1.039 (0.052) 189.66 (102.83)

TD2-CFM 12.00 (0.05) 0.099 (0.005) 27.56 (0.53)

Table 13. Per task results for the single policy experiments with a curved conditional path.

Task Method EMD NLL MSE(V)

TD-CFM 12.39 (0.17) 1.218 (0.107) 326.14 (56.12)

TD-CFM(C) 23.80 (4.91) 0.923 (0.191) 69.39 (8.84)

TD2-CFM 10.69 (0.06) 0.040 (0.008) 11.69 (4.01)

TD-CFM 14.08 (0.12) 1.410 (0.189) 896.83 (278.52)

TD-CFM(C) 47.39 (3.14) 1.801 (0.186) 401.22 (321.52)

TD2-CFM 13.37 (0.11) 0.198 (0.008) 7.65 (2.29)

TD-CFM 13.24 (0.23) 0.896 (0.053) 274.60 (121.20)

TD-CFM(C) 36.32 (13.80) 0.625 (0.053) 159.74 (31.53)

TD2-CFM 12.50 (0.17) 0.119 (0.008) 9.42 (1.95)

TD-CFM 12.69 (0.20) 1.067 (0.015) 936.30 (86.71)

TD-CFM(C) 25.29 (7.62) 0.808 (0.049) 128.28 (65.70)

TD2-CFM 11.42 (0.20) 0.119 (0.026) 81.47 (3.53)

Task Method EMD NLL MSE(V)

TD-CFM 15.31 (0.17) 1.460 (0.188) 115.99 (138.59)

TD-CFM(C) 39.28 (8.90) 0.980 (0.062) 686.51 (314.49)

TD2-CFM 14.36 (0.07) 0.358 (0.010) 10.84 (3.05)

TD-CFM 15.61 (0.16) 1.450 (0.060) 104.52 (33.53)

TD-CFM(C) 40.27 (7.59) 0.898 (0.040) 240.50 (58.83)

TD2-CFM 14.73 (0.06) 0.288 (0.015) 21.13 (3.52)

TD-CFM 15.24 (0.11) 1.515 (0.215) 173.07 (34.09)

TD-CFM(C) 22.77 (6.86) 0.924 (0.053) 275.03 (249.91)

TD2-CFM 14.17 (0.09) 0.342 (0.019) 7.05 (1.80)

TD-CFM 16.37 (0.10) 1.486 (0.022) 701.13 (83.58)

TD-CFM(C) 33.68 (4.69) 0.917 (0.036) 887.11 (120.92)

TD2-CFM 14.99 (0.08) 0.318 (0.016) 529.71 (35.40)

Pointmass Maze

Task Method EMD NLL MSE(V)

TD-CFM 0.112 (0.015) 1.465 (0.171) 1888.54 (444.66)

TD-CFM(C) 0.132 (0.031) 5.191 (1.328) 1354.09 (102.55)

TD2-CFM 0.020 (0.000) 0.708 (0.013) 2.31 (0.59)

REACH BOTTOM LEFT

TD-CFM 0.096 (0.012) 1.091 (0.142) 628.74 (118.04)

TD-CFM(C) 0.078 (0.001) 3.942 (0.576) 820.02 (52.88)

TD2-CFM 0.022 (0.001) 0.883 (0.057) 10.55 (9.13)

REACH BOTTOM RIGHT

TD-CFM 0.097 (0.001) 1.296 (0.220) 290.21 (29.94)

TD-CFM(C) 0.109 (0.009) 5.310 (0.552) 409.28 (10.79)

TD2-CFM 0.019 (0.001) 0.833 (0.049) 2.64 (0.30)

REACH TOP LEFT

TD-CFM 0.070 (0.003) 0.894 (0.139) 500.63 (142.18)

TD-CFM(C) 0.048 (0.002) 2.821 (0.268) 75.79 (20.06)

TD2-CFM 0.025 (0.002) 0.738 (0.011) 26.56 (9.99)

REACH TOP RIGHT

TD-CFM 0.083 (0.004) 1.035 (0.138) 482.68 (128.45)

TD-CFM(C) 0.080 (0.001) 4.436 (0.305) 737.30 (23.75)

TD2-CFM 0.019 (0.001) 0.866 (0.026) 4.02 (1.75)

Task Method EMD NLL MSE(V)

TD-CFM 12.92 (1.25) 1.324 (0.042) 342.71 (129.09)

TD-CFM(C) 22.90 (15.00) 1.364 (0.108) 140.32 (42.14)

TD2-CFM 10.89 (0.08) 0.433 (0.012) 74.34 (6.50)

FLIP BACKWARD

TD-CFM 14.52 (4.08) 1.346 (0.190) 576.31 (169.57)

TD-CFM(C) 25.46 (25.58) 1.427 (0.027) 388.45 (87.18)

TD2-CFM 10.48 (0.23) 0.538 (0.034) 283.84 (40.81)

TD-CFM 14.00 (0.77) 1.390 (0.043) 114.51 (3.11)

TD-CFM(C) 17.42 (5.78) 1.423 (0.091) 37.23 (8.74)

TD2-CFM 10.85 (0.08) 0.405 (0.010) 32.58 (8.42)

RUN BACKWARD

TD-CFM 14.50 (0.31) 1.439 (0.102) 109.32 (5.35)

TD-CFM(C) 38.06 (28.90) 1.283 (0.110) 101.24 (149.88)

TD2-CFM 11.06 (0.05) 0.399 (0.007) 12.32 (2.34)

TD-CFM 13.66 (0.71) 1.290 (0.041) 1040.43 (147.86)

TD-CFM(C) 21.01 (16.43) 1.096 (0.028) 343.71 (66.91)

TD2-CFM 10.45 (0.04) 0.323 (0.010) 213.87 (23.09)

WALK BACKWARD

TD-CFM 13.83 (0.89) 1.336 (0.033) 684.05 (21.31)

TD-CFM(C) 30.29 (22.58) 1.178 (0.206) 124.29 (17.61)

TD2-CFM 11.00 (0.11) 0.372 (0.015) 113.09 (22.45)

Temporal Difference Flows

Table 14. Per domain results for the quantitative multipolicy experiments.

Domain Method EMD NLL MSE(V)

TD-DD 17.79 (0.40) 1.442 (0.042) 534.82 (107.81)

TD2-DD 74.35 (7.49) 0.771 (0.020) 253.89 (21.42)

TD-CFM 12.54 (0.04) 1.044 (0.044) 826.54 (58.01)

TD-CFM(C) 11.19 (0.11) 0.581 (0.011) 249.02 (19.81)

TD2-CFM 11.06 (0.08) 0.481 (0.008) 230.34 (44.81)

TD-DD 0.152 (0.006) 2.048 (0.093) 662.96 (76.86)

TD2-DD 0.349 (0.037) 0.666 (0.027) 312.98 (66.46)

TD-CFM 0.087 (0.003) 0.771 (0.025) 580.94 (41.28)

TD-CFM(C) 0.063 (0.000) 0.174 (0.021) 220.11 (100.36)

TD2-CFM 0.060 (0.002) 0.043 (0.022) 169.74 (85.76)

TD-DD 20.21 (1.76) 1.403 (0.022) 499.88 (292.17)

TD2-DD 135.79 (9.24) 0.901 (0.051) 415.29 (101.86)

TD-CFM 15.06 (0.08) 0.950 (0.024) 391.12 (141.00)

TD-CFM(C) 14.98 (0.15) 0.528 (0.016) 176.62 (13.73)

TD2-CFM 14.74 (0.12) 0.340 (0.010) 178.95 (30.43)

TD-DD 21.49 (0.64) 1.441 (0.009) 571.72 (196.76)

TD2-DD 104.44 (2.84) 0.688 (0.009) 180.45 (47.82)

TD-CFM 15.08 (0.28) 0.920 (0.023) 768.13 (66.48)

TD-CFM(C) 13.57 (0.09) 0.414 (0.019) 179.39 (24.52)

TD2-CFM 13.70 (0.33) 0.307 (0.008) 154.75 (8.70)

Table 15. Per domain results for the multi-policy experiments evaluating planning performance with generalized policy improvement.

Domain Method Planner Z-Distribution D(Z)

Random Local Perturbation Train Distribution

FB 479.35 (14.56) FB GPI 275.32 (2.50) 401.08 (5.92) 269.59 (8.18)

TD-DD GPI 574.05 (3.88) 604.53 (11.87) 620.72 (14.29)

TD2-DD GPI 662.17 (0.94) 680.22 (5.98) 678.98 (3.67)

TD-CFM GPI 403.54 (81.24) 426.46 (81.69) 372.40 (99.68)

TD-CFM(C) GPI 681.52 (6.49) 700.97 (6.57) 697.81 (3.16)

TD2-CFM GPI 682.21 (5.41) 692.72 (7.96) 693.63 (5.50)

FB 472.45 (14.40) FB GPI 0.64 (7.70) 240.54 (23.69) 17.74 (4.34)

TD-DD GPI 569.05 (37.58) 599.92 (37.26) 537.69 (47.54)

TD2-DD GPI 763.95 (38.02) 805.72 (2.23) 788.87 (17.13)

TD-CFM GPI 625.44 (23.12) 671.53 (52.75) 695.70 (27.88)

TD-CFM(C) GPI 800.87 (3.46) 812.44 (1.58) 808.03 (2.77)

TD2-CFM GPI 790.34 (14.16) 813.90 (1.62) 800.99 (8.56)

FB 627.28 (1.98) FB GPI 671.95 (0.58) 674.09 (0.53) 646.05 (2.28)

TD-DD GPI 657.98 (1.87) 662.29 (1.46) 657.44 (4.71)

TD2-DD GPI 667.24 (6.32) 671.54 (1.40) 665.52 (5.12)

TD-CFM GPI 669.35 (5.82) 672.46 (4.96) 668.61 (5.74)

TD-CFM(C) GPI 695.52 (4.51) 697.65 (5.21) 696.18 (3.29)

TD2-CFM GPI 696.58 (4.10) 696.57 (2.36) 695.73 (2.07)

FB 526.66 (5.94) FB GPI 35.23 (0.98) 37.51 (1.20) 39.04 (1.48)

TD-DD GPI 512.65 (19.19) 553.35 (14.28) 533.37 (27.24)

TD2-DD GPI 509.39 (10.26) 598.40 (6.44) 609.28 (5.87)

TD-CFM GPI 506.62 (15.84) 524.34 (4.75) 537.24 (17.20)

TD-CFM(C) GPI 513.24 (17.77) 608.80 (16.14) 624.19 (19.45)

TD2-CFM GPI 518.07 (20.74) 617.08 (6.55) 627.63 (7.97)

Temporal Difference Flows

Table 16. Per task results for the quantitative multi-policy experiments.

Task Method EMD NLL MSE(V)

TD-DD 24.22 (0.37) 1.595 (0.021) 494.85 (221.39)

TD2-DD 108.16 (1.64) 0.893 (0.065) 103.71 (34.77)

TD-CFM 16.01 (0.33) 1.120 (0.037) 431.62 (64.40)

TD-CFM(C) 14.77 (0.38) 0.704 (0.083) 74.42 (13.13)

TD2-CFM 14.81 (0.56) 0.546 (0.012) 73.86 (26.41)

TD-DD 21.28 (0.97) 1.389 (0.005) 53.28 (20.52)

TD2-DD 102.69 (3.60) 0.546 (0.070) 6.35 (0.88)

TD-CFM 14.99 (0.65) 0.845 (0.085) 209.80 (54.21)

TD-CFM(C) 13.01 (0.35) 0.260 (0.089) 32.84 (8.26)

TD2-CFM 13.20 (0.36) 0.180 (0.076) 34.61 (21.58)

TD-DD 21.31 (0.65) 1.482 (0.015) 1093.50 (700.34)

TD2-DD 103.72 (1.69) 0.903 (0.067) 115.58 (28.18)

TD-CFM 15.16 (0.53) 1.020 (0.036) 482.78 (24.82)

TD-CFM(C) 14.22 (0.06) 0.605 (0.076) 170.20 (48.23)

TD2-CFM 14.34 (0.20) 0.449 (0.056) 197.13 (26.98)

TD-DD 21.34 (0.66) 1.459 (0.029) 594.94 (219.72)

TD2-DD 103.86 (4.22) 0.630 (0.030) 250.96 (79.14)

TD-CFM 14.28 (0.32) 0.829 (0.107) 1371.68 (326.61)

TD-CFM(C) 13.43 (0.34) 0.335 (0.067) 265.09 (12.84)

TD2-CFM 13.52 (0.61) 0.284 (0.062) 166.16 (17.51)

TD-DD 19.30 (0.80) 1.282 (0.033) 622.04 (186.99)

TD2-DD 103.79 (3.80) 0.471 (0.055) 425.65 (131.20)

TD-CFM 14.97 (0.14) 0.787 (0.070) 1344.77 (149.38)

TD-CFM(C) 12.39 (0.25) 0.165 (0.042) 354.40 (114.40)

TD2-CFM 12.63 (0.46) 0.078 (0.072) 301.97 (21.93)

Task Method EMD NLL MSE(V)

TD-DD 20.23 (1.67) 1.394 (0.024) 279.84 (165.15)

TD2-DD 135.62 (9.10) 0.921 (0.044) 562.83 (170.42)

TD-CFM 15.25 (0.02) 0.960 (0.006) 365.14 (177.15)

TD-CFM(C) 15.24 (0.13) 0.548 (0.008) 129.02 (23.63)

TD2-CFM 15.00 (0.08) 0.369 (0.004) 139.10 (9.66)

TD-DD 20.06 (1.67) 1.405 (0.013) 273.65 (192.14)

TD2-DD 135.28 (9.10) 0.909 (0.049) 171.76 (48.29)

TD-CFM 15.04 (0.02) 0.961 (0.031) 189.56 (63.62)

TD-CFM(C) 14.92 (0.17) 0.538 (0.017) 84.74 (6.77)

TD2-CFM 14.71 (0.12) 0.351 (0.008) 90.48 (10.33)

TD-DD 20.01 (1.78) 1.401 (0.033) 1131.49 (863.23)

TD2-DD 135.81 (9.19) 0.875 (0.054) 669.65 (148.88)

TD-CFM 14.91 (0.10) 0.931 (0.033) 735.43 (274.17)

TD-CFM(C) 14.75 (0.30) 0.508 (0.017) 336.02 (16.08)

TD2-CFM 14.49 (0.24) 0.309 (0.015) 325.59 (79.83)

TD-DD 20.55 (1.93) 1.412 (0.035) 314.53 (84.50)

TD2-DD 136.45 (9.57) 0.901 (0.056) 256.91 (58.22)

TD-CFM 15.06 (0.22) 0.949 (0.030) 274.37 (58.65)

TD-CFM(C) 15.02 (0.09) 0.518 (0.024) 156.72 (41.37)

TD2-CFM 14.76 (0.19) 0.331 (0.019) 160.62 (23.36)

Pointmass Maze

Task Method EMD NLL MSE(V)

TD-DD 0.164 (0.013) 2.012 (0.089) 1642.91 (26.55)

TD2-DD 0.350 (0.038) 0.637 (0.046) 236.52 (58.27)

TD-CFM 0.082 (0.004) 0.772 (0.065) 575.00 (75.51)

TD-CFM(C) 0.061 (0.002) 0.083 (0.013) 93.04 (5.55)

TD2-CFM 0.060 (0.003) 0.010 (0.059) 61.08 (20.86)

TD-DD 0.151 (0.007) 2.094 (0.119) 537.80 (22.89)

TD2-DD 0.337 (0.040) 0.659 (0.028) 213.93 (51.75)

TD-CFM 0.088 (0.003) 0.782 (0.018) 225.96 (59.92)

TD-CFM(C) 0.070 (0.007) 0.266 (0.066) 86.12 (23.47)

TD2-CFM 0.065 (0.003) 0.101 (0.074) 102.65 (27.28)

BOTTOM LEFT

TD-DD 0.131 (0.008) 1.969 (0.143) 207.45 (38.79)

TD2-DD 0.339 (0.050) 0.510 (0.043) 89.56 (41.56)

TD-CFM 0.078 (0.005) 0.659 (0.044) 376.64 (67.43)

TD-CFM(C) 0.048 (0.002) 0.099 (0.054) 73.84 (7.92)

TD2-CFM 0.042 (0.002) 0.261 (0.024) 14.20 (0.47)

REACH BOTTOM LEFT LONG

TD-DD 0.144 (0.005) 2.037 (0.062) 1239.65 (627.94)

TD2-DD 0.355 (0.042) 0.807 (0.010) 1431.05 (342.62)

TD-CFM 0.105 (0.004) 0.987 (0.060) 2212.63 (504.47)

TD-CFM(C) 0.078 (0.002) 0.457 (0.058) 993.60 (639.56)

TD2-CFM 0.074 (0.005) 0.310 (0.083) 896.15 (598.65)

REACH BOTTOM RIGHT

TD-DD 0.180 (0.004) 2.106 (0.114) 194.15 (75.47)

TD2-DD 0.369 (0.053) 0.618 (0.035) 112.24 (12.17)

TD-CFM 0.096 (0.004) 0.724 (0.055) 272.13 (33.13)

TD-CFM(C) 0.070 (0.005) 0.063 (0.043) 103.89 (10.69)

TD2-CFM 0.067 (0.007) 0.104 (0.024) 49.28 (13.90)

REACH TOP LEFT

TD-DD 0.122 (0.005) 2.083 (0.149) 433.81 (37.98)

TD2-DD 0.343 (0.036) 0.631 (0.046) 158.15 (35.48)

TD-CFM 0.076 (0.003) 0.679 (0.086) 453.53 (88.84)

TD-CFM(C) 0.051 (0.003) 0.092 (0.071) 54.48 (4.15)

TD2-CFM 0.052 (0.003) 0.022 (0.047) 31.01 (7.91)

TD-DD 0.149 (0.004) 1.994 (0.093) 221.28 (26.98)

TD2-DD 0.350 (0.022) 0.563 (0.121) 69.97 (34.75)

TD-CFM 0.074 (0.004) 0.700 (0.078) 250.17 (36.85)

TD-CFM(C) 0.051 (0.000) 0.032 (0.010) 39.79 (5.50)

TD2-CFM 0.047 (0.002) 0.131 (0.020) 17.01 (6.82)

TD-DD 0.175 (0.008) 2.088 (0.105) 826.61 (162.87)

TD2-DD 0.347 (0.030) 0.906 (0.033) 192.41 (36.20)

TD-CFM 0.093 (0.002) 0.869 (0.026) 281.43 (52.24)

TD-CFM(C) 0.077 (0.005) 0.566 (0.027) 210.94 (97.86)

TD2-CFM 0.075 (0.006) 0.392 (0.043) 186.53 (58.91)

Task Method EMD NLL MSE(V)

TD-DD 16.97 (0.45) 1.358 (0.033) 903.42 (267.90)

TD2-DD 73.44 (9.89) 0.782 (0.065) 308.54 (36.36)

TD-CFM 13.06 (0.46) 0.964 (0.073) 911.92 (135.18)

TD-CFM(C) 10.96 (0.58) 0.564 (0.050) 328.99 (27.34)

TD2-CFM 10.95 (0.32) 0.443 (0.047) 222.71 (32.96)

FLIP BACKWARD

TD-DD 18.64 (0.48) 1.442 (0.052) 678.24 (40.56)

TD2-DD 75.09 (6.07) 0.753 (0.007) 215.67 (39.77)

TD-CFM 12.83 (0.38) 0.966 (0.020) 381.99 (112.95)

TD-CFM(C) 11.36 (0.21) 0.582 (0.005) 230.92 (14.13)

TD2-CFM 11.06 (0.18) 0.476 (0.028) 255.25 (57.02)

TD-DD 17.61 (0.43) 1.489 (0.054) 87.64 (30.75)

TD2-DD 73.06 (6.95) 0.742 (0.039) 111.78 (49.51)

TD-CFM 12.22 (0.30) 1.103 (0.066) 194.36 (19.68)

TD-CFM(C) 10.75 (0.17) 0.535 (0.028) 34.90 (21.47)

TD2-CFM 10.74 (0.07) 0.445 (0.021) 24.71 (10.91)

RUN BACKWARD

TD-DD 18.75 (0.31) 1.475 (0.036) 57.65 (8.84)

TD2-DD 76.43 (7.28) 0.802 (0.013) 90.76 (20.35)

TD-CFM 12.59 (0.11) 1.083 (0.041) 82.43 (5.56)

TD-CFM(C) 11.78 (0.17) 0.632 (0.008) 30.50 (4.34)

TD2-CFM 11.53 (0.15) 0.534 (0.013) 33.52 (3.75)

TD-DD 16.77 (0.46) 1.461 (0.037) 805.51 (158.64)

TD2-DD 72.44 (7.86) 0.757 (0.020) 348.70 (58.96)

TD-CFM 11.91 (0.18) 1.095 (0.042) 1899.15 (131.04)

TD-CFM(C) 10.72 (0.14) 0.551 (0.024) 277.74 (117.41)

TD2-CFM 10.66 (0.18) 0.464 (0.029) 260.01 (153.43)

WALK BACKWARD

TD-DD 18.00 (1.11) 1.427 (0.073) 676.44 (296.20)

TD2-DD 75.66 (7.41) 0.787 (0.036) 447.90 (49.14)

TD-CFM 12.62 (0.28) 1.056 (0.067) 1489.41 (90.89)

TD-CFM(C) 11.60 (0.17) 0.621 (0.030) 591.06 (12.67)

TD2-CFM 11.41 (0.18) 0.523 (0.021) 585.86 (103.16)

Temporal Difference Flows

Table 17. Per task results for planning with GPI.

Domain Method Planner Z-Distribution D(Z)

Random Local Perturbation Train Distribution

FB 326.94 (7.00) FB GPI 14.13 (0.51) 14.06 (0.43) 14.57 (0.33)

TD-DD GPI 328.61 (2.66) 303.45 (44.57) 292.96 (52.91)

TD2-DD GPI 316.41 (4.59) 338.14 (14.23) 349.77 (12.78)

TD-CFM GPI 301.88 (17.94) 221.07 (3.80) 199.50 (4.09)

TD-CFM(C) GPI 325.92 (14.31) 368.97 (16.10) 367.38 (11.48)

TD2-CFM GPI 325.76 (12.88) 362.79 (19.09) 358.73 (24.84)

FB 338.41 (2.98) FB GPI 29.32 (2.15) 36.99 (4.28) 39.09 (5.90)

TD-DD GPI 281.76 (74.27) 304.28 (65.49) 299.92 (72.32)

TD2-DD GPI 287.70 (50.25) 298.07 (51.86) 298.78 (34.52)

TD-CFM GPI 323.56 (32.38) 323.90 (34.69) 328.47 (15.90)

TD-CFM(C) GPI 266.80 (80.22) 251.21 (64.39) 284.37 (73.57)

TD2-CFM GPI 287.26 (84.37) 313.07 (33.89) 320.22 (96.14)

FB 852.55 (19.44) FB GPI 79.24 (2.32) 80.60 (3.28) 82.58 (3.06)

TD-DD GPI 852.49 (26.17) 806.55 (7.62) 872.85 (19.73)

TD2-DD GPI 839.00 (22.97) 914.47 (6.14) 936.41 (11.77)

TD-CFM GPI 823.10 (15.06) 758.91 (19.84) 846.08 (7.85)

TD-CFM(C) GPI 858.42 (5.92) 931.74 (12.01) 947.69 (6.45)

TD2-CFM GPI 863.16 (4.15) 923.77 (9.88) 963.10 (6.69)

FB 588.74 (5.30) FB GPI 18.23 (0.68) 18.36 (1.15) 19.94 (0.87)

TD-DD GPI 587.76 (6.46) 799.12 (11.85) 667.74 (81.98)

TD2-DD GPI 594.48 (3.74) 842.94 (28.16) 852.17 (12.09)

TD-CFM GPI 577.95 (2.89) 793.47 (21.27) 774.91 (55.37)

TD-CFM(C) GPI 601.81 (8.58) 883.26 (4.06) 897.33 (10.88)

TD2-CFM GPI 596.08 (3.41) 868.69 (14.42) 868.46 (44.44)

Domain Method Planner Z-Distribution D(Z)

Random Local Perturbation Train Distribution

FB 683.96 (2.09) FB GPI 742.71 (1.01) 746.48 (1.63) 718.52 (2.65)

TD-DD GPI 673.33 (6.07) 690.13 (6.34) 677.58 (5.71)

TD2-DD GPI 744.92 (0.69) 750.42 (2.30) 745.29 (1.12)

TD-CFM GPI 748.19 (10.47) 753.72 (0.58) 745.93 (12.60)

TD-CFM(C) GPI 790.56 (14.06) 795.84 (16.14) 785.20 (13.69)

TD2-CFM GPI 796.39 (13.27) 800.34 (9.63) 791.43 (11.66)

FB 452.38 (3.25) FB GPI 486.71 (0.64) 488.23 (0.48) 469.03 (2.35)

TD-DD GPI 484.45 (1.07) 482.81 (2.55) 482.53 (2.38)

TD2-DD GPI 485.26 (1.63) 486.35 (0.93) 484.89 (2.43)

TD-CFM GPI 488.93 (1.08) 488.45 (0.62) 488.98 (0.28)

TD-CFM(C) GPI 491.66 (2.75) 490.89 (2.05) 491.81 (2.14)

TD2-CFM GPI 488.89 (1.35) 488.65 (1.19) 489.31 (1.03)

FB 896.43 (5.80) FB GPI 975.01 (1.40) 977.94 (0.76) 938.44 (7.04)

TD-DD GPI 976.59 (2.78) 976.75 (0.86) 975.25 (2.49)

TD2-DD GPI 981.26 (1.56) 981.59 (1.45) 979.46 (0.93)

TD-CFM GPI 982.08 (1.27) 981.06 (0.26) 981.29 (1.34)

TD-CFM(C) GPI 984.03 (1.20) 984.50 (1.49) 983.33 (1.20)

TD2-CFM GPI 984.36 (0.25) 985.52 (0.89) 984.36 (1.21)

FB 476.34 (4.71) FB GPI 483.37 (1.05) 483.73 (3.02) 458.20 (6.62)

TD-DD GPI 497.55 (10.40) 499.45 (11.65) 494.38 (19.75)

TD2-DD GPI 457.54 (23.37) 467.78 (4.58) 452.44 (17.14)

TD-CFM GPI 458.20 (29.01) 466.62 (19.30) 458.24 (30.28)

TD-CFM(C) GPI 515.84 (5.84) 519.36 (14.37) 524.37 (1.56)

TD2-CFM GPI 516.67 (3.49) 511.77 (3.70) 517.82 (2.58)

Pointmass Maze

Domain Method Planner Z-Distribution D(Z)

Random Local Perturbation Train Distribution

FB 223.85 (23.81) FB GPI 1.67 (0.30) 74.52 (2.24) 1.24 (0.28)

TD-DD GPI 169.55 (74.06) 363.47 (23.78) 148.59 (43.53)

TD2-DD GPI 781.84 (1.20) 769.02 (5.03) 768.67 (11.19)

TD-CFM GPI 254.07 (85.86) 546.75 (191.17) 359.50 (144.41)

TD-CFM(C) GPI 763.24 (15.57) 776.51 (12.37) 769.87 (13.18)

TD2-CFM GPI 773.51 (2.71) 773.81 (4.71) 772.22 (3.11)

FB 317.59 (8.55) FB GPI 81.99 (5.11) 315.10 (1.95) 61.41 (3.58)

TD-DD GPI 462.86 (5.90) 430.51 (72.79) 593.64 (56.15)

TD2-DD GPI 876.91 (9.21) 889.03 (2.40) 878.78 (2.43)

TD-CFM GPI 832.91 (27.77) 797.10 (57.17) 852.81 (16.74)

TD-CFM(C) GPI 873.85 (21.16) 885.90 (4.21) 875.45 (3.43)

TD2-CFM GPI 885.07 (2.79) 887.18 (5.27) 878.26 (0.64)

REACH BOTTOM LEFT

FB 830.60 (0.63) FB GPI 0.18 (0.17) 127.90 (20.14) 0.11 (0.10)

TD-DD GPI 781.69 (8.09) 797.98 (3.52) 795.12 (3.88)

TD2-DD GPI 823.28 (2.76) 820.15 (1.89) 824.00 (1.40)

TD-CFM GPI 808.61 (7.06) 801.97 (2.97) 813.36 (6.35)

TD-CFM(C) GPI 824.02 (0.73) 824.17 (1.77) 824.18 (3.84)

TD2-CFM GPI 827.85 (1.45) 820.98 (3.63) 828.45 (3.10)

REACH BOTTOM LEFT LONG

FB 49.31 (0.09) FB GPI 464.55 (19.21) 0.58 (1.79) 401.26 (28.43)

TD-DD GPI 461.30 (7.43) 468.73 (26.94) 252.28 (241.97)

TD2-DD GPI 609.10 (11.64) 597.03 (6.46) 668.76 (4.02)

TD-CFM GPI 180.27 (35.66) 311.59 (152.06) 439.47 (230.80)

TD-CFM(C) GPI 631.52 (11.58) 614.90 (8.82) 688.44 (4.05)

TD2-CFM GPI 646.67 (9.38) 639.90 (13.22) 691.68 (2.99)

REACH BOTTOM RIGHT

FB 366.39 (27.01) FB GPI 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)

TD-DD GPI 343.62 (112.70) 470.97 (42.25) 398.94 (81.80)

TD2-DD GPI 360.71 (312.31) 674.54 (7.35) 529.97 (137.98)

TD-CFM GPI 394.78 (159.98) 356.58 (69.31) 548.65 (73.62)

TD-CFM(C) GPI 642.67 (6.59) 686.08 (4.46) 679.75 (2.98)

TD2-CFM GPI 534.62 (57.49) 687.66 (1.75) 641.45 (2.00)

FB 895.88 (1.26) FB GPI 351.72 (17.68) 837.14 (2.07) 185.50 (13.00)

TD-DD GPI 941.32 (16.86) 812.40 (152.88) 920.44 (28.63)

TD2-DD GPI 940.90 (5.41) 967.44 (3.52) 939.49 (10.02)

TD-CFM GPI 964.27 (0.34) 948.83 (11.63) 955.82 (9.32)

TD-CFM(C) GPI 940.00 (29.18) 967.03 (3.38) 931.02 (18.53)

TD2-CFM GPI 943.43 (19.57) 967.06 (1.42) 940.05 (18.13)

REACH TOP RIGHT

FB 715.25 (4.47) FB GPI 0.72 (0.96) 358.22 (20.05) 1.35 (0.85)

TD-DD GPI 766.59 (6.78) 771.64 (9.55) 733.83 (44.76)

TD2-DD GPI 822.44 (1.74) 818.06 (6.60) 823.09 (1.76)

TD-CFM GPI 777.94 (46.86) 765.68 (41.55) 754.73 (45.71)

TD-CFM(C) GPI 826.30 (1.36) 824.87 (2.61) 821.51 (5.64)

TD2-CFM GPI 809.75 (28.90) 824.23 (1.88) 788.98 (45.69)

FB 337.33 (9.46) FB GPI 4.89 (1.03) 148.93 (0.90) 2.97 (0.82)

TD-DD GPI 585.01 (39.45) 587.71 (46.77) 451.92 (5.35)

TD2-DD GPI 896.45 (7.94) 910.52 (2.63) 878.19 (11.18)

TD-CFM GPI 790.65 (3.06) 843.76 (8.91) 841.25 (22.73)

TD-CFM(C) GPI 905.41 (1.13) 920.06 (2.51) 874.00 (10.23)

TD2-CFM GPI 901.82 (1.29) 910.41 (1.66) 866.85 (7.95)

Domain Method Planner Z-Distribution D(Z)

Random Local Perturbation Train Distribution

FB 221.55 (44.79) FB GPI 355.27 (5.95) 356.52 (9.99) 355.94 (5.10)

TD-DD GPI 451.93 (81.15) 445.10 (100.81) 424.78 (100.74)

TD2-DD GPI 702.98 (27.77) 712.72 (16.66) 683.62 (35.04)

TD-CFM GPI 355.69 (110.25) 420.53 (184.00) 341.40 (124.16)

TD-CFM(C) GPI 724.85 (8.19) 710.02 (4.51) 711.16 (13.29)

TD2-CFM GPI 722.08 (7.50) 718.74 (14.51) 713.66 (14.14)

FLIP BACKWARD

FB 463.12 (5.73) FB GPI 238.33 (9.74) 388.33 (25.98) 249.60 (5.64)

TD-DD GPI 620.00 (69.42) 596.45 (38.20) 595.59 (34.96)

TD2-DD GPI 706.99 (8.08) 690.83 (3.20) 706.75 (8.34)

TD-CFM GPI 545.12 (184.05) 540.36 (186.74) 492.55 (173.13)

TD-CFM(C) GPI 727.23 (25.25) 716.22 (29.49) 711.11 (20.97)

TD2-CFM GPI 709.19 (16.76) 684.33 (37.92) 694.16 (15.24)

FB 310.39 (35.44) FB GPI 200.65 (4.44) 301.34 (11.26) 191.10 (6.56)

TD-DD GPI 436.74 (3.52) 438.90 (4.92) 434.94 (3.02)

TD2-DD GPI 427.15 (16.50) 429.98 (13.04) 421.92 (14.83)

TD-CFM GPI 206.96 (45.56) 243.53 (60.37) 238.96 (66.97)

TD-CFM(C) GPI 465.08 (2.50) 470.44 (5.05) 462.89 (3.15)

TD2-CFM GPI 462.71 (9.73) 467.25 (14.78) 454.90 (10.61)

RUN BACKWARD

FB 201.07 (10.72) FB GPI 5.31 (2.02) 102.20 (5.73) 19.11 (2.52)

TD-DD GPI 165.02 (4.50) 246.72 (12.09) 325.40 (0.86)

TD2-DD GPI 224.90 (21.33) 310.10 (22.82) 322.33 (4.05)

TD-CFM GPI 90.83 (28.26) 92.46 (15.59) 49.88 (29.15)

TD-CFM(C) GPI 222.14 (36.05) 342.15 (2.02) 333.90 (3.00)

TD2-CFM GPI 252.70 (10.86) 319.46 (35.05) 332.21 (0.77)

FB 792.89 (52.74) FB GPI 830.00 (15.20) 889.84 (5.00) 733.11 (34.27)

TD-DD GPI 977.30 (3.13) 978.74 (2.47) 979.48 (3.47)

TD2-DD GPI 959.18 (30.39) 955.97 (25.64) 956.79 (29.06)

TD-CFM GPI 767.47 (96.47) 805.68 (104.96) 853.73 (117.82)

TD-CFM(C) GPI 985.04 (0.10) 985.06 (0.29) 984.90 (0.18)

TD2-CFM GPI 984.21 (0.03) 984.46 (0.09) 984.23 (0.07)

FB 897.16 (32.19) FB GPI 22.40 (10.18) 373.19 (13.71) 78.60 (18.72)

TD-DD GPI 793.32 (52.67) 946.70 (12.57) 982.37 (0.21)

TD2-DD GPI 951.82 (11.09) 981.74 (0.27) 982.45 (0.07)

TD-CFM GPI 455.18 (190.16) 456.19 (140.79) 257.85 (173.06)

TD-CFM(C) GPI 964.75 (4.54) 981.93 (0.26) 982.89 (0.25)

TD2-CFM GPI 962.41 (4.32) 982.08 (0.16) 982.64 (0.05)

Temporal Difference Flows

γ = 0.8 γ = 0.9 γ = 0.95 γ = 0.98 γ = 0.99

GROUND TRUTH

Figure 5. Qualitative samples generated with TD-CFM, TD-DD, VAE, and GAN methods for various discount factors γ on the LOOP task in the POINTMASS MAZE domain. The last row depicts ground truth discounted occupancies.

Temporal Difference Flows

E. Theoretical Results

E.1. Proofs of Main Results

Lemma 1. Let pt be a probability path for P generated by vector field vt and p (n) t be a probability path for P πm(n) 1 generated by v (n) t such that p0 = p (n) 0 = m0. For any t [0, 1] and (s, a) let v(n+1) t ( | s, a) be the solution of 4

arg min v : Rd Rd(1 γ)E

Xt pt( |s,a)

Xt | s, a) 2i

Xt p (n) t ( |s,a)

Xt) v (n) t (

Xt | s, a) 2i .

Then v(n+1) t induces a probability path m(n+1) t such that m(n+1) 0 = m0 and m(n+1) 1 = T πm(n) 1 .

Proof. By Lemma 4, we have that

v(n+1) t (x | s, a) = (1 γ) pt(x|s, a) vt(x | s, a) + γ p (n) t (x|s, a) v (n) t (x | s, a)

m(n+1) t (x|s, a) ,

where m(n+1) t (x|s, a) = (1 γ) pt(x|s, a) + γ p (n) t (x|s, a). Lemma 3 implies that m(n+1) t is the probability path generated by v(n+1) t . It is easy to see that m(n+1) 0 = m0 since p0 = p (n) 0 = m0. Moreover, since p1 = P and p (n) 1 = P πm(n) 1 by assumption, m(n+1) 1 = (1 γ)P + γP πm(n) 1 = T πm(n) 1 , which proves the result.

Theorem 1. For any n 1, the probability paths generated by TD-CFM, TD-CFM(C), or TD2-CFM satisfy

m(n+1) t (x | s, a) = Bπ t m(n) t (x | s, a), t [0, 1]

where Bπ t m := (1 γ)Pt + γP πm and Pt(x|s, a) := R pt|1(x | x1)P(x1|s, a)dx1. For any t [0, 1], the operator Bπ t is a γ-contraction in 1-Wasserstein distance, that is, for any couple of probability paths pt, qt,

sup s,a W1 ((Bπ t pt) ( | s, a), (Bπ t qt) ( | s, a))

γ sup s,a W1 (pt( | s, a), qt( | s, a)) .

Proof. To prove that the iterates of the three algorithms satisfy a Bellman-like update through the operator Bπ t we only need to apply Proposition 3 for TD2-CFM, Theorem 5 for TD-CFM, and Theorem 6 for TD-CFM(C). That Bt is a γ-contraction in 1-Wasserstein distance can be seen by applying Theorem 4 with k = 1.

Corollary 1. Let {m(n) t }n 0 be the sequence of probability paths produced by TD-CFM, TD-CFM(C), or TD2-CFM starting from an arbitrary vector field v(0) t . Then,

lim n m(n) t = mt = Btmt,

where mt is the unique fixed point of Bt, and mt = m MC t , where m MC t ( | s, a) = R pt|1( | x1) mπ(x1 | s, a) is the probability path of the Monte-Carlo approach (MC-CFM; 6).

Proof. That Bπ t has a unique fixed point mt to which every sequence m(n) t converges to is a consequence of the Banach fixed point theorem applied on the space of all probability paths mt : S A P(Rd) equipped with the sup-1Wasserstein metric. By inspecting the definition of Bπ t , it is easy to see that mt = (I γP π) 1Pt. Since Pt(x|s, a) = R pt|1(x|x1)P(x1|s, a)dx1,

mt(x|s, a) = [(I γP π) 1Pt](x|s, a) = Z pt|1(x|x1) [(I γP π) 1P](x1|s, a) | {z } =mπ(x1|s,a)

dx1 = m MC t (x|s, a).

Temporal Difference Flows

Theorem 2. For any n 1 and t [0, 1], assume that m(n) t (x | s, a) = R pt|1(x | x1)m(n) 1 (x1 | s, a)dx1, then

σ2 TD-CFM = σ2 TD2-CFM +

γ2 E Tr Cov X1|s,a,Xt θvt(Xt|s, a; θ) ut|1(Xt|X1) .

Proof. See Theorem 7.

Theorem 3. For any n 1 and t [0, 1], assume that m(n) t (x | s, a) = R pt|0,1(x | x0, x1)m(n) 0,1(x0, x1 | s, a)dx0dx1 5, then we obtain

σ2 TD-CFM(C) = σ2 TD2-CFM +

γ2E Tr Cov Z|S,A,Xt θvt(Xt|S, A; θ) ut|Z(Xt|Z) ,

where Z = (X0, X1). Furthermore, if we use straight conditional paths, i.e., Xt = t X1 + (1 t)X0, and the linear interpolant Xt does not intersect for any s, a, s , then σ2 TD-CFM(C) = σ2 TD2-CFM.

Proof. See Theorem 8.

E.2. General Results

Lemma 3. Let v1 t and v2 t be vector fields that generate the probability paths p1 t and p2 t, respectively. Then, for any γ [0, 1], the mixture probability path pt = (1 γ)p1 t + γp2 t is generated by the vector field

vt := (1 γ)p1 tv1 t + γp2 tv2 t (1 γ)p1 t + γp2 t . (27)

Proof. Since vt 1 (resp. vt 2) generates p1 t (resp. p2 t), we know from the continuity equation that:

p1 t t = div(p1 tv1 t ), p2 t t = div(p2 tv2 t ),

where div denotes the divergence operator. Then, by linearity of div,

t = (1 γ)p1 t + γp2 t

t = (1 γ)div(p1 tv1 t ) + γdiv(p2 tv2 t )

= div (1 γ)p1 tv1 t + γp2 tv2 t

= div (1 γ)p1 tv1 t + γp2 tv2 t (1 γ)p1 t + γp2 t

(1 γ)p1 t + γp2 t

= div (1 γ)p1 tv1 t + γp2 tv2 t (1 γ)p1 t + γp2 t pt

= div(vtpt).

Hence, (vt, pt) satisfies the continuity equation, which implies that vt generates pt.

Lemma 4. Let v1 t and v2 t be vector fields that generate the probability paths p1 t and p2 t, respectively. For γ [0, 1], the vector field vt = (1 γ)p1 t v1 t +γp2 t v2 t (1 γ)p1 t +γp2 t satisfies

vt = arg min v:Rd Rd

n (1 γ) Ext p1 t vt(xt) v1 t (xt) 2 + γ Ext p2 t vt(xt) v2 t (xt) 2 o .

Proof. Let ℓt(v) := (1 γ) Ext p1 t vt(xt) v1 t (xt) 2 + γ Ext p2 t vt(xt) v2 t (xt) 2 . The functional derivative of this quantity wrt v evaluated at some point x is

vℓt(v)(x) = (1 γ)pt 1(x)(vt(x) v1 t (x)) + γpt 2(x)(vt(x) v2 t (x)).

Setting this to zero and solving for vt(x) yields the result.

Temporal Difference Flows

E.3. Analysis of TD2-CFM

We study the learning dynamics of an idealized variant of TD2-CFM which minimizes the flow-matching loss exactly. Starting from an arbitrary vector field v(0) t , at each iteration n 0 we compute

v(n+1) t ( |s, a) arg min v:Rd Rd ℓ(n)

TD2-CFM(t, s, a), (28)

TD2-CFM(t, s, a) := (1 γ)

ℓ(t, s, a) + γ

ℓ(t, s, a) := ES P ( |s,a),Xt pt|1( |S ) h v(Xt|s, a) ut(Xt|S ) 2i

ℓ(t, s, a) := ES P ( |s,a),Xt m(n) t ( |s ,π(s ))

h v(Xt|s, a) v(n) t (Xt|S , π(S )) 2i ,

and m(n) t (x|s, a) is the probability path generated by v(n) t (x|s, a).

Lemma 5. For any n 0, the vector field minimizing (28) is

v(n+1) t (x | s, a) =

(1 γ) R ut|1(x | x1)pt|1(x | x1)P(x1|s, a)dx1 + γES P ( |s,a)[m(n) t (x|S , π(S ))v(n) t (x|S , π(S ))]

m(n+1) t (x|s, a)

where we define m(n+1) t (x|s, a) := (1 γ)Pt(x|s, a) + γES P ( |s,a)[m(n) t (x|S , π(S ))] and Pt(x|s, a) := R pt|1(x |

x1)P(x1|s, a)dx1. Moreover v(n+1) t generates m(n+1) t .

Proof. By Theorem 2 of (Lipman et al., 2023), we have for the first term in ℓTD2-CFM

ℓ(t, s, a) = θEXt Pt( |s,a) h vt(Xt|s, a) vt(Xt|s, a) 2i ,

where Pt(x|s, a) := R pt|1(x | x1)P(x1|s, a)dx1, vt(x|s, a) =

R ut|1(x|x1)pt|1(x|x1)P (x1|s,a)dx1

Pt(x|s,a) . Similarly, we have for the second term:

ℓ(t, s, a) = θEXt p (n) t ( |s,a)

h vt(Xt|s, a) vt(Xt|s, a) 2i ,

where p (n) t = P πm(n) t and vt = P π(m(n) t v(n) t )

P πm(n) t .

Therefore, ℓ(n) TD-CFM(t, s, a) is equivalent, in term of gradient, to a mixture of two marginal flow-matching losses, which implies that v(n+1) t has the stated expression by Lemma 4. The fact that it generates m(n+1) t is a consequence of Lemma 3.

We then define the following operator to characterize the iterates of TD2-CFM.

Definition 1 (Bellman operator for probability paths). For any t [0, 1], we define the operator Bπ t m := (1 γ)Pt+γP πm, where Pt(x|s, a) := R pt|1(x | x1)P(x1|s, a)dx1.

The following observation is then immediate from Lemma 5.

Proposition 3. For any n 0, the probability path generated by TD2-CFM satisfies m(n+1) t (x|s, a) = Bπ t m(n) t (x | s, a), where Bπ t is the operator of Definition 1. Moreover, m(n+1) 1 (x|s, a) = T πm(n) 1 (x | s, a).

Theorem 4. For any t [0, 1], the operator Bπ t of Definition 1 is a γ1/k-contraction in Wasserstein k-distance, i.e., for any couple of probability paths pt, qt and k [1, ),

sup s,a Wk Bπ t pt ( | s, a), Bπ t qt ( | s, a) γ1/k sup s,a Wk (pt( | s, a), qt( | s, a)) .

Temporal Difference Flows

Proof. Recall that the Wasserstein k-distance between pt and qt induced by a metric d is defined as

Wk(pt( |s, a), qt( |s, a)) := inf Γ( |s,a) C(pt( |s,a),qt( |s,a)) E(X,Y ) Γ( |s,a)[d(X, Y )k]1/k,

where C(pt( |s, a), qt( |s, a)) is the set of all couplings between the two measures. Now take any coupling Γ( |s, a) C(pt( |s, a), qt( |s, a)) for any s, a. Then, the following quantity

Θ(x, y|s, a) = (1 γ)P(x|s, a)δ(x y) + γ P π Γ (x, y|s, a)

is a valid coupling between Bπ t pt ( | s, a) and Bπ t qt ( | s, a). In fact, Z Θ(x, y|s, a)dx = (1 γ) Z P(x|s, a)δ(x y)dx + γ Z P π Γ (x, y | s, a)dx

= (1 γ)P(y|s, a) + γ Z Es P ( |s,a) h Γ(x, y|s , π(s )) i dx

= (1 γ)P(y|s, a) + γ Es P ( |s,a)

Z Γ(x, y|s , π(s ))dx

= (1 γ)P(y|s, a) + γ Es P ( |s,a)[qt(y|s , π(s ))]

= T πqt (y|s, a).

Analogously, we can prove that R Θ(x, y|s, a)dy = Bπpt (x|s, a). Then,

Wk Bπ t pt ( | s, a), Bπ t qt ( | s, a) = inf Γ( |s,a) C([Lπ t pt]( |s,a),[Lπ t qt]( |s,a)) E(X,Y ) Γ( |s,a)[d(X, Y )k]1/k

E(X,Y ) Θ( |s,a)[d(X, Y )k]1/k

= (1 γ)E(X P ( |s,a),Y δX)[d(X, Y )k] + γE(X,Y ) [P π Γ]( |s,a)[d(X, Y )k] 1/k

= γ1/k Es P ( |s,a),(X,Y ) Γ( |s ,π(s ))[d(X, Y )k]1/k.

Since this holds for any coupling Γ( |s, a) C(pt( |s, a), qt( |s, a)), we can take the infimum over all such couplings on the right-hand side, so that

Wk Bπ t pt ( | s, a), Bπ t qt ( | s, a) γ1/k Es P ( |s,a)

inf Γ C(pt( |s ,π(s )),qt( |s ,π(s ))) E(X,Y ) Γ[d(X, Y )k] 1/k

= γ1/k Es P ( |s,a) Wk(pt( |s , π(s )), qt( |s , π(s )))k 1/k

γ1/k sup s,a Wk(pt( | s, a), qt( | s, a)).

Taking the supremum over (s, a) of the left-hand side concludes the proof.

E.4. Analysis of TD-CFM

We study the learning dynamics of an idealized variant of TD-CFM which minimizes the flow-matching loss exactly. Starting from an arbitrary vector field v(0) t , at each iteration n 0 we compute

v(n+1) t ( |s, a) arg min vt( ):Rd Rd ℓ(n) TD-CFM(t, s, a) := EX1 T πm(n) 1 (s,a),Xt pt|1( |X1)

h vt(Xt) ut|1(Xt|X1) 2i , (29)

where m(n) t (x|s, a) is the probability path generated by v(n) t (x|s, a). Lemma 6. For any n 0, the vector field minimizing (29) is

v(n+1) t (x | s, a) = Z ut|1(x|x1)pt|1(x | x1) T πm(n) 1 (x1 | s, a)

m(n+1) t (x|s, a) dx1,

where m(n+1) t (x|s, a) := R pt|1(x | x1) T πm(n) 1 (x1 | s, a)dx1. Moreover v(n+1) t generates m(n+1) t .

Temporal Difference Flows

Proof. Note that (29) is a standard flow matching loss for the target distribution T πm(n) 1 . The expression of v(n+1) t (x | s, a) given in the statement is exactly the vector field obtained by marginalization of the conditional vector field ut|1, which we know to be the minimizer of the loss from Theorem 2 of (Lipman et al., 2023). The fact that v(n+1) t generates m(n+1) t is a consequence of Theorem 1 of (Lipman et al., 2023).

Lemma 7. For any n 0, the probability path generated by (29) satisfies m(n+1) 1 (x|s, a) = T πm(n) 1 (x|s, a).

Proof. This is immediate from the definition of conditional probability path, as we set p1|1(x | x1) = δ(x x1) by construction, where δ( ) is the Dirac s delta function.

Theorem 5. For any n 1, the probability path generated by (29) satisfies

m(n+1) t (x|s, a) = Bπ t m(n) t (x|s, a),

where Bπ t is the operator of Definition 1. Moreover, if the initial vector field v(0) t satisfies

v(0) t (x | s, a) = Z ut|1(x|x1)pt|1(x | x1)m(0) 1 (x1 | s, a)

m(0) t (x|s, a) dx1,

with m(0) t being its generated proability path, then this result is valid at all n 0.

Proof. We know that, for all n 0, vn+1 t generates m(n+1) t (Lemma 6) and that m(n+1) 1 = T πm(n) 1 (Lemma 7). Note that m(n+1) t is written as a function of m(n) 1 only, i.e., at each iteration we keep only the distribution generated at time t = 1 (m(n) 1 ) and discard the associated probability path (m(n) t for t < 1). We can however express m(n+1) t as a function of m(n) t thanks to the linearity of the Bellman operator and the definition of marginal paths. For any n 1,

m(n+1) t (x | s, a) : = Z pt|1(x | x1) T πm(n) 1 (x1 | s, a)dx1

= Z pt|1(x | x1) (1 γ)P(x1 | s, a) + γ Es P ( |s,a) h m(n) 1 (x1 | s , π(s )) i dx1

= (1 γ) Z pt|1(x | x1)P(x1 | s, a)dx1 + γ Es P ( |s,a)

Z pt|1(x | x1)m(n) 1 (x1 | s , π(s ))dx1

= (1 γ) Z pt|1(x | x1)P(x1 | s, a)dx1 + γ Es P ( |s,a)

Z pt|1(x | x1) T πm(n 1) 1 (x1 | s , π(s ))dx1

= (1 γ) Z pt|1(x | x1)P(x1 | s, a)dx1 + γ Es P ( |s,a) h m(n) t (x | s , π(s )) i

= (1 γ)Pt(x|s, a) + γ Es P ( |s,a) h m(n) t (x | s , π(s )) i = Bπ t m(n) t (x | s, a).

This proves the first part of the statement. For the second part, we only need to prove that the result also holds at n = 0. Note that the assumption on v(0) t implies that m(0) t (x | s, a) := R pt|1(x | x1)m(0) 1 (x1 | s, a)dx1. Thus,

m(1) t (x | s, a) : = Z pt|1(x | x1) T πm(0) 1 (x1 | s, a)dx1

= Z pt|1(x | x1) (1 γ)P(x1 | s, a) + γ Es P ( |s,a) h m(0) 1 (x1 | s , π(s )) i dx1

= (1 γ) Z pt|1(x | x1)P(x1 | s, a)dx1 + γ Es P ( |s,a)

Z pt|1(x | x1)m(0) 1 (x1 | s , π(s ))dx1

= (1 γ) Z pt|1(x | x1)P(x1 | s, a)dx1 + γ Es P ( |s,a) h m(0) t (x | s , π(s )) i = Bπ t m(0) t (x | s, a).

Temporal Difference Flows

E.5. Analysis of TD-CFM(C)

The idealized update of TD-CFM(C) is, for any n 0,

v(n+1) t ( |s, a) arg min vt( ):Rd Rd ℓ(n) TD-CFM(C)(t, s, a) , where

ℓ(n) TD-CFM(C)(t, s, a) := E(X0,X1) Γ(n) 0,1 ( |s,a),Xt pt|0,1( |X0,X1) vt(Xt) ut|0,1(Xt | X0, X1) 2 , (30)

and Γ(n) 0,1( | s, a) is the coupling between m0 and T πm(n) 1 , while pt|0,1, ut|0,1 are such that ut|0,1(x | x0, x1) generates pt|0,1(x | x0, x1), p1|0,1(x | x0, x1) = δx1(x), and

pt|1(x | x1) = Z pt|0,1(x | x0, x1)m0(x0)dx0. (31)

Lemma 8. The coupling Γ(n) 0,1( | s, a) satisfies

Γ(n) 0,1(x0, x1 | s, a) = (1 γ)P(x1 | s, a)m0(x0) + γ ES P ( |s,a) h m(n) 0,1(x0, x1 | S , π(S )) i ,

where m(n) 0,1(x0, x1 | s, a) = m0(x0)δψ(n) 1 (x0|s,a)(x1) is the joint distribution of (X0, X1), i.e the endpoints of the ODE.

Proof. For any x0, x1, we can write Γ(n) 0,1(x0, x1 | s, a) = Γ(n) 1 (x1 | s, a, x0)m0(x0), where Γ(n) 1 is the corresponding conditional distribution. By definition, we have

Γ(n) 1 (x1 | s, a, x0) = (1 γ)P(x1 | s, a) + γ Es P ( |s,a) h δψ(n) 1 (x0|s ,π(s ))(x1) i

where ψ(n) 1 is the flow that generates m(n) 1 . Multiplying both sides by m0(x0) and using that m(n) 0,1(x0, x1 | s, a) = m0(x0)δψ(n) 1 (x0|s,a)(x1) concludes the proof.

Lemma 9. For any n 0, the vector field minimizing (30) is

v(n+1) t (x | s, a) = Z Z ut|0,1(x | x0, x1)pt|0,1(x | x0, x1)Γ(n) 0,1(x0, x1 | s, a)

m(n+1) t (x | s, a) dx0dx1,

where m(n+1) t (x | s, a) := R R pt|0,1(x | x0, x1)Γ(n) 0,1(x0, x1 | s, a)dx0dx1. Moreover v(n+1) t generates m(n+1) t .

Proof. Note that (30) is a standard conditional flow matching loss since ut|0,1(x | x0, x1) generates pt|0,1(x | x0, x1) and

p1|0,1(x | x0, x1) = δx1(x). The expression of v(n+1) t (x | s, a) given in the statement is exactly the vector field obtained by marginalization of the conditional vector field ut|0,1, which we know to be the minimizer of the loss from Theorem 2 of

(Lipman et al., 2023). The fact that v(n+1) t generates m(n+1) t is a consequence of Theorem 1 of (Lipman et al., 2023).

Lemma 10. For any n 0, the probability path generated by (29) satisfies m(n+1) 1 (x | s, a) = T πm(n) 1 (x | s, a).

Proof. By Lemma 9 and the fact that p1|0,1(x | x0, x1) = δx1(x),

m(n+1) 1 (x | s, a) := Z Z p1|0,1(x | x0, x1)Γ(n) 0,1(x0, x1 | s, a)dx0dx1

= Z Γ(n) 0,1(x0, x | s, a)dx0

= T πm(n) 1 (x|s, a).

Temporal Difference Flows

Theorem 6. For any n 1, the probability path generated by (29) satisfies

m(n+1) t (x | s, a) = Bπ t m(n) t (x | s, a),

where Bπ t is the operator of Definition 1. Moreover, if the initial vector field v(0) t satisfies

v(0) t (x | s, a) = Z Z ut|0,1(x|x0, x1)pt|0,1(x | x0, x1)m(0) 0,1(x0, x1 | s, a)

m(0) t (x | s, a) dx0dx1,

with m(0) t being its generated probability path, then this result is valid at all n 0.

Proof. We know that, for all n 0, vn+1 t generates m(n+1) t (Lemma 9) and that m(n+1) 1 = T πm(n) 1 (Lemma 10). While m(n+1) t is written as a function of Γ(n) 0,1 only, we can rewrite it as a function of m(n) t thanks to the linearity of the Bellman operator and the definition of marginal paths. For any n 1, By Lemma 8,

m(n+1) t (x | s, a) := Z Z pt|0,1(x | x0, x1)Γ(n) 0,1(x0, x1 | s, a)dx0dx1

= Z Z pt|0,1(x | x0, x1) (1 γ)P(x1 | s, a)m0(x0) + γ ES P ( |s,a) h m(n) 0,1(x0, x1 | S , π(S )) i dx0dx1

= (1 γ) Z Z pt|0,1(x | x0, x1)P(x1 | s, a)m0(x0)dx0dx1 | {z } (i)

+ γ Es P ( |s,a)

Z Z pt|0,1(x | x0, x1)m(n) 0,1(x0, x1 | S , π(S ))dx0dx1

| {z } (ii)

(i) = Z pt|1(x | x1)P(x1 | s, a)dx1 = Pt(x | s, a).

For (ii), by Lemma 9, we have m(n) t (x | s, a) = R R pt|0,1(x | x0, x1)Γ(n 1) 0,1 (x0, x1 | s, a)dx0dx1, n 0, which implies

m(n) 0,1(x0, x1 | s , π(s )) = Γ(n 1) 0,1 (x0, x1 | s , π(s )).

Therefore, again by definition of m(n) t (Lemma 9),

(ii) = Es P ( |s,a)

Z Z pt|0,1(x | x0, x1)Γ(n 1) 0,1 (x0, x1 | s , π(s ))dx0dx1

= Es P ( |s,a) h m(n) t (x | s , π(s )) i .

Plugging the expressions of (i) and (ii) into the one of m(n+1) t (x | s, a) yields the first part of the statement.

For the second part, we only need to prove that the result also holds at n = 0. Note that the assumption on v(0) t implies that m(0) t (x | s, a) = R R pt|0,1(x | x0, x1)m(0) 0,1(x0, x1 | s , π(s ))dx0dx1. Thus, using the same decomposition above, we have

m(1) t (x | s, a) = (1 γ)Pt(x | s, a) + γ Es P ( |s,a)

Z Z pt|0,1(x | x0, x1)m(0) 0,1(x0, x1 | s , π(s ))dx0dx1

= (1 γ)Pt(x | s, a) + γ Es P ( |s,a) h m(0) t (x | s , π(s )) i ,

which proves the result.

Temporal Difference Flows

E.6. Variance Analysis

Theorem 7. Let us define the random variables

g TD2-CFM(t, s, a, s ,

Xt, X(n) t ) := (1 γ) θvt(

Xt|s, a; θ) vt(

Xt|s, a; θ) ut|1(

+ γ θvt(X(n) t |s, a; θ) vt(X(n) t |s, a; θ) v(n) t (X(n) t |s , π(s ))

g TD-CFM(t, s, a, s ,

Xt, X1, Xt) := (1 γ) θvt(

Xt|s, a; θ) vt(

Xt|s, a; θ) ut|1(

+ γ θvt(Xt|s, a; θ) vt(Xt|s, a; θ) ut|1(Xt|X1)

where t U([0, 1]), (s, a) ρ, s P( |s, a),

Xt pt|1( |s ), X(n) t m(n) t ( | s , π(s )), X1 m(n) 1 ( | s , π(s )), and Xt pt|1( |X1). Then, g TD2-CFM and g TD-CFM are respectively unbiased estimates of the gradients θℓTD2-CFM(θ) and θℓTD-CFM(θ).

Moreover, if we consider their respective total variations defined as:

σ2 TD2-CFM = Trace Covt,s,a,s , Xt,X(n) t

h g TD2-CFM(t, s, a, s ,

Xt, X(n) t ) i

σ2 TD-CFM = Trace Covt,s,a,s , Xt,X1,Xt

h g TD-CFM(t, s, a, s ,

Xt, X1, Xt) i

and we assume that m(n) t (x | s, a) = R pt|1(x | x1)m(n) 1 (x1 | s, a)dx1, then we obtain

σ2 TD-CFM = σ2 TD2-CFM + γ2Et,s,a,Xt Trace Cov X1|s,a,Xt θvt(Xt|s, a; θ) ut|1(Xt | X1) .

Proof. Recall the TD2-CFM and TD-CFM objectives:

ℓTD2-CFM(θ) = (1 γ)Et,s,a,s ,Xt pt|1( |s )

vt(Xt|s, a; θ) ut|1(Xt|s ) 2

+ γEt,s,a,s ,Xt m(n) t ( |s ,π(s ))

h vt(Xt|s, a; θ) v(n) t (Xt|s , π(s )) 2i ,

ℓTD-CFM(θ) = (1 γ)Et,s,a,s ,Xt pt|1( |s )

vt(Xt|s, a; θ) ut|1(Xt|s ) 2

+ γEt,s,a,s ,X1 m(n) 1 ( |s ,π(s )),Xt pt|1( |X1)

h vt(Xt|s, a; θ) ut|1(Xt|X1) 2i .

Computing the gradients of these quantities w.r.t. θ, it is easy to check that g TD2-CFM and g TD-CFM are their unbiased estimates.

Let us now analyze the total variation of these estimators. By assumption, we have m(n) t (x | s, a) = R pt|1(x | x1)m(n) 1 (x1 |

s, a)dx1, which implies that X(n) t and Xt follow the same law. Moreover, we obtain the following identities:

v(n) t (x | s , π(s )) = EX1|x,s ut|1(x | X1) ,

g TD2-CFM(t, s, a, s ,

Xt, Xt) = EX1|Xt,s [g TD-CFM(t, s, a, s , Xo t , X1, Xt)] ,

EXt m(n) t ( |s ,π(s ))

h g TD2-CFM(t, s, a, s ,

Xt, Xt) i = EX1 m(n) 1 ( |s ,π(s )) Xt pt|1( |X1)

h g TD-CFM(t, s, a, s ,

Xt, X1, Xt) i ,

where X1 | x, s pt|1(x|X1)m(n) 1 (X1|s ,π(s ))

m(n) t (x|s,a) is the posterior distribution of X1 given x and s .

To simplify notation, we denote by Y the random variable (t, s, a, s ,

Xt). Using the decomposition of variance into

Temporal Difference Flows

conditional variance, Var(X) = E[Var(X|Y )]) + Var(E[X|Y]), we conclude that

σTD-CFM = Trace (Cov Y,X1,Xt [g TD-CFM(Y, X1, Xt)])

g TD-CFM(Y, X1, Xt) EY,X1,Xt [g TD-CFM(Y, X1, Xt)] 2

EX1|Y,Xt [g TD-CFM(Y, X1, Xt)] EY,X1,Xt [g TD-CFM(Y, X1, Xt)] 2

EX1|Y,Xt h g TD-CFM(Y, X1, Xt) EX1|Y,Xt [g TD-CFM(Y, X1, Xt)] 2i

g TD2-CFM(Y, Xt) EY,Xt [g TD2-CFM(Y, Xt)] 2

EX1|Y,Xt h θvt(Xt|s, a; θ) ut|1(Xt | X1) EX1|Y,Xt θvt(Xt|s, a; θ) ut|1(Xt | X1) 2i

= σTD2-CFM + γ2EY,Xt Trace Cov X1|Y,Xt θvt(Xt|s, a; θ) ut|1(Xt | X1)

= σTD2-CFM + γ2Et,s,a,Xt Trace Cov X1|s,a,Xt θvt(Xt|s, a; θ) ut|1(Xt | X1) .

Theorem 8. Let us define the random variable

g TD-CFM(C)(t, s, a, s ,

Xt, X0, X1, Xt) := (1 γ) θvt(

Xt|s, a; θ) vt(

Xt|s, a; θ) ut|0,1(

+ γ θvt(Xt|s, a; θ) vt(Xt|s, a; θ) ut|0,1(Xt|X0, X1)

where t U([0, 1]), (s, a) ρ, s P( |s, a),

Xt pt|1( |s ), (X0, X1) m(n) 0,1( | s , π(s )) and Xt pt|0,1( |X0, X1, ). Then g TD-CFM(C) is an unbiased estimate of the gradient θℓTD-CFM(C)(θ).

Moreover, if we consider its total variation defined as:

σTD-CFM(C) = Trace Covt,s,a,s , Xt,X0,X1,Xt

h g TD-CFM(C)(t, s, a, s ,

Xt, X0, X1, Xt) i

and we assume that m(n) t (x | s, a) = R R pt|0,1(x | x0, x1)m(n) 0,1(x0, x1 | s, a)dx0dx1, then we obtain

σTD-CFM(C) = σTD2-CFM + γ2Et,s,a,Xt Trace Cov(X0,X1)|s,a,Xt θvt(Xt|s, a; θ) ut|0,1(Xt | X0, X1) .

Furthermore, if we use straight conditional paths, i.e., pt|0,1(x|x0, x1) = δ(tx1 + (1 t)x0 x), then

σTD-CFM(C) σTD2-CFM

+ γ2 sup t,s,a,x

θvt(x|s, a; θ) 2 Et,s,a,s ,X0,X1,Xt X1 X0 E(X1,X0)|s,a,s ,Xt [X1 X0] 2 .

In particular, when the paths of the linear interpolation Xt do not intersect for any s, a, s , we have Et,s,a,s ,X0,X1,Xt X1 X0 E(X1,X0)|s,a,s ,Xt [X1 X0] 2 = 0 and σTD-CFM(C) = σTD2-CFM.

Proof. The first two statements can be checked by repeating the proof of Theorem 7 with conditional paths pt|0,1 and vector fields ut|0,1. Let us thus prove the second part. We know that the flow ϕt(x0, x1) that generates the the conditonal path pt|0,1(x|x0, x1) = δtx1+(1 t)x0(x) is ϕt(x0, x1) = tx1 + (1 t)x0. Its associated vector field ut|0,1 is thus

ut|0,1(ϕt(x0, x1)|x0, x1) = d

dtϕt(x0, x1) = x1 x0.

Temporal Difference Flows

Theorefore, denoting Y = (t, s, a), we can bound the second term in the decomposition of σTD-CFM(C) as

EY,Xt Trace Cov(X0,X1)|Y,Xt θvt(Xt|s, a; θ) ut|1(Xt | X0, X1)

EX0,X1|Y,Xt

θvt(Xt|s, a; θ) ut|0,1(Xt | X0, X1) EX0,X1|Y,Xt θvt(Xt|s, a; θ) ut|0,1(Xt | X0, X1) 2

θvt(Xt|s, a; θ) 2EX0,X1|Y,Xt

ut|0,1(Xt | X0, X1) EX0,X1|Y,Xt ut|0,1(Xt | X0, X1) 2

θvt(Xt|s, a; θ) 2EX0,X1|Y,Xt

X0 X1 EX0,X1|Y,Xt [X1 X0] 2

sup t,s,a,x θvt(x|s, a; θ) 2EY,Xt

EX0,X1|Y,Xt

X0 X1 EX0,X1|Y,Xt [X1 X0] 2 .

This proves the third statement. For the last point, simply note that if the paths generating Xt do not cross, then the distribution of X0, X1|Y, Xt is supported over a single couple (X0, X1), which means that its variance is zero.

Temporal Difference Flows

E.7. Transport Cost Analysis

Theorem 9. Assume that m(n) t (x | s, a) = R pt|1(x | x1)m(n) 1 (x1 | s, a)dx1, where pt|1( | x1) = N(tx1, (1 t)2I) is a Gaussian path. Then, the conditional paths 6built by TD-CFM(C) and TD2-CFM to generate m(n+1) 1 = T πm(n) 1 induce a smaller transport cost than those built by TD-CFM. Formally, for every t, s, a,

Et,s,a,s ,X0 m0,X1 (1 γ)δs +γδ ψ(n) 1 (X0|s ,π(s )) X1 X0 2 Et,s,a,s ,X0 m0,X1 [T πm(n) 1 ]( |s,a) X1 X0 2 .

Proof. The paths generated by TD-CFM(C) and TD2-CFM induce the same transport cost since both algorithms connect the endpoints of the ODE path m(n) t in the bootstrapped term. Hence,

Et,s,a,s ,X0 m0,X1 (1 γ)δs +γδ ψ(n) 1 (X0|s ,π(s )) X1 X0 2

= (1 γ)Et,s,a,s ,X0 s X0 2 + γEt,s,a,s ,X0 h ψ(n) 1 (X0 | s , π(s )) X0 2i

(a) = (1 γ)Et,s,a,s ,X0 s X0 2 + γEt,s,a,s ,X0

Z v(n) t (ψ(n) t (X0 | s , π(s )))dt 2

(b) (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,X0

v(n) t (ψ(n) t (X0 | s , π(s ))) 2 dt

(c) = (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,Xt m(n) t ( |s ,π(s ))

v(n) t (Xt | s , π(s )) 2 dt

(d) = (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,Xt m(n) t ( |s ,π(s ))

EX1|s ,Xt ut|1(Xt|X1) 2 dt

(e) (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,Xt m(n) t ( |s ,π(s ))

ut|1(Xt|X1) 2 dt

(f) = (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,X1 m(n) 1 ( |s ,π(s )),Xt pt|1( |X1)

ut|1(Xt|X1) 2 dt

(g) = (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,X1 m(n) 1 ( |s ,π(s )),X0

ut|1(t X1 + (1 t)X0|X1) 2 dt

(h) = (1 γ)Et,s,a,s ,X0 s X0 2 + γ Z Et,s,a,s ,X1 m(n) 1 ( |s ,π(s )),X0

(i) = (1 γ)Et,s,a,s ,X0 s X0 2 + γEt,s,a,s ,X1 m(n) 1 ( |s ,π(s )),X0

(j) = Et,s,a,s ,X0 m0,X1 [T πm(n) 1 ]( |s,a) X1 X0 2 ,

where (a) uses the definition of flow as integration of a vector field, (b) uses Cauchy-Schwarz inequality, (c) uses that

m0 ψ(n) t is the pushforward measure generating m(n) t , (d) defines X1 | x, s pt|1(x|X1)m(n) 1 (X1|s ,π(s ))

m(n) t (x|s,a) as the posterior

distribution of X1 given x, s and uses that v(n) t is in marginal form by assumption, (e) uses Jensen s inequality, (f) uses the Tower property of expectations, (g) uses the definition of pt|1 and the corresponding linear-interpolation flow, (h) uses the definition of ut|1, (i) is trivial, and (j) simply combines the two terms using the definition of Bellman operator T π.

6Recall that, given a marginal probability path m(n) t (x | s, a), the conditional probability path built by TD-CFM(C) and TD2-CFM to generate T πm(n) 1 is a linear interpolation between noise X0 m0 and X1 (1 γ)δs + γψ(n) 1 (X0|s , π(s )), while the one built by

TD-CFM is a linear interpolation between noise X0 m0 and a sample X1 [T πm(n) 1 ]( | s, a) from the target distribution.