# on_the_guidance_of_flow_matching__f842f8b4.pdf

On the Guidance of Flow Matching

Ruiqi Feng 1 Chenglei Yu 1 Wenhao Deng 1 Peiyan Hu 2 Tailin Wu 1

Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where generation under energy guidance (abbreviated as guidance in the following) is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at https://github. com/AI4Science-Westlake U/flow_ guidance.

1. Introduction

Flow matching has emerged as a prominent class of generative models. It features the ability to use a vector field

*Equal contribution 1Department of Artificial Intelligence, Westlake University, Hangzhou, China 2Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. Correspondence to: Tailin Wu <wutailin@westlake.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

to transform samples from a source distribution into samples following a target distribution, thus realizing generative modeling (Lipman et al., 2023). The probability distribution the samples follow during the flow is called the probability path. By designing the probability path in a large design space, flow matching has shown improved generative modeling fidelity as well as higher sampling efficiency in a variety of generative modeling tasks including image generation (Lipman et al., 2023), decision-making (Zheng et al., 2023), audio generation, and molecular structure design (Gat et al., 2024; Chen & Lipman, 2024; Ben-Hamu et al., 2024). Flow matching substantially extends diffusion models (Ho et al., 2020; Song et al., 2021). Most diffusion models leverage the score matching process (Song & Ermon, 2019; Song et al., 2020; 2021), inherently limiting them to using the Gaussian distribution as the source distribution to construct a special probability path. Meanwhile, flow matching can learn the mapping between any source distribution and target distributions (Lipman et al., 2023; 2024; Chen & Lipman, 2024; Gat et al., 2024).

Guiding flow matching models refers to steering the generated samples toward desired properties, e.g., sampling from a distribution weighted with some objective function (Lu et al., 2023) or conditioned on class labels (Song et al., 2021)1. It is vital in many generative modeling applications (Song et al., 2023b; Zheng et al., 2023), but contrary to wellstudied guidance in diffusion models (Song et al., 2023b; Chung et al., 2023; Dhariwal & Nichol, 2021; Song et al., 2023a; Zheng et al., 2024; Lu et al., 2023; Dou & Song, 2024; Trippe et al., 2023), the guidance of flow matching remains less investigated. Most existing guidance methods only apply to a subset of flow matching that assumes the source distribution to be Gaussian and the probability path to have a certain simple form (Lipman et al., 2024; Zheng et al., 2023; Pokle et al., 2024; Zhang et al., 2025; Kollovieh et al., 2025). In these cases, it is allowed to simplify the guidance of flow matching to be essentially the same as diffusion model guidance, but flow matching s power of generating more flexible probability paths than diffusion models (Tong et al., 2024; Chen & Lipman, 2024; Gat et al., 2024) is restricted. There have been other controlled generation methods for flow matching, with a notable stream following the paradigm of optimizing some objective functions via

1Mathematically, they are essentially equivalent (Section 3.1).

On the Guidance of Flow Matching

differentiating through the sampling process (Ben-Hamu et al., 2024; Liu et al., 2023b; Wang et al., 2025). However, their goal differs from our guidance of weighting the generated distribution. Therefore, the guidance for flow-matching models remains unrevealed in the rest of the ample design space.

To fill this gap, in this work, we start from a similar assumption as diffusion guidance and propose a general framework of flow matching guidance. From the perspective of this framework, we propose Monte Carlo-based training-free asymptotically exact guidance for flow matching. We also propose different training losses for exact training-based guidance, one of which covers existing losses as special cases (Lu et al., 2023; Zhang et al., 2025). For approximate guidance methods, we can theoretically derive from our framework many famous guidance methods that have appeared in the literature, including DPS (Chung et al., 2023), ΠGDM (Song et al., 2023a), LGD (Song et al., 2023b), as well as their flow-matching extensions that are theoretically justified for general flow matching models. We demonstrate the effectiveness of our proposed method in both synthetic datasets and decision-making (offline RL) benchmarks. Furthermore, more extensive experiments are conducted on image inverse problems to provide an empirical comparison of different types of guidance methods for a guideline of choosing different methods.

We summarize our contributions as follows: (1) We propose a theoretically justified unified framework to construct guidance for general flow matching, i.e., with arbitrary source distribution, coupling, and conditional paths. (2) The framework inspires us to propose a family of new guidance methods, including Monte Carlo sampling-based asymptotically exact guidance and training-based exact guidance for flow matching. (3) The framework can exactly cover multiple classical guidance methods in flow matching and diffusion models. Contrary to previous derivations relying on the flow to have a Gaussian source distribution, our derivation provides theoretical justification of these methods for general flow matching. (4) Empirical comparisons between guidance methods are conducted in different tasks, providing insights into choosing appropriate guidance methods for different generative modeling tasks.

2. Background

Let Rd denote the data space where the data samples xt Rd. Here the subscript t [0, 1] denotes inference time such that p1(x1) is a target distribution we want to generate, and p0(x0) is a base distribution that is easy to sample from. Flow-based generative models (Lipman et al., 2023; 2024) define a vector field vt(xt) : [0, 1] Rd Rd

that generates a probability path pt(xt) : [0, 1] Rd R>0 connecting the tractable base distribution p0(x0) and

the target distribution p1(x1). By first sampling x0 from p0(x0) and then solving the ordinary differential equation (ODE) d dtxt = vt(xt), one can generate clean samples x1 := xt|t=1 that follow the target distribution p1(x1).

An efficient way to learn the vector field vt( ) by a model vθ( , t) is to use flow matching (Lipman et al., 2023; 2024; Tong et al., 2024). It works by first finding a conditional vector field vt|z(xt|z) that generates a conditional probability path pt(xt|z), where z denotes sample pairs (x0, x1) 2. The pairs (couplings) follow the probability distribution of p(z) = π(x0, x1)3. We use (x0, x1) and z interchangeably throughout the paper.

It has been proved that the marginal vector field

vt(xt) := Z vt|z(xt|z)p(z|xt)dz,

where p(z|xt) = pt(xt|z)p(z)

p(xt) , will generate the marginal probability path pt(xt) = R pt(xt|z)p(z)dz (Lipman et al., 2023). Thus, one only needs to fit the marginal vector field vt(xt) = R vt|z(xt|z)pt(z|xt)dz using a model vθ(xt, t).

It is intuitive to construct the loss

Et U(0,1),xt p(xt) vθ(xt, t) vt(xt) | {z } intractable

which is, unfortunately, intractable. To cope with this problem, an equivalent conditional flow matching loss has been proposed (Lipman et al., 2023; Tong et al., 2024):

Et U(0,1),z p(z),xt p(xt|z) h vθ(xt, t) vt|z(xt|z) 2 2

which is tractable and can be used to train vθ.

3. Guidance Vector Fields

This section is organized as follows: Section 3.1 proposes the general flow matching guidance framework. In Section 3.2, we derive a new guidance method g MC based on Monte Carlo estimation, which is asymptotically exact. In Section 3.3, we derive a guidance glocal t proportional to the gradient of the energy function J by approximating gt with Taylor expansion. Then, we introduce the affine path assumption (Assumption 3.2) to obtain a tractable gcov glocal t , and show that under the stronger uncoupled affine Gaussian path assumption (Assumption 3.3), glocal covers classical diffusion guidance methods. Section 3.4 introduces an alternative approximation that p(z|xt) is a Gaussian. Under this approximation and the affine path assumption (Assumption 3.2), we derive a derivative-free guidance gsim-MC, and a specialized guidance of gsim-A for inverse problems. Finally,

2The notation z can also represent x1 alone (Lipman et al., 2024). Our analysis is in the general setting where z = (x0, x1), but for ease of interpretation, one may consider z as simply x1. 3We use π to denote the probability density of data couplings.

On the Guidance of Flow Matching

Section 3.5 provides different training losses for trainingbased exact guidance gϕ. Figure 1 visualizes the above outline and the relations between these guidance methods.

3.1. General Guidance Vector Fields

Guided generative modeling aims to generate samples according to specific requirements, and energy-guided sampling is one of the primary approaches. Appendix A.1 shows that energy guidance is fundamentally equivalent to conditional sampling or posterior sampling. Accordingly, we use the notation of energy guidance throughout, with the theoretical results directly extending to these tasks. Given an energy function J : Rd R and a pre-trained generative model for p(x), energy-guided samples follow x p (x) = 1

Z p(x)e J(x), where Z = R p(x)e J(x)dx is the normalizing constant. Samples with lower energy values J are more likely to be generated. Thus, the problem of guidance for flow-matching models becomes:

How can we alter the original vector field (VF) vt that generates p(x) such that the new VF v t

4can generate samples from the new distribution p (x) = 1

Z p(x)e J(x)?

A natural choice that is commonly used in diffusion models (Dhariwal & Nichol, 2021) is to add a guidance VF gt(xt) : [0, 1] Rd Rd to the original VF vt(xt), such that the new VF v t(xt) = gt(xt) + vt(xt) is a VF formed with the same flow matching hyperparameters but generates the new probability path arriving at the new distribution p (x). Therefore,

gt(xt) = Z v t|z(xt|z)p (z|xt)dz Z vt|z(xt|z)p(z|xt)dz,

where p(z|xt) = pt(xt|z)p(z)

p(xt) and p (z|xt) = p t(xt|z)p (z)

p (xt) . Recall that new VF v t(xt) has the conditional probability path and conditional VF as that of the original VF, i.e., v t(xt|z) = vt(xt|z) and p t(xt|z) = pt(xt|z). Then, we have the following theorem (proof in Appendix A.2).

Theorem 3.1. Adding the guidance VF gt(xt) to the original VF vt(xt) will form VF v t(xt) that generates p t(xt) = R pt(xt|z)p (z)dz, as long as gt(xt) follows:

gt(xt) = Z P e J(x1)

Zt(xt) 1 vt|z(xt|z)p(z|xt)dz, (1)

where Zt(xt) = Z Pe J(x1)p(z|xt)dz, (2)

P = π (x0|x1)

π(x0|x1) is the reverse coupling ratio, where π (x0|x1) is the reverse data coupling for the new VF, i.e., the distribution of x0 given x1 sampled from the target distribution.

4The prime symbol denotes probability distributions and vector fields corresponding to the new distribution p (x).

In this paper, we consider the case where P is or can be approximated as 1. P is exactly 1 when the coupling is independent π(x0, x1) = p(x0)p(x1), which results in π (x0|x1) = π(x0|x1) = p(x0). P = 1 is also reasonable in many practical flow matching methods with dependent couplings, such as mini-batch optimal transport (OT) conditional flow matching (Tong et al., 2024), which we elaborate in Appendix A.3. However, there may be cases where the VF of dependent couplings is significantly different from that of independent couplings. In these cases, the impact of P can be non-negligible, which we leave for future work. Later, we will show this approximation allows us to cover many existing guidance techniques for flow matching or diffusion models.

Uncoupled affine Gaussian path flow matching, where π(x0, x1) = p(x0)p(x1), vt|1(xt|x1) = αtxt + βtx1, and p(x0) = N(x0; µ, Σ), is known to be equivalent to diffusion models, with little difference in the noise schedule (Zheng et al., 2023; Ma et al., 2024). In this case, our general guidance for flow matching Eq. (1) can be reduced to a commonly used guidance term in diffusion models: gt(xt) xt log Zt(xt) (proof in Appendix A.4). Thus, most existing works that only consider uncoupled affine Gaussian path flow matching essentially apply the same guidance techniques for diffusion models (Dhariwal & Nichol, 2021; Song et al., 2023b; Chung et al., 2023).

Next, we will explore more challenging scenarios of flow matching guidance, which are substantially different from diffusion guidance: either the coupling is dependent, or the source distribution is non-Gaussian.

3.2. Monte Carlo Estimation, g MC t

To start with, we discuss the Monte Carlo (MC) method to estimate the guidance gt(xt) in Eq. (1), which is asymptotically exact while being training-free. The MC estimation of the integrals in Eq. (1) and (2) requires sampling from intractable p(z|xt), but they can be converted into:

g MC t (xt)

= Ex1,x0 p(z)

Zt 1)vt|z(xt|z)pt(xt|z)

Zt(xt) = Ex1,x0 p(z)

e J(x1) pt(xt|z)

where only pt(xt) remains intractable. We can use the same MC samples to self-normalize the distribution pt(xt|z), using pt(xt) = Ex1,x0 p(z)[pt(xt|z)]. In the expressions above, pt(xt|z) is usually designed to be a simple, known distribution (Tong et al., 2024) and p(z) can be sampled either using the learned generative model or from the training distribution if accessible. The above method can be understood as executing importance sampling to convert the expectation under the intractable distribution Ez p(z|xt)[ ] to

On the Guidance of Flow Matching

3.1 General Guidance

LGD-MC DPS / LGD-MC (n=1)

Approximation

affine path (Assumption 3.2)

approximate covariance of

uncoupled affine Gaussian path (Assumption 3.3)

is Gaussian & affine path (Assumption 3.2)

is a specific inverse problem

Intractable / Training-based

Training-free

Small variance of

Training Loss

Figure 1: Overview of guidance methods in the paper. We start with a unified guidance expression and derive different guidance methods, including training-free and training-based methods, and cover many classical diffusion guidances.

that under a tractable distribution Ez p(z)[ p(xt|z)

p(xt) ( )]. The pseudocode for computing g MC t (xt) can be found in Algorithm 1, and a simplified version of g MC under the assumption of independent coupling is provided in Appendix A.8.

It should be noted that this method is unbiased and applicable to any source distribution. This enables the guidance for flow matching with different source distributions such as uniform (Chen & Lipman, 2024), Gaussian process (Andrae et al., 2025), mixture of Gaussian (Hiranaka et al., 2025; Papamakarios et al., 2017), and others (Mathieu & Nickel, 2020; Stimper et al., 2022). This method can also be applied to flow matching with dependent couplings of x1, x0, e.g., optimal transport couplings (Tong et al., 2024) and rectified flow (Liu et al., 2023a).

To our knowledge, g MC t is the first to provide an asymptotically exact training-free estimation of the guidance VF for flow matching whose source distribution is non-Gaussian. However, due to the high variance of MC given a limited number of samples, g MC is more suitable for tasks where the energy function J varies gently and the data is relatively low-dimensional.

Since the efficiency of g MC is restricted by the high variance in the MC estimation, it has an unsatisfactory performance in high-dimensional generation tasks like image inverse problems. To improve the scalability of g MC, there are many techniques that can be readily applied. For example, we can adopt importance sampling in Eq. (3) to reduce the variance

of the MC estimation:

g MC-IS t (xt)

= Ex1,x0 p(z)

p(z)(e J(x1)

Zt 1)pt(xt|z)

pt(xt) vt(xt|z) ,

Zt(xt) = Ex1,x0 p(z)

p(z)e J(x1) pt(xt|z)

where p(z) is an alternative distribution that can be sampled from. To enhance the scalability of g MC-IS, we need to select a p such that e J(x1)

p(z) has lower variance, i.e., when e J(x1)

is large, p(z) is also large, and vice versa. This can be achieved by sampling using another guided VF, such as those to be proposed subsequently, which generates samples close to (but not exactly) 1

Z p(x1)e J(x1). The probability density ratio p(z)

p(z) can be estimated using, for example, the Hutchinson trace estimator to preserve scalability (Song et al., 2021). It should be noted that this estimation is still unbiased.5

3.3. Localized p(x1|x)

Many practical guidance methods rely on approximations (Song et al., 2023b; Pokle et al., 2024), so contrary to the unbiased MC estimation of gt, we investigate approximate (and thus biased) guidance methods in this subsection. We start from an intuitive assumption that the probability mass of p(x1|xt) is centered around its mean. Following this,

5We leave empirical investigation into g MC-IS to future work.

On the Guidance of Flow Matching

it is natural to approximate the integrals in Eq. (1) with the Taylor expansion that captures local behaviors around the mean. Thus, Eq. (1) can be simplified and becomes tractable.

First, we approximate the normalizing constant Zt(xt) in Eq. (2) as (Appendix A.9):

Zt(xt) = Z p(z|xt)e J(x1)dz e J(ˆx1) (4)

where ˆx1 := Ex0,x1 p(z|xt)[x1]. Likewise, Eq. (1) can be approximated as gt(xt) glocal t , and:

glocal t (xt) = Ez p(z|xt) (x1 ˆx1)v1|t(xt|z) ˆx1J(ˆx1),

where ˆx1 is defined as above. The approximation error δg 2 2 can be bounded: δg 2 2 |λhσ1d/e J(ˆx1)|2(C1 + C2) (Appendix A.9), where λh is the maximum eigenvalue of the Hessian of e J(x), σ1 is the L2 norm of the covariance matrix of the distribution p(x1|xt), d is the dimensionality of the data, and C1, C2 are constants dependent on the variance of the original conditional VF and the guided VF.

The error bound gives insights on the approximation quality of glocal t (detailed discussion in Appendix A.9.):

The error is small when J is smooth, in which case the Hessian of e J(x) will approach zero. This corresponds to the mild guidance, where approximationbased glocal works well.

The error is small when σ1 is small, i.e. the covariance matrix Σ11 has a small Frobenius norm. This is the case when the flow time t 1 (and σt = 0), where xt predicts x1 well.

The gradient in glocal t is a natural outcome of the Taylor expansion near the mean of p(x1|xt). glocal t is not only applicable to more general flow matching but also originates differently from diffusion guidance (Dhariwal & Nichol, 2021) where the gradient naturally emerges from the score function xt log p(xt). Moreover, the error bound we provide here is more practical compared to those previously proposed for diffusion guidance, e.g., the Jensen gap in diffusion posterior sampling (DPS) (Chung et al., 2023), which only bounds the error in E[e J(x1)], but it is xt log Zt that is the guidance VF, whose error is not bounded.

In order to obtain ˆx1, we need the following assumption: Assumption 3.2. The affine path assumption. We assume the conditional probability path to be affine, i.e., xt = αtx1 + βtx0 + σtε, where ε is a random noise and σt, σt are both sufficiently small.6

6This choice is a widely used one (Lipman et al., 2023; Tong et al., 2024). With a small random noise σtε, xt under conditional VF flows is almost exactly from x0 to x1.

Note that this assumption does not prevent the samples from having dependent coupling: π(x0|x1) = p(x0). Under Assumption 3.2, we can use the x1-parameterization (Lipman et al., 2024) to express ˆx1 with the VF vt that is learned by the model vθ (Appendix A.10):

ˆx1 βt αtβt βtαt xt + βt αtβt βtαt vt. (5)

With the commonly chosen schedule αt = t, βt = 1 t (Lipman et al., 2023; Tong et al., 2024), ˆx1 = xt + (1 t)vt coincides with the 1-step generated x1 under the Euler scheme. Also under this affine path assumption, glocal t can be expressed with the covariance matrix of p(x1|xt), Σ1|t, and the gradient ˆx1J(ˆx1) (proof in Appendix A.11):

gcov t = αtβt βtαt

βt | {z } schedule

Σ1|t ˆx1J(ˆx1). (6)

Intuitively, the guidance is the gradient of estimated J preconditioned with the covariance matrix of p(z|xt). The preconditioning squeezes the guidance vector into the p(z|xt) manifold. Next, we discuss different ways to obtain this covariance term, resulting in gcov-A, gcov-L, and gcov-G.

The simplest way is to approximate the covariance matrix with a manually set schedule λ t I, resulting in gcov-A t . This allows us to tune the guidance s schedule, a common practice in diffusion model guidance (Song et al., 2023b;a).

gcov-A t = λcov-A t ˆx1J(ˆx1),

where the original schedule in gcov t is already included in the hyperparameter λcov-A t . Since p(x1|xt) is localized when t 1, a general guideline is to set λcov-A t decaying. gcov-A t only assumes affine path (Assumption 3.2), and is thus theoretically justified for mini-batch optimal transport conditional flow matching (Tong et al., 2024) for which few existing guidance methods have been proposed. Besides, gcov-A t is efficient as its computation involves no extra number of function evaluations (NFE) than unguided sampling.

Alternatively, we can use Proposition 3.5 to train a model to fit the actual covariance matrix of p(x1|xt) and acquire gcov-L t . Note that this covariance matrix is determined by the original distribution and is agnostic to the energy function J. The original flow matching essentially learns Ex1 p(x1|xt)[x1], but to achieve better approximate guided generation, more detailed information of the distribution p(x1|xt), its covariance, also needs to be learned.

The important special case where flow matching is a diffusion model with a new schedule is formalized with:

Assumption 3.3. The uncoupled affine Gaussian path assumption. In addition to Assumption 3.2, the source distribution p(x0) is a standard Gaussian and the coupling is independent π(x0, x1) = p(x0)p(x1).

On the Guidance of Flow Matching

We will demonstrate that under Assumption 3.3, gcov t simplifies into gcov-G t (Eq. (7)), which covers classical guidance methods in diffusion models. Specifically, using the second-order Tweedie s formula, we can express the covariance matrix Σ1|t in Eq. (6) in terms of the Hessian of the log-probability, xt xt log p(xt) (Rozet et al., 2024; Boys et al., 2023; Ye et al., 2024). Then, under Assumption 3.3, xt log p(xt) depends affinely on the VF vt in Eq. (5) which depends affinely on ˆx1. Therefore, the covariance matrix Σ1|t can be expressed with the derivative of the Jacobian matrix ˆx1

xt . We refer to this relationship as the Jacobian trick (proof in Appendix A.12):

Proposition 3.4. The Jacobian trick. Under Assumption 3.3, the inverse covariance matrix of p(x1|xt), Σ1|t, depends affinely on the Jacobian of the VF vt

xt , and is proportional to the Jacobian ˆx1

Σ1|t = β2 t αt( αtβt βtαt) ( βt + βt vt xt ) = β2 t αt

Inserting back to Eq. (6), we have:

gcov-G t =λcov-G t xt J(ˆx1), (7)

where λcov-G t = βt( αtβt βtαt)/αt is the schedule.

gcov-G t covers classical diffusion model guidance including loss-guided diffusion7 (LGD) (Song et al., 2023b) and diffusion posterior sampling (DPS) (Chung et al., 2023). In diffusion models, the guidance can be expressed as xt log Ez p(z|xt)[e J(x1)], and DPS approximates this by neglecting the gap in the Jensen inequality (Chung et al., 2023) to move the expectation into e J( ), and LGD approximates the expectation with a point estimation. Both methods arrive at xt J(ˆx1), which is covered by our Eq. (7).

3.4. Gaussian Approximation of p(z|xt)

Instead of approximating guidance by expanding it near its mean, we can alternatively approximate p(z|xt) with a Gaussian distribution p(z|xt) = N(z; µt(xt), Σt(xt)), and get gt gsim t . To minimize the KL divergence between the approximate distribution and the original distribution, we need to choose µt(xt) and Σt(xt) to that of the actual distribution p(z|xt) (Rozet et al., 2024; Boys et al., 2023). Under Assumption 3.2, ˆx1 can be estimated using Eq. (5), ˆx0 can be similarly estimated (Appendix A.10), and Σt(xt) can be either set as a hyperparameter as in gcov-A t or computed as in gcov-G t in uncoupled affine Gaussian path flow matching.

Now that if J has an unknown expression, we can use Monte Carlo (MC) sampling to estimate gsim t gsim-MC t , since we

7We refer to the simplest version of LGD-MC(n = 1) as LGD here. LGD-MC with n > 1 will be covered in Section 3.4.

can sample from p(z|xt) which is a Gaussian distribution:

gsim-MC t =

vt|z(xt|zi), (8)

where Zt := 1 N PN i e J(xi 1) is an estimated normalizing constant. This shares the spirit of LGD-MC (Song et al., 2023b), which uses MC to estimate Zt and then computes diffusion guidance xt log Zt.

In some guided generation tasks like inverse problems where J is known analytically, we can derive analytical expressions of gsim t . For example, when the measurement involves a known degradation operator H applied to x and then adding Gaussian noise with scale σy to Hx, we have J(x1) 1 2σ2y y Hx 2 2. Thus we can derive gsim t under the affine path assumption (Assumption 3.2), and propose a practical approximation gsim-inv-A t (Appendix A.13):

gsim-inv-A t = λt σ2 y r2 t + HT H 1 HT (y Hˆx1) , (9)

where λt and rt are both hyperparameters. gsim-inv-A t extends ΠGDM (Song et al., 2023a) to affine flow matching but requires a further approximation ˆx1

xt I, which is accurate when t 1.

In uncoupled affine Gaussian path (Assumption 3.3), gsim t in our framework can exactly cover ΠGDM and OT-ODE (Song et al., 2023a; Pokle et al., 2024) (Appendix A.13). Note that our gsim t is theoretically justified for dependent couplings, such as optimal transport conditional flow matching (OT-CFM) (Tong et al., 2024).

3.5. Training-based Guidance gϕ

Previously, we have discussed training-free guidance methods. In this subsection, we discuss how to train a neural network gϕ to fit guidance gt. To construct a tractable training loss, we extend the conditional loss in flow matching to arbitrary conditional variables in the following proposition (proof in Appendix A.5).

Proposition 3.5. Any marginal variable f(xt, t) := Ez pt(z|xt)[ft|z(xt, z, t)], z = (x0, x1) has an intractable marginal loss:

Lt = Ext p(xt) h fθ(xt, t) Ez pt(z|xt)[ft|z(xt, z, t)] 2 2

whose gradient w.r.t. θ is identical to that of the tractable conditional loss:

Lt|z = Ext,z p(xt,z) h fθ(xt, t) ft|z(xt, z, t) 2 2

Guidance Matching. Based on proposition 3.5, We can train gϕ by first learning a surrogate model ZϕZ(xt, t) Zt, and then train gϕ gt with the following guidance

On the Guidance of Flow Matching

Figure 2: Results of the synthetic dataset with different source (blue) and target (red) distributions. We visualize the start/end points and the flow trajectories. g MC and gϕ yield the best guidance across different settings while diffusion guidance fails.

matching (GM) losses.

LϕZ,ϕ = Et U(0,1),z p(z),xt p(xt|z)[ℓϕZ + ℓGM ϕ ], (10)

ℓGM ϕ = gϕ(xt, t) e J(x1)

ZϕZ,sg(xt, t) 1 vt|z(xt|z)

ℓϕZ = ZϕZ(xt, t) e J(x1) 2

where sg denotes stopping the gradient in automatic differentiation. We prove in Appendix A.7 that the minimizer of LϕZ,ϕ is indeed ZϕZ = Zt and gϕ = gt.

In fact, there are other methods to learn ZϕZ. In the literature on diffusion models, Lu et al. (2023) proposed to use contrastive learning to obtain Zt(xt) that can be extended to include the general flow matching path (Appendix A.6). Alternatively, Monte Carlo estimation can be applied to obtain ZϕZ (Appendix A.6). As for learning gϕ, alternative to the loss ℓGM ϕ in Eq. (11), there are other losses ℓVGM, ℓRGM, and ℓMRGM which, when substituted into Eq. (10), will produce the same minimizer. We provide detailed analysis and proof for all these losses in Appendix A.7. Notably, ℓMRGM that we derive is identical to a newly proposed training-based flow matching guidance loss in Zhang et al. (2025). For uncoupled affine Gaussian path flow matching (Assumption 3.3), since the guidance is gt x1 log Zt(xt) (Appendix A.4), learning ZϕZ in Eq. (11) is adequate, and Lu et al. (2023) also learns ZϕZ.

These different training-based guidance opens the design space of classifier-guidance (Dhariwal & Nichol, 2021) for flow matching. For diffusion models, one only needs to train a classifier on noisy input to produce the correct guidance,

whereas for general flow matching, one needs the training loss proposed above to train a network to get the accurate guidance.

In summary, we have proposed guidance matching methods for training-based guidance in this subsection. We derive different losses for learning Zt and gt, which all provide unbiased estimations of the gradient and can be utilized without specific assumptions on flow matching.

4. Experiments

In the experiment, we benchmark different guidance methods for flow matching models in different tasks including synthetic datasets, generative decision-making, and image inverse problems. These tasks all fall into the category of energy guidance (Janner et al., 2022; Lu et al., 2023) or posterior sampling (Chung et al., 2023). With these experiments, we aim to answer the following questions: (1) Can the proposed learning-based exact guidance method gϕ learn the correct guidance VF gt for general (including nonuncoupled affine Gaussian path) flow matching? (2) For the asymptotically exact MC guidance method, can it produce the correct guidance in non-Gaussian flow matching, and is it exact assuming a sufficiently large sample budget? (3) On the practical aspect, how do different types of guidance methods (approximate/exact, training-free/training-based) perform on more realistic tasks, and how do we choose the appropriate flow matching guidance in different tasks?

On the Guidance of Flow Matching

Table 1: Results of the D4RL Locomotion experiments. Entries with 95% score of the best results per task (excluding baselines) are highlighted in bold. The standard deviation is deferred to Appendix B.2 due to limited space.

Baselines OT-CFM CFM

BC Diffuser w/o g gcov-A gcov-G gsim-MC g MC gϕ w/o g gcov-A gcov-G gsim-MC g MC gϕ

Half Cheetah 55.2 88.9 61.9 64.8 73.2 78.1 86.4 70.2 46.4 63.4 68.5 83.5 87.7 81.5 Medium-Expert Hopper 52.5 103.3 95.2 101.8 112.3 112.3 112.7 98.1 83.4 93.9 113.3 88.5 113.3 91.5 Walker2d 107.5 106.9 79.1 97.3 107.2 101.0 107.2 91.3 65.7 100.4 106.9 107.0 107.1 102.3

Half Cheetah 42.6 42.8 34.7 42.2 42.9 43.1 43.1 43.4 41.8 43.6 43.3 43.8 43.8 43.8 Medium Hopper 52.9 74.3 63.3 75.1 89.8 76.2 79.8 79.7 73.2 79.1 82.7 82.1 88.0 85.2 Walker2d 75.3 79.6 72.4 82.7 81.3 83.4 83.0 80.6 72.2 80.7 82.5 81.9 81.9 72.9

Half Cheetah 36.6 37.7 25.6 31.7 36.1 36.8 40.0 35.5 22.2 33.4 39.3 37.9 40.6 39.1 Medium-Replay Hopper 18.1 93.6 40.1 57.7 74.1 60.9 88.6 55.3 55.1 63.0 69.3 61.0 80.9 63.5 Walker2d 26.0 70.6 31.2 62.5 82.5 64.4 88.1 52.4 28.3 64.9 76.6 58.9 70.9 70.3

Average 51.9 77.5 55.9 68.4 77.7 72.9 81.0 67.4 54.3 69.2 75.8 71.6 79.4 72.2

4.1. Synthetic Dataset

We first compare different guidance methods on 2dimensional synthetic datasets where the source distributions are different from Gaussian. The base flow matching model is trained to learn the flow with source distributions other than Gaussian. During inference, different guidance methods are applied to perform guided sampling with different objective functions J for each dataset. All of the source distributions are non-Gaussian, so traditional diffusion guidance should not be applied. In Figure 2, we compare the performance of g MC, gϕ, an exact diffusion guidance called contrastive energy guidance (CEG) proposed by Lu et al. (2023), and approximate guidance gcov-A, gcov-G, and gsim-MC. The original target distributions (w/o gt) and the J-weighted distributions (ground truth) are shown in the first and second columns. The details of the experiment are provided in Appendix B.

It can be seen from Figure 2 that g MC and gϕ generated samples that almost exactly match the ground truth distribution and the performance is consistent across different datasets. Note that the generated samples maintain the correct data manifold instead of concentrating on some points as gradient-based approximate guidance methods do. As has been mentioned in Section 3, CEG is essentially xt log Zt which is exact under the uncoupled affine Gaussian path assumption 3.3. However, none of the flow matching paths we have here are uncoupled affine Gaussian paths, so exact diffusion guidance performs poorly compared to gϕ and g MC, showing a largely distorted generated distribution.

Also, we investigated the asymptotic exactness of g MC. To quantitatively see the increasing guidance precision as the sample size increases, we show the Wasserstein-2 distance between the guided generation distribution and the ground truth target distribution, estimated using 1000 samples. As shown in Figure 4, the error decreases as the number of samples for computing g MC (N in Algorithm 1) increases from 5 to 104, and eventually approaches or surpasses the error

of the learned generative model for the original distribution, which can be seen as an approximate lower-bound of the guided generation error.

4.2. Planning

We also conduct experiments on offline RL tasks where generative models have been used as planners (Janner et al., 2022; Chen & Lipman, 2024). The planning process is realized through sampling from 1

Z p(x1)e R(x1) (Levine, 2018), which aligns with the goal of our guidance by setting J = R, and R being the return. We report experiment results on the Locomotion tasks in the D4RL dataset (Fu et al., 2020), with experiment setting details and complete results provided in Appendix B.2.

The average normalized scores across five seeded runs of each guidance method are reported in Table 1 where score = 100 corresponds to the scores of the expert. The conclusions for CFM and OT-CFM are consistent: for gradientbased methods, gcov-G is generally better than gcov-A with an average increase in score of 8.0. gsim-MC has a performance between gcov-A and gcov-G. The improved performance of gcov-G comes at a higher computation cost of differentiation through the VF model, though. The MC-based guidance, although being gradient-free, outperforms the second-best method gcov-G by 3.5 on average and is the best method in 7 out of 9 tasks. For gϕ, we report the result of the best losses ℓ, but their performance is still relatively weak, falling behind the best by 10.4 on average. We attribute this to the unstable joint training of two networks and provide the complete results in Appendix B.2.

To investigate the effectiveness of g MC, we collect an ensemble of plans that are generated under guidance, compute the critic-predicted objective function value (estimated return R), and then plot the density distribution of the estimated return R. An ideal guidance will result in the R distribution to be p(R)e R where p(R) is the distribution generated without guidance (Appendix B.2). As can be seen from Figure 3,

On the Guidance of Flow Matching

Table 2: Image inverse problem results on Celeb A-HQ. The best and runner-up results are highlighted in bold and underlined.

Inpainting-Center, σy = 0.05 Super-Resolution 4, σy = 0.05 Gaussian Deblurring, σy = 0.05

FID LPIPS PSNR SSIM FID LPIPS PSNR SSIM FID LPIPS PSNR SSIM

gcov-A 7.330 0.1904 25.70 0.8432 26.06 0.3016 26.64 0.7292 14.34 0.2968 24.46 0.6982 gsim-A 10.67 0.1716 25.42 0.8681 31.78 0.3717 23.88 0.6072 11.83 0.2837 24.54 0.7109 OT-CFM gcov-G 18.43 0.2390 26.96 0.8125 15.42 0.2533 27.40 0.7783 17.04 0.2873 24.96 0.7196 ΠGDM 12.56 0.1717 27.74 0.8723 9.828 0.2322 27.24 0.7891 12.95 0.2226 28.43 0.7952 g MC 22.75 0.5589 8.67 0.3484 22.92 0.5596 8.59 0.3468 22.82 0.5596 8.64 0.3469

gcov-A 7.678 0.1920 25.95 0.8414 31.76 0.3770 22.79 0.5899 16.09 0.3052 24.21 0.6825 gsim-A 11.54 0.1785 28.00 0.8686 33.02 0.3749 23.75 0.6015 13.37 0.2947 24.34 0.6926 CFM gcov-G 19.65 0.2377 27.03 0.8140 13.89 0.2461 27.15 0.7864 16.89 0.2908 24.84 0.7112 ΠGDM 15.27 0.1753 25.48 0.8700 10.52 0.2435 26.96 0.7755 12.60 0.2244 28.27 0.7893 g MC 26.37 0.5615 9.06 0.3495 26.75 0.5492 10.05 0.3689 26.84 0.5494 10.04 0.3684

the samples generated with g MC have a density distribution that best matches the ground-truth target distribution.

4.3. Image Inverse Problems

We conduct experiments on the image inverse problems on the Celeb A-HQ (256 256) dataset to reflect the guidance performances on higher dimensional generative tasks. We consider three different types of noised inverse problems: box inpainting, super-resolution, and Gaussian deburring, and compute the metrics FID, LPIPS, PSNR, and SSIM on 3000 samples from the test set. The details of the settings and result visualizations are included in Appendix B.3.

200 220 240 260 280 300 320 Estimated R

Probability Density of

Generated Samples

1 Zp(R)e R(x1)

Figure 3: R distribution of generated trajectories in Locomotion. g MC matches the target gray dashed line well.

The results demonstrate that ΠGDM is generally better on all tasks, being the best in 8 out of 12 metrics. gcov-G has a similar but slightly worse performance than ΠGDM, being the best or the runner-up in all 4 metrics of the superresolution task if ranking the results of CFM and OT-CFM separately and 3 out of 4 metrics in the deblurring task. gsim-A that does not involve the Jacobian shows remarkable performance in inpainting and deblurring, with a lower FID score than ΠGDM by 1.5 on average, and LPIPS, PSNR, and SSIM within 3% relative difference compared to the best or the runner-up method in 5 out of 6 metrics, especially when considering the efficiency of gsim-A that no backprop-

agation through the model is needed. gcov-A is the worst excluding g MC, being the second worst or the worst in 11 out of 12 metrics and ranking CFM and OT-CFM separately. It should be noted that g MC performs poorly here, as J in the inverse problem is highly complex, thus requiring an infeasibly large sample budget to compute accurate g MC. A more detailed explanation is deferred to Appendix B.3.

5. Limitations and Future Works

The major limitation of our work lies in the assumption that P 1. When the coupling is strong, the guidance VF no longer has the correct direction. In future works, we plan to address this problem by estimating P, which enables guidance for flow matching models with exact coupling.

In addition, we plan to improve the specific guidance methods. For example, g MC suffers from the low sample efficiency on high-dimensional datasets. Thus, it is worthwhile to comprehensively investigate and improve the effectiveness of g MC-IS t , or to explore other techniques (such as the control variable method) to further lower the variance.

Besides, guided VF v t(xt) can be chosen to have different properties, such as the straightness of the VF. In this way, there may be add-on VFs that facilitate the sampling efficiency, which we also leave for future work.

6. Conclusion

In this work, we have proposed the first framework for general flow matching guidance, from which new MC-based guidance g MC, many approximate guidance, and learned guidance gϕ, are derived, all of them verified by experiments. Many classical guidance methods have been covered as special cases, and we provided a theoretical derivation for each guidance method. We believe this work will facilitate the application of flow matching by opening novel design spaces of its guidance methods.

On the Guidance of Flow Matching

Acknowledgements

We thank Qianyi Chen, Tengfei Xu, Long Wei, Chuanrui Wang, Jiashu Pan, and Tao Zhang for the discussions and for providing feedback on our manuscript. We gratefully acknowledge the support of Westlake University Research Center for Industries of the Future and Westlake University Center for High-performance Computing. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding entities.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola, T., and Agrawal, P. Is conditional generative modeling all you need for decision-making? In The Eleventh International Conference on Learning Representations, 2023.

Andrae, M., Landelius, T., Oskarsson, J., and Lindsten, F. Continuous ensemble weather forecasting with diffusion models. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=e PEZv QNFDW.

Ben-Hamu, H., Puny, O., Gat, I., Karrer, B., Singer, U., and Lipman, Y. D-flow: Differentiating through flows for controlled generation. In Proceedings of the 41st International Conference on Machine Learning, 2024.

Boys, B., Girolami, M., Pidstrigach, J., Reich, S., Mosca, A., and Akyildiz, O. D. Tweedie moment projected diffusions for inverse problems. In The Twelfth International Conference on Learning Representations, 2023.

Chen, R. T. Q. and Lipman, Y. Flow matching on general geometries. In The Twelfth International Conference on Learning Representations, 2024.

Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023.

Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In The Thirty-Fifth Annual Conference on Neural Information Processing Systems, 2021.

Dou, Z. and Song, Y. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In The Twelfth International Conference on Learning Representations, 2024.

Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., and Lee, K. DPOK: Reinforcement learning for fine-tuning text-toimage diffusion models. In The Thirty-seventh Annual Conference on Neural Information Processing Systems, 2023.

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T. Q., Synnaeve, G., Adi, Y., and Lipman, Y. Discrete flow matching. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems, 2024.

Hiranaka, A., Chen, S.-F., Lai, C.-H., Kim, D., Murata, N., Shibuya, T., Liao, W.-H., Sun, S.-H., and Mitsufuji, Y. HERO: Human-feedback efficient reinforcement learning for online diffusion model finetuning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=y MHe9SRvxk.

Ho, J. and Salimans, T. Classifier-free diffusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In The Thirty-Fourth Annual Conference on Neural Information Processing Systems, 2020.

Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. Planning with diffusion for flexible behavior synthesis. In Proceedings of the 39th International Conference on Machine Learning, 2022.

Kollovieh, M., Lienen, M., L udke, D., Schwinn, L., and G unnemann, S. Flow matching with gaussian process priors for probabilistic time series forecasting. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=ux VBb Sl KQ4.

Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.

Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T. Q., Lopez-Paz, D., Ben-Hamu, H., and Gat, I. Flow matching guide and code. ar Xiv preprint ar Xiv:2412.06264, 2024.

On the Guidance of Flow Matching

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023a.

Liu, X., Wu, L., Zhang, S., Gong, C., Ping, W., and Liu, Q. Flow Grad: Controlling the output of generative ODEs with gradients. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.

Lu, C., Chen, H., Chen, J., Su, H., Li, C., and Zhu, J. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, 2023.

Lu, H., Han, D., Shen, Y., and Li, D. What makes a good diffusion planner for decision making? In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=7BQk XXM8Fy.

Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden-Eijnden, E., and Xie, S. Si T: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers. In The 18th European Conference on Computer Vision, 2024.

Mathieu, E. and Nickel, M. Riemannian continuous normalizing flows. In The Thirty-Fourth Annual Conference on Neural Information Processing Systems, 2020.

Owen, A. B. Monte Carlo theory, methods and examples.

https://artowen.su.domains/mc/, 2013.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In The Thirty First Annual Conference on Neural Information Processing Systems, 2017.

Pokle, A., Muckley, M. J., Chen, R. T. Q., and Karrer, B. Training-free linear image inverses via flows. In The Twelfth International Conference on Learning Representations, 2024.

Rozet, F., Andry, G., Lanusse, F., and Louppe, G. Learning diffusion priors from observations by expectation maximization. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems, 2024.

Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pseudoinverse-guided diffusion models for inverse problems. In The Eleventh International Conference on Learning Representations, 2023a.

Song, J., Zhang, Q., Yin, H., Mardani, M., Liu, M.-Y., Kautz, J., Chen, Y., and Vahdat, A. Loss-guided diffusion models for plug-and-play controllable generation.

In Proceedings of the 40th International Conference on Machine Learning, 2023b.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In The Thirty-third Annual Conference on Neural Information Processing Systems, 2019.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, 2020.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations, 2021.

Stimper, V., Sch olkopf, B., and Hern andez-Lobato, J. M. Resampling base distributions of normalizing flows. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022.

Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., and Bengio, Y. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, March 2024.

Trippe, B. L., Yim, J., Tischer, D., Baker, D., Broderick, T., Barzilay, R., and Jaakkola, T. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations, 2023.

Wang, L., Cheng, C., Liao, Y., Qu, Y., and Liu, G. Training free guided flow-matching with optimal control. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=61ss5RA1MM.

Ye, H., Lin, H., Han, J., Xu, M., Liu, S., Liang, Y., Ma, J., Zou, J., and Ermon, S. TFG: Unified training-free guidance for diffusion models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems, 2024.

Zhang, S., Zhang, W., and Gu, Q. Energy-weighted flow matching for offline reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=HA0o LUvu GI.

Zheng, H., Chu, W., Wang, A., Kovachki, N., Baptista, R., and Yue, Y. Ensemble kalman diffusion guidance: A derivative-free method for inverse problems. ar Xiv preprint ar Xiv:2409.20175v1, 2024.

On the Guidance of Flow Matching

Zheng, Q., Le, M., Shaul, N., Lipman, Y., Grover, A., and Chen, R. T. Q. Guided flows for generative modeling and decision making. ar Xiv preprint ar Xiv:2311.13443, 2023.

Zhou, G., Swaminathan, S., Raju, R. V., Guntupalli, J. S., Lehrach, W., Ortiz, J., Dedieu, A., L azaro-Gredilla, M., and Murphy, K. Diffusion model predictive control. ar Xiv preprint ar Xiv:2410.05364, 2024.

On the Guidance of Flow Matching

A. Complete Theoretical Derivations and Proofs

Here, we provide the theoretical analysis that is deferred from the main text, including the following subsections: Appendix A.1 includes proof of how energy-guided sampling from p(x)e J(x)/Z is equivalent to conditional sampling from p(x|y). Appendix A.2 proves Theorem 3.1 by showing gt + vt is equal to v t which generates the correctly guided probability path. Appendix A.4 Proof of that under uncoupled affine Gaussian paths, our general guidance gt is equivalent to the diffusion guidance xt log Zt. Appendix A.5 is proof of Proposition 3.5. Appendix A.6 discusses other ways to obtain Zt, including using contrastive learning and Monte Carlo estimation. In appendix A.7, we propose three other training losses ℓV GM, ℓRGM, and ℓMRGM for gϕ and prove that the losses will produce the correct gradient. Appendix A.8 includes a more detailed explanation of g MC and a variant of it under uncoupled paths. Appendix A.9 derives glocal and proves its error bound. Appendix A.10 shows how to estimate ˆx1 using xt and vθ(xt, t) under the affine path assumption. Appendix A.11 proves that glocal becomes gcov under affine paths. Appendix A.12 includes the proof of the Jacobian trick (Proposition A.12). Appendix A.13 includes a proof of gsim-inv t for the image inverse problem, how to derive gsum-inv-A, and how to recover ΠGDM under the uncoupled affine Gaussian path assumption.

A.1. Energy Guided Sampling as Posterior Sampling

There exists a J(x) such that sampling from 1

Z p(x)e J(x) is equivalent to sampling from p(x|y).

Simply take J = log p(y|x). Plug it in to get

1 Z p(x)e J(x) = p(x)elog p(y|x) R p(x)elog p(y|x)dx = p(x)elog p(y|x)

p(y) R p(x)elog p(y|x)

p(y) dx = p(x|y). (12)

This theorem can be interpreted as, when you have a classifier p(y|x) and an energy guidance algorithm, you can directly use this algorithm to perform conditional generation from p(y|x) by setting the energy J(x) = p(y|x).

Similar approaches have been used in probability inference, reinforcement learning (Levine, 2018), and Diffuser (Janner et al., 2022) uses this to convert return-conditioned sampling into energy-guided sampling.

A.2. General Guidance

We prove Theorem 3.1 here.

Theorem A.1. Adding the guidance VF gt(xt) to the original VF vt(xt) will form VF v t(xt) that generates p t(xt) = R pt(xt|z)p (z)dz, as long as gt(xt) follows:

gt(xt) = Z P e J(x1)

Zt(xt) 1 vt|z(xt|z)p(z|xt)dz, (13)

where Zt(xt) = Z Pe J(x1)p(z|xt)dz, (14)

P = π (x0|x1)

π(x0|x1) is the reverse coupling ratio, where π (x0|x1) is the reverse data coupling for the new VF, i.e., the distribution of x0 given x1 sampled from the target distribution.

Proof. We can subtract vt(xt) from v t(xt) to construct gt(xt):

gt(xt) = v t(xt) Z vt|z(xt|z)pt(xt|z)p(z)

pt(xt) dz. (15)

v t(xt) generates pt(xt) = R pt(xt|z)p (z)dz if

v t(xt) = Z vt|z(xt|z)pt(xt|z)p (z)

p (xt) dz, (16)

where p (z) = π (x0|x1) 1

Z p(x1)e J(x1), which follows from conditional flow matching, i.e., a VF marginalizing a conditional VF will generate the corresponding marginal probability path (Lipman et al., 2023; Tong et al., 2024). Then, we

On the Guidance of Flow Matching

have a possible form of gt(xt)

gt(xt) = Z vt|z(xt|z)(pt(xt|z)p (z)

p t(xt) pt(xt|z)p(z)

= Z vt|z(xt|z)(pt(xt|z)π (x0|x1) 1

Z p(x1)e J(x1)

p t(xt) pt(xt|z)p(z)

= Z vt|z(xt|z)(P(z)pt(xt|z)p(z) 1

p t(xt) pt(xt|z)p(z)

= Z vt|z(xt|z)pt(xt|z)p(z)

pt(xt) (P(z) 1

Z e J(x1) pt(xt)

p t(xt) 1)dz, (17)

where pt(xt) = R p(xt, z)dz, p (xt) = R p (xt, z)dz, and P(z) = π (x0|x1)

π(x0|x1) . Since

p(xt) = Z p(xt, z)dz = p(xt) Z p(z|xt)dz (18)

p (xt) = Z p (xt, z)dz = Z p (xt|z)p (z)dz

= Z p(xt|z)π (x0|x1) 1

Z p(x1)e J(x1)dz

= Z P(z)p(xt|z)p(z) 1

Z e J(x1)dz

= Z P(z)p(xt, z) 1

Z e J(x1)dz

= p(xt) Z P(z)p(z|xt) 1

Z e J(x1)dz. (19)

Plugging Eq. (18) and Eq. (19) into Eq. (17), we get

gt(xt) = Z vt|z(xt|z)pt(xt|z)p(z)

pt(xt) (P(z) 1

Z e J(x1) pt(xt)

p t(xt) 1)dz

= Z vt|z(xt|z)pt(xt|z)p(z)

pt(xt) (P(z) 1 Z e J(x1) p(xt) R p(z|xt)dz

p(xt) R P(z)p(z|xt) 1 Z e J(x1)dz 1)dz

= Z vt|z(xt|z)pt(xt|z)p(z)

pt(xt) (P(z)e J(x1) 1 R P(z)p(z|xt)e J(x1)dz

Z p(z|xt)dz | {z } =1

= Z vt|z(xt|z)pt(xt|z)p(z)

pt(xt) ( P(z)e J(x1)

Ez p(z|xt)[P(z)e J(x1)] 1)dz. (20)

Finally, denote Zt = Ez p(z|xt)[e J(x1)] to complete the proof.

Remark. The theorem states that v t = gt + vt not only generates the desired terminal distribution 1

Z p(x1)e J(x1) at time t = 1, but also generates a probability path p t(xt) that is similar to the original one. Specifically, their noising process p(xt|x1), and the conditional vector fields v(xt|x1), and the reverse coupling p(x0|x1) are the same. These are all hyperparameters of flow matching, as one can choose an arbitrary conditional vector field satisfying boundary conditions and the conditional vector field uniquely determines the conditional probability path; the reverse coupling, given target (dataset) distribution p(x1) or 1

Z p(x1)e J(x1), composes the data coupling p(x0, x1) = p(z) for flow matching training.

It should be noted that the gt and v t we construct here is only one of infinitely many (Lipman et al., 2023) possible vector fields to generate 1

Z p(x)e J(x1) at t = 1. It remains an interesting question whether there exists better v t that, for example, simplifies gt or improves the vector field by straightening the flow.

On the Guidance of Flow Matching

Flow time 0.05 0.25 0.5 0.75 0.95

Relative L2 0.0382 0.0076 0.0297 0.0043 0.0271 0.0032 0.0312 0.0033 0.0717 0.0096

Table 3: The relative L2 of the difference between the VF of OT-CFM and CFM at different time steps on the sampling trajectory.

A.3. The P = 1 Approximation

From the empirical perspective, we found that P = 1 is a valid choice for dependent couplings in realistic datasets. As we show in Table 3, the VF of the OT-CFM (batch size 128) in and uncoupled CFM trained on Celeb A 256 256 have small relative error at all flow time steps, so their ideal guidance VFs are approximately the same, validating our approximation of P = 1.

Next, we give a more intuitive understanding of the approximation. Our framework allows us to choose any π (x0|x1) (and hence P) as long as the source distribution is consistent: p(x0) = R π (x0|x1) 1

Z p(x1)e J(x1)dx1. In other words, the error induced by setting P = 1 can be characterized by the deviation in either the source distribution or the vector field. In the former case, the guidance VF is essentially considered exact, which implies π (x0|x1) = π(x0|x1). Therefore, the error is induced by the fact that we should have sampled from R π (x0|x1) 1

Z p(x1)e J(x1)dx1, rather than the original p(x0). In the latter case, we actually assumed the source distribution to be unchanged, i.e., we need π (x0|x1) = p(x0) to make the source distributions compatible automatically. In this case, the error is caused by approximating P = p(x0) π(x0|x1) with 1.

We now discuss the practical effect of the approximation, i.e., when it is a good approximation and when it is not.

In the case of strongly dependent couplings, P 1 still holds as long as J varies slowly. This is demonstrated by the small deviation in the error of the source distribution (assuming π (x0|x1) = π(x0|x1)) as we discussed above. If J is always near its average value, the new source distribution R π(x0|x1) 1

Z p(x1)e J(x1)dx1 is almost R π(x0|x1)p(x1)dx1 = p(x0) that is the original source.

Nevertheless, when both the coupling is strong and J varies intensively, a more complicated treatment is required for exact guidance. For example, we can try to sample from the new source distribution R π(x0|x1) 1

Z p(x1)e J(x1)dx1. Although one may argue that this may be equally difficult as sampling from the target distribution 1

Z p(x1)e J(x1) exactly, it may be learned more easily as the target distribution is potentially smoothed after being convolved with the kernel π(x0|x1). We leave this more detailed treatment of P to future work.

A.4. Uncoupled Affine Gaussian Guidance

Here we prove that xt log Zt(xt) is proportional to the guidance gt in Eq. (1). Note that the term xt log Zt(xt) is widely used as guidance in the diffusion model literature (Dhariwal & Nichol, 2021; Ho & Salimans, 2022; Song et al., 2021; 2023b;a; Janner et al., 2022; Ajay et al., 2023). Therefore, we prove that our general flow matching guidance exactly covers the original diffusion guidance under the affine Gaussian path assumption, i.e., when flow matching falls back to the diffusion model. Our proof here also elucidates how the gradient xt emerges from the original expression of the general guidance for flow matching in Eq. (1) where there is no apparent gradient.

We restate Eq. (1) here:

gt(xt) = Z (e J(x1)

Zt(xt) 1)vt|z(xt|z)pt(xt|z)p(z)

pt(xt) dz (21)

Zt(xt) = Z e J(x1) pt(xt|z)p(z)

pt(xt) dz (22)

Assuming the flow matching to be of uncoupled affine path, we have

xt = σtx0 + αtx1, (23)

On the Guidance of Flow Matching

where σt and αt are schedulers satisfying boundary conditions σ0 = α1 = 1, σ1 = α0 = 0. Thus,

vt|z(xt|z) = σtx0 + αtx1

= σt σt |{z} at

σt ( αtσt σtαt) | {z } bt

where ft := df

dt denotes derivative to time t, and we define at := σt

σt , bt := 1 σt ( αtσt σtαt).

First, we demonstrate a useful technique for the proof later. Inserting Eq. (24) into gt(xt) in Eq. (21) and we have

gt(xt) = Z (e J(x1)

Zt 1)(atxt + btx1)pt(xt|z)p(z)

pt(xt) dz. (25)

Since Zt = Ez p(z|xt)[e J(x1)], Z (e J(x1)

Zt 1)atxt pt(xt|z)p(z)

pt(xt) dz = 0. (26)

That is to say, xt inside the integral of Eq. (26) will integrate to zero, and we can freely remove or add it to construct desired terms.

Recall the assumption of uncoupled Gaussian path, i.e. p(x0, x1) = p(x0)p(x1), p(x0) = N(x0; 0, I). We can utilize the important fact that the conditional probability path for affine Gaussian path flows satisfies xt N(xt; αtx1, σ2 t I), which allows us to connect the conditional score to x1

xt log p(xt|x1) = xt αtx1

σ2 t . (27)

Using Eq. (26) and Eq. (27), Eq. (25) can be further converted to

gt(xt) = Z (e J(x1)

Zt 1)(btx1 bt

αt xt | {z } Eq. (26)

)pt(xt|x1)p(x1)

pt(xt) dx1 (28)

= btσ2 t αt

Zt 1) xt log p(xt|x1)pt(xt|x1)p(x1)

= btσ2 t αt

xt log p(xt) + xt log p(x1|xt) | {z } Bayes rule

pt(xt|x1)p(x1)

= btσ2 t αt

(((((( xt log p(xt) | {z } Integrates to zero as in Eq. (26)

+ xt log p(x1|xt)

pt(xt|x1)p(x1)

pt(xt) dx1 (29)

= btσ2 t αt

Zt 1) xt log p(x1|xt)pt(xt|x1)p(x1)

pt(xt) dx1. (30)

Notice in Eq. (30) that Z xt log p(x1|xt)pt(xt|x1)p(x1)

= Z p(x1|xt) xt log p(x1|xt) | {z } Since f log f= f

= Z xtp(x1|xt)dx1

Z p(x1|xt)dx1

On the Guidance of Flow Matching

gt(xt) = btσ2 t αt

Zt 1) xt log p(x1|xt)pt(xt|x1)p(x1)

= btσ2 t αt

Zt(xt) pt(x1|xt) xt log p(x1|xt) | {z } Using again f log f = f

= btσ2 t αt

Z e J(x1) xtp(x1|xt)dx1

= btσ2 t αt

1 Zt(xt) xt

Z e J(x1)p(x1|xt)dx1 | {z }

Absorb e J(x1) for it is independent of x1, and exchange with integral

= btσ2 t αt

1 Zt(xt) xt Zt(xt) | {z } Zt s definition in Eq. (1)

= btσ2 t at xt log Zt(xt). (32)

Another possible way to derive this is to first prove the vector field in affine Gaussian path flow matching to be affine to the marginal score, and we direct interested readers to (Zheng et al., 2023). Remark A.2. The above derivation opens the possibility of using diffusion guidance into affine Gaussian path flow matching, i.e., by multiplying a scheduler btσ2 t αt = σt( αtσt σtαt)

αt to the diffusion classifier guidance. The most common scheduler for flow matching is σt = 1 t, αt = t (Lipman et al., 2023; Pokle et al., 2024; Zheng et al., 2023; Tong et al., 2024; Liu et al., 2023a; Lipman et al., 2024). In this case, the guidance scheduler is (1 t)

t . It should be noted that this scheduler explodes near t = 0, thus being unstable. The flow matching schedule σt and αt can be chosen as other ways to avoid this instability. Remark A.3. Note that this guidance cannot be applied to coupled paths. Central to the proof is that in uncoupled affine Gaussian paths, we can convert the conditional vector field to the conditional score. If we could do this in coupled paths, we would require (1) pt(x|z) is Gaussian N(x; µt, σt I), such that xt log pt(xt|z) xt µt. and (2) vt|z = µt + σt(xt µt) µt, such that in Eq. (28) the conditional vector field can be converted to the conditional score. Therefore, the following equation must hold for any xt, x0, x1

µt + σt(xt µt) µt (33)

inside the integral of Eq. (28) where µt = ξtx0 + ηtx1. This equivalent to that

( ξt σtξt)x0 + ( ηt σtηt)x1 + σtxt ξtx0 + ηtx1 (34)

must hold for any xt, x0, x1 inside the integral of Eq. (28). According to Eq. (26), xt terms will integrate to zero, thus

ηt( ξt σtξ) = ξt( ηt σtηt) (35)

which cannot hold8: because of the boundary conditions ξ0 = η1 = 1 and ξ1 = η0 = 0, t (0, 1), d log ξt

dt = d log ηt

dt . It can be observed that the reason why this guidance does not apply to coupled paths is that x0, x1, and xt are all independent variables here, preventing us from canceling two of x0 to avoid matching the schedulers ratio.

A.5. Proof of Proposition 3.5

We prove proposition 3.5 here. Proposition A.4. Any marginal variable f(xt, t) := Ez pt(z|xt)[ft|z(xt, z, t)], z = (x0, x1) has an intractable marginal loss L = Ext p(xt) h fθ(xt, t) Ez pt(z|xt)[ft|z(xt, z, t)] 2 2

whose gradient is identical to the tractable conditional loss

Lt|z = Ext,z p(xt,z) h fθ(xt, t) ft|z(xt, z, t) 2 2

8Unless ξt = 1 which falls back to the uncoupled case.

On the Guidance of Flow Matching

Proof. Expand and take gradient w.r.t. Eq. (36) to get

θLt = θExt p(xt) h fθ(xt, t) Ez pt(z|xt)[f(xt, z, t)] 2 2

= Ext p(xt) h θ fθ(xt, t) Ez pt(z|xt)[f(xt, z, t)] 2 2

= Z θp(xt) fθ(xt, t) Ez pt(z|xt)[f(xt, z, t)] 2 2 dxt

= Z θp(xt) fθ(xt, t) 2 2 fθ(xt, t), Z pt(z|xt)dzf(xt, z, t) dxt

= Z θpt(z|xt)p(xt) fθ(xt, t) 2 2 fθ(xt, t), f(xt, z, t) dxtdz

= Z pt(z|xt)p(xt) θ fθ(xt, t) 2 2 fθ(xt, t), f(xt, z, t) dxtdz

= Ez,xt pt(z|xt)p(xt) h θ fθ(xt, t) f(xt, z, t) 2i

= θ Ez,xt pt(z|xt)p(xt) h fθ(xt, t) f(xt, z, t) 2i

| {z } L1|z

Thus, the gradient of the marginal loss Lt is identical to the gradient of the conditional loss Lt|z.

A.6. Other Ways to Obtain ZϕZ

Lu et al. (2023) proposed to use contrastive learning to train ZϕZ. The proof already applies to any uncoupled path, and we show that ZϕZ does not depend on the coupling

Zt =Ex1 p(x0,x1|xt)[e J(x1)] = Z e J(x1)p(x0|x1, xt)p(x1|xt)dx0dx1 (39)

= Z e J(x1)p(x1|xt)dx1 = Ex1 p(x1|xt)[e J(x1)]. (40)

That is, instead of actually sampling from p(x0, x1|xt), sampling from p(x1|xt) will result in the same Zt. In the case of the coupled path, the marginalized distribution is identical to the uncoupled path case. Therefore, the contrastive learning method can be readily applied to train ZϕZ.

Besides training-based ZϕZ, we can also use Monte Carlo estimation to obtain Zt. Notice that by using importance sampling, we have

Zt = Ex1 p(x0,x1|xt)[e J(x1)] = Ex1 p(x0,x1)

p(xt|x0, x1)

p(xt) e J(x1) . (41)

As long as p(xt|x0, x1) is known (which is often the case (Lipman et al., 2023; Tong et al., 2024)), we can estimate Zt by sampling N pairs of xi 0, xi 1 from p(x0, x1) and estimate

p(xi 0, xi 1|xt) PN j p(xj 0, xj 1|xt) e J(xi 1) !

A similar technique is used in Section 3.2.

A.7. Guidance Matching Losses

Here, we prove that the loss in guidance matching is correct and show there are three other equivalent training losses ℓVGM, ℓRGM, ℓMRGM. The expressions of different losses are summarized below, and their proof follows.

VF-added Guidance Matching (VGM) Loss. By utilizing the learned VF vθ(xt, t) into Eq. (11), we have

ℓVGM ϕ = gϕ(xt, t) + vθ(xt, t) e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z)

On the Guidance of Flow Matching

Reweight Guidance Matching (RGM) Loss. ℓVGM ϕ can be further shown equivalent to

ℓRGM ϕ = e J(x1)

ZϕZ,sg(xt, t)

gϕ(xt, t) + vθ(xt, t) vt|z(xt|z) 2 2 . (44)

This training loss steers gϕ to where e J(x1) is larger by assigning a large loss to steer gt towards high e J(x1) regions.

Marginalized Reweight Guidance Matching (MRGM) Loss. The above loss can be re-assigned a weight, which will result in the same optimal gϕ2(xt, t) as in Eq. (11). Specifically, by changing ZϕZ,sg(xt, t) to its expectation under pt(xt), we have the following equivalent loss

ℓMRGM ϕ = e J(x1)

Z gϕ2(xt, t) + vθ(xt, t) vt|z(xt|z) 2 2, (45)

where Z = Ex1 p(x1)[e J(x1)]. ℓMRGM ϕ is identical to a newly proposed fine-tuning loss in Zhang et al. (2025). It can also be derived via importance sampling in Eq. 10 and similar reweighting-based fine-tuning losses have been studied in the literature of diffusion models (Fan et al., 2023).

(1) Guidance Matching Loss ℓGM By using proposition 3.5, the following conditional loss

LGM ϕ = Et U(0,1),z p(z),xt p(xt|z)

gϕ(xt, t) ( e J(x1)

ZϕZ,sg(xt, t) 1)vt|z(xt|z)

2 | {z } =ℓGM

has a gradient that is equivalent to the marginal loss

Et U(0,1),z p(z)xt p(xt|z)

gϕ(xt, t) Ez p(z|xt)

ZϕZ,sg(xt, t) 1)vt|z(xt|z)

| {z } =gt(xt)

Therefore, using the loss LϕZ we can train gϕ to matching gt. Recall that L in Eq. (11) is identical to LϕZ, and we proved the validity of the guidance matching training.

(2) VF-added Guidance Matching Loss ℓVGM. By replacing the learned VF vθ(xt, t) into Eq. (46), we show that

LVGM ϕ = Et U(0,1),z p(z),xt p(xt|z)

" gϕ(xt, t) + vθ(xt, t) e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z)

has a gradient equal to that of Lϕ in Eq. (46).

Expand Eq. (46) to get

LGM ϕ = Et U(0,1),z p(z),xt p(xt|z)

" gϕ(xt, t) ( e J(x1)

ZϕZ,sg(xt, t) 1)vt|z(xt|z)

= Et U(0,1),z p(z),xt p(xt|z)[ gϕ(xt, t) 2 2 | {z } dependent on ϕ

ZϕZ,sg(xt, t)vt|z(xt|z) 2 2 + vt|z(xt|z) 2 2

2 gϕ(xt, t), e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) | {z } dependent on ϕ

ZϕZ,sg(xt, t)vt|z(xt|z), vt|z(xt|z) 2 vt|z(xt|z), gϕ(xt, t) | {z } dependent on ϕ

On the Guidance of Flow Matching

After taking gradient w.r.t. ϕ, only the terms

ϕEt U(0,1),z p(z),xt p(xt|z)

gϕ(xt, t) 2 2 2 gϕ(xt, t), e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) 2 vt|z(xt|z), gϕ(xt, t) (50)

survive. Therefore, by assuming a perfectly learned vθ(xt, t), i.e.,

vθ(xt, t) = Ez p(z|xt) vt|z(xt|z) , (51)

Et U(0,1),z p(z),xt p(xt|z) vt|z(xt|z), gϕ(xt, t)

=Et U(0,1),z p(z|xt),xt p(xt) vt|z(xt|z), gϕ(xt, t)

=E z p( z|xt) Et U(0,1),z p(z|xt),xt p(xt) vt|z(xt|z), gϕ(xt, t)

=E z p( z|xt) Et U(0,1),xt p(xt) Ez p(z|xt)[vt|z(xt|z)], gϕ(xt, t)

=Ez p(z|xt) Et U(0,1),xt p(xt) E z p( z|xt)[vt|z(xt|z)], gϕ(xt, t)

=Et U(0,1),z p(z|xt),xt p(xt) [ vθ(xt, t), gϕ(xt, t) ] , (52)

so by adding back terms that the gradient is agnostic to, we can see that the new loss L(1) ϕ in Eq. (48) is equivalent to Lϕ in Eq. (46)

ϕLGM ϕ = ϕEt U(0,1),z p(z),xt p(xt|z)

" gϕ(xt, t) ( e J(x1)

ZϕZ,sg(xt, t) 1)vt|z(xt|z)

= ϕEt U(0,1),z p(z),xt p(xt|z)[ gϕ(xt, t) 2 2 + e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) 2 2 + vθ(xt, t) 2 2 | {z } Vanishes after ϕ

2 gϕ(xt, t), e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) 2 e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z), vt|z(xt|z) 2 vθ(xt, t), gϕ(xt, t) | {z } Changed vt|z to vθ using Eq. (52)

= ϕEt U(0,1),z p(z),xt p(xt|z)

gϕ(xt, t) + vθ(xt, t) e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z)

2 | {z } :=ℓVGM

= ϕLVGM ϕ . (54)

(3) Reweighted Guidance Matching Loss ℓRGM. Eq. Replacing ℓGM in (46) with ℓVGM:

ZϕZ,sg(xt, t) gϕ(xt, t) + vθ(xt, t) vt|z(xt|z) 2 2, (55)

and the loss LGM ϕ becomes LVGM ϕ , which are shown equivalent in the following.

Starting from Eq. (49), we can extract e J(x1)

ZϕZ from the three terms depended on ϕ, and thus showing the resulting loss is

indeed ℓRGM. Notice that because ZϕZ = Ez p(z|xt)[e J(z)],

Ez p(z|xt)[ e J(z)

ZϕZ(xt)f(xt, t)] = f(xt, t). (56)

On the Guidance of Flow Matching

Thus, we have:

LGM ϕ = Et U(0,1),z p(z),xt p(xt|z)[ gϕ(xt, t) 2 2 | {z } dependent on ϕ

ZϕZ,sg(xt, t)vt|z(xt|z) 2 2 + vt|z(xt|z) 2 2

2 gϕ(xt, t), e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) | {z } dependent on ϕ

ZϕZ,sg(xt, t)vt|z(xt|z), vt|z(xt|z) 2 vt|z(xt|z), gϕ(xt, t) | {z } dependent on ϕ

= Et U(0,1),z p(z),xt p(xt|z)[ e J(x1)

ZϕZ(xt)gϕ(xt, t) 2 2 + e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) 2 2 + vt|z(xt|z) 2 2

2 gϕ(xt, t), e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z) 2 e J(x1)

ZϕZ,sg(xt, t)vt|z(xt|z), vt|z(xt|z) 2 e J(x1)

ZϕZ(xt) vt|z(xt|z), gϕ(xt, t) ]

= Et U(0,1),z p(z),xt p(xt|z)

ZϕZ,sg(xt, t) gϕ(xt, t) + vθ(xt, t) vt|z(xt|z) 2 2 | {z } =ℓRGM

= LRGM ϕ . (58)

Where we used the conclusion of Eq. (52), and inserted terms that vanish after ϕ to make ℓRGM.

(3) Marginalized Reweighted Guidance Matching Loss ℓMRGM. The above loss Eq. (55) can be re-assigned a weight, which will result in the same optimal gϕ(xt, t) as in Eq. (11). Specifically, by changing 1 ZϕZ ,sg(xt,t) to

1 Ext p(xt)[ZϕZ ,sg(xt,t)] = 1 R p(xt)ZϕZ ,sg(xt,t)dxt , we have

e J(x1) R e J(x1)p(z)dz gϕ(xt, t) + vθ(xt, t) vt|z(xt|z) 2 2. (59)

We only need to prove that Z p(xt)Zt(xt)dxt = Z e J(x1)p(z)dz. (60)

Recall that Zt(xt) = Z p(z|xt)e J(x1)dz, (61)

so Z p(xt)Zt(xt)dxt = Z p(xt) Z p(z|xt)e J(x1)dxtdz (62)

= Z p(z)e J(x1)dz = Z. (63)

Eq. (45) can also be derived by applying importance sampling Ez 1

Z p(z)e J(x1) 2 2 = Ez p(z) h e J(x1)

Z 2 2 i to the

flow matching objective of the new VF for the new target distribution p (xt) = 1

Z p(xt)e J(xt).

Discussions The losses have the same expected gradient, but their performance may differ. Among the four losses, ℓMRGM ϕ is the only one that does not require the auxiliary model ZϕZ. However, ℓRGM ϕ assigns loss weight dependent on xt. The weight is emphasized when the expectation of e J(x1) under p(x1|xt) is small. Compared to these two losses, ℓGM ϕ , ℓVGM ϕ do not reweight the loss. The variance of ℓGM ϕ will be smaller if J is smooth, while ℓVGM ϕ is better when vt is more complex.

A.8. Algorithm Details and Variants of g MC

The pseudocode for computing g MC is as follows.

On the Guidance of Flow Matching

Algorithm 1 Monte Carlo estimation of the guidance gt(xt)

Require: Current t, xt, known pt(xt|z).

1: Sample zi p(z), where i = 1, 2, ..., N // Recall zi = (xi 1, xi 0) 2: pt(xt) 1

i pt(xt|zi)

3: Zt(xt) 1

i e J(xi 1) pt(xt|zi)

4: g MC t (xt) 1

i( e J(xi 1) Zt(xt) 1)vt|z(xt|zi) pt(xt|zi)

pt(xt) 5: return g MC t (xt)

Independent Couplings. Although we introduced the flow matching using the condition z = (x0, x1), it can also be chosen as x0 or x1 (Lipman et al., 2024). When we z := x1, Algorithm 1 can be readily adopted for the x0 condition. This way, the Monte Carlo estimation reduces the integration region dimensionality to half of the original one, thus becoming more efficient.

In the case where z = (x0, x1) and the data coupling is independent π(x0|x1) = p(x0), we show here that the MC estimation can be simplified to the x1-conditioned case that is more efficient:

g MC-x1 t (xt) = Ex1 p(x1)

Zt 1)vt|x1(xt|x1)pt(xt|x1)

ZMC-x1 t (xt) = Ex1 p(x1)

e J(x1) pt(xt|x1)

Obviously, as Zt = R p(x0, x1|xt)e J(x1)dx0dx1 = R p(x0|x1, xt)p(x1|xt)e J(x1)dx0dx1, integrating out x0 gives Zt = ZMC-x1 t . Therefore, to prove the above simplification, we only need to prove that:

Ex0,x1 p(x0,x1|xt)

Zt 1)v(xt|x0, x1) = Ex1 p(x1|xt)

Zt 1)v(xt|x1) . (66)

The proof is simply integrating out x0:

Z p(x0, x1|xt)(e J(x1)

Zt 1)v(xt|x0, x1)dx0dx1

= Z Z p(x0|x1, xt)v(xt|x0, x1)dx0 | {z } :=v(xt|x1)

Zt 1)p(x1|xt)dx1

= Z v(xt|x1)(e J(x1)

Zt 1)p(x1|xt)dx1. (67)

It should be noted that v(xt|x1) is defined to be generally different from v(xt|x0, x1), and to do MC estimation via importance sampling, we need the forward probability path p(xt|x1) to have a known density. The variance-reducing variant of g MC is summarized in Algorithm 2.

Algorithm 2 Monte Carlo estimation of the guidance gt(xt)

Require: Current t, xt, known pt(xt|x1).

1: Sample xi 1 p(x1), where i = 1, 2, ..., N 2: pt(xt) 1

N P i pt(xt|xi 1)

3: Zt(xt) 1

N P i e J(xi 1) pt(xt|xi 1) p(xt)

4: gt(xt) 1

i( e J(xi 1) Zt(xt) 1)vt|z(xt|xi 1) pt(xt|xi 1) pt(xt) 5: return gt(xt)

On the Guidance of Flow Matching

A.9. Localized Approximation

To get glocal, we presume p(z|xt) is localized, and we can use a point estimation to approximate Zt:

Zt(xt) = Z p(z|xt)e J(x1)dz e J(ˆx1) (68)

where ˆx1 := Ex0,x1 p(z|xt)[x1], and then expanding gt to the first order

gt(xt) gt(xt)local = Ez p(z|xt)

e J(ˆx1) 1)vt|z(xt|z)

(e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

e J(ˆx1) 1)vt|z(xt|z)

= Ex1 p(x1|xt) (x1 ˆx1)vt|z(xt|z) ˆx1J(ˆx1). (69)

To quantify the approximation error, we have

δg 2 := gt glocal 2 2

= Ez p(z|xt)

Zt(xt) 1)vt|z(xt|z)

(e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

e J(ˆx1) 1)vt|z(xt|z)

= Ez p(z|xt)

Zt(xt) e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

e J(ˆx1) )vt|z(xt|z)

where Zt(xt) = Ez p(z|xt)[e J(x1)]. (71)

We start by computing the error bound of approximating Zt with e J(ˆx1). Using Taylor expansion and the Taylor Remainder Theorem 9,

Zt(xt) e J(ˆx1) 2

Ez p(z|xt)[ X

1 k!Dk x1e J(x) x1=ˆx1(x1 ˆx1)k]

1 2(x1 ˆx1)T x1 x1e J(x) x1=ˆx1+t(x1 ˆx1) | {z }

where t [0, 1].

If we set the L2 norm of the covariance matrix Ez p(z|xt)[(x1 ˆx1)(x1 ˆx1)T ] as σ1, and the eigenvalue with the largest

absolute value of maxt,x1 |h(J) t | to be λh, we can show that Zt(xt) e J(ˆx1) 2 Ez p(z|xt)[(x1 ˆx1)T h(J) t (x1 ˆx1)]

Ez p(z|xt)[(x1 ˆx1)T λh(x1 ˆx1)]

λh Ez p(z|xt)[(x1 ˆx1)T (x1 ˆx1)]

λhtrΣ11 λhσ1d, (73)

9The notations here neglect the order of vector/matrix products, but this does not matter as all of them will be scaled using the triangle inequality.

On the Guidance of Flow Matching

where Σ11 is the covariance matrix of p(x1|xt), d is the dimensionality of x Rd. The last inequality follows from tr A = Pn i λi n maxi λi, and the L2 norm of a matrix is its largest singular value, i.e., for the covariance matrix, that is the largest eigenvalue.

δg = Ez p(z|xt)

Zt(xt) e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

= Ez p(z|xt)

Zt(xt) e J(x1)

e J(ˆx1) | {z } Using the error bound of Zt

e J(ˆx1) e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

δg 2 2 λhσ1d Zt(xt)e J(ˆx1) Ez p(z|xt)[e J(x1)vt|z]

+ Ez p(z|xt)

e J(x1) e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

e J(ˆx1) vt|z

= λhσ1d Zt(xt)e J(ˆx1) Ez p(z|xt)[e J(x1)vt|z]

+ Ez p(z|xt)

e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1)) + R2 e J(ˆx1)(1 ˆx1J(ˆx1)(x1 ˆx1))

e J(ˆx1) vt|z

By using the Taylor Remainder Theorem again, we have

2(x1 ˆx1)T ξ ξe J(ξ) ξ=ˆx1+t(x1 ˆx1) | {z }

(x1 ˆx1). (76)

δg 2 2 λhσ1d E[e J(x1)v(xt|z)]

E[e J(x1)]e J(ˆx1)

2 + Ez p(z|xt)

1 2e J(ˆx1) (x1 ˆx1)T h(J) t (x1 ˆx1)vt|z

λhσ1d E[e J(x1)v(xt|z)]

E[e J(x1)]e J(ˆx1)

1 2e J(ˆx1) (x1 ˆx1)T h(J) t (x1 ˆx1)

# h Ez p(z|xt) v(xt|z) 2 2 i

λhσ1d E[e J(x1)v(xt|z)]

E[e J(x1)]e J(ˆx1)

2 + λhσ1d 2e J(ˆx1)

2 E v(xt|z) 2 2 . (77)

Then we have

δg 2 2 λhσ1d E[e J(x1)v(xt|z)]

E[e J(x1)]e J(E[x1])

2 + λhσ1d 2e J(E[x1])

2 E v(xt|z) 2 2

= λhσ1d e J(E[x1])

E[e J(x1)v(xt|z)]

2 | {z } C1

4E v(xt|z) 2 2

On the Guidance of Flow Matching

where we omit z p(z|xt) in Ez p(z|xt)[ ] and simplify the notation to E[ ]. Therefore, the approximation error of glocal is

bounded by (λhσ1d)2(C1 +C2)/|e J(E[x1])|, where λh is the largest eigenvalue of h(J) t , the Hessian matrix of the objective function e J, σ1 is the L2 norm of the covariance matrix, d is the sample dimensionality, C1 is a constant that has to do with the norm of the new VF, and C2 is the variance of the original conditional VF. Some intuitions can be emphasized:

1. The error is small when J is smooth, in which case the Hessian of e J will approach zero. This corresponds to the mild guidance, where approximation-based glocal works well.

2. The error is small when σ1 is small, i.e. the covariance matrix Σ11 has a small Frobenius norm. This is the case when the flow time t 1 (and σt = 0), where xt predicts x1 well.

3. The magnitude of |e J(E[x1])| reflects how well how well the estimated generated sample E[x1] matches the objective J given the current xt. If E[x1] lies inside the region where J is small, i.e., E[x1] is a good sample, then the approximate guidance will be more accurate as the optimization is conducted locally, and gradient can reflect the landscape well. If J(E[x1]) is large, the gradient is almost randomly exploring the sample space, producing a larger approximation error.

4. The cases where C1 and C2 are small are not necessarily those where the guidance is of better accuracy. Because of the small norm of the VF, the error in the guidance VF will likely cause a larger deviation due to increased relative error.

A.10. Estimation of ˆx1

Under the affine path assumption (Assumption 3.2), we can estimate the expectation of x1 under the distribution p(z|xt). This is a well-known trick (Lipman et al., 2024; Pokle et al., 2024), but our analysis includes the dependent coupling case.

Since the flow matching model learns

vθ(xt, t) vt(xt) = Ez p(z|xt) [v(xt|z)] , (79)

using the affine path assumption (xt = αtx1 + βtx0 + σt σtε),

v(xt|z) = d

dtxt = ( αx1 + βtx0 + σtε), (80)

so vt(xt) = Ez p(z|xt) h αx1 + βtx0 + σtε i . (81)

Meanwhile, taking the expectation of xt under p(z|xt) yields

Ez p(z|xt)[xt] = xt | {z } because R zp(z|xt)dz=1

= Ez p(z|xt)[αtx1 + βtx0 + σtε]. (82)

Then, by using Eq. 81 and 82, we can eliminate either ˆx0 or ˆx1 in each other s expression:

ˆx0 := Ez p(z|xt)[x0] = αtxt αtvt(xt)

βt αt βtαt + Ez p(z|xt)

αtσtε + αt σtε

| {z } :=ζ0 t

ˆx1 := Ez p(z|xt)[x1] = βtxt + βtvt(xt)

βt αt βtαt + Ez p(z|xt)

" βtσtε βt σtε

| {z } :=ζ1 t

It should be noted that we have assumed that σt is small and thus ζ0 t and ζ1 t are also small in the affine path assumption (Assumption 3.2):

ζ0 t = αtσt + αt σt

βt αt βtαt Ez p(z|xt)[ε]

= αtσt + αt σt

Z p(xt|x0, x1)π(x0, x1)

p(xt) εdx0dx1

= αtσt + αt σt

Z 1 p(xt)Eε pε(ε) [π(x0(xt, x1, ε), x1)ε] dx1. (85)

On the Guidance of Flow Matching

Since π(x0(xt, x1, ε), x1) is a probability distribution that is assumed to be bounded, we denote maxε π(x0(xt, x1, ε), x1) M(x1, xt), and thus

lim σt 0, σt 0 ζ0 t lim σt 0, σt 0

αtσt + αt σt

2 Eε pε(ε) ε 2M(x1, xt) 2 2 dx1 = 0. (86)

Since everything in the integral is independent of ε, x0, or σt, as σt 0 ζ0 simply converges to zero. A similar approach can prove that limσt 0, σt 0 ζ1 2 is also zero.

Next, we explain why we specifically care about the case where the small σt assumption holds. In independent coupling flow matching, σtε is exactly zero since we can use two of xt, x0, and x1 to express the third one. In dependent coupling flow matching, this assumption also holds for famous methods such as optimal transport conditional flow matching or Schrodinger Bridge conditional flow matching (Tong et al., 2024), where ε N(0, I) and σt is set as a small constant. Therefore, the assumption that σtε is small in affine path assumption is general and applies to many existing flow matching methods. Hence, by approximating ζ0 t and ζ1 t as zero, we have the final estimation of ˆx1

ˆx0 αtxt αtvθ(xt, t)

βt αt βtαt (87)

ˆx1 βtxt + βtvt(xt)

βt αt βtαt , (88)

where Eq. (88) is just Eq. (5). Note the approximations become exact under Assumption 3.3.

A.11. Proof of gcov

Here, we prove that under the affine path assumption (Assumption 3.2), Eq. (6)

glocal t gcov t = αtβt βtαt

βt | {z } schedule

Σ1|t ˆx1J(ˆx1), (89)

where glocal t = Ez p(z|xt) (x1 ˆx1)v1|t(xt|z) ˆx1J(ˆx1). (90)

Under the affine path xt = αtx1 + βtx0 + σtε the conditional vector field v1|t follows

v1|t(xt) = αtx1 + βtx0 + σtε. (91)

Plugging this into the definition of glocal and we get

glocal t = Ez p(z|xt)[(x1 ˆx1)( αtx1 + βtx0 + σtε | {z } substitute x0 with x1, σtε, and xt

)] ˆx1J(ˆx1)

= Ez p(z|xt)[(x1 ˆx1)( αtx1 + βt βt (xt αtx1 σtε) + σtε)] ˆx1J(ˆx1)

= Ez p(z|xt)

βt αt αt βt

x1 + xt + ( σt σt)ε

= αtβt βtαt

βt Σ1|t ˆx1J(ˆx1) + (σt σt)Ez p(z|xt)[(x1 ˆx1)ε] ˆx1J(ˆx1) | {z } :=Υ, limσt 0, σt 0 Υ 2 2=0

where the xt term is canceled out because R p(z|xt)(x1 Ez p(z|xt)[x1])dz = R p(z|xt)x1dz Ez p(z|xt)[x1] = 0, Σ1|t := Ez p(z|xt) [(x1 ˆx1)(x1 ˆx1)], and the residual term that characterizes the approximation error (denoted as Υ 2 2)

On the Guidance of Flow Matching

in Eq. (6) (restated in Eq. (89)) is

Υ =(σt σt)Ez p(z|xt)[(x1 ˆx1)ε] ˆx1J(ˆx1)

=(σt σt) ˆx1J(ˆx1) Z p(xt|z)p(z)

p(xt) (x1 ˆx1)εdx1dx0

=(σt σt) ˆx1J(ˆx1) Z p(σtε)π(x0|x1)p(x1)

p(xt) (x1 ˆx1) 1

σt (xt αtx1 βtx0)dx1dx0

=(σt σt) ˆx1J(ˆx1) Z p(x1)

p(xt) (x1 ˆx1)dx1

Z p(σtε)π(x0|x1)εdx0

=(σt σt) ˆx1J(ˆx1) Z p(x1)

p(xt) (x1 ˆx1)Eε pε(ε)

βt (xt αtx1 σtε) | x1

ε dx1, (93)

where pε(ε) is the marginal distribution of ε. Suppose π 1 βt (xt αtx1 σtε) | x1 2 2 = π (x0 | x1) 2 2 M(x1, xt) (which is a function independent of ε, then

(σt σt) ˆx1J(ˆx1) 2 2

p(xt) (x1 ˆx1)Eε pε(ε)

βt (xt αtx1 σtε) | x1

|(σt σt)| ˆx1J(ˆx1) 2 2 | {z } :=G

Z p(x1) p(xt) (x1 ˆx1)

2 | {z } :=Q

βt (xt αtx1 σtε) | x1

(MEε pε(ε)[ ε 2]) 2 M2Varpε

|(σt σt)|G(xt) Z Q(x1, xt))M2(x1, xt)Varpεdx1, (94)

all of which are independent on x0 or σt. Thus,

lim σ 0,σt 0 Υ 2 2 = 0. (95)

A.12. The Jacobian Trick

We prove the Jacobian Trick here.

Proposition A.5. The Jacobian trick. Under Assumption 3.3, the inverse covariance matrix of p(x1|xt), Σ1|t, is affine to the Jacobian of the VF vt

xt , and is proportional to the Jacobian ˆx1

Σ1|t = β2 t αt( αtβt βtαt) ( βt + βt vt xt ) = β2 t αt

To begin with, we prove Σ1|t = β2 t αt ˆx1 xt . A similar conclusion has been proved in Ye et al. (2024). We generalize their proof to affine Gaussian path flow matching:

Recall from Eq. (88) that

ˆx1 = βt αtβt βtαt xt + βt αtβt βtαt vt (96)

and ˆx0 = αt αtβt βtαt xt αt αtβt βtαt vt. (97)

So we have the Jacobian trick ˆx1 xt = βt αtβt βtαt + βt αtβt βtαt

On the Guidance of Flow Matching

and because the VF is associated with the score

vt(xt) = βt( αtβt βtαt)

αt xt log pt(xt) + αt

αt xt. (99)

Next, we try to prove

xt xt log pt(xt) = 1

β2 t + α2 t β4 t Σx1x1, (100)

which allows us to connect the derivative of the score 2 xt log p(xt)10with the covariance matrix Σ1|t.

2 xt log p(xt) = 2 xtp(xt) p(xt) xt log p(xt) xt log p(xt)

Z p(x1) 2 xtp(xt|x1) | {z } using 2p=p 2 log p+p( log p)2 dx1 xt log p(xt) xt log p(xt)

Z p(x1)(p(xt|x1) 2 xt log p(xt|x1) + p(xt|x1) xt log p(xt|x1) xt log p(xt|x1))dx1

xt log p(xt) xt log p(xt)

=Ex1 p(x1|xt) 2 xt log p(xt|x1) + xt log p(xt|x1) xt log p(xt|x1) xt log p(xt) xt log p(xt)

=Ex1 p(x1|xt)

β2 t + xt αtx1

xt αt Ex1 p(x1|xt)[x1]

β2 t + α2 t β4 t

E[x1x T 1 ] E[x1]E[x1]T

β2 t + α2 t β4 t Σx1x1. (101)

Then by combining Eq. (99) and (100) we have

ˆx1 xt = βt αtβt βtαt + βt αtβt βtαt

βt( αtβt βtαt)

αt xt xt log pt(xt) + αt

= βt αtβt βtαt + βt αtβt βtαt

βt( αtβt βtαt)

β2 t + α2 t β4 t Σx1x1) + αt

β2 t Σx1x1. (102)

Inserting back Eq. (88) and we prove

Σ1|t = β2 t αt( αtβt βtαt) ( βt + βt vt xt ). (103)

A.13. Proof for gsim-inv t

We begin with

gsim-inv t (xt) = Z e J(x1)

Zt 1 vt|z(xt|z) p(z|xt)dz, (104)

where Zt = Z e J(x1) p(z|xt)dz, (105)

10We use and 2 interchangeably, with a little abuse of notation. It should not cause confusion since the size of the terms in the equations must match.

On the Guidance of Flow Matching

and approximate p(x1|xt) with N(x1; ˆx1, Σt) where ˆx1 := Ez p(z|xt)[x1] and Σt is already known.

With Assumption 3.2 and assuming e J(x1) = N(Hx1; y, σy I), we have

Zt = Z e J(x1) p(z|xt)dz

= Z e 1 2σ2y y Hx1 2 2 1

2 (z ˆz)T Σ 1 t (z ˆz)dz (106)

where ˆz = (ˆx0, ˆx1) is the expectation of z under p(z|xt). Note H operates on x1 only, and we pad the blocks related to x0 with zero in H.

Then, by inserting the Gaussian approximation

gsim-inv t (xt) Z e J(x1)

Zt 1 ( αtx1 + βtx0) p(z|xt) | {z } Gaussian

= Z 1 Zt exp 1

2σ2y y Hx1 2 2 1

2(z ˆz)T Σ 1 t (z ˆz)

| {z } := p(z|xt)

( αtx1 + βtx0)dz vt(xt). (108)

Remark A.6. Note that Σ 1 t couples x0 and x1. This is a fundamental feature of dependent couplings π(x0, x1). However, it may seem tempting to further assume that Σ 1 t is diagonal or even a scalar. It should be noted that this assumption completely discards the dependency of x0 and x1 in the coupling, and thus, we try to avoid that in the dependent coupling case.

For clarity, we need to express Σ 1 t with

Σ 1 t = Ξ00 Ξ01 Ξ10 Ξ11

Then, the distribution exp 1 2σ2y y Hx1 2 2 1

2(z ˆz)T Σ 1 t (z ˆz) is still a Gaussian, and to estimate the expectation

of z = (x0, x1) we need to simply the probability density of this Guassian into a standard form.

Zt p(z|xt) = exp 1

2σ2y y Hx1 2 2 1

2(z ˆz)T Σ 1 t (z ˆz)

y 2 2 y, Hx1 + Hx1 2 2 1

2z T Σ 1 t z 1

2 ˆz T Σ 1 t ˆz + z, Σ 1 t ˆz | {z }

since Σ 1 t =Σ 1 t T

σ2y + Σ 1 t

σ2y y + Σ 1 t ˆz + 1

2σ2y y 2 + 1

2 ˆz T Σ 1 t ˆz , (110)

It is obvious that the mean of this Gaussian is

σ2y + Σ 1 t | {z } :=P

σ2y y + Σ 1 t ˆz , (111)

where we can find P s blocks to be

Ξ00 Ξ01 Ξ10 Ξ11 + HT H

Then, by computing µ = (ˆˆx0, ˆˆx1) we can compute gsim-inv t + vt, where ˆˆx0, ˆˆx1 are the x0 and x1 term in the integral of Eq. (108), because gsim-inv t + vt = Ez p(z|xt)[ αtx1 + βx0]. (113)

On the Guidance of Flow Matching

To simplify, insert back vt to get

gsim-inv t = Ez p(z|xt)[ αtx1 + βx0] Ez p(z|xt)[ αtx1 + βx0]

= βt I αt I P 1 0 HT

+ Ξ00 Ξ01 Ξ10 Ξ11

= βt I αt I P 1 0 HT

= βt I αt I P 1 0 HT

Usually, βt I αt I P 1 is difficult to obtain:

gsim-inv t = ( βt P 1 01 + αt P 1 11 ) HT

where P 1 01 and P 1 11 requires computing the inversion of P and thus in general intractable. Using block matrix inversion, we have

gsim-inv t =

βtΞ 1 11 Ξ01 Ξ11 + HT H

σ2y Ξ10Ξ 1 11 Ξ01 1 + αt Ξ11 + HT H

σ2y Ξ10Ξ 1 11 Ξ01 1 ! HT

For general (possibly coupled) affine path flow matching, we can make approximations and set the blocks in Σ 1 t to scalars. It should be noted that this Gaussian assumption can still capture some coupling between x0 and x1 since the off-diagonal blocks Ξ01 and Ξ10 are not set to zero. Specifically, we have

gsim-inv-A t = λt σ2 y r2 t + HT H 1 HT (y Hˆx1) , (117)

where λt and r2 t are hyperparameters. λt approximates αt βtΞ 1 11 Ξ01, absorbing the flow schedule.

Special case: the non-coupled affine Gaussian path

Next, we prove that gsim-inv t covers ΠGDM (Song et al., 2023a) and OT-ODE (Pokle et al., 2024) as special cases. Under the uncoupled affine Gaussian path assumption (Assumption 3.3), one may think that the covariance matrix is block diagonal, but it is false: x0 and x1 are still dependent on each other in the distribution p(z|xt) = p(x0, x1|xt) even if the coupling is independent. In the uncoupled case, the probability graph is x0 xt x1, so although x0 and x1 are marginally independent (π(x0, x1) = p(x0)p(x1)), their conditional can be dependent p(x0, x1|xt) = p(x0|xt)p(x1|xt). Then, we notice the uncoupled path is xt = αtx1 + βtx0, (118)

so we actually should not have approximated the distribution pz|xt as a Gaussian in the uncoupled case. Fortunately, there is a workaround to make x0 almost entirely dependent on x1. We can set x0 = αt

βt xt + ξϵ, where ϵ N(0, I), and setting ξ 0 gives our desired uncoupled path results. The covariance matrix of x0 and x1 to:

α2 t β2 t Σx1x1 + ξ2I αt

βt Σx1x1 αt

βt Σx1x1 Σx1x1

Note that Σ 1 x1x1 = Ξ11 as Ξ11 is a block in the inversion of the larger matrix Σt. Next we compute Σ 1 t :

ξ2 I Σx1x1 a2Σx1x1(a2Σx1x1 + ξ2I) 1Σx1x1 1

1 ξ2 I αt βtξ2 I

αt βtξ2 I Σx1x1 Σx1x1(Σx1x1 + β2 t α2 t ξ2I) 1Σx1x1 1

On the Guidance of Flow Matching

where a = αt

βt . Therefore, Ξ11 , Ξ01 = Ξ10 = αt βtξ2 I , and Ξ00 = 1 ξ2 I . Thus, we need more detailed calculations to get the result:

gsim-inv-diffusion t =

βt Ξ 1 11 Ξ01 | {z } αt

Ξ11 |{z} +HT H

σ2y Ξ10Ξ 1 11 Ξ01 | {z }

1 + αt Ξ11 |{z} +HT H

σ2y Ξ10Ξ 1 11 Ξ01 | {z }

Obviously we want to find the finite term left in Ξ11 Ξ10Ξ 1 11 Ξ01:

lim ξ 0 Ξ11 Ξ10Ξ 1 11 Ξ01

Σx1x1 Σx1x1(Σx1x1 + β2 t α2 t ξ2I) 1Σx1x1 1 Ξ10Ξ 1 11 Ξ01

Σx1x1 Σx1x1(Σx1x1 + β2 t α2 t ξ2I) 1Σx1x1 1 α2 t β2 t ξ2

Σx1x1 + β2 t α2 t ξ2I 1 ( Σx1x1 + β2 t α2 t ξ2I) Σx1x1

α2 t β2 t ξ2

= lim ξ 0 α2 t β2 t ξ2

Σx1x1 + β2 t α2 t ξ2I 1 1

=Σ 1 x1x1. (122)

Now we have

gsim-inv-diffusion t = αtβt βtαt

Σ 1 x1x1 + HT H

This is essentially the same formulation as in ΠGDM (Song et al., 2023a) and OT-ODE (Pokle et al., 2024). Next, we will make some trivial conversions to cover the formulations exactly.

In diffusion paths (Assumption 3.4) we proved that ˆxt

β2 t Σ1|t where Σ1|t is just what we denote Σx1x1 here. Equivalently, xt ˆx1 = β2 t αt Σ 1 x1x1. (124)

gsim-inv-diffusion t = αtβt βtαt

Σ 1 x1x1 + HT H

= αtβt βtαt

βt Σx1x1 σ2 y I + Σx1x1HT H 1 HT (y Hˆx1)

= αtβt βtαt

σ2 y I + Σx1x1HT H 1 HT y HT Hˆx1

=βt( αtβt βtαt)

σ2 y I + Σx1x1HT H 1 HT y HT Hˆx1

=βt( αtβt βtαt)

(y Hˆx1)T H σ2 y I + Σx1x1HT H 1 ˆx1

On the Guidance of Flow Matching

Now we make the same approximation in ΠGDM that Σx1x1 = r2 t I. Then by noticing that σ2 y I + r2 t HHT H =H σ2 y I + r2 t HT H

H σ2 y I + r2 t HT H 1 = σ2 y I + r2 t HHT 1 H (126)

We exactly cover

gsim-inv-ΠGDM t = βt( αtβt βtαt)

(y Hˆx1)T σ2 y I + r2 t HT H 1 H ˆx1

and the scheduler βt( αtβt βtαt)

αt in the path αt = t, βt = 1 t becomes 1 t

t , which exactly covers the schedule in OT-ODE which takes the same path. In addition, we can also directly compute Σx1x1 using ˆx1

xt instead of approximating it with rt. This corresponds to the approach in Boys et al. (2023), which uses the Jacobian to acquire the covariance and then remove the approximation error in computing σ2 y I + Σx1x1HT H 1. Remark A.7. Starting from the more general assumption of affine path flow matching, we derived the guidance compatible with dependent coupling flow matching, including OT-CFM. The fact that our guidance can exactly cover classical diffusion guidance like ΠGDM and affine Gaussian path flow matching guidance like OT-ODE verifies the validity of our theory.

B. Experiment Details

B.1. Synthetic Dataset Experiment Details

The training of flow matching models involves sampling x0 from a source distribution of {Circle,8 Gaussians,Uniform,Gaussian} and sampling x1 from the target distribution of {S-Curve,Moons,8 Guassians}. The model backbone is an MLP of 4 layers with a hidden dimension of 256. The models are trained 1e5 steps.

To evaluate the asymptotic exactness of g MC, we compute the Wasserstain-2 (W2) distance of the samples generated under guidance, with the ground truth energy-weighted distribution p(x1)e J(x1)/Z. Since the source distribution p(x1) is learned, the flow matching model itself has a small error w, which can also be quantified using the W2 distance. In principle, this error w characterizes the performance upper bound of the guided distribution: the W2 distance of the guided distribution will, in principle, not be significantly lower than w. The result is shown in Figure 4, where w is demonstrated using the dashed line.

B.2. Planning Experiment Details

Settings. The experiment leverages the D4RL (Datasets for Deep Data-Driven Reinforcement Learning) dataset (Fu et al., 2020), specifically the Locomotion datasets, which is a common choice to evaluate offline reinforcement learning methods, as well as offline planning methods (Janner et al., 2022; Dou & Song, 2024). The datasets contain non-expert behaviors from which the model is required to learn the optimal policy, such as a mixture of expert and medium-level experts, or the training replay buffer of an RL agent.

To evaluate the performance of different guidance methods, we conduct experiments on offline RL tasks where generative models have been used as planners (Janner et al., 2022; Chen & Lipman, 2024). Our setting is based on that of a classical generative planner called Diffuser (Janner et al., 2022), where a generative model generates a state-action pair sequence of multiple future steps, and another critic model that predicts the future reward11 of the generated plans. The generative model then optimizes its plans using guidance for higher future rewards. Following the formulation in Levine (2018); Janner et al. (2022), the optimization is realized through sampling from 1

Z p(x1)e R(x1), where R is the critic model. This aligns with the goal of guidance which we discussed in this paper, so we chose this experiment to evaluate different guidance methods.

Baseline Methods. The results of the two baselines are collected from the literature (Janner et al., 2022). Behavior cloning refers to using a Gaussian distribution to fit the offline behavior distribution. Diffuser refers to using a diffusion model to learn the offline behavior and then guiding the model to generate plans with higher expected future returns. Note that the Diffuser also adopts training-based guidance, which requires re-training the guidance model when switching to a new objective function. On the contrary, the training-free guidances gcov-A, gcov-G, gsim-MC, and g MC has zero-shot generalization

11Formally, it predicts the discounted return-to-go.

On the Guidance of Flow Matching

101 102 103

Number of Samples

Circle Moons Gaussian Moons 8 Gaussians Moons

Figure 4: Error scaling with Monte Carlo sample number. In the synthetic dataset, the guidance performance (W2 distance between the generated distribution and the ground truth energy weighted distribution p(x1)e J(x1)/Z) decreases as the number of Monte Carlo samples increases. The dashed lines denote the W2 distance between the learned unguided distribution and the original ground truth distribution p(x1). The reason why the guided generation errors (crosses) do not converge to the dashed lines is that they measure the W2 distance of p(x1) and p(x1)e J(x1)/Z, respectively.

ability for (Zhou et al., 2024) the objective function. In the experiment results, we do not highlight the results of baselines as we focus on the comparison between different guidance methods.

Hyperparameters. As we mentioned, a generative model is first trained on the offline dataset as a behavior-cloning method but captures the actual action distribution rather than approximating it with a Gaussian. The generative model we consider here is the CFM or mini-batch optimal transport CFM with affine paths αt = t, βt = 1 t, and whose backbones are an 8-layer Transformer with a hidden dimension of 256. The models are trained with 1e5 steps, a batch size of 32, a learning rate of 2e-4, and the cosine annealing learning rate scheduler. As for the critic model, it is trained with the same backbone model using the last token as the value output and trained 1e4 steps, batch size of 64, and a learning rate of 2e-4. The value discount factor is set to 0.99 for all 3 datasets. We use a planning horizon of 20 steps and the planning stride 1. We exclude tricks such as using the inverse dynamics model, planning with stride, and using sample-and-select methods (Lu et al., 2025). During evaluation, the same base model is utilized for different guidance methods to ensure a fair comparison. We report the normalized score (Fu et al., 2020) where 100 is the expert RL agent s return.

For different guidance methods, different hyperparameter combinations are tuned. We elaborate on them here.

gcov-A: We tune λt in {constant, cosine decay, exponential decay, linear decay} with a scaler {0.01, 0.1, 1.0, 10.0}, where the schedule functions are normalized to [0, 1].

gcov-G: The same hyperparameters are tuned as in gcov-A.

g MC: We tune the scale of J in {0.2, 1, 2, 3, 5}. The Monte Carlo sample number is limited to be smaller than {128}. We also include a small number ϵ to enhance numerical stability. We conduct an ablation study to show the performance is insensitive to ϵ, as shown in Table 6.

gsim-MC: We tune scale before J in {0.1, 1, 10} and the assumed standard deviation of p(x1|xt) in {0.1, 0.5, 1, 10}, and do extra schedule and scale of the estimated guidance with schedule tuned in {linear decay, constant} and the

On the Guidance of Flow Matching

scale tuned in {0.1, 1, 10}. It is worth noting that if the objective function J is properly normalized, the scale does not require extensive tuning.

gϕ: Training-based methods have many hyperparameters involving model architecture and the training settings. We switch between different model depth and hidden dimensions, and different training losses. The best results for each loss are provided in Table 4 and 5.

Finally, for each pre-trained VF (CFM or OT-CFM), we try adding different guidance VFs (also CFM or OT-CFM), and then the best result is reported.

Estimation of the ground truth distribution of J. Suppose the unguided model generates x p(x), and J(x) follows the distribution p J(J). Then, if x 1

Z p(x)e J(x), then J follows

p (J) = p (x)det x J(x)

Z p(x)e J(x)det x J(x)

Since for the original distribution

p(J) = p(x)det x J(x)

so p (J) = 1

Z e Jp(J). (130)

Therefore, by sampling from the unguided model and then reweighting the distribution of J, we can compute the ground truth distribution p (J) for the J under ideal guidance. Note that in the planning experiment, J = R.

It should be noted that although gradient-based guidances gcov-A and gcov-G result in distributions where the estimated return R is higher, it does not necessarily mean that their performance is better: first, the goal of the guidance is the gray line, what we assume here is that the methods produce a distribution close to the gray line is better; second, practically speaking, the high return is predicted by the critic model, but gradient methods may produce plans that the critic has not seen during training, which is called distribution shift, thus cheating the critic. On the contrary, the target guided distribution p(x1)e R(x1)/Z regularizes the guided distribution on the support of p(x1), alleviating the problem of distribution shift.

Additional Results on the Distribution of Generated R. The additional results of the distribution of R in different environments with different guidance scales (the α in p (R) = p(R)e αR/Z) are shown in Figure 5.

Additional Results. The complete results, including standard deviations, are provided in Table 4 and 5.

Table 4: Full experiment results on D4RL Locomotion datasets. The base model is mini-batch optimal transport conditional flow matching. Entries with 95% score than the best per task are highlighted in bold. Baselines are excluded from the ranking.

w/o g gcov-A gcov-G gsim-MC g MC gϕ GM gϕ VGM gϕ RGM gϕ MRGM

Medium-Expert Half Cheetah 61.9 13.3 64.8 12.7 73.2 9.5 78.1 3.2 86.4 0.8 59.5 18.4 57.7 14.1 57.5 13.1 70.2 18.1 Hopper 95.2 20.4 101.8 22.2 112.3 1.8 112.3 0.8 112.7 0.9 85.2 23.3 98.1 16.3 90.3 24.1 89.3 18.7 Walker2d 79.1 35.2 97.3 9.4 107.2 1.4 101.0 10.2 107.5 1.0 87.0 16.7 90.5 10.0 88.0 17.2 91.3 11.7

Medium Half Cheetah 34.7 9.6 42.2 0.8 42.9 0.9 43.1 1.7 43.1 0.4 42.7 1.4 43.0 1.2 42.7 0.8 43.4 0.9 Hopper 63.3 4.6 75.1 14.9 89.8 13.6 76.2 13.2 79.8 14.8 79.7 12.4 71.6 9.2 77.4 6.6 72.5 6.0 Walker2d 72.4 13.3 82.7 5.3 81.3 2.0 83.4 1.9 83.0 3.4 80.6 2.2 80.2 2.0 78.4 3.9 76.6 6.1

Medium-Replay Half Cheetah 25.6 13.0 31.7 3.4 36.1 5.1 36.8 1.8 40.0 1.6 33.4 2.6 35.5 1.8 32.9 1.8 34.7 3.1 Hopper 40.1 3.7 57.7 15.4 74.1 5.1 60.9 13.1 88.6 11.6 54.6 14.8 48.1 15.2 46.6 11.9 55.3 19.4 Walker2d 31.2 6.0 62.5 16.8 82.5 10.8 64.4 9.7 88.1 2.1 45.6 17.2 37.8 14.5 44.3 23.1 52.4 20.6

B.3. Image Experiment Details

We pre-trained a CFM and mini-batch optimal transport CFM model with affine path αt = t, βt = 1 t on Celeb A-HQ 256 256 dataset. The flow matching model utilizes the backbone of a U-Net following (Pokle et al., 2024). The pretraining was conducted with a learning rate of 1e-4 and a batch size of 128 for 500 epochs. The run time was roughly 3 days on two

On the Guidance of Flow Matching

0 200 400 600 800 1000 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(a): Half Cheetah, scale 0.001.

100 200 300 400 500 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(b): Walker2d, scale 0.001.

200 220 240 260 280 300 320 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(c): Hopper, scale 0.001.

0 200 400 600 800 1000 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(d): Half Cheetah, scale 0.01.

100 200 300 400 500 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(e): Walker2d, scale 0.01.

200 220 240 260 280 300 320 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(f): Hopper, scale 0.01.

0 200 400 600 800 1000 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(g): Half Cheetah, scale 0.05.

100 200 300 400 500 Estimated Return

Probability Density of

Generated Samples

Return Distribution

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(h): Walker2d, scale 0.05.

200 220 240 260 280 300 320 Estimated R

Probability Density of

Generated Samples

Distribution of R

1 Zp(R)e R(x1)

p(R) (w/o guidance)

(i): Hopper, scale 0.05.

Figure 5: The complete results of the distribution of R. The distribution of J under g MC matches the ground truth value (gray dashed line) very well.

Table 5: Full experiment results on D4RL Locomotion datasets. The base model is conditional flow matching. Entries with 95% score than the best per task are highlighted in bold. Baselines are excluded from the ranking.

w/o g gcov-A gcov-G gsim-MC g MC gϕ GM gϕ VGM gϕ RGM gϕ MRGM

Medium-Expert Half Cheetah 46.4 10.1 63.4 17.8 68.5 6.2 83.5 4.2 87.7 1.8 61.2 17.1 81.5 7.7 64.2 20.7 66.4 19.2 Hopper 83.4 19.2 93.9 22.5 113.3 1.8 88.5 28.0 113.3 0.9 91.4 16.1 84.2 22.2 86.2 18.7 86.5 22.4 Walker2d 65.7 12.1 100.4 10.7 106.9 0.8 107.0 0.8 107.1 0.4 96.2 13.7 102.3 9.2 100.9 9.4 98.2 12.3

Medium Half Cheetah 41.8 0.9 43.6 0.6 43.3 1.2 43.8 1.0 43.8 1.0 43.7 0.7 43.7 1.0 42.9 0.7 43.8 1.1 Hopper 73.2 5.4 79.1 9.8 82.7 11.9 82.1 9.2 88.0 11.3 81.4 14.0 85.2 15.5 74.0 16.3 71.6 14.7 Walker2d 72.2 5.9 80.7 1.1 82.5 2.8 81.9 5.4 81.9 1.4 67.5 23.7 72.9 2.1 63.3 21.4 55.9 28.6

Medium-Replay Half Cheetah 22.2 14.9 33.4 7.1 39.3 2.0 37.9 1.4 40.6 2.0 36.9 2.3 39.1 1.7 37.3 2.1 36.8 4.4 Hopper 55.1 17.2 63.0 17.5 69.3 16.4 61.0 22.6 80.9 15.7 51.4 22.6 63.5 12.7 48.6 10.9 57.3 24.0 Walker2d 28.3 7.3 64.9 20.2 76.6 10.7 58.9 18.9 70.9 21.7 57.5 20.6 70.3 4.9 53.1 22.0 54.6 18.7

On the Guidance of Flow Matching

Table 6: Ablation study of the impact of the epsilon to the performance of g MC. The best results and the second best per task are highlighted in bold and underlined.

1 1e 3 5e 3 1e 2 5e 2

Half Cheetah Medium 42.5 1.6 43.1 0.6 39.8 8.4 41.7 2.9 43.2 1.5 Medium-Expert 68.2 15.5 66.9 17.1 75.6 12.6 74.6 12.1 72.7 16.6 Medium-Replay 33.5 9.9 31.7 12.6 39.7 2.0 34.7 10.9 37.3 6.3

Hopper Medium 69.0 10.6 67.6 4.9 73.1 9.4 72.2 10.0 72.6 12.3 Medium-Expert 95.4 19.5 103.2 19.2 108.0 12.3 107.1 14.3 101.5 20.8 Medium-Replay 53.8 18.7 59.1 17.0 68.2 18.7 68.1 18.1 74.3 18.8

Walker2d Medium 74.6 6.5 75.1 15.2 74.0 10.6 71.1 13.8 74.4 11.0 Medium-Expert 79.0 26.4 95.7 17.8 94.2 22.0 100.9 17.6 103.0 7.9 Medium-Replay 48.7 21.5 49.5 16.7 54.7 20.6 57.6 22.3 56.7 18.6

H800 GPUs. For the Celeb A dataset, we employed a train-validation-test split of 8:1:1. The test data was subsequently used for three downstream tasks: central box inpainting, superresolution by four times, and Gaussian deblurring which is all common benchmarks.

Settings for the experiments. We evaluated the guidance methods using 3,000 images randomly sampled from the test set across the three inverse problems. Specifically, for deblurring, we apply a 61 61 Gaussian kernel with a standard deviation of σb = 1.0. For super-resolution, we perform 4 downsampling on the Celeb A images. In the case of box-inpainting, we use a centered 40 40 mask. Furthermore, for all three tasks, we add Gaussian noise after the degradation operation with a standard deviation of σ = 0.05 to the images.

Metrics. In this paper, we use four commonly adopted metrics for image quality assessment: FID (Fr echet Inception Distance), which measures the distance between generated and real image distributions; LPIPS (Learned Perceptual Image Patch Similarity), which evaluates the perceptual similarity between images; and PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) to quantify image quality in terms of signal preservation and structural consistency, respectively.

Why is g MC bad at image inverse problems? As can be seen from Figure 6 and Figure 7, the images generated by g MC do not respect the reference degraded image. This is largely due to the variance of the MC estimation being too high given the limited number of samples. Specifically, to estimate gt, one needs to obtain samples from regions where e J is significantly higher than average, which corresponds to the images that already look like the degraded image. Sampling such images requires an infeasibly large number of samples. More advanced MC sampling techniques may help address this shortcoming, such as the control variable method (Owen, 2013). Combining g MC and methods that are biased but with lower variance, such as glocal or gsim, may also boost the performance.

However, on tasks such as conditional generation, as long as the condition often appears in the dataset, it will be easier to obtain an accurate estimation of gt using g MC t . Such scenarios include property-conditioned molecular structure generation, label-conditioned image generation, and decision-making tasks, which are included in our experiments.

Visualizations. We provide visualizations of the results of the inverse problems in Figure 6 and 7.

On the Guidance of Flow Matching

ΠGDM Clean Degraded

𝑔!"#$% 𝑔&'($% 𝑔!"#$) 𝑔*+

Figure 6: The visualization of the image inverse problems with the base flow matching model of mini-batch optimal transport conditional flow matching (OT-CFM). Three rows show the results of Gaussian deblurring, box-inpainting, and super-resolution from top to bottom.

ΠGDM Clean Degraded 𝑔!"#$% 𝑔&'($% 𝑔!"#$) 𝑔*+

Figure 7: The visualization of the image inverse problems with the base flow matching model of conditional flow matching (CFM). Three rows show the results of Gaussian deblurring, box-inpainting, and super-resolution from top to bottom.