# towards_hierarchical_rectified_flow__0f259ee5.pdf Published as a conference paper at ICLR 2025 TOWARDS HIERARCHICAL RECTIFIED FLOW Yichi Zhang1, Yici Yan1, Alex Schwing1, Zhizhen Zhao1 1University of Illinois Urbana-Champaign We formulate a hierarchical rectified flow to model data distributions. It hierarchically couples multiple ordinary differential equations (ODEs) and defines a timedifferentiable stochastic process that generates a data distribution from a known source distribution. Each ODE resembles the ODE that is solved in a classic rectified flow, but differs in its domain, i.e., location, velocity, acceleration, etc. Unlike the classic rectified flow formulation, which formulates a single ODE in the location domain and only captures the expected velocity field (sufficient to capture a multi-modal data distribution), the hierarchical rectified flow formulation models the multi-modal random velocity field, acceleration field, etc., in their entirety. This more faithful modeling of the random velocity field enables integration paths to intersect when the underlying ODE is solved during data generation. Intersecting paths in turn lead to integration trajectories that are more straight than those obtained in the classic rectified flow formulation, where integration paths cannot intersect. This leads to modeling of data distributions with fewer neural function evaluations. We empirically verify this on synthetic 1D and 2D data as well as MNIST, CIFAR-10, and Image Net-32 data. Our code is available at: https://riccizz.github.io/HRF/. 1 INTRODUCTION Diffusion models (Ho et al., 2020; Song et al., 2021a;b) and particularly also flow matching (Liu et al., 2023; Lipman et al., 2023; Albergo & Vanden-Eijnden, 2023; Albergo et al., 2023) have gained significant attention recently. This is partly due to impressive results that have been reported across domains from computer vision (Ho et al., 2020) and medical imaging (Song et al., 2022) to robotics (Kapelyukh et al., 2023) and computational biology (Guo et al., 2024). Beyond impressive results, flow matching was also reported to faithfully model multimodal data distributions. In addition, sampling is reasonably straightforward: it requires to solve an ordinary differential equation (ODE) via forward integration of a set of source distribution points along an estimated velocity field from time zero to time one. The source distribution points are sampled from a simple and known source distribution, e.g., a standard Gaussian. The velocity field is obtained by matching velocities from a constructed ground-truth integration path with a parametric deep net using a mean squared error (MSE) objective. See Fig. 1(a) for the ground-truth integration paths of classic rectified flow. Studying the ground-truth velocity distribution at a distinct location and time for rectified flow reveals a multimodal distribution. We derive an analytic expression for the multimodal velocity distribution in case of a mixture-of-Gaussian data distribution in Section 3.1. It is known that the MSE objective used in classic rectified flow does not permit to capture this multimodal distribution. Instead, classic rectified flow leads to a model that aims to capture the mean of the velocity distribution. This is illustrated in Fig. 1(b). We do want to emphasize that capturing the mean of the velocity distribution is sufficient for characterizing a multimodal data distribution (Liu et al., 2023). However, only capturing the mean velocity also leads to unnecessarily curved forward integration paths. This is due to the fact that integration paths cannot intersect when using an MSE objective, as can be observed in Fig. 1(b). In this paper, we hence wonder whether it is possible to capture the velocity distribution in its entirety. This enables integration paths to intersect during data generation, as illustrated in Fig. 1(c). Intuitively, and as detailed in Section 3.2, we can capture the velocity distribution by formulating a rectified flow objective in the velocity space rather than the location space. Hence, instead of training Published as a conference paper at ICLR 2025 (a) Linear Interpolation (b) Rectified Flow (c) Ours Figure 1: Particles flow from starting points (grey) to endpoints (blue) as time increases from 0 to 1. Ideally, the trajectories (green) are straight lines connecting two ends as shown in (a). Rectified Flow captures the expected velocity field while our Hierarchical Rectified Flow can model the true velocity field thus generating intersecting and more straight paths. a deep net to estimate the velocity for integration in location space, as done in classic rectified flow, we train a deep net that estimates the acceleration for integration in velocity space. Sampling can then be done by forward integrating two hierarchically coupled processes: first, forward integrate in velocity space to obtain a sample from the velocity distribution; then use the velocity sample to perform a step in location space. While this nested integration of two processes seems computationally more demanding at first, it turns out that fewer integration steps are needed, particularly in the latter process. This is due to the fact that the integration path is indeed less curved, as shown in Fig. 1(c). We also show in Section 3.3 that capturing the velocity distribution in its entirety permits to capture a multimodal data distribution. The data generation process is governed by a random differential equation (RDE) (Han, 2018) with learned random velocity field. Going forward, instead of using just two hierarchically coupled processes we can extend the formulation to an arbitrary depth, which is detailed in Section 3.4. Using a depth of one defaults to classic rectified flow (deep net captures the expected velocity field), while a depth of two leads to a deep net that captures the acceleration, etc. We refer to this construction of hierarchically coupled processes as a hierarchical rectified flow. Empirically, we find that the studied hierarchical rectified flow leads to samples that better fit the data distribution. Specifically, we find that this hierarchical rectified flow leads to slightly better results than the vanilla rectified flow. 2 PRELIMINARIES Given a dataset D = {(x1)} consisting of samples x1 ρ1, e.g., images, drawn from an unknown target data distribution ρ1, the goal of generative modeling is to learn a model that faithfully captures the unknown target data distribution ρ1 and permits to sample from the learned distribution. Since we focus primarily on rectified flow, we provide its formulation in the following. At inference time, rectified flow starts from samples x0 ρ0 drawn from a known source distribution ρ0, e.g., a standard Gaussian. The source distribution samples are pushed forward from time t = 0 to target time t = 1 via integration along a trajectory specified via a learned velocity field v(zt, t). This learned velocity field depends on the current time t and the sample location zt at time t. Formally, we obtain samples by numerically solving the ordinary differential equation (ODE) dzt = v(zt, t)dt, with z0 ρ0, t [0, 1]. (1) Notably, this sampling procedure is able to capture multimodal dataset distributions, as one expects from a generative model. To learn the velocity field, at training time, rectified flow constructs random pairs (x0, x1), consisting of a source distribution sample x0 ρ0 and a target distribution sample x1 D. The latter is drawn from a given dataset D consisting of samples which are assumed to be drawn from the unknown target distribution ρ1. For a uniformly drawn time t U[0, 1], the time-dependent location xt is computed from the pair (x0, x1) using linear interpolation of (x0, x1), i.e., xt = (1 t)x0 + tx1, where x0 ρ0, x1 D. (2) Published as a conference paper at ICLR 2025 At this location xt and time t, the ground-truth velocity vgt(xt, t) = xt/ t = x1 x0 is readily available. It is then matched during training with a velocity model v(xt, t) via a standard ℓ2 loss, i.e., during training we address inf v Ex0 ρ0,x1 D,t U[0,1] x1 x0 v(xt, t) 2 2 , (3) where the optimization is over the set of all measurable velocity fields. In practice, the functional velocity model v(xt, t) is often parameterized via a deep net with trainable parameters θ, i.e., v(xt, t) vθ(xt, t), and the infimum resorts to a minimization over parameters θ. Considering the training procedure more carefully, it is easy to see that different random pairs (x0, x1) can lead to different ground-truth velocity directions at the same time t and at the same location xt. The aforementioned ℓ2 loss hence asks the functional velocity model v(xt, t) to regress to different ground-truth velocity directions. This leads to averaging, i.e., the optimal functional velocity model v (xt, t) = E{(x0,x1,t):(1 t)x0+tx1=xt} [v(xt, t)]. According to Theorem 3.3 by Liu et al. (2023), if we use v for the ODE in Eq. (1), then the stochastic process associated with Eq. (1) has the same marginal distributions for all t [0, 1] as the stochastic process associated with the linear interpolation characterized in Eq. (2). Nonetheless, to avoid the averaging, in this paper we wonder whether it is possible to capture the multimodal velocity distribution at each time t and at each location xt, and whether there are any potential benefits to doing so. 3 TOWARDS HIERARCHICAL RECTIFIED FLOW In the following Section 3.1, we first discuss the multimodality of the velocity distribution and provide a case study with Gaussian mixtures. The case study is designed to provide insights regarding the velocity distribution. We then discuss in Section 3.2 a simple way to capture the multimodal velocity distribution and how to use it to sample from the data distribution. Then, we show in Section 3.3 that the proposed procedure indeed faithfully captures the data marginals. Finally, we discuss in Section 3.4 an extension towards a hierarchical rectified flow formulation. 3.1 VELOCITY DISTRIBUTION AND CASE STUDY WITH GAUSSIAN MIXTURES The linear interpolation in Eq. (2) defines a time-differentiable stochastic process with the random velocity field v(xt, t) = x1 x0, where x0 ρ0 and x1 ρ1. Note, the source and target distributions are independent. The following theorem characterizes the distribution of the velocity at a specific space time location (xt, t): Theorem 1 The velocity distribution π1(v; xt, t) at the space time location (xt, t) induced by the linear interpolation in Eq. (2) is π1(v; xt, t) = p V |Xt(v|xt) = ρ0(xt tv)ρ1(xt + (1 t)v) ρt(xt) , (4) for ρt(xt) = 0 with ( * denotes convolution) ρ0(x0) for t = 0, 1 t(1 t)ρ0 xt 1 t ρ1 xt t for t (0, 1), ρ1(x1) for t = 1. The distribution π1(v; xt, t) is undefined if ρt(xt) = 0. The proof of Theorem 1 is deferred to Appendix A. Note that since ρ1 is typically multimodal, the resulting π1(v; xt, t) is also multimodal. At t = 0, we have π1(v; xt, t) = ρ1(xt + v), which corresponds to the data distribution shifted by xt. At t = 1, we have π1(v; xt, t) = ρ0(xt v), which corresponds to the flipped source distribution shifted by xt. To illustrate the multimodality of the velocity distribution, we consider a simple 1-dimensional example. The source distribution is a standard Gaussian (zero mean, unit variance). The target distribution is a Gaussian mixture. The following corollary provides the ground-truth velocity distribution at any location xt. Published as a conference paper at ICLR 2025 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Velocity analytical samples 2 1 0 1 2 Velocity analytical samples 4 3 2 1 0 1 2 Velocity analytical samples 2 1 0 1 2 3 4 Velocity analytical samples (a) (xt, t) = ( 1.0, 0.0) (b) (xt, t) = (0.0, 0.4) (c) (xt, t) = (0.5, 0.6) (d) (xt, t) = (1.0, 1.0) Figure 2: We verify the derived velocity distribution by comparing its probability density function (blue) to the empirical sample histogram (orange) at different times t and locations xt. Corollary 1 Assume ρ0 = N(x; 0, 1) and ρ1 = PK k=1 wk N(x; µk, σ2 k), then π1(v; xt, t) = v; (1 t)(µk xt) + tσ2 kxt σ2 k,t , σ2 k σ2 k,t where σ2 k,t = (1 t)2 + t2σ2 k and wk,t = wk N(xt;tµk, σ2 k,t) PK k =1 wk N(xt;tµk , σ2 k ,t). We defer the proof of Corollary 1 to Appendix B. To empirically check the fit of Corollary 1, in Fig. 2, we compare the derived velocity distribution with empirical estimates at different locations (xt, t). We observe a great fit and very clearly multimodal distributions. It is very much worthwhile to study these distributions a bit more. In particular, we observe that the velocity distribution at time t = 1 collapses to a single Gaussian, more specifically a shifted source distribution. This can be seen from Fig. 2(d). Further, at time t = 0, we observe the velocity distribution to be identical to a shifted data distribution. This can be seen from Fig. 2 (a). This is valuable to know as it suggests that the velocity distribution is at least as complex as the data distribution. Indeed, at t = 0, the velocity distribution is identical to a shifted data distribution. 3.2 MODELING THE VELOCITY DISTRIBUTION The previous section showed that the velocity distributions can be multimodal. Knowing that the optimal velocity model v (xt, t) of classic rectified flow averages ground-truth velocities, we can t expect classic rectified flow to capture this distribution. We hence wonder: 1) is it possible to capture the multimodal velocity distribution at each time t and at each location xt; 2) are there any benefits to capturing the multimodal velocity distribution as opposed to just capturing its mean as done by classic rectified flow. Intuitively, an accurate characterization of the velocity distribution might be beneficial because we obtain straighter integration paths, which in turn may lead to easier integration with fewer neural function evaluations (NFE). In addition, capturing the velocity distribution provides additional modeling flexibility (an additional time axis), which might yield to improved results. Notably, modeling of the velocity distribution does not lead to modeling of a simpler distribution. As mentioned in Section 3.1, at time t = 0 the velocity distribution is identical to a shifted data distribution. To accurately model the ground-truth velocity distribution, we can use rectified flow for velocities rather than locations, which are used in the classic rectified flow formulation. This is equivalent to learning the acceleration. To see this, first, consider classic rectified flow again: we construct a timedependent location xt from pairs (x0, x1), compute the ground-truth velocity vgt(xt, t) = xt/ t, and train a velocity model vθ(xt, t) to match this ground-truth velocity vgt(xt, t). To learn the acceleration, we introduce a source velocity sample v0 π0 drawn from a known source velocity distribution π0. We also construct a target velocity sample v1(xt, t) π1(v; xt, t) at time t and at location xt, which follows the target velocity distribution π1(v; xt, t) at time t and at location xt. Note, the target velocity sample at time t and at location xt = (1 t)x0 + tx1 is obtained via v1(xt, t) = x1 x0, when considering a rectified flow. The samples v1(xt, t) follow the ground-truth velocity distribution π1(v; xt, t) at time t and at location xt. Published as a conference paper at ICLR 2025 Algorithm 1: Hierarchical Rectified Flow Training 1 The source distributions ρ0 and π0 and the dataset D 2 while stopping conditions not satisfied do 3 Sample x0 ρ0, x1 D, and v0 π0 ; //better to sample a mini-batch 4 Sample t U[0, 1] and τ U[0, 1] ; //different t and τ for each mini-batch sample 5 Compute loss following Eq. (8); 6 Perform gradient update on θ Using both the source velocity sample v0 and the target velocity sample v1(xt, t), and following classic rectified flow, we introduce a new time-axis τ [0, 1] and construct a time-dependent velocity vτ(xt, t) = (1 τ)v0 + τv1(xt, t) at time t and at location xt. Using it, we obtain the groundtruth acceleration from the time-dependent velocity vτ(xt, t) via agt(xt, t, vτ, τ) = vτ/ τ = v1(xt, t) v0 = x1 x0 v0. Note, for a specific (xt, t), we can get the following ODE induced from the linear interpolation of the target velocity distribution to convert u0 π0 to u1 π1(v; xt, t), duτ(xt, t) = a(xt, t, uτ, τ)dτ, with u0 π0. (7) Here, a(xt, t, uτ, τ)=Eπ0,π1(v;xt,t)[V1 V0|Vτ =uτ]=Eπ0,ρ0,ρ1[X1 X0 V0|Vτ =uτ, Xt =xt] is the expected acceleration vector field. Our approach aims to learn the acceleration vector field a though flow matching for all (xt, t), i.e., matching the ground-truth acceleration by addressing inf a Ex0 ρ0,x1 D,t U[0,1],v0 π0,τ U[0,1] (x1 x0 v0) a(xt, t, vτ, τ) 2 2 . (8) In practice, we use a parametric model aθ(xt, t, vτ, τ) to match the target ground-truth acceleration by minimizing the objective w.r.t. the trainable parameters θ. Training of the parametric acceleration model is straightforward. It is summarized in Algorithm 1. It remains to answer how we use the trained acceleration model aθ(xt, t, vτ, τ) during sampling. We have the following coupled ODEs induced from the coupled linear interpolations: duτ(zt, t) = a(zt, t, uτ, τ)dτ, with u0(zt, t) π0, τ [0, 1], dzt = u1(zt, t)dt, with z0 ρ0, t [0, 1]. (9) Those coupled ODEs convert z0 ρ0 to z1 ρ1. After training, the ODEs in Eq. (9) are simulated using the vanilla Euler method and aθ, as detailed in Algorithm 2. We first draw two random samples: v0 π0 from the source velocity distribution and x0 ρ0 from the source location distribution. We then integrate the velocity forward to time τ = 1 to obtain a sample from the modeled velocity distribution v1(x0, 0). Subsequently, we use this sample to perform one integration step on the location. We continue this procedure until we arrive at a sample x1. Remark. The generation of data is governed by a random differential equation (RDE) with the random velocity field, where Eq. (9) can be viewed as dzt = g(zt, t, u0(zt, t))dt, where u0(zt, t) π0, z0 ρ0, (10) and g is a deterministic function. The randomness comes from the initial conditions for data and velocity. This is different from the sampling process governed by a stochastic differential equation (SDE) used in diffusion models (Song et al., 2021b), where the randomness comes from the Wiener process and the initial condition for data. Note that our use of the term acceleration is not due to second-order derivatives of the location, but rather due to two hierarchically coupled linear processes. We hence refer to this construction as a hierarchical rectified flow. It remains to show that the obtained samples indeed follow the target data distribution. We will dive into this topic next. Published as a conference paper at ICLR 2025 Algorithm 2: Hierarchical Rectified Flow Sampling Input : The source distributions ρ0 and π0, the number of t-discretization steps J, the number of τ-discretization steps L, and the trained network parameters θ. 1 Sample z0 ρ0 and u0 π0; 2 Compute t = 1 J 1 and τ = 1 L 1; 3 for j = 1, . . . , J do 4 for l = 1, . . . , L do 5 Compute ul = ul 1 + aθ(ztj 1, tj 1, ul 1, τl 1) τ 7 Compute zj = zj 1 + u L t 3.3 DISCUSSIONS ON THE GENERATED DATA DISTRIBUTION We discuss below the property of the hierarchical rectified flow defined in Eq. (9). According to rectified flow theory, we can generate samples from the velocity distribution using the expected acceleration field. The following theorem states that the generation process defined in Eq. (9), which uses the velocity distribution, leads to correct marginals for all times t [0, 1]. Theorem 2 The time-differentiable stochastic process Z = {Zt : t [0, 1]} generated by Eq. (9) has the same marginal distribution as the time-differentiable stochastic process X = {Xt : t [0, 1]} generated by the linear interpolation in Eq. (2). We defer the proof of Theorem 2 to Appendix C. Intuitively, the marginal preserving property is because at each time t [0, 1], we can express zt as the linear interpolation of an x0 ρ0 and an x1 ρ1 according to Eq. (2). A key benefit of our approach is that the process Z can be piece-wise straight. Starting with samples zt from ρt for t [0, 1], we propagate each sample by v(zt, t) t, where v(zt, t) π1(v; zt, t). Since v(zt, t) = x1 x0, where tx1 + (1 t)x0 = zt, the straight path following v(zt, t) will lead to a sample from the data distribution. In other words, t can be chosen arbitrarily in the interval (0, 1 t]. In practice, the learned velocity distribution is not perfect. Therefore, instead of one-step generation from the initial distribution, we choose to propagate the samples for a couple of steps. As shown in Section 4, we typically only use 2-5 steps in the numerical integration for data generation. Computationally, straight paths are very attractive as trajectories with nearly straight paths incur small time-discretization error in numerical simulation. 3.4 EXTENDING TOWARDS HIERARCHICAL RECTIFIED FLOW Consider the training objective for acceleration matching discussed in Eq. (8), and further consider the coupled ODE solved when sampling from the constructed process as specified in Eq. (9). It is straightforward to extend both to an arbitrary depth. I.e., instead of modeling the velocity distribution by matching accelerations, we can model the acceleration distribution by matching jerk or go even deeper towards snap, crackle, pop, and beyond. Formally, the training objective of a hierarchical rectified flow of depth D is given by inf f Ex0 ρ0,x1 ρ1,t U[0,1]D h x1 1T Dx0 f (xt, t) 2 2 Here, 1D is the D-dimensional all-ones vector and t = t(1), . . . , t(D) T is a D-dimensional vector of time variables drawn from a D-dimensional unit cube U[0, 1]D. Moreover, we use the D-dimensional vector of source distribution samples x0 = [x(1) 0 , . . . , x(D) 0 ]T , drawn from a D-dimensional source distribution ρ0, e.g., a D-dimensional standard Gaussian. We further use the D-dimensional location vector xt = [x(1) t , . . . , x(D) t ]T , with its d-th entry given as x(d) t = (1 t(d))x(d) 0 + t(d)(x1 Pd 1 k=1 x(k) 0 ). In addition, we refer to f as the functional field of directions. Note that Eq. (11) is identical to Eq. (3) if D = 1 or Eq. (8) if D = 2. Published as a conference paper at ICLR 2025 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Velocity analytical 5 steps WD=0.095 10 steps WD=0.048 20 steps WD=0.031 100 steps WD=0.019 2 1 0 1 2 Velocity analytical 5 steps WD=0.157 10 steps WD=0.088 20 steps WD=0.058 100 steps WD=0.031 4 3 2 1 0 1 2 Velocity analytical 5 steps WD=0.143 10 steps WD=0.074 20 steps WD=0.038 100 steps WD=0.023 2 1 0 1 2 3 4 Velocity analytical 5 steps WD=0.135 10 steps WD=0.048 20 steps WD=0.011 100 steps WD=0.040 (a) (xt, t) = ( 1.0, 0.0) (b) (xt, t) = (0.0, 0.4) (c) (xt, t) = (0.5, 0.6) (d) (xt, t) = (1.0, 1.0) Figure 3: Numerical estimation of π1(xt, t) in HRF2 with different number of v integration steps. The blue line shows the ground-truth π1, where ρ0 is a standard Gaussian and ρ1 is a mixture of two Gaussians. The 1-Wasserstein distances (WD) for the estimates w.r.t. π1 are shown in the legend. Before discussing inference we want to highlight the importance of the first term in Eq. (11). Subtracting a large number of Gaussians from a data sample x1 leads to a smoothed distribution. This is another potential benefit of a hierarchical rectified flow formulation. Given a trained functional field of directions f we sample from the defined process via numerical simulation according to the following coupled ODEs: dz(D) t z(1:D 1) t , t(1:D 1) = f(zt, t)dt(D), with z(D) 0 ρ(D) 0 , dz(D 1) t z(1:D 2) t , t(1:D 2) = z(D) 1 z(1:D 1) t , t(1:D 1) dt(D 1), with z(D 1) 0 ρ(D 1) 0 , ... dz(1) t = z(2) 1 z(1) t , t(1) dt(1), with z(1) 0 ρ(1) 0 . (12) Note that Eq. (12) is identical to Eq. (1) if D = 1 or Eq. (9) if D = 2. Again, note that our use of the terms acceleration, jerk, etc. is not due to second, third, and higherorder derivatives of the location, but rather due to hierarchically coupled linear processes. 4 EXPERIMENTS The studied hierarchical rectified flow (HRF) formulation couples multiple ODEs to accurately model the multimodal velocity distribution. To assess efficacy of this formulation, we first validate the approach in low-dimensional settings, where the analytical form of the velocity distribution is straightforward to compute. This allows us to verify that the model can indeed capture the velocity distribution accurately. We then investigate whether fitting the velocity distribution enhances the model s ability to fit the data distribution in generative tasks. We perform experiments on 1D data (Section 4.1), 2D data (Section 4.2), and high-dimensional image data (Section 4.3) with depth two HRF (HRF2) models: the models not only fit the velocity distribution but also enhance the quality of the generative process. We also include results for depth three HRF (HRF3) models on low dimensional data to show the potential for exploring deeper hierarchical structures. Importantly, for all experiments we report total number of function evaluations (NFEs), i.e., the product of the number of integration steps at all HRF levels. 4.1 SYNTHETIC 1D DATA For the 1D experiments, we first consider a standard Gaussian source distribution and a target distribution represented by a mixture of two Gaussians. Using Eq. (6), we can compute the analytical form of the velocity distribution. As shown in Fig. 3, our model captures the analytic velocity distribution with high accuracy. As expected, an increasing number of velocity integration steps increases the accuracy of the estimated velocity. The only exception occurs when t approaches 1. The model performance deteriorates, and excessive steps accumulate errors, leading to less accurate results. Next, we examine the data generation quality. We use a mixture of five Gaussians as illustrated in Fig. 4(a) and compare HRF to a baseline rectified flow (RF). We use the 1-Wasserstein distance (WD) as a metric to assess the quality of the generated data. As shown in Fig. 4(b), for the same Published as a conference paper at ICLR 2025 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x Ground Truth RF WD=0.042 HRF2 WD=0.022 HRF3 WD=0.014 0 25 50 75 100 125 150 175 200 Total NFEs Wasserstein Distance RF HRF2 5 v steps HRF2 10 v steps HRF2 20 v steps HRF3 (a) Data distribution (b) Metrics (c) RF trajectories (d) HRF2 trajectories Figure 4: Results on 1D example, where ρ0 is a standard Gaussian and ρ1 is a mixture of 5 Gaussians. (a) Histograms of generated samples and ρ1. (b) The 1-Wasserstein distance vs. NFE. (c) and (d) The trajectories of particles flowing from source distribution (grey) to target distribution (blue). 0 25 50 75 100 125 150 175 200 Total NFEs Sliced Wasserstein Distance RF HRF2 5 v steps HRF2 10 v steps HRF2 20 v steps HRF3 0 25 50 75 100 125 150 175 200 Total NFEs Sliced Wasserstein Distance RF HRF2 5 v steps HRF2 10 v steps HRF2 20 v steps HRF3 (a) Metrics (b) RF trajectories (c) HRF2 trajectories (d) HRF3 trajectories Figure 5: Results on 2D data. Top row: ρ0 is a standard Gaussian and ρ1 is a mixture of 6 Gaussians. Bottom row: ρ0 is a mixture 8 Gaussians and ρ1 is represented by the moons data. (a) Sliced 2-Wasserstein distance with respect to NFE. (b) and (c) show the trajectories (green) of sample particles flowing from source distribution (grey) to target distribution (blue). neural function evaluations (NFEs), the HRF models outperform the baseline, producing data distributions with a lower WD, indicating superior quality. In Fig. 4 s legend, the term v steps refers to the number of velocity integration steps. In this 1D experiment, HRF3 demonstrates better performance compared to HRF2. More 1D results are provided in Appendix H.1. Additionally, we observe a fundamental difference in the generated trajectories. Since rectified flow estimates only the mean of the velocity distribution, it tends to move towards the center of the target distributions initially. In contrast, the HRF model determines the next direction at each space-time location based on the current velocity distribution. As shown in Fig. 4(d), the HRF2 trajectories are nearly linear and can intersect, which permits to use fewer data sampling steps during generation. For the deep net, we use simple embedding layers and linear layers to first process the space and time information separately. Afterward, we concatenate these representations. This combined input is then passed through a series of fully connected layers, allowing the model to capture complex interactions and extract high-level features essential for accurate velocity prediction. We use the same architecture for the baseline model but increase the dimension of the hidden layers to optimize its performance. In contrast, the HRF2 model contains only 74,497 parameters compared to 297,089 parameters for the baseline model. This demonstrates the potential efficiency of HRF in handling higher-dimensional data while maintaining a more compact architecture. More details of the experiments are provided in Appendix F. 4.2 SYNTHETIC 2D DATA For the 2D experiments, we consider two settings: 1) a standard Gaussian source distribution and a target distribution consisting of a mixture of six Gaussians; and 2) a mixture of eight Gaussians as the source distribution and the moons dataset as the target distribution. We employ the same network architecture as used in the 1D experiments. Due to the 2D data, we now have 76,674 parameters for Published as a conference paper at ICLR 2025 10 100 Total NFEs RF (1.08M) HRF2 (1.07M) 10 100 Total NFEs RF (35.75M) HRF2 (44.81M) 10 100 Total NFEs RF (37.06M) HRF2 (46.21M) (a) MNIST (b) CIFAR-10 (c) Image Net-32 Figure 6: Experimental results on (a) MNIST, (b) CIFAR-10, and (c) Image Net-32 datasets. Top row: samples of generated images, bottom row: FID scores with respect to total NFEs. the HRF2 model and 329,986 parameters for the baseline RF model. We measure the quality of data generation using the sliced 2-Wasserstein distance (SWD). Fig. 5 shows the results. It is evident that on these more complex datasets, the performance gap between an HRF model and the rectified flow baseline is more pronounced. The trajectories demonstrate similar patterns to those observed in the 1D experiments: HRF2 produces significantly straighter paths, while the rectified flow baseline often exhibits large directional changes. Additionally, the HRF models consistently achieve higher quality in data generation compared to the baseline. HRF3 outperforms HRF2 for generating the moon data from a mixture of 8 Gaussians. However, HRF2 works better for the simpler mixture of Gaussian target. There is room to improve the training and scheduling of the integration steps among different layers for deeper HRF models. 4.3 IMAGE DATA In addition to low-dimensional data, we also conduct experiments on high-dimensional image datasets including MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009), and Image Net32 (Deng et al., 2009). We employ the Fr echet Inception Distance (FID) as the metric for evaluating image generation quality. For the baseline, we use the same U-Net architecture as Lipman et al. (2023) and successfully reproduced the state-of-the-art results across all datasets. The HRF model builds upon this U-Net structure. We follow the parameter settings and training procedures from Tong et al. (2024) and Lipman et al. (2023). Further details on the architecture and training setup are provided in Appendix F. As shown in Fig. 6, for the same total NFEs, the HRF2 model demonstrates better performance on MNIST and CIFAR-10, and on-par performance on Image Net-32 when compared to the baseline. 5 RELATED WORK Generative Modeling: GANs (Goodfellow et al., 2014; Arjovsky et al., 2017), VAEs (Kingma & Welling, 2014), and normalizing flows (Tabak & Turner, 2013; Rezende & Mohamed, 2015; Dinh et al., 2017; Huang et al., 2018; Durkan et al., 2019) are classic methods for learning deep generative models. GANs excel in generating high-quality images but face challenges like training instability and mode collapse due to their min-max update mechanism. VAEs and normalizing flows rely on maximum likelihood estimation (MLE) for training, which necessitates architectural constraints or special approximations to ensure manageable likelihood computations. VAEs often employ a conditional Gaussian distribution alongside variational approximations, while the discrete normalizing flows utilize specifically designed invertible architectures and require costly Jacobian matrix calculations. Extending the discrete normalizing flow to continuous cases enabled the Jacobian to be unstructured yet estimatable using trace estimation methods (Hutchinson, 1990; Chen et al., 2018; Grathwohl et al., 2019). However, using maximum likelihood estimation (MLE) for this mapping Published as a conference paper at ICLR 2025 requires costly backpropagation through numerical integration. Regularizing the path can minimize solver calls (Finlay et al., 2020; Onken et al., 2021), but it doesn t resolve the fundamental optimization challenges. Rozen et al. (2021); Ben-Hamu et al. (2022) considered simulation-free training by fitting a velocity field, but still present scalability issues (Rozen et al., 2021) and biased optimization (Ben-Hamu et al., 2022). Recent research has utilized diffusion processes, particularly the Ornstein-Uhlenbeck (OU) process, to link the target distribution ρ1 with a source distribution ρ0. This involves a stochastic differential equation (SDE) that evolves over infinite time, framing generative model learning as fitting the reverse evolution of the SDE from Gaussian noise to ρ1 (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021b). This method learns the velocity field by estimating the score function log(ρt(x)) using the Fischer divergence instead of the maximum likelihood estimation (MLE) objective. Although diffusion models have demonstrated significant potential for modeling highdimensional distributions (Rombach et al., 2022; Hoogeboom et al., 2022; Saharia et al., 2022), the requirement for infinite time evolution, heuristic time step parameterization (Xiao et al., 2022), and the unclear significance of noise and score (Bansal et al., 2024; Lu et al., 2022) pose challenges. Notably, the score-based diffusion models typically require a large number of time steps to generate data samples. In addition, calculating the actual likelihoods necessitates using the ODE probability flow linked to the SDE (Song et al., 2021b). These highlight the need for further exploration of effective ODE-driven methods for learning the data distribution. Flow Matching: Concurrently, Liu et al. (2023); Lipman et al. (2023); Albergo & Vanden-Eijnden (2023) presented an alternative to score-based diffusion models by learning the ODE velocity through a time-differentiable stochastic process defined by interpolating between samples from the source and data distributions, i.e., xt = ψt(x0, x1), with x0 ρ0 and x1 ρ1, instead of the OU process. This offers greater simplicity and flexibility by enabling precise connections between any two densities over finite time intervals. Liu et al. (2023) concentrated on a linear interpolation with ψt(x0, x1) = (1 t)x0 + tx1, i.e., straight paths connecting points from the source and the target distributions. Lipman et al. (2023) introduced the interpolation through the lens of conditional probability paths leading to a Gaussian. Extensions of Lipman et al. (2023) were detailed by Tong et al. (2024), generalizing the method beyond a Gaussian source distribution. Albergo & Vanden-Eijnden (2023); Albergo et al. (2023) introduced stochastic interpolants with more general forms. Straightening Flows: Liu et al. (2023) outlined an iterative process called Re Flow for coupling the points from the source and target distributions to straighten the transport path and demonstrated that repeating this procedure leads to an optimal transport map. Other related studies bypass the iterations by modifying how noise and data are sampled during training. For example, Pooladian et al. (2023); Tong et al. (2024) calculated mini-batch optimal transport couplings between the Gaussian and data distributions to minimize transport costs and gradient variance. Note that these approaches are orthogonal to our approach and can be adopted in our formulation (see Appendix H). Modeling Velocity Distribution: Concurrently, Guo & Schwing (2025) also study a method to model multi-modal velocity vector fields. In this paper, we discuss use of a hierarchy of ordinary differential equations. Differently, Guo & Schwing (2025) study how to use a lower-dimensional latent space to enable modeling of the velocity distribution via a variational approach. The hierarchy of ordinary differential equations permits to more accurately model the velocity distribution while use of the variational approach enables to capture semantics. 6 DISCUSSION & CONCLUSION We study a hierarchical rectified flow formulation that hierarchically couples linear ODEs, each akin to a classic rectified flow formulation. We find this formulation to accurately model multimodal distributions for velocity, etc., which in turn enables integration paths to intersect during data generation. As a consequence, integration paths are less curved leading to compelling results with fewer neural function evaluations. Currently, our sampling process is relatively simple, relying on the Euler method for multiple integrations. We have only performed a basic grid search regarding possible integration schedules and we have not explored other solvers. We suspect, better strategies exist and we leave their exploration to future work. Acknowledgements: Work supported in part by NSF grants 1934757, 2008387, 2045586, 2106825, MRI 1725729, NIFA award 2020-67021-32799, and the Alfred P. Sloan Foundation. Published as a conference paper at ICLR 2025 M. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In Proc. ICLR, 2023. M. Albergo, N. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proc. ICML, 2017. A. Bansal, E. Borgnia, H.-M. Chu, J. Li, H. Kazemi, F. Huang, M. Goldblum, J. Geiping, and T. Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. In Proc. Neur IPS, 2024. H. Ben-Hamu, S. Cohen, J. Bose, B. Amos, M. Nickel, A. Grover, R. Chen, and Y. Lipman. Matching normalizing flows and probability paths on manifolds. In Proc. ICML, 2022. P. Bromiley. Products and convolutions of gaussian probability density functions. Tina-Vision Memo, 2003. R. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In Proc. Neur IPS, 2018. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. In Proc. ICLR, 2017. John R Dormand and Peter J Prince. A family of embedded runge-kutta formulae. Journal of computational and applied mathematics, 6(1):19 26, 1980. C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. In Proc. Neur IPS, 2019. C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. Oberman. How to train your neural ode: the world of jacobian and kinetic regularization. In Proc. ICML, 2020. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. Neur IPS, 2014. W. Grathwohl, R. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In Proc. ICLR, 2018. W. Grathwohl, R. Chen, J. Bettencourt, and D. Duvenaud. Scalable reversible generative models with free-form continuous dynamics. In Proc. ICLR, 2019. P. Guo and A. G. Schwing. Variational Rectified Flow Matching. In https://arxiv.org/abs/2502.09616, 2025. Z. Guo, J. Liu, Y. Wang, M. Chen, D. Wang, D. Xu, and J. Cheng. Diffusion models in bioinformatics and computational biology. Nature Reviews Bioengineering, 2024. Xiaoying. Han. Random Ordinary Differential Equations and Their Numerical Solution. Springer, 2018. J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. Neur IPS, 2020. E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling. Equivariant diffusion for molecule generation in 3d. In Proc. ICML, 2022. C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. In Proc. ICML, 2018. Published as a conference paper at ICLR 2025 M. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 1990. I. Kapelyukh, V. Vosylius, and E. Johns. DALL-E-Bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 2023. D. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proc. ICLR, 2014. A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. Y. Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. In Proc. ICLR, 2023. X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In Proc. ICLR, 2023. C. Lu, K. Zheng, F. Bao, J. Chen, C. Li, and J. Zhu. Maximum likelihood training for score-based diffusion ODEs by high order denoising score matching. In Proc. ICML, 2022. D. Onken, S. W. Fung, X. Li, and L. Ruthotto. OT-Flow: Fast and accurate continuous normalizing flows via optimal transport. In Proc. AAAI, 2021. A.-A. Pooladian, H. Ben-Hamu, C. Domingo-Enrich, B. Amos, Y. Lipman, and R. Chen. Multisample flow matching: Straightening flows with minibatch couplings. In Proc. ICML, 2023. D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, 2015. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022. N. Rozen, A. Grover, M. Nickel, and Y. Lipman. Moser flow: Divergence-based generative modeling on manifolds. In Proc. Neur IPS, 2021. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Proc. Neur IPS, 2022. J. Skilling. The eigenvalues of mega-dimensional matrices. Maximum Entropy and Bayesian Methods, 1989. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, 2015. J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021a. Y. Song, J. Sohl-Dickstein, D. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021b. Y. Song, L. Shen, L. Xing, and S. Ermon. Solving inverse problems in medical imaging with scorebased generative models. In Proc. ICLR, 2022. E. Tabak and C. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 2013. A. Tong, N. Malkin, G. Huguet, Y. Zhang, J. Rector-Brooks, K. Fatras, G. Wolf, and Y. Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. TMLR, 2024. Z. Xiao, K. Kreis, and A. Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In Proc. ICLR, 2022. Published as a conference paper at ICLR 2025 APPENDIX: TOWARDS HIERARCHICAL RECTIFIED FLOW The appendix is organized as follows. We first provide a proof of Theorem 1 (the velocity distribution given xt) in Appendix A. We then provide a proof of Corollary 1 (velocity distribution for the special case of a mixture of Gaussians target distribution) in Appendix B. Afterwards we provide the proof of Theorem 2 (correctness of the marginals) in Appendix C. Then we discuss density estimation for HRF models in Appendix D. Next we provide more details regarding the hierarchical rectified flow formulation in Appendix E. Subsequently, we discuss experimental and implementation details in Appendix F. Finally, we provide additional ablation studies in Appendix G and additional experimental results in Appendix H. A PROOF OF THEOREM 1 Proof of Theorem 1: The velocity at location xt and time t is v = x1 x0 = x1 xt 1 t . The last equality holds because (1 t)x0 + tx1 = xt. Recall that for a random variable Y = αX + β with α, β R and α = 0, we have p Y (y) = 1 α . Since the random variable V is a linear transform of the random variable X1, we get π1(v; xt, t) = p V |Xt(v|xt) = (1 t)p X1|Xt ((1 t)v + xt|xt) . (13) Therefore, we need to evaluate p X1|Xt. Using Bayes formula, p X1|Xt(x1|xt) = p Xt|X1(xt|x1)p X1(x1) p Xt(xt) , (14) assuming that p Xt(xt) = 0. It is undefined if p Xt(xt) = 0. Now it remains to find p Xt|X1 and we have p Xt|X1(xt|x1) = p(1 t)X0+tx1(xt) = 1 1 tp X0 Plugging Eq. (14) and Eq. (15) into Eq. (13) and using x1 = xt + (1 t)v, we have π1(v; xt, t) = p V |Xt(v|xt) = p X0(xt tv)p X1(xt + (1 t)v) = ρ0(xt tv)ρ1(xt + (1 t)v) ρt(xt) . (16) Since the random variable Xt is a linear combination of two independent random variables X0 and X1 as defined in Eq. (2), we have ρt(xt) = p(1 t)X0(xt) pt X1(xt) = Z p(1 t)X0(z)pt X1(xt z)dz = Z 1 1 tp X0 = 1 t(1 t)ρ0 , for t (0, 1). (17) At t = 0, ρt = ρ0 since xt = x0. At t = 1, ρt = ρ1, since xt = x1. π1(v; xt, t) is undefined if ρt(xt) = 0. This completes the proof. B PROOF OF COROLLARY 1 Bromiley (2003) summarizes a few useful properties for the product and convolution of Gaussian distributions. We state the relevant results here for our proof of Corollary 1. Lemma 1 For the linear transform of a Gaussian random variable, we have N(ax + b; µ, σ2) = 1 Published as a conference paper at ICLR 2025 Lemma 2 For the convolution of two Gaussian distributions, we have N(x; µ1, σ2 1) N(x; µ2, σ2 2) = N(x; µ1 + µ2, σ2 1 + σ2 2). Lemma 3 For the product of two Gaussian distributions, we have N(x; µ1, σ2 1) N(x; µ2, σ2 2) = 1 p 2π(σ2 1 + σ2 2) exp (µ1 µ2)2 σ2 1 + σ2 2 N x; µ1σ2 2 + µ2σ2 1 σ2 1 + σ2 2 , σ2 1σ2 2 σ2 1 + σ2 2 The proofs of the Lemmas are detailed by Bromiley (2003). Proof of Corollary 1: We first compute the density of Xt using Theorem 1 with the specific ρ0 and ρ1: ρt(xt) = 1 t(1 t)ρ0 = 1 t(1 t)N xt k=1 wk N xt t ; µk, σ2 k ! By applying Lemma 1 and Lemma 2 to Eq. (18), we get ρt(xt) = N xt; 0, (1 t)2 k=1 wk N xt; tµk, t2σ2 k ! k=1 wk N xt; 0, (1 t)2 N xt; tµk, t2σ2 k k=1 wk N xt; tµk, σ2 k,t . (19) Using Theorem 1 and Eq. (19), we have p V |Xt(v|xt) = N (xt tv; 0, 1) PK k=1 wk N xt + (1 t)v; µk, σ2 k PK k =1 wk N xt; tµk , σ2 k ,t t2 PK k=1 wk N v; µk xt 1 t , σ2 k (1 t)2 PK k =1 wk N xt; tµk , σ2 k ,t PK k=1 wk N v; xt t2 N v; µk xt 1 t , σ2 k (1 t)2 t(1 t) PK k =1 wk N xt; tµk , σ2 k ,t PK k=1 wk t(1 t) 2π((1 t)2+t2σ2 k) exp (xt tµk)2 (1 t)2+t2σ2 k N v; (1 t)(µk xt)+tσ2 kxt σ2 k,t , σ2 k σ2 k,t t(1 t) PK k =1 wk N xt; tµk , σ2 k ,t PK k=1 wk N xt; tµk, σ2 k,t N v; (1 t)(µk xt)+tσ2 kxt σ2 k,t , σ2 k σ2 k,t PK k =1 wk N xt; tµk , σ2 k ,t v; (1 t)(µk xt) + tσ2 kxt σ2 k,t , σ2 k σ2 k,t The equality a holds by applying Lemma 1. The equality b is derived by applying Lemma 3 to the product of two Gaussian distributions. Simplifying the expressions, we get equality c and the final expression of p V |Xt(v|xt). This completes the proof. Published as a conference paper at ICLR 2025 C PROOF OF THEOREM 2 According to Theorem 3.3 of Liu et al. (2023), the ODE in Eq. (7) generates the samples from the ground-truth velocity distributions at space time location (xt, t). In other words, the random variable V π1. Now we consider the characteristic function of Zt+ t = Zt +V t for t [0, 1] and t [0, 1 t], assuming that Zt has the same distribution as Xt. If the characteristic functions of Zt+ t and Xt+ t agree, then Zt+ t and Xt+ t have the same distribution. To show this, we evaluate the characteristic function of Zt+ t, E h eı k,Zt+ t i = Eρt,π1 h eı k,Xt+V t i = Z Z eı k,xt+v t π1(v; xt, t)ρt(xt)dvdxt a= Z Z eı k,xt+v t ρ0(xt tv)ρ1(xt + (1 t)v) ρt(xt) ρt(xt)dvdxt = Z Z eı k,(xt+v t) ρ0(xt tv)ρ1(xt + (1 t)v)dvdxt b= Z Z eı k,(1 t t)x0+(t+ t)x1 ρ0(x0)ρ1(x1)dx0dx1 = Eρt+ t h eı k,Xt+ t i . (21) We use the notation , to denote the inner product. The equality a is valid due to Theorem 1. The equality b holds because x0 = xt tv and x1 = xt+(1 t)v with the linear interpolation. Therefore, we find that Zt+ t and Xt+ t follow the same distribution. In addition, since Z0 and X0 follow the same distribution ρ0, we can conclude that Zt and Xt follow the same marginal distribution at t for t [0, 1]. This completes the proof. D DENSITY ESTIMATION In the following, we describe two approaches for density estimation. The resulting procedures are summarized in Algorithm 3 and Algorithm 4. To empirically verify the correctness of the density estimation procedures, we train an RF baseline and an HRF2 model using a bimodal Gaussian target distribution and a standard Gaussian source distribution (see Appendix H.1 for more details). In Fig. 7 we compare 1) the ground truth density, 2) the density estimated for the RF baseline model, and 3) the densities estimated for the HRF2 model with both procedures. We also report bits per dimension (bpd) for experiments on the 1D 1N 2N, 2D 8N moon, CIFAR-10, and Image Net32 data. The results are shown in Table 1. We observe that HRF2 consistently outperforms the RF baseline. To estimate the density, according to Eq. (4) in Theorem 1, we have log ρ1(z1) = log π1(u; zt, t) + log ρt(zt) log ρ0(zt tu), with u = z1 zt This implies that for any given t [0, 1], we can use Eq. (22) to estimate the density for a generated sample z1. We can choose zt using the linear interpolation in Eq. (2) with z0 ρ0. For t = 0, we observe that ρ1(z1) = π1(z1 z0; z0, 0), where z0 ρ0. In this case, we can directly evaluate the likelihood of the generated sample via the velocity distribution. We discuss evaluation of the likelihood below. The procedure to compute the density is summarized in Algorithm 3. For t = 1, the right-hand side of Eq. (22) becomes log ρ1(z1) because log π1(u; z1, 1) = log ρ0(z1 u), which cancels out with the last term in Eq. (22). Hence, t = 1 can t be used to estimate the density. For t (0, 1), we need to evaluate ρt(zt) to estimate the likelihood of z1. Considering a one step linear flow from z0 at time 0 to zt at t, we have zt = z0 + vt and ρt(zt|z0) = 1 t π1(v; z0, 0). Using Published as a conference paper at ICLR 2025 Algorithm 3: Density Estimation 1 (t = 0) Input : Generated sample z1 and the source distribution ρ0. 1 Sample z0 ρ0 ; 2 Compute u = z1 z0 ; 3 Compute ˆρ1(z1) = π1(u; z0, 0) according to Eq. (24) ; 4 (Optional) Compute ˆρ1(z1) = 1 N PN i=1 π1(u(i); z(i) 0 , 0), with u(i) = z1 z(i) 0 and z(i) 0 ρ0 ; Output: ˆρ1(z1) Algorithm 4: Density Estimation 2 (t (0, 1)) Input : Generated sample z1 and the source distributions ρ0 and π0. 1 Draw random t Unif(0, 1) ; 2 Sample z0 ρ0 ; 3 Compute zt = tz1 + (1 t)z0 and u = z1 zt 4 Evaluate ρ0(zt tu), ρt(zt) according to Eq. (23), and π1(u; zt, t) according to Eq. (24) ; 5 Compute the log likelihood according to Eq. (22) ; Output: ˆρ1(z1) it, the density at time t can be computed according to ρt(zt) = Z 1 t ; z0, 0 ρ0(z0) dz0 1 zt z(i) 0 t ; z(i) 0 , 0 where z(i) 0 ρ0. Algorithm 4 outlines the procedure for the likelihood computation with a randomly drawn t (0, 1). Optionally, we can average across randomly drawn t (0, 1). To evaluate the (log-)likelihood of a velocity u at location zt and time t, which is needed in both cases (t = 0 and t (0, 1)), we follow the approach introduced by Chen et al. (2018); Song et al. (2021b) and numerically evaluate log π1(u; zt, t) = log π0(u0; zt, t) Z 0 1 uτ aθ(zt, t, uτ, τ) dτ. (24) Here, the random variable uτ as a function of τ can be obtained by solving the ODE in Eq. (7) backward with a fixed u at τ = 1. The term uτ aθ(zt, t, uτ, τ) is computed by using the Skilling Hutchinson trace estimator Ep(ϵ) ϵT uτ a(zt, t, uτ, τ)ϵ (Skilling, 1989; Hutchinson, 1990; Grathwohl et al., 2018). The vector-Jacobian product ϵT vτ a(zt, t, uτ, τ) can be efficiently computed by using reverse mode automatic differentiation, at approximately the same cost as evaluating a(zt, t, uτ, τ). In our experiments, we use the RK45 ODE solver (Dormand & Prince, 1980) provided by the scipy.integrate.solve ivp package. We use atol = 1e 5 and rtol = 1e 5. When implementing Algorithm 4, we use N = 1000 to evaluate ρt(xt). As mentioned above, to empirically verify the correctness of the density estimation procedures, we train an RF baseline and an HRF2 model using a bimodal Gaussian target distribution and a standard Gaussian source distribution. We compare the density estimated for the RF baseline model and the densities estimated for the HRF2 model with both Algorithm 3 and Algorithm 4. Fig. 7(a) compares the results obtained with Algorithm 3 to the RF baseline and the ground truth. Fig. 7(b) compares the density estimated for different times t with Algorithm 4 to the RF baseline and the ground truth. Regardless of the choice of algorithm and time, we observe that the HRF2 model obtains a better estimation of the likelihood. Importantly, both procedures provide a compelling way to estimate densities. In Table 1, we report bits per dimension (bpd) for experiments on the 1D 1N 2N, 2D 8N moon, CIFAR-10, and Image Net-32 data. For 1D data, z0 = 0 suffices for compelling results. For higher dimensional data, we use 20 samples of z0 as shown in the optional line 4 of Algorithm 3 to compute the bits per dimension. We observe that HRF2 consistently outperforms the RF baseline. Published as a conference paper at ICLR 2025 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Data 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Data GT RF HRF2 t=0.2 HRF2 t=0.4 HRF2 t=0.6 (a) Algorithm 3 (b) Algorithm 4 Figure 7: Density estimation results and comparison to ground truth. Irrespective of the choise of algorithm and the choice of time, we observe compelling density estimation results. We also note that the HRF2 model improves upon the RF baseline. NLL (BPD ) 1N 2N 8N moon CIFAR-10 Image Net-32 Baseline (RF) 0.275 2.119 2.980 3.416 Ours (HRF2) 0.261 2.113 2.975 3.397 Table 1: Density estimation on 1D 1N 2N, 2D 8N moon, CIFAR-10, and Image Net-32 data using bits per dimension (bpd). We observe a consistently better density estimation with the HRF2 model. E HIERARCHICAL RECTIFIED FLOW FORMULATION DETAILS In this section, we show how Eq. (8) can be derived from Eq. (11). For convenience we re-state Eq. (11): inf f Ex0 ρ0,x1 ρ1,t U[0,1]D h x1 1T Dx0 f (xt, t) 2 2 For D = 2, we note that x1 1T Dx0 is equivalent to x1 x(1) 0 x(2) 0 . Letting x0 = x(1) 0 and v0 = x(2) 0 , we obtain x1 1T Dx0 = x1 x0 v0. Further note that we obtain the time variables t = [t(1), t(2)] = [t, τ] U[0, 1]2, since t and τ are drawn independently from a uniform distribution U[0, 1]. Also, x0 = [x(1) 0 , x(2) 0 ] = [x0, v0] ρ0, where x0 and v0 are drawn independently from standard Gaussian source distributions ρ0 and π0 because ρ0 is a D-dimensional standard Gaussian. Based on the general expression x(d) t = (1 t(d))x(d) 0 + t(d)(x1 Pd 1 k=1 x(k) 0 ) and the previous results, we have xt = x(1) t = (1 t(1))x(1) 0 + t(1)x1 = (1 t)x0 + tx1 and vτ = x(2) t = (1 t(2))x(2) 0 + t(2)(x1 x(1) 0 ) = (1 τ)v0 + τv1. This is identical to the computation of xt and vτ. Combining all of these results while renaming the function from f to a, we arrive at inf a Ex0 ρ0,x1 D,t U[0,1],v0 π0,τ U[0,1] (x1 x0 v0) a(xt, t, vτ, τ) 2 2 . (26) This program is identical to the one stated in Eq. (8). F EXPERIMENTAL AND IMPLEMENTATION DETAILS F.1 LOW DIMENSIONAL EXPERIMENTS For the 1D and 2D experiments, we use the same neural network. It consists of two parts. The first part processes the space and time information separately using sinusoidal positional embedding and Published as a conference paper at ICLR 2025 linear layers. In the second part, the processed information is concatenated and passed through a series of linear layers to produce the final output. Compared to the baseline, our HRF model with depth D takes D times more space and time information as input. Therefore, the first part of the network has D times more embedding and linear layers to handle spatial and temporal information from different depths. However, by adjusting the dimensions of the hidden layers, we reduced the network size to just one-fourth of the baseline, while achieving superior performance. For each dataset in the low-dimensional experiments, we use 100,000 data points for training and another 100,000 data points for evaluation. For each set of experiments, we train five different models using five random seeds. During the evaluation, we performed a total of 125 experiments and averaged the results to ensure the fairness and validity of our findings. F.2 HIGH DIMENSIONAL EXPERIMENTS In the high-dimensional image experiments, we used the U-Net architecture described by Lipman et al. (2023) for the baseline model. To handle extra inputs vτ and τ, we designed new U-Net-based network architectures for MNIST, CIFAR-10, and Image Net-32 data. MNIST. For MNIST, we use a single U-Net and modify the Res Net block. Similar to the neural network used in our low-dimensional experiments, each Res Net block has two parts. In the first part, we handle two sets of space-time information, i.e., (xt, t) and (vτ, τ), separately with 2 distinct pathways: convolutional layers for spatial data and linear layers for time embeddings. In the second part, all the spatial data and time embeddings are added together and passed through a series of linear layers to capture the space-time dependencies. For a fair evaluation, we adjusted the number of channels such that the model sizes approximately match (ours: 1.07M parameters vs. baseline: 1.08M parameters). We note that the HRF formulation significantly outperforms the baseline. The results were shown in Fig. 6. More results are provided in Appendix H.3. CIFAR-10. For CIFAR-10, we use two U-Nets with the same number of layers but different channel sizes. We use a larger U-Net with channel size 128 to process the velocity vτ and time τ. We use another smaller U-Net with channel size 32 to process the location xt and time t. We merge the output of each Res Net block of the smaller U-Net with the corresponding Res Net block of the bigger U-Net. The size of this new U-Net structure is 1.25 larger than the baseline (44.81M parameters in our model and 35.75M parameters in the baseline). Our model achieves a slightly better generation quality (see Fig. 6 in Section 4 and Table 7 in Appendix H.3). Image Net-32. For Image Net-32, we adopt the same architectural setup as for CIFAR-10 but modify the attention resolution to 16,8 instead of just 16 to better capture the increased multimodality of the Image Net-32 dataset. Our U-Net model has a parameter size of 46.21M, compared to 37.06M for the baseline. It demonstrates slightly improved generation quality (see Fig. 6 in Section 4 and Table 7 in Appendix H.3). For training, we adopt the procedure and parameter settings from Tong et al. (2024) and Lipman et al. (2023). We use the Adam optimizer with β1 = 0.9, β2 = 0.999, and ϵ = 10 8, with no weight decay. For MNIST, the U-Net has channel multipliers [1, 2, 2], while for CIFAR-10 and Image Net-32, the channel multipliers are [1, 2, 2, 2]. The learning rate is set to 1 10 4 with a batch size 128 for MNIST and CIFAR-10. For Image Net-32, we increase the batch size to 512 and adjust the learning rate to 2 10 4. We train all models on a single NVIDIA RTX A6000 GPU. For MNIST, we train both the baseline and our model for 150,000 steps while we use 400,000 steps for CIFAR-10. G ABLATION STUDIES G.1 ABLATION STUDY FOR NFE The sampling process of HRF with depth D involves integrating D ODEs using Euler s method. The total number of neural function evaluations (NFE) is defined as NFE = Q d N (d) where N (d) is the number of integration steps at depth d. Note, for a constant NFE budget, varying the N (d) values can lead to different results. Therefore, we conduct an ablation study to understand suitable choices for N (d). Published as a conference paper at ICLR 2025 Total NFEs Sampling Steps N 2N N 5N 2N 2N N 6N(2D) 8N moon 1-WD 1-WD 1-WD 2-SWD 2-SWD 100 (1, 100) 0.020 0.031 0.045 0.070 0.172 100 (2, 50) 0.025 0.019 0.011 0.037 0.107 100 (5, 20) 0.022 0.020 0.010 0.045 0.119 100 (10, 10) 0.025 0.019 0.017 0.053 0.163 100 (20, 5) 0.026 0.017 0.030 0.062 0.201 100 (50, 2) 0.047 0.030 0.075 0.081 0.222 100 (100, 1) 0.032 0.030 0.050 0.085 0.177 Table 2: HRF2 performance for low dimensional experiments under the same NFE = 100 budget with different choices of sampling steps. Sampling steps (J, L) indicates that we use J steps to integrate x and L steps to integrate v. 1-WD refers to the 1-Wasserstein distance and 2-SWD refers to the Sliced 2-Wasserstein distance. Bold for the best. Underline for the runner-up. 0 2500 5000 7500 10000 12500 15000 17500 20000 Training Steps RF HRF2 HRF3 HRF4 HRF5 Figure 8: Training losses of HRF with different depths on 1D data, a standard Gaussian source distribution to a mixture of 2 Gaussians target distribution. We observe training to remain stable. As shown in Fig. 3, increasing the number of integration steps improves the sampling of the velocity distribution. However, beyond a certain threshold, the benefit of additional steps does not justify the increased computational cost. Table 2 further illustrates that, for a fixed NFE budget, a compelling strategy is to allocate a sufficient number of steps to accurately sample v for a precise velocity distribution while using fewer steps to integrate over x. G.2 ABLATION STUDY FOR DEPTH Our HRF framework can be extended to an arbitrary depth D. Here, we compare the training loss convergence of HRF with depths ranging from 1 to 5, where HRF1 corresponds to the baseline RF. As illustrated by the training losses shown in Fig. 8, training stability remains consistent across different depths, with higher-depth HRFs demonstrating comparable stability to lower-depth models. Importantly, note that Fig. 8 mainly serves to compare convergence behavior and not loss magnitudes as those magnitudes reflect different objects, i.e., velocity for a depth of 1, acceleration for a depth of 2, etc. Moreover, the deep net structure for the functional field of directions f depends on the depth, which makes a comparison more challenging. Table 3 and Table 4 indicate that increasing the depth results in manageable model size, training time, and inference time. These trade-offs are justified by the significant performance improvements observed in Fig. 4 and Fig. 5. See Appendix H.1 for details regarding the training data. Published as a conference paper at ICLR 2025 Training 1D data 2D data RF (0.30M) HRF2 (0.07M) HRF3 (0.67M) RF (0.33M) HRF2 (0.08M) HRF3 (0.71M) Time ( 10 2 s/iter) 1.292 0.736 2.202 1.503 0.737 2.252 Memory (MB) 2011 1763 2417 2091 1803 2605 Param. Counts 297,089 74,497 673,793 329,986 76,674 711,042 Table 3: Computational requirements for training on synthetic datasets. All models in this table are trained for 15000 iterations with a batch size of 51200. Inference Time (s) 1D data 2D data Total NFEs RF (0.30M) HRF2 (0.07M) HRF3 (0.67M) RF (0.33M) HRF2 (0.08M) HRF3 (0.71M) 5 0.030 0.014 0.014 0.005 0.037 0.030 0.035 0.017 0.017 0.006 0.041 0.034 10 0.069 0.020 0.033 0.000 0.128 0.001 0.078 0.025 0.039 0.000 0.145 0.001 50 0.372 0.024 0.164 0.000 0.642 0.001 0.440 0.001 0.193 0.000 0.727 0.001 100 0.755 0.001 0.327 0.000 1.291 0.002 0.884 0.001 0.385 0.000 1.455 0.003 Table 4: Inference time comparison for synthetic data using a varying NFE budget. For HRF2, we used sampling step combinations: (1, 5), (2, 5), (5, 10), (10, 10). For HRF3, we used sampling step combinations: (1, 1, 5), (1, 2, 5), (1, 5, 10), (2, 5, 10). For all experiments, we set our batch size to 100,000. H ADDITIONAL EXPERIMENTAL RESULTS H.1 ADDITIONAL 1D RESULTS The results for experiments used in Fig. 1 and Fig. 3 are shown in Fig. 9. H.2 HIERARCHICAL RECTIFIED FLOW WITH OTCFM As mentioned in Section 5, various approaches for straightening the paths in flow matching models exist. These approaches are orthogonal to our work and can be easily incorporated in the HRF formulation. To demonstrate this, we incorporate the minibatch optimal transport conditional flow matching (OTCFM) (Tong et al., 2024) into the two layered hierarchical rectified flow (HRF2). In OTCFM, for each batch of data ({x(i) 0 }B i=1, {x(i) 1 }B i=1) seen during training, we sample pairs of points from the joint distribution γbatch(x0, x1) given by the optimal transport plan between the source and target points in the batch. We follow the same procedure to couple noise with the data points and use the batch-wise coupled x0 and x1 to learn the parameters in aθ. We refer to this approach as HOTCFM2. We test its performance on two synthetic examples: 1) a 1D example with a standard Gaussian source distribution and a mixture of two Gaussians as the target distribution; and 2) a 2D example with a mixture of eight Gaussians as the source distribution and the moons dataset as the target distribution. Fig. 10 and Fig. 11 show that hierarchical rectified flow improves the performance of OTCFM. H.3 ADDITIONAL RESULTS ON MNIST, CIFAR-10, AND IMAGENET-32 Here we show additional results for experiments with MNIST, CIFAR-10, and Image Net-32 data. From Tables 5 to 7, we can observe the following: For MNIST, our model is comparable in size, comparable in training times, and comparable in inference times, while outperforming the baseline. For CIFAR-10 and Image Net-32, our model is 1.25 larger and has a slower inference time. However, as shown in Table 7, it still outperforms the baseline. We believe that the modest trade-off in model size and inference time is acceptable given the performance gains. Published as a conference paper at ICLR 2025 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x Ground Truth RF HRF 0 25 50 75 100 125 150 175 200 Total NFEs Wasserstein Distance RF HRF2 5 v steps HRF2 10 v steps HRF2 20 v steps 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x Ground Truth RF HRF 0 25 50 75 100 125 150 175 200 Total NFEs Wasserstein Distance RF HRF2 5 v steps HRF2 10 v steps HRF2 20 v steps (a) Data distribution (b) Metrics (c) RF trajectories (d) HRF trajectories Figure 9: More experiments on 1D data: top row shows results for a standard Gaussian source distribution and a mixture of 2 Gaussians target distribution; bottom row shows results for a mixture of 2 Gaussians source distribution and the same mixture of 2 Gaussians target distribution. 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x Ground Truth OTCFM WD=0.069 HOTCFM2 WD=0.025 0 25 50 75 100 125 150 175 200 Total NFEs Wasserstein Distance OTCFM HOTCFM2 5 v steps HOTCFM2 10 v steps HOTCFM2 20 v steps 3 2 1 0 1 2 3 Space 3 2 1 0 1 2 3 Space (a) Data distribution (b) Metrics (c) OTCFM trajectories (d) HOTCFM2 trajectories Figure 10: Results for 1D data, with ρ0 being a standard Gaussian and ρ1 being a mixture of 2 Gaussians. (a) Histograms of generated samples and ρ1. (b) The 1-Wasserstein distance vs. total NFEs. (c,d) The trajectories of particles flowing from source distribution (grey) to target distribution (blue). 0 25 50 75 100 125 150 175 200 Total NFEs Sliced Wasserstein Distance OTCFM HOTCFM2 5 v steps HOTCFM2 10 v steps HOTCFM2 20 v steps (a) Metrics (b) OTCFM trajectories (c) HOTCFM2 trajectories Figure 11: Results for 2D data, with ρ0 being a mixture of 8 Gaussians and ρ1 being represented by the moons data. (a) Sliced 2-Wasserstein distance vs. total NFEs. (b) and (c) show the trajectories (green) of sample particles flowing from source distribution (grey) to target distribution (blue). Training MNIST CIFAR-10 Image Net-32 RF (1.08M) HRF2 (1.07M) RF (35.75M) HRF2 (44.81M) RF (37.06M) HRF2 (46.21M) Time (s/iter) 0.1 0.1 0.3 0.4 0.7 0.8 Memory (MB) 3935 3931 8743 10639 27234 33838 Param. Counts 1,075,361 1,065,698 35,746,307 44,807,843 37,064,707 46,210,083 Table 5: Computational requirements during training on image datasets. Published as a conference paper at ICLR 2025 Inference time (s) MNIST CIFAR-10 Image Net-32 Total NFEs RF (1.08M) HRF2 (1.07M) RF (35.75M) HRF2 (44.81M) RF (37.06M) HRF2 (46.21M) 5 0.084 0.001 0.085 0.001 0.221 0.000 0.295 0.000 0.229 0.000 0.301 0.000 10 0.168 0.000 0.169 0.000 0.441 0.001 0.589 0.001 0.458 0.000 0.601 0.000 20 0.336 0.000 0.339 0.000 0.889 0.001 1.176 0.001 0.918 0.001 1.207 0.001 50 0.843 0.001 0.851 0.002 2.229 0.001 2.953 0.004 2.302 0.002 3.029 0.004 100 1.693 0.002 1.706 0.003 4.471 0.004 5.921 0.003 4.618 0.003 6.100 0.014 500 8.538 0.030 8.598 0.010 22.375 0.011 29.701 0.011 23.110 0.005 30.863 0.083 Table 6: Inference time comparison for MNIST, CIFAR-10, and Image Net-32 datasets using a varying total NFEs budget. For HRF2 on MNIST we used sampling step combinations: (1, 5), (2, 5), (5, 4), (5, 10), (5, 20), (5, 100). For HRF2 on CIFAR-10 and Image Net-32 we used sampling step combinations: (1, 5), (1, 10), (1, 20), (1, 50), (2, 50), (2, 250). All experiments are conducted with a batch size of 128. Performance (FID) MNIST CIFAR-10 Image Net-32 Total NFEs RF (1.08M) HRF2 (1.07M) RF (35.75M) HRF2 (44.81M) RF (37.06M) HRF2 (46.21M) 5 19.187 0.188 15.798 0.151 36.209 0.142 30.884 0.104 69.233 0.166 48.933 0.177 10 7.974 0.119 6.644 0.076 14.113 0.092 12.065 0.024 21.744 0.045 20.286 0.022 20 6.151 0.090 3.408 0.076 8.355 0.065 7.129 0.027 12.411 0.002 12.492 0.100 50 5.605 0.057 2.664 0.058 5.514 0.034 4.847 0.028 8.910 0.137 9.024 0.112 100 5.563 0.049 2.588 0.075 4.588 0.013 4.334 0.054 7.873 0.110 7.679 0.022 500 5.453 0.047 2.574 0.121 3.887 0.035 3.706 0.043 6.962 0.087 6.503 0.035 Table 7: Performance comparison for MNIST, CIFAR-10, and Image Net-32 datasets using a varying total NFEs budget. For HRF2 on MNIST we used sampling step combinations: (5, 1), (10, 1), (5, 4), (10, 5), (10, 10), (100, 5). For HRF2 on CIFAR-10 and Image Net-32 we used sampling step combinations: (1, 5), (1, 10), (1, 20), (1, 50), (2, 50), (2, 250). Bold for lower mean.