# diversified_flow_matching_with_translation_identifiability__ad9631f1.pdf

Diversified Flow Matching with Translation Identifiability

Sagar Shrestha 1 Xiao Fu 1

Abstract Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introduces diversified flow matching (DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function s velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.

1. Introduction

Unpaired domain translation (UDT) aims to translate samples from one domain to another (e.g., photographs to cartoons) while keeping the high-level semantic meaning (or content ). Here, unpaired means that the translation is done without using paired cross-domain samples. UDT has achieved significant empirical successes in various applications, such as unpaired image-to-image translation (Zhu et al., 2017; Choi et al., 2018; Huang et al., 2018; Yang

1School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA. Correspondence to: Xiao Fu <xiao.fu@oregonstate.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

et al., 2023), medical imaging (Kong et al., 2021; Song et al., 2024), and single-cell data analysis (Tong et al., 2023; Kapu sniak et al., 2024).

UDT is commonly realized by transporting the distribution of the source domain to that of the target domain (see, e.g., (Zhu et al., 2017; Liu et al., 2023b)). Distribution transport is a core task in modern machine learning. It is heavily studied in the context of domain adaptation, transfer learning, and generative models. Distribution transport can be realized by popular tools such as MMD (Long et al., 2016), GANs (Goodfellow et al., 2014), Shr odinger Bridge (De Bortoli et al., 2021), and continuous normalizing flows (CNF) (Lipman et al., 2023). However, it has been noted that UDT often loses control of the content to be produced in the target domain. For example, in writing style translation, a handwritten digit 7 could be translate to printed 3 under perfect distribution transport (Moriakov et al., 2020; Shrestha & Fu, 2024). Similar issues are seen in Fig. 1, where source pictures are translated to cartoons of unrelated persons. This content misalignment issue arises as UDT does not have identifiability of the intended translation function (i.e., the function that translates handwritten 7 to printed 7 and the function that converts profile pictures to cartoon faces without losing identities). There could be an infinite number of translation functions that can attain perfect distribution transport among the domains (Moriakov et al., 2020; Shrestha & Fu, 2024).

Many attempts have been made to address the content misalignment issue; see, e.g., (de B ezenac et al., 2019; Zhu et al., 2017; Taigman et al., 2017; Liu et al., 2017; Xu et al., 2022) for various regularization strategies to empirically enforce translation identifiability. Notably, the recent work (Shrestha & Fu, 2024) proposed the so-called diversified distribution matching (DDM) approach, which showed that UDT can be made identifiable if the translation function is learned from simultaneously transporting a number of diverse conditional distribution pairs.

Challenges. The DDM criterion provides a theory-driven approach to avoid translation non-identifiability. However, similar as many works in this domain (e.g., (Zhu et al., 2017; Xu et al., 2022; de B ezenac et al., 2019; Xie et al., 2023)), the DDM approach imposes a structural regularization on the translation function realized by GANs. GAN-based

Diversified Flow Matching with Translation Identifiability

methods sometimes suffer from numerical instability due to its adversarial training nature. More importantly, GANs learn the translation functions but not the continuous intermediate states between the source and target domains, i.e., the transport trajectories, yet the latter is critical in many applications such as robot navigation and single-cell evolution inference (Tong et al., 2023; Liu et al., 2023a; 2018).

To overcome these challenges, a natural thought is to use flow matching (FM) (Lipman et al., 2023; Albergo et al., 2023) for UDT. FM learns the velocity of the translation function and thus easily recovers the trajectories. In addition, FM methods are friendly to train, using nonlinear least squares instead of min-max adversarial criteria. Nonetheless, using FM in identifiability-driven UDT turns out to be quite nontrivial, as most existing works (e.g., (Zhu et al., 2017; Xu et al., 2022; de B ezenac et al., 2019; Xie et al., 2023; Shrestha & Fu, 2024)) enforce identifiability via imposing constraints/regularization on the translation function yet, unlike GANs, FM does not have an explicit expression of the function. Shifting such constraints onto the function velocity/trajectory requires completely different designs, which have been elusive in the literature.

Contributions. In this work, our interest lies in an FMbased learning framework for DDM-based distribution transport which we call diversified flow matching (DFM). Our detailed contributions are as follows:

DFM with Transaltion Identifiability. We custom design the loss function and interpolant (i.e., the function to guide the trajectory of the flow transport) for identifiability-guaranteed DFM. Conventional FM uses nonlinear least squares losses and linear interpolants (Liu et al., 2023b; Tong et al., 2023), which fail to realize DDM as they generate conflicting trajectories among different conditional distribution pairs. We propose to use a nonlinear, private interpolant function for each conditional distribution pair, and show that a bilevel optimization loss with this design provably retains the translation identifiability of DDM.

Tangible Implementation. We propose an implementation that exploits the non-overlapping property of conditional distributions. This way, we show that the computationally demanding bilevel optimization loss can be recast into a more manageable two-stage approach, consisting of an interpolant learning stage and a flow training stage both of which admit differentiable unconstrained losses and can be solved using simple back-propagation.

We test our method over synthetic data and real-world applications (i.e., robot crowd route planning and unpaired image translation). The results corroborate with our theoretical analyses and algorithm design.

Notation. We largely adhere to established conventions in machine learning; see also Appendix A.

Figure 1: [Columns 2-4] Content misalignment issues in both GAN and FM based UDT (Cycle GAN(Zhu et al., 2017), FM(Lipman et al., 2023), FM-OT(Tong et al., 2023)) [Column 5] Result by DDM-GAN (Shrestha & Fu, 2024).

2. Background

Unsupervised Domain Translation. Consider two data domains (e.g., photos and sketches), denoted by X Rd

and Y Rd, respectively. Assume that there exists a deterministic continuous mapping that translates every x X into its content-aligned counterpart y Y, i.e.,

x px, y = g (x), (1)

where px is the distribution of x and g : X Y is a differentiable bijective map. We stress that there might exist many g s such that g(x) Y for all x X, but our interest lies in a content-preserving g (e.g., the one that changes the style of handwritten 7 to its printed version but keeps its identity). The goal of UDT is to estimate g from the unpaired samples of px and py.

Translation via Distribution Transport. In the literature (Zhu et al., 2017; Park et al., 2020), the UDT problem is generally addressed by finding an invertible translation function g such that the distribution of the translated samples g(x) matches that of y, i.e.,

find invertible g (2)

subject to : g#px = py,

where g#px represents the distribution of g(x). The criterion can be realized by many computational tools, e.g., GANs (Goodfellow et al., 2014), and more recently, diffusion-based tools such as Shr odinger bridge (De Bortoli et al., 2021), and FM (Lipman et al., 2023). It was noticed in the literature (Galanti et al., 2018; Moriakov et al., 2020) that solving (2) sometimes produces content-misaligned translations meaning that the desired g in the groundtruth translation model (1) is not identified. Fig. 1 (columns 2-4) shows the content misalignment issue that exists in both GANs and FM based UDT.

Many works showed that imposing more structural information on g in (2) could establish content alignment (or, the identifiability of the translation function g ); see, e.g., Benaim & Wolf (2017); Benaim et al. (2018); Xu et al. (2022);

Diversified Flow Matching with Translation Identifiability

Figure 2: The idea of DDM. The variable u(q) can often be defined as attributes that are not supposed to change across domains. In (Shrestha & Fu, 2024), it was shown Q 2 suffices to underpin the translation identifiability.

Moriakov et al. (2020). Among them, Shrestha & Fu (2024) showed an interesting theoretical result. To elaborate, the method in (Shrestha & Fu, 2024) first defines corresponding conditional distributions px|u=u(q) and py|u=u(q) for q [Q], where u(q) is auxiliary information. For example, for face to cartoon translation in Fig. 2, u(q) could be designed to represent some face attributes, e.g., gender, that are not supposed to change during the translation. Then, they proposed to learn a unified g for matching Q pairs of such conditional distributions using the so-called diversified distribution matching (DDM) criterion:

(DDM) find invertible g (3)

subject to g#px|u(q) = py|u(q), q [Q],

where we used px|u(q) = px|u=u(q). Shrestha & Fu (2024) introduced the following condition:

Definition 2.1 (Sufficiently Diverse Condition (SDC)). For any two disjoint sets A, B X, where A and B are connected, open, and non-empty, there exists a u(A,B) {u1, . . . , u(q)} such that R

A px|u(A,B)(x)dx = R

B px|u(A,B)(x)dx. Then, the set of conditional distributions {px|u(q)}Q q=1 is called sufficiently diverse.

Under the SDC, there are at least two PDFs among {px|u(q)}Q q=1 that are sufficiently different over any A and B. It was also shown that

Theorem 2.2 (Translation Identifiability). (Shrestha & Fu, 2024) Suppose that {px|u(q)}Q q=1 satisfies the SDC. Let bg be any optimal solution of the DDM criterion (3). Then, we have bg = g , a.e.

In addition, it was also shown that the DDM criterion is robust to small violations of the SDC (see Appendix B.1).

DDM-GAN and Challenges. In (Shrestha & Fu, 2024), Problem (3) was solved using a GAN-based framework:

min g max {d(q)}

q=1 Pr(u(q)) Ey py|u(q) h log d(q)(y) i

+ Ex px|u(q) h log 1 d(q)(g(x)) i , (4)

where d(q) is a discriminator for the qth pair of conditional distributions. Cycle-consistency and backward translation were also used to enforce invertibility of g, which is omitted here for conciseness. The DDM-GAN formulation showed promising results (see Fig. 1), but two main challenges exist: First, GAN-based training is sometimes numerically unstable due to the min-max nature. Second, more importantly, the learned deterministic g does not contain the trajectory reflecting how px was changed to py, but such trajectories are critical information for a number of domain translation problems, e.g., robot navigation and single-cell evolution inference (Tong et al., 2023; Liu et al., 2023a; 2018).

3. Proposed Approach

As mentioned, besides GANs, diffusion and flow based methods can also be used for distribution transport. The latter genre is known for their relatively simple training processes and the ability to reveal the transport trajectories. Hence, we are motivated to design an FM (Lipman et al., 2023) based UDT method for carrying out the DDM principle (3). This turns out to be a nontrivial task. To see this, we start with some preliminaries of FM.

3.1. Preliminaries on Flow Matching

FM is an instance of a class of generative models called continuous normalizing flow (CNF) (Lipman et al., 2023). CNFs learn a time-varying differentiable map f t : Rd Rd, t [0, 1], called the flow, such that f t(x) = zt, where z0 = x, z1 = y. The flow f t has a velocity field vt : Rd Rd:

vt(f t(x)) = d

dtf t(x). (5)

Using the vector field vt, one can easily translate a given sample x to any intermediate state in the trajectory from x to its corresponding y:

f t(x) = x + Z t

0 vs(zs)ds, t [0, 1] (6)

and we have f 1(x) = g(x) = y. In practice, vt for transporting px to py is parameterized by a neural network and

Diversified Flow Matching with Translation Identifiability

learned via the following nonlinear least squares loss (Albergo et al., 2023; Lipman et al., 2023; Liu et al., 2023b):

min vt E x,y ρ(x,y), t Unif([0,1])

zt=I(x,y,t)

vt(zt) t I(x, y, t) 2 2

| {z } LFM(vt,I,ρ)

where ρ(x, y) is any joint distribution of x, y such that R

X ρ(x, y)dx = py(y) and R

Y ρ(x, y)dy = px(x), and I : Rd Rd [0, 1] Rd is the so-called interpolant function which is differentiable with respect to t and satisfies I(x, y, 0) = x and I(x, y, 1) = y. The independent coupling ρ(x, y) = px(x)py(y) and the linear interpolant

Ilinear(x, y, t) = (1 t)x + ty (8)

are most commonly used in the FM literature.

3.2. Challenges of FM-based DDM

While FM appears to circumvent some well-known challenges of GANs such as the instability of min-max optimization it introduces its own distinct difficulties when used to realize the DDM criterion in (3). To explain, let us start with the following definitions:

Definition 3.1 (Transport of Measures). A vector field vt is said to transport a distribution ω to η if the corresponding flow gv = f 1 (cf. Eq. (6)) satisfies: [gv]#ω = η.

Definition 3.2 (DDM Satisfaction). A vector field vt is said to satisfy DDM in (3) if it transports px|u(q) to py|u(q) for all q [Q], i.e., [gv]#px|u(q) = py|u(q), q [Q].

Next, we show that the classical interpolant in (8) fails to attain DDM satisfaction.

Classic Linear Interpolant Fails. A naive way to implement DDM using FM is as follows:

minimize vt

q=1 LFM(vt, Ilinear, ρq), (9)

where ρq := ρ(x, y|u(q)) = px|u(q)(x)py|u(q)(y). The goal of (9) is to enforce a unified vector field to transport px|u(q) to py|u(q) for each u(q). Unfortunately, the loss (9) cannot attain this goal in general. Fig. 3 shows a typical failure case. The source and target distributions are both Gaussian mixtures with two modes. There, the red x are supposed to be translated to red (same for the blue modes). However, the linear interpolants for two pairs of conditional distributions intersect at t = 1/2.This intersection results in bv 1

2y E[x]), where bvt is the optimal solution of (9) (see Fig. 3(d) and Appendix 3.4 for derivation; also see similar visualizations in (Liu

(a) Left: py|u(q). Right: px|u(q).

(b) Ilinear(x, y, t) as interpolant trajectories.

(c) Actual trajectories computed from (9).

(d) bv1/2(0.5x + 0.5y).

Figure 3: (a) Samples of two pairs of conditional distributions. (b) Linear interpolant at t [0, 1] for interpolant trajectories. (c) Actural trajectories learned by solving (9). (d) bv 1

2 points towards the E[x].

et al., 2023b)). This implies that the samples from px|u1 gets reflected back to py|u2 and that from px|u2 to py|u1, i.e., x to and x to as shown in Fig. 3(c).

Using Private and Learnable Interpolants. The previous example shows that the commonly used interpolant Ilinear(x, y, t) = (1 t)x + ty does not work for DDM. Note that the interpolant guides the velocity, and it is well known that the velocity is the conditional mean of the timederivative of the interpolant (Albergo et al., 2023). Hence, to learn a legitimate vt that transports px|u(q) to py|u(q) for all q that satisfies the DDM criterion (3), a suitable interpolant needs to be selected.

To this end, we propose the following strategy: First, we let each pair px|u(q) and py|u(q) use their own private interpolant, denoted by I(q). Second, we design the I(q) s to be nonlinear, learnable interpolants. To proceed, we define a set of learnable I as follows:

I = {Iθ(x, y, t) : Rd Rd [0, 1] Rd | Iθ(x, y, 0) = x,

Iθ(x, y, 1) = y, Iθ differentiable w.r.t. t}

Using the private interpolant, a natural formulation to realize the DDM (3) appears to be the following:

minimize vt,{I(q)}Q q=1

q=1 LFM vt, I(q), ρ(x, y|u(q))

Diversified Flow Matching with Translation Identifiability

Figure 4: [Left] Success case of solving (10). [Right] Failure case of solving (10).

where I(q) I and has learnable parameters θq. The hope is that I(q) will be learned in a way such that vt simultaneously minimizes each L(q) FM for q = 1, . . . , Q in (10) in order to attain DDM satisification.

Pitfall of Loss (10). However, it turns out that (10) is not a correct criterion. Fig. 4 shows the result of solving (10) on the Gaussian mixture example (see details in Appendix B.4). One can see that it sometimes works [Left] yet sometimes fails [Right] no matter how hard one tunes the optimization algorithm for solving (10). This drives us to discover the following fact:

Fact 3.3. The problem formulation in (10) is not equivalent to the DDM criterion in (3).

To explain, note that we hope using the loss in (10) to find

bvt(z) = E[ tb I (q)(x, y, t) | b I (q)(x, y, t) = z], (11)

for all q [Q], where the expectation is taken over ρ(x, y|u(q)) . Under the model in (1), such bvt and b I (q) for all q could exist (which we will discuss later in more details). However, when minimizing the loss(10),

it is possible that one finds (evt, e I (q)) for certain q s

such that L(q) FM(evt, e I (q)) L(q) FM(bvt, b I (q)) yet evt =

E[ te I (q)(x, y, t) | e I (q)(x, y, t) = z]. This pathological case could happen because I(q) is a learnable term whose scale can be changed to attain an overall small L(q) FM. However, such small loss values do not reflect the real goal of distribution transport. In other words, due to the changeable I(q) and the setting Q > 1, having a small value of

L(q) FM(evt, e I (q)) does not mean (11) is met for q, which makes (10) problematic for enforcing DDM.

3.3. Proposed Criterion: A Bilevel Learning Loss

The above discussion shows that we need (11) to hold for each q. Hence, instead of using a sum of LS-type loss, we propose the following diversified flow matching (DFM)

minimize vt,v(q) t ,I(q)

q=1 v(q) t vt 2 + 1I[I(q)] (12a)

subject to : v(q) t = arg min w(q) t LFM w(q) t , I(q), ρ(q) ,

where ρ(q) = ρ(x, y|u(q)) for short and 1I[I] is the indicator function of set I. Problem (12) is a bilevel optimization problem, where we ensure DDM satisfaction by using two constraints (12b). The lower level optimization (12b) introduces private vector field v(q) t for each q and the upper level optimization forces consensus among all v(q) t . Consequently, (11) holds for all q [Q] when (12) is optimally solved recovering the DDM criterion in (3). This leads to the following proposition: Proposition 3.4. Suppose that there exists a flow f t : Rd Rd, continuously differentiable in time and space, from px to py such that f 1 = g . Suppose there exist a diffeomorphism from the standard Gaussian N(0, Id) to px|u(q), q. Let bvt denote a solution of Problem (12) and

denote gbv(x) = x + R t=1 t=0 bvt(zt)dt. Then, under the same model in (1), when {px|u(q)}Q q=1 satisfies the SDC, we have gbv = g , holds a.e.

Proposition 3.4 shows that it is viable to use an FM-based loss to attain the same conclusion of Theorem 2.2. Note that Theorem 2.2 was established under the premise that DDM is attained, yet Proposition 3.4 specifically needs the distribution matching part to be realized by FM. This gap is filled by the bilevel loss design in (12).

3.4. Implementation: Exploiting Structural Constraint

Unlike (10), which suffers from theoretical flaws, (12) is theoretically sound in establishing translation identifiability. However, it requires solving a bilevel optimization problem. While this can be tackled using off-the-shelf techniques such as implicit gradients and gradient unrolling (see (Zhang et al., 2024)), computational efficiency remains a concern.. In addition, for problems where the lower level optimization part is nonconvex and intractable, convergence of bilevel optimization is hard to guarantee.

To avoid these computational barriers, we propose to simplify the Problem (12) by exploiting the structural property of conditional distributions that naturally arises in many cases. Specifically, the auxiliary information u often correspond to semantic attributes/labels (e.g., gender for image translation), which induce roughly non-overlapping clusters. To utilize this structure, let us assume the following: Assumption 3.5 (Non-overlapping Supports). For {px|u(q)}Q q=1, we have supp(px|ui) supp(px|uj) = ϕ

Diversified Flow Matching with Translation Identifiability

and supp(py|ui) supp(py|uj) = ϕ, i = j, where supp(p) = {z | p(z) > 0}.

Under the above assumption, it is possible to design probability paths pt u(q), q, that satisfy the boundary conditions p0 u(q) = px|u(q) and p1 u(q) = py|u(q), such that pt u(q) s supports are non-overlapping among all q [Q]. Learning unified vector field that simultaneously follow the pt u(q) for all q [Q] will then ensure that the vector field satisfies DDM.

Designing such non-intersecting pt u(q) can be achieved by designing the interpolants themselves. This is due to the well-known fact:

Fact 3.6. (Albergo et al., 2023) The distribution pt u(q) is the same as the probability distribution of the random variable I(q)(x, y, t) where x, y ρ(x, y|u(q)).

Therefore, it suffices to select interpolant I(q) for all q that do not intersect with each other in the following sense:

Definition 3.7 (Non-intersecting Interpolants). The interpolants {I(q)}Q q=1 are non-intersecting if i, j [Q], i = j,

I(i)(x(i), y(i), t) = I(j)(x(j), y(j), t),

for all (x(q), y(q)) supp(ρ(x, y|u(q))), q {i, j}.

As we will see, non-intersecting interpolants will allow us to greatly simplify the bilevel loss so that the unified vector field can be learnt efficiently.

Learning Non-intersecting Interpolants. A major benefit of exploiting the non-overlapping support structure in Assumption 3.5 as follows: Under Assumption 3.5, assume that I(q) for all q [Q] are learnable universal path repre-

senters. Then, there exists a set of ˇI (q) for q [Q] that are non-intersecting. This is simply because the starting and ending points mapped by ˇI (i) and ˇI (j) are completely different. Further, using Assumption 3.5, one can use a unified I = I(q) to express the non-intersecting interpolants for transporting px|u(q) to py|u(q) for all q [Q]; that is,

ˇI(x, y, t) =

q=1 1x,y supp(ρ(q))ˇI (q)(x, y, t). (13)

To find a non-intersecting unified interpolant, we use the following learning criterion:

ˇI = arg min I

j=i+1 Et1,t2[γσ2(|t1 t2|) (14)

γσ1( I(x(i), y(i), t1) I(x(j), y(j), t2) 2)],

where γσ(a) = max(exp( a2/2σ2), η) and η > 0 is a small constant. Simply speaking, Problem (14) tries to push

the values of I(x(i), y(i), t1) and I(x(j), y(j), t2) apart, if the two values are close in time and space.

Simplifying the Bilevel Loss. Let ˇI denote the learned non-intersecting interpolant. Then (12) can be reformulated as follows:

minimize vt,v(q) t

q=1 v(q) t vt 2 (15a)

subject to v(q) t = arg min wt LFM wt, ˇI, ρ(q) (15b)

In contrast to (12), ˇI is not learnable in Problem (15). This allows us to further re-express the formulation as:

minimize vt

q=1 LFM(vt, ˇI, ρ(x, y|u(q))), (16)

Note that we have eliminated the slack variable representing the private v(q) and the consensus loss in (15a), as no private I(q) is needed in our loss function.

The proposed algorithm is referred to as DFM and is detailed in Algorithm 1 in Appendix B.

4. Related Works

UDT via Distribution Transport. Distribution transport is arguably the most widely used approach in UDT (as well as closely related techniques such as domain adaptation (Long et al., 2016)). In the literature, distribution transport is realized using various methods such as minimization of maximum mean discrepancy (MMD) (Long et al., 2016), GANs (Zhu et al., 2017; Huang et al., 2018; Choi et al., 2020), bridge matching (De Bortoli et al., 2021; Liu et al., 2023a; De Bortoli et al., 2024), and FM (Liu et al., 2023b; Eyring et al., 2024).

Translation Identifiability. The non-uniqueness of the distribution transport maps in UDT is a well-known issue (Moriakov et al., 2020; Shrestha & Fu, 2024; Galanti et al., 2018). Many hypothesized that the desired map g is likely to be the optimal transport (OT); see (Liu et al., 2023b) and (De Bortoli et al., 2021; 2024). While OT maps have been shown to be effective in some applications such as single cell analysis (Bunne et al., 2024), it is in general not clear if g is the OT map. There also exist many empirical ways to constrain the feasible set of the transport maps, making the found bg with intended behaviors (Amodio & Krishnaswamy, 2019; Xu et al., 2022). Recently, (Shrestha & Fu, 2024) showed that it is possible to identify a general (non-OT) translation function under the SDC.

Bridge and Flow Matching for UDT. Following the rise of diffusion and FM based generative models, many works have emerged to apply the continuous density transport

Diversified Flow Matching with Translation Identifiability

perspective onto UDT. Two major classes are notable. The first class is the FM-based approaches that use deterministic continuous time process characterized by an ODE (Liu et al., 2023b; Eyring et al., 2024; Kapu sniak et al., 2024; Kornilov et al., 2024; Gazdieva et al., 2023). The second is diffusion and schr odinger bridges (Liu et al., 2023a; De Bortoli et al., 2024; Sasaki et al., 2021), characterized by SDEs. These methods do not consider the translation identifiability and thus issues in Fig. 1 arise. In addition, they are designed to transport among only one pair of distributions, which is hard to use for DDM as in this work.

Using Auxiliary Information in FM. Auxiliary information has been incorporated into FM in the context of learning conditional generative models (Atanackovic et al., 2024; Zhu & Lin, 2024). These approaches typically learn separate vector fields vt( | u(q)) for each conditioning variable u(q), treating the auxiliary information as a condition. As a result, they avoid the need to construct a unified vt across all conditional pairs and do not require designing nonlinear, learnable interpolants. This makes them fundamentally different from the setting considered in this work.

5. Experiments

Interpolant Construction. In order to construct the learnable I, we parametrize Iθ using the following form:

Iθ(x, y, t) = (1 t)x + ty + t(1 t)γθ(x, y, t), (17)

where γθ : Rd Rd [0, 1] Rd is a neural network. Note that Iθ constructed this way satisfies the boundary conditions by design, i.e., Iθ(x, y, 0) = x and Iθ(x, y, 1) = y. Similar constructions have been used in the literature (Kapu sniak et al., 2024).

Baselines. The major baselines used throughout this section are plain-vanilla flow matching (FM) (Albergo et al., 2023; Liu et al., 2023b; Lipman et al., 2023) and FM with minim-batch optimal transport coupling (FM-OT) (Pooladian et al., 2023; Tong et al., 2023). We also modify the plain-vanilla FM and FM-OT to incorporate auxiliary variables following Eq. (9) to show that without custom designed interpolants, simply using auxiliary variables does not attain DDM satisfaction or translation identifiability. The two modified algorithms are referred to as FM-cond and FM-cond-OT, respectively. For Sec. 5.3 and the navigation experiment there, we further use metric flow matching (MFM-OT), and its auxiliary-variable modified version MFM-cond-OT (also following Eq. (9)). For Sec. 5.2 and the image translation task there, we also include GAN-based Cycle GAN (Zhu et al., 2017) and DDM-GAN (Shrestha & Fu, 2024) and diffusion-based baselines SDEdit (Meng et al., 2021) and EGSDE (Zhao et al., 2022).

5.1. Synthetic Data Validation

Setting. We use two layer MLP with 64 hidden units and Se LU activations to represent vt( ; ϕ) as well as Iθ. We use an Adam optimizer with an initial learning rate of 0.001 for vt and 0.0001 for Iθ. We use a batch size of 512. Details of the proposed algorithm is presented in Algorithm 1. We run both phases of Algorithm 1 for 2000 iterations.

Metrics. We use the Earth Mover s Distance (EMD) to assess the distribution matching between py|u(q) and pby|u(q), q. Similarly, we use translation error to assess the identifiability with respect to the true translation function g , which is defined as follows

Translation Error(TE) = 1

n=1 byn yn 2

where yn = g (xn) target translation and byn = gbv(xn) = xn + R 1 t=0 bvt(zt)dt is the predicted translation.

3D Gaussian blobs. Fig. 5 shows the results of a setting where we generate Gaussian mixtures p(x) with d = 3 and use y = g (x) = x to generate samples from p(y), where each Gaussian component is with unit variance. Trajectories obtained by the baselines FM-cond and FM-cond-OT show the reflection effect discussed in Fig. 3, which results in p(by|u(q)) being very different from p(y|u(q)). However, the proposed method successfully finds vector fields that circumvent the reflection issue associated with this setting, correctly transporting px|u(q) to py|u(q), q {1, 2}.

2D Gaussian blobs. Fig. 6 shows the performance of DFM and baselines when d = 2. Note that the d = 2 case is arguably more challenging than the d = 3 case, as the transport trajectories have less space to explore for collision avoidance. Nonetheless, one can see that DFM learns a vt such that the two Gaussian blobs travel at different speed (see the color bar of t in Fig. 6) to avoid collision and reflection, successfully transporting px|u(q) to py|u(q), q {1, 2}. This interesting behavior is not acquired by FM-cond or FM-cond-OT, showing the importance of designing I.

Quantitative Results. Table 1 shows the mean and standard deviations of EMD and TE attained by various methods averaged over 10 trials. It shows that the proposed method successfully transports px|u(q) to py|u(q), q and identifies g more accurately relative to the baselines.

5.2. Image Translation

In this subsection, we demonstrate the efficacy of the proposed method on an important UDT task, namely, unpaired image to image translation. We use images of human faces from the Celeb AHQ dataset (Karras et al., 2017) with

Diversified Flow Matching with Translation Identifiability

Input FM-cond FM-cond-OT Proposed (front view) Proposed (top view)

Figure 5: Trajectories returned by all methods for the 3D synthetic data.

Figure 6: Trajectories returned by all methods for the 2D synthetic data. Colorbar indicates time t.

Table 1: EMD and TE attained by the proposed method and baselines for the synthetic data.

EMD TE Method 2D Gaussians 3D Gaussians 2D Gaussians 3D Gaussians

FM-cond 9.64 0.51 9.59 0.45 9.95 0.43 10.115 0.388 FM-cond-OT 9.56 0.41 9.44 0.49 9.86 0.37 9.994 0.428 DFM (proposed) 0.31 0.06 0.54 0.05 2.42 0.04 3.196 0.034

30, 000 images as the source data px and Bitmoji faces with 4084 images (Mozafari, 2020) as the target data py. To avoid data imbalance, we only use randomly selected 5000 images from Celeb AHQ. Both image domains are resized to have a size of 256 256. The Bitmoji faces are first center-cropped to 80% of the original image before resizing. All FM-based methods are trained on the latent space of the VAE from Stable Diffusion v1 (Rombach et al., 2022). We use FID to measure the distribution matching performance, while translation identifiability is visually checked from content alignment. The auxiliary information used for this task is the gender, i.e., u1 = male and u2 = female . More details and hyperparameter settings are in Appendix D.1.

DDM-GAN s Convergence Issues. As discussed in Sec. 2, GANs could encounter convergence challenges in some scenarios. Fig. 7 shows a case where the dataset size is up to 5000 and Q = 2. One can see that FID exhibits quite erratic behaviors when the number of iterations increases. Note that FID 70 is unacceptable quality (see Appendix B for an example). As we mentioned, such hardness in adversarial optimization is a reason motivating our DFM approach. Therefore, we present the best FID attained by

Figure 7: Convergence issues of DDM-GAN.

DDM-GAN in the sequel.

We observe such issue in the considered experiment (see Appendix B.5 for results). As we mentioned, such hardness in adversarial optimization is a reason motivating our DFM approach. Therefore, we present the best FID attained by DDM-GAN in the sequel.

Result. Fig. 8 shows the qualitative results of translation obtained by all methods. One can see that Cycle GAN, FM, and FM-OT, which does not use the auxiliary information of gender, suffer from content misalignment issue. DDM-GAN does not show satisfiable content alignment either, probably due to the numerical instability as demonstrated in Fig. 7. FM-cond, the naive FM based implementation of DDM using linear interpolants in (9), show better content alignment than the other baselines (e.g., see first row). Nonetheless, DFM shows the best alignment, supporting the translation

Diversified Flow Matching with Translation Identifiability

Figure 8: Human faces to Bitmoji translation by all methods.

Table 2: Quantitative results on the Face to Bitmoji task.

Method FID Dream Sim Train Time (hrs) Infer. Time (s) GAN-based DDM-GAN (Shrestha & Fu, 2024) 26.20 0.58 (0.05) 13.60 0.011 Cycle GAN (Zhu et al., 2017) 31.14 0.63 (0.06) 22.12 0.011 Diffusion-based SDEdit (k=3) 63.37 0.62 (0.06) 27.10 27.22 EGSDE (k=3) 69.93 0.56 (0.06) 31.77 65.82 FM-based FM (Albergo et al., 2023) 43.23 0.66 (0.06) 3.08 5.28 FM-cond (w/ Eq. 9) 22.65 0.60 (0.06) 3.15 5.30 FM-OT (Tong et al., 2023) 44.60 0.66 (0.06) 12.42 5.21 DFM (Ours) 22.20 0.59 (0.05) 6.17 5.27

identifiability claims of DDM.

Table 2 presents the quantitative results of all methods. FID assesses the distributional similarity between the translated and target images, while Dream Sim evaluates content alignment between the source and translated images. The proposed method achieves the best balance between FID and Dream Sim, indicating that it preserves high-quality domain translation without sacrificing content consistency.

5.3. Swarm Navigation

We also test our algorithm on an interesting robot swarm navigation problem (Liu et al., 2018; 2023a); please refer to Appendix C.1 for details.

6. Conclusion

In this work, we introduced DFM, a computational framework that integrates FM into the DDM criterion. DDM is a powerful criterion for UDT, which provably attains translation identifiability in UDT, solving content misalignment issues. DDM had only been realized by GANs, encountering numerical instability and losing transport trajectory

information. This motivated us to design a FM-based DDM approach. We first revealed the unique challenges of imposing DDM-induced constraints on flows. Then, we designed a bilevel formulation with private learnable nonlinear interpolants, provably recovering the DDM criterion using FM. We also provided an efficient two-stage implementation, avoiding computational barriers. Experiments demonstrated that DFM effectively computes the DDM criterion, serving as the first translation identifiability-guaranteed flow model.

Limitations. First, our UDT framework is restricted to one-to-one translations, whereas many applications, such as image translation, can benefit from one-to-many mappings. Extending FM-based methods to enable identifiable one-to-many translations is a promising yet challenging direction. Second, our efficient computing scheme relies on non-overlapping supports of the conditional distributions. For overlapped cases, how to efficiently realize the bilevel loss is worth considering in the future.

Acknowledgment

This work was supported in part by the National Science Foundation CAREER Award ECCS-2144889.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Diversified Flow Matching with Translation Identifiability

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023.

Amodio, M. and Krishnaswamy, S. Travel GAN: Image-toimage translation by transformation vector learning. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition (CVPR), pp. 8983 8992, 2019.

Arvanitidis, G., Hansen, L. K., and Hauberg, S. A locally adaptive normal distribution. Advances in Neural Information Processing Systems (Neur IPS), 29, 2016.

Atanackovic, L., Zhang, X., Amos, B., Blanchette, M., Lee, L. J., Bengio, Y., Tong, A., and Neklyudov, K. Meta flow matching: Integrating vector fields on the wasserstein manifold. In ICML Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2024.

Benaim, S. and Wolf, L. One-sided unsupervised domain mapping. In Advances in Neural Information Processing Systems (Neur IPS), volume 30, 2017.

Benaim, S., Galanti, T., and Wolf, L. Estimating the success of unsupervised image to image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 218 233, 2018.

Bunne, C., Schiebinger, G., Krause, A., Regev, A., and Cuturi, M. Optimal transport for single-cell and spatial omics. Nature Reviews Methods Primers, 4(1):58, 2024.

Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo, J. Star GAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition (CVPR), pp. 8789 8797, 2018.

Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. Star GAN v2: Diverse image synthesis for multiple domains. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition (CVPR), pp. 8188 8197, 2020.

de B ezenac, E., Ayed, I., and Gallinari, P. Optimal unsupervised domain translation. ar Xiv preprint ar Xiv:1906.01292, 2019.

De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems (Neur IPS), 34:17695 17709, 2021.

De Bortoli, V., Korshunova, I., Mnih, A., and Doucet, A. Schrodinger bridge flow for unpaired data translation. Advances in Neural Information Processing Systems, 37: 103384 103441, 2024.

Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems (Neur IPS), 34:8780 8794, 2021.

Eyring, L., Klein, D., Uscidda, T., Palla, G., Kilbertus, N., Akata, Z., and Theis, F. J. Unbalancedness in neural monge maps improves unpaired domain translation. In International Conference on Learning Representations (ICLR), 2024.

Galanti, T., Wolf, L., and Benaim, S. The role of minimal complexity functions in unsupervised learning of semantic mappings. In Proceedings of International Conference on Learning Representations (ICLR), 2018.

Gazdieva, M., Korotin, A., Selikhanovych, D., and Burnaev, E. Extremal domain translation with neural optimal transport. Advances in Neural Information Processing Systems (Neur IPS), 36:40381 40413, 2023.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in Neural Information Processing Systems (Neur IPS), 27, 2014.

Huang, X., Liu, M.-Y., Belongie, S., and Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of European Conference on Computer Vision (ECCV), pp. 172 189, 2018.

Kapu sniak, K., Potaptchik, P., Reu, T., Zhang, L., Tong, A., Bronstein, M., Bose, A. J., and Di Giovanni, F. Metric flow matching for smooth interpolations on the data manifold. ar Xiv preprint ar Xiv:2405.14780, 2024.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017.

Kong, L., Lian, C., Huang, D., Hu, Y., Zhou, Q., et al. Breaking the dilemma of medical image-to-image translation. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pp. 1964 1978, 2021.

Kornilov, N., Mokrov, P., Gasnikov, A., and Korotin, A. Optimal flow matching: Learning straight trajectories in just one step. Advances in Neural Information Processing Systems (Neur IPS), 37:104180 104204, 2024.

Legg, N. and Anderson, S. Southwest flank of mt. rainier, wa, 2013. https://opentopography. org/meta/OT.052013.26910.1, 2013. Accessed on January 28, 2025.

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023.

Diversified Flow Matching with Translation Identifiability

Liu, G.-H., Lipman, Y., Nickel, M., Karrer, B., Theodorou, E. A., and Chen, R. T. Generalized schrodinger bridge matching. ar Xiv preprint ar Xiv:2310.02233, 2023a.

Liu, M.-Y., Breuel, T., and Kautz, J. Unsupervised imageto-image translation networks. In Advances in Neural Information Processing Systems (Neur IPS), volume 30, 2017.

Liu, X., Gong, C., et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations (ICLR), 2023b.

Liu, Z., Wu, B., and Lin, H. A mean field game approach to swarming robots control. In Annual American Control Conference (ACC), pp. 4293 4298, 2018.

Long, M., Zhu, H., Wang, J., and Jordan, M. I. Unsupervised domain adaptation with residual transfer networks. Advances in Neural Information Processing Systems (Neur IPS), 29, 2016.

Loshchilov, I. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. ar Xiv preprint ar Xiv:2108.01073, 2021.

Moriakov, N., Adler, J., and Teuwen, J. Kernel of Cycle GAN as a principle homogeneous space. In Proceedings of International Conference on Learning Representations (ICLR), 2020.

Mozafari, M. Bitmoji faces. https://www. kaggle.com/datasets/mostafamozafari/ bitmoji-faces, 2020. Accessed on January 20th, 2025.

Park, T., Efros, A. A., Zhang, R., and Zhu, J.-Y. Contrastive learning for unpaired image-to-image translation. In Proceedings of European Conference on Computer Vision (ECCV), pp. 319 345, 2020.

Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. Multisample flow matching: Straightening flows with minibatch couplings. ar Xiv preprint ar Xiv:2304.14772, 2023.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684 10695, 2022.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234 241. Springer, 2015.

Sasaki, H., Willcocks, C. G., and Breckon, T. P. UNIT-DDPM: Unpaired image translation with denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2104.05358, 2021.

Shrestha, S. and Fu, X. Towards identifiable unsupervised domain translation: A diversified distribution matching approach. In International Conference on Learning Representations (ICLR), 2024.

Song, J., Shrestha, S., Li, X., Gan, Y., and Fu, X. Translation identifiability-guided unsupervised cross-platform superresolution for OCT images. In IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), pp. 1 5, 2024.

Taigman, Y., Polyak, A., and Wolf, L. Unsupervised crossdomain image generation. In Proceedings of International Conference on Learning Representations (ICLR), 2017.

Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems (Neur IPS), 30, 2017.

Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. Improving and generalizing flow-based generative models with minibatch optimal transport. ar Xiv preprint ar Xiv:2302.00482, 2023.

Villani, C. et al. Optimal transport: old and new, volume 338. Springer, 2009.

Xie, S., Xu, Y., Gong, M., and Zhang, K. Unpaired image-toimage translation with shortest path regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10177 10187, 2023.

Xu, Y., Xie, S., Wu, W., Zhang, K., Gong, M., and Batmanghelich, K. Maximum spatial perturbation consistency for unpaired image-to-image translation. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition (CVPR), pp. 18311 18320, 2022.

Yang, S., Jiang, L., Liu, Z., and Loy, C. C. Gp-unit: Generative prior for versatile unsupervised image-to-image translation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Diversified Flow Matching with Translation Identifiability

Zhang, Y., Khanduri, P., Tsaknakis, I., Yao, Y., Hong, M., and Liu, S. An introduction to bilevel optimization: Foundations and applications in signal processing and machine learning. IEEE Signal Processing Magazine, 41(1):38 59, 2024.

Zhao, M., Bao, F., Li, C., and Zhu, J. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems (Neur IPS), 35:3609 3623, 2022.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition (CVPR), pp. 2223 2232, 2017.

Zhu, Q. and Lin, W. Switched flow matching: Eliminating singularities via switching ODEs. In International Conference on Machine Learning (ICML), pp. 62443 62475, 2024.

Diversified Flow Matching with Translation Identifiability

Supplementary Material of Diversified Flow Matching with Translation Identifiability

A. Notation

1. tf(t, ) represents the partial derivative of f with respect to t.

2. [f]#p and f #p represents the push-forward of the density p by the map f : X Y, i.e., it satisfies f #p(A) = p(f 1(A)) for a measruable set A Y, where f 1 is the pre-image of f and p(B) denotes the measure of set B under the distribution specified by density p.

3. Id Rd d represents the identity matrix.

4. tf t represents the partial derivative of f t with respect to t.

5. v represents the divergence of vector field v.

B. Details on Proposed method and Challenges

B.1. Robust Identifiability (Shrestha & Fu, 2024)

Theorem B.1. (Shrestha & Fu, 2024) Let bg be any optimal solution of the DDM criterion (3). Let dia(A) = supw,z A w z 2 measure the size of a set. Let V = (A, B) | SDC is violated on (A, B)}. Assume that g is an L-Lipchitz continuous function and that max(A,B) V max(dia(A), dia(B)) r. Then,

bg(x) g (x) 2 2r L, x X.

That is, g is identified to reasonable accuracy by the DDM criterion if the SDC holds approximately. Morevoer, the translation error increases only linearly with the size of sets in which the SDC condition is violated.

B.2. Algorithm DFM

Algorithm 1 DFM Require: ρ, σ1, σ2, initialized parameters ϕ, θ

1: while Stopping criterion for Iθ is not met do 2: Sample (x(q), y(q)) ρ(x, y|u(q)) and tq U(0, 1), q [Q]

3: z(q) tq,θ (1 tq)x(q) + tqy(q) + tq(1 tq)γθ(x(q), y(q)), q [Q]

4: Linterp(θ) PQ 1 i=1 PQ j=i+1 γσ2(|ti tj|)γσ1( z(i) ti,θ z(j) tj,θ 2)

5: θ θ α1 θLinterp(θ) 6: end while 7: while Stopping criterion for vt( ; ϕ) is not met do 8: Sample (x, y) ρ(x, y) and t U(0, 1) 9: zt (1 t)x + ty + t(1 t)γθ(x, y) 10: ℓ(ϕ) = vt(zt) tzt 2 2 11: ϕ ϕ α2 ϕℓ(ϕ) 12: end while 13: return vt( ; ϕ)

In Algorithm 1, represents gradient-based update direction. The specific optimizers used in the experiments are described in their corresponding experiment sections.

Diversified Flow Matching with Translation Identifiability

Figure 9: Image quality during the GAN training when convergence issues were encountered.

B.3. Derivation of bv 1

2y) in Fig. 3

If the vector field vt(z) is a solution to (8), then vt can be expressed as follows:

bvt(z) = E u(q) pu,x px|u(q) y py|u(q)

h t Ilinear(x, y, t) Ilinear(x, y, t) = z i

= E u(q) pu,x px|u(q) y py|u(q)

h y x (1 t)x + ty = z i

2 (z) = Eu(q) pu,x px|u[2z 2x]

= Ex px[2z 2x]

= 2(z E[x])

2 points towards the mean of the two clusters, px|u(q), q = 1, 2, resulting in reflection.

B.4. Pitfalls of Problem (23)

In order to evaluate how well Problem (23) works in practice, we run simulations on 2D and 3D synthetic data as presented in Fig. 4. We observed that directly using Problem (23) is challenging from optimization perspective, since the unified vector field need to minimize LFM simultaneously with respect to all conditional distributions. Instead the following variable splitting based implementation that decouples LFM loss minimization and vector field unification was found to be optimization friendly:

minimize vt,{ev(q) t ,θq}Q q=1

q=1 LFM ev(q) t , I(q) θq , ρ(x, y|u(q)) + λEt,zt ρ(q) t

vt(zt) ev(q)(zt) 2

where zt ρ(q) t implies zt = I(q) θq (x, y, t), (x, y) ρ(x, y|u(q)).

Experiment Details. The 2D Gaussian blobs for px|u1 and px|u2 for Fig. 4 are generated with unit variance at locations (1, 1) and (1, 1). y = x is used to generate py|u1 and py|u2. For 3D plot, the location of the blobs are (1, 1, 0) and (1, 1, 0).

B.5. Convergence issue of GAN-based UDT

Fig. 9 shows an example training trail of DDM-GAN for the setting considered in Sec. 5.2 along with the sample translated images. It shows that once the convergence issues are encountered, the translation quality is unacceptable and can be considered as an optimization failure.

Diversified Flow Matching with Translation Identifiability

C. Additional Experiments

C.1. Swarm Navigation

Navigating a large swarm of robots, on land or in air, requires computing the control policies (i.e., the velocities) for individual robots that meet the specifications for their collective behavior (Liu et al., 2018). CNF-based methods can be used to recover individual robot control policy based on the robot s location and time such that the entire swarm navigates from source location to the destination (Liu et al., 2023a; Kapu sniak et al., 2024; Liu et al., 2018).

Problem Description. We consider swarm navigation on a complex land surface. The surface is specified by Li DAR measurements of Mt. Rainier (Legg & Anderson, 2013) containing 34,183 points. We consider a challenging scenario where there are two different swarms of robots with their own source and destination locations on the surface as shown in Fig. 10[Column 1]. We consider the source locations of the first and second swarm distributed as px|u1 and px|u2, and the destinations distributed as py|u1 and py|u2, respectively. Then, the goal is to transport samples from px|u(q) to py|u(q), q, which aligns with the objective of DFM. This problem can be understood as a DDM problem. We want to control the clusters of robots so that they move from source to destination clusters, while avoiding collision among the clusters.

Neural Networks and Hyperparameter settings. We use a 3-layer MLP with 64 hidden units and Se LU activation to represent both Iθ and vt( ; ϕ). We use Adam optimizer for both interpolant and the vector field with an initial learning rate of 10 4 and 10 3 respectively. We use a weight decay of 10 5 for both networks. We use σ1 = 0.1, σ2 = 1.5.

Regularization for surface adherence In order to encourage the swarm to stay close to the land surface, we add a regularization from (Kapu sniak et al., 2024) in the first phase of our method, i.e., interpolant training in Algorithm 1. Since Problem (12) itself does not encourage trajectories to stay close to the land surface represented by Li DAR measurements, we use additional regularization from (Kapu sniak et al., 2024) that encourages trajectories to stay close to the surface. To explain, let D = {mi}N i=1 be the set of Li DAR measurements. Then, Line 4 and 5 in Algorithm 1 is modified as follows:

ℓ(θ) = λ1Linterp(θ) + λ2Lmfm(θ),

Lmfm(θ) = tz(q) tq,θ G(z(q) tq,θ, D) tz(q) tq,θ ,

where G(z(q) tq,θ, D) is a data dependent metric introduced in (Kapu sniak et al., 2024) to pull the interpolant paths closer to the manifold represented by D.

For our experiments Sec. 5.3, we use LAND originally introduced in Arvanitidis et al. (2016) (see details in (Kapu sniak et al., 2024)), with hyperparameter σ = 0.125. We set λ1 = 5000 and λ2 = 1.

Dataset. In the experiment in Fig. 10, we use Gaussians to represent px|u(q) and py|u(q). We use a variance of 0.02 for px|u(q) and that of 0.03 for py|u(q), q {1, 2}. We use K = 4000 samples for each of the conditional distributions.

Metric. We use surface adherence (SA) metric to measure how close the trajectory is on average to the surface specified by Li DAR measurements. SA is defined as follows:

τ=1 |[xτ]3 [NN([xτ]1:2; D1:2)]3|, (19)

where NN([xτ]1:2; D1:2) represents the nearest neighbor of xτ in the set D while only considering the first and the second coordinates ,i.e., x, y location without the height.

Result. Fig. 10 shows the trajectories obtained by the proposed method and the baselines. One can see that the trajectories returned by the baselines almost overlap for different swarms, and poorly adhere to the land surface (e.g., crossing underneath the mountain) However, the proposed DFM method returns different trajectories for different swarms while staying closely to the surface.

Table 3 shows the SA obtained by all methods averaged over 5 trials. Combined observation from Table 3 and Fig. 10 shows that the proposed method shows better swarm navigation, in terms of transport and surface adherence, compared to the baselines. Note that the color in Fig. 10 indicates the time t. This experiment also shows that DFM is useful in tasks requiring simultaneous trajectory estimation between multiple pairs of distributions.

Diversified Flow Matching with Translation Identifiability

MFM-cond-OT DFM (Proposed)

Top View Front View

Figure 10: Visualization of the trajectory obtained by different methods.[Column 1 Top] source locations of two swarms (red and blue) on the top and corresponding destinations on the bottom of the plot. [Column 2 - 5] swarm trajectories obtained by different methods, color specify the time. From red to blue, t = 0 to 1.

Table 3: Average SA with standard deviation for different methods.

FM-OT (Tong et al., 2023) 1.09 0.027 FM-cond-OT (Tong et al., 2023) w/ (9) 1.08 0.019 MFM-OT (Kapu sniak et al., 2024) 0.47 0.015 MFM-cond-OT (Kapu sniak et al., 2024) w/ (9) 0.43 0.055 DFM (Proposed) 0.32 0.080

D. Experiment Details

D.1. Unpaired Image to Image Translation

Neural Networks and Hyperparameter settings. We use the UNet architecture (Ronneberger et al., 2015) to represent both neural networks. We adopt a similar hyperparameter configuration based on the UNet architecture (Dhariwal & Nichol, 2021). For the vector field, we use the Adam W optimizer (Loshchilov, 2017) with an initial learning rate of 10 4 and parameters β1 = 0.9, β2 = 0.999, ϵ = 1e 8, and no weight decay. We use a batch size of 64, and dropout of 0.1. We take the exponential moving average (EMA) (Tarvainen & Valpola, 2017) of the weights with decay parameter 0.9999. We use the same hyperparameter settings for the interpolant, except that the learning rate is set to 10 8, head channels is 32 and attention resolution is 8. We use σ1 = 0.1 and σ2 = 10. We train the models for 100k iterations. All baselines are also trained for the same number of iterations.

For the image translation experiment, we observed improved training stability and efficiency when modifying Algorithm 1 so that θ-updates and ϕ-updates are interleaved, resulting in Algorithm 2. The primary reason is that learning the vector field in the second phase of Algorithm 1 with respect to a complex nonlinear interpolant appears challenging, as evidenced by oscillations in the objective function. However, training the vector field and interpolant in tandem alleviates this issue, as the randomly initialized Iθ at the beginning of training is close to a linear interpolant. This results in a more gradual training of vector field from almost linear to increasingly nonlinear paths.

E.1. Proposition 3.4

To prove Lemma 3.4, first consider the following definition:

Diversified Flow Matching with Translation Identifiability

Table 4: Hyperparameters of the UNets for unpaired image translation task.

Hyperparameter Value

Attention resolution 16 Heads channels 64 Heads 1 Channels multiple 2, 2, 2 Res Net blocks 4 Channels 128

Algorithm 2 DFM with Interleaved Training Require: ρ, σ1, σ2, initialized parameters ϕ, θ

1: while Stopping criterion is not met do 2: Sample (x(q), y(q)) ρ(x, y|u(q)) and tq U(0, 1), q [Q]

3: z(q) tq,θ (1 tq)x(q) + tqy(q) + tq(1 tq)γθ(x(q), y(q)), q [Q]

4: Linterp(θ) PQ 1 i=1 PQ j=i+1 γσ2(|ti tj|)γσ1( z(i) ti,θ z(j) tj,θ 2)

5: θ θ θLinterp(θ)

6: LFM(ϕ) 1

Q PQ q=1 vt(z(q) tq,θ) tz(q) tq,θ 2 2 7: ϕ ϕ ϕLFM(ϕ) 8: end while 9: return vt( ; ϕ)

Definition E.1 (Interpolable density, Albergo et al. (2023) Def. D.1). A probability path pt with t [0, 1] is interpolable if there exists a time dependent invertible map ψt : Rd Rd with t [0, 1], continuously-differentiable in time and space, such that pt is the pushforward of ψt of the standard normal density, i.e., [ψt]#N(0,I) = pt.

This definition will be helpful in proving Proposition 3.4

Proposition 3.4. Suppose that there exists a flow f t : Rd Rd, differentiable in time and space, from px to py such that f 1 = g . Suppose there exist a diffeomorphism from a N(0, Id) to px|u(q), q. Let bvt denote a solution of

Problem (12) and denote gbv(x) = x + R t=1 t=0 bvt(zt)dt. Then, under the same model in (1), when {px|u(q)}Q q=1 satisfies the SDC, we have gbv = g , holds a.e.

Proof. Let ψ(q) 0 denote a deffeomorphism from N(0, Id) to px|u(q). Further, let ψ(q) t = f t ψ(q) 0 . This implies that

[ψ(q) 0 ]#N(0,I) = px|u(q) and [ψ(q) 1 ]#N(0,I) = py|u(q). Hence, ψ(q) t is a time and space continuously-differntiable invertible map from px|u(q) to py|u(q).

Moreover, let p(q) t = [ψ(q) t ]#px|u(q). Then p(q) t , q are interpolable densities in the sense of Definition E.1.

Now, consider the following result from (Albergo et al., 2023)

Proposition E.2 ((Albergo et al., 2023) Proposition D.1). Let ρt be an interpolable density in the sense of Definition E.1 with corresponding map ψt, i.e., [ψt]#N(0,I) = ρt. Then

I(x, y, t) = ψt ψ 1 0 (x) cos π

2 t + ψ 1 1 (y) sin π

is such that I I for independent coupling, i.e., x ρ0 and y ρ1.

Invoking the above proposition, we have that for each q [Q], there exists an interpolant function that can follow the probability path p(q) t .

Diversified Flow Matching with Translation Identifiability

We use a similar reformulation technique as in Albergo et al. (2023)(Appendix D) to re-express our constraints as follows:

minimize vt,ρ(q) t ,v(q) t

q=1 v(q) t vt 2 (21a)

subject to : tρ(q) t + (v(q) t ρ(q) t ) = 0, (21b)

ρ(q) 0 = px|u(q), ρ(q) 1 = py|u(q) (21c)

ρ(q) t interpolable, (21d)

where u represents the divergence of vector field u, i.e., u(x) = Pn i=1 ui xi . Here, the optimization is over the

probability path ρ(q) t instead of the interpolants. Note that the constraints (21c) (21d) restricts the search over probability paths that can be represented by interpolants in I. In other words, ρ(q) t in Problem (21) is a reparametrization of I(q) in Problem (12). Finally, the constraint (21b) is the continuity equation that ensures that the private vector fields v(q) t follow the probability path ρ(q) t (Villani et al., 2009; Lipman et al., 2023; Albergo et al., 2023). Therefore, (21b) is equivalent to (12b).

Let v t (f (x)) = tf t (x). Then, v , p(q) t , q constitutes a feasible solution to Problem (21) that can attain zero objective. This implies that Problem (12) can also attain an objective of zero at the minimum.

Let bv(q) and b I (q) I denote the private vector fields and interpolants returned by solving Problem (12). Note that when the constraint (12b) is satisfied, bv(q) transports px|u(q) to py|u(q) (Albergo et al., 2023).

Since there exists at least one solution to Problem (12) which attains zero objective value, bvt must also satisfy

q=1 bvt bv(q) t 2 = 0

= bvt = bv(q) t , q [Q]

Hence bvt is a unified vector field that transports px|u(q) to py|u(q), q [Q].

Therefore, gbv satisfies

[gbv]#px|u = py|u

Hence gbv is a solution to Problem (4). Therefore, invoking Theorem 2.2, gbv = g , a.e.

E.2. Proof of Fact 3.3

Let us define the solution sets V(q) as follows:

V(q) = {vt | vt transports px|u(q) to py|u(q)}

= vt | vt = arg min wt Lq(wt, I), I I ,

where Lq = LFM( , , ρ(x, y|u(q))) for brevity. Note that V(q) is the set of all vector fields that can transport px|u(q) to py|u(q). It is clear that V(i) = V(j) for i = j in general. Our goal is to find a unified vector field that can simultaneously transport px|u(q) to py|u(q), q. To that end, let

V = {vt | vt transports px|u(q) to py|u(q), q}

= {vt | vt V(1) & . . . & vt V(Q)}

= Q q=1V(q). (22)

Diversified Flow Matching with Translation Identifiability

For the ease of exposition, let Q = 2. Our argument will generalize to arbitrary Q. Let

v(1) t , I (1) = arg min vt,I I Lq(vt, I) and

bvt, b I (1), b I (2) = arg min vt V, I(1),I(2) I

q=1 Lq(vt, I(q)), (23)

Note that in Problem (23) vt V is explicitly enforced. Whereas in Problem (10), vt V is not enforced but is what we hope for.

Nonetheless the definition of v(1) t , I (1) implies that

L1(v(1) t , I (1)) L1(bv(1) t , b I (1)).

However, if v(1) t V and L1(v(1) t , I (1)) L1(bv(1) t , b I (1)), then there may exist e I (2) I such that

L1(bvt, b I (1)) L1(v(1) t , I (1)) > L2(v(1) t , e I (2)) L2(bvt, b I (2))

= L1(v(1) t , I (1)) + L2(v(1) t , e I (2)) < L1(bvt, b I (1)) + L2(bvt, b I (2)).

Since (v(1) t , I (1), e I (2)) is also a feasible solution to Problem (10), we can conclude that Problem (10) is not equivalent to Problem (23). The vector field returned by Problem (10) does not guarantee DDM satisfaction, since its solution may not be from V.