# energyguided_entropic_neural_optimal_transport__fafb1eb0.pdf

Published as a conference paper at ICLR 2024

ENERGY-GUIDED ENTROPIC NEURAL OPTIMAL TRANSPORT

Petr Mokrov1, Alexander Korotin1,2, Alexander Kolesov1, Nikita Gushchin1, Evgeny Burnaev1,2

1Skolkovo Institute of Science and Technology, Moscow, Russia 2Artificial Intelligence Research Institute, Moscow, Russia {petr.mokrov,a.korotin}@skoltech.ru

Energy-based models (EBMs) are known in the Machine Learning community for decades. Since the seminal works devoted to EBMs dating back to the noughties, there have been a lot of efficient methods which solve the generative modelling problem by means of energy potentials (unnormalized likelihood functions). In contrast, the realm of Optimal Transport (OT) and, in particular, neural OT solvers is much less explored and limited by few recent works (excluding WGAN-based approaches which utilize OT as a loss function and do not model OT maps themselves). In our work, we bridge the gap between EBMs and Entropy-regularized OT. We present a novel methodology which allows utilizing the recent developments and technical improvements of the former in order to enrich the latter. From the theoretical perspective, we prove generalization bounds for our technique. In practice, we validate its applicability in toy 2D and image domains. To showcase the scalability, we empower our method with a pre-trained Style GAN and apply it to high-res AFHQ 512 512 unpaired I2I translation. For simplicity, we choose simple shortand long-run EBMs as a backbone of our Energy-guided Entropic OT approach, leaving the application of more sophisticated EBMs for future research. Our code is available at: https: //github.com/Petr Mokrov/Energy-guided-Entropic-OT

Figure 1: AFHQ 512 512 Cat Dog unpaired translation by our Energy-guided EOT solver applied in the latent space of Style GAN2-ADA. Our approach does not need data2latent encoding. Left: source samples; right: translated samples.

1 INTRODUCTION

The computational Optimal Transport (OT) field is an emergent and fruitful area in the Machine Learning research which finds its applications in generative modelling (Arjovsky et al., 2017; Gulrajani et al., 2017; Deshpande et al., 2018), domain adaptation (Nguyen et al., 2021; Shen et al., 2018; Wang et al., 2022), unpaired image-to-image translation (Xie et al., 2019; Hu et al.), datasets manipulation (Alvarez-Melis & Fusi, 2020), population dynamics (Ma et al., 2021; Wang et al., 2018), gradient flows modelling (Alvarez-Melis et al., 2022; Mokrov et al., 2021), barycenter estimation

Published as a conference paper at ICLR 2024

(Korotin et al., 2022a; Fan et al., 2021). The majority of the applications listed above utilize OT as a loss function, e.g., have WGAN-like objectives which compare the generated (fake) and true data distributions. However, for some practical use cases, e.g., unpaired image-to-image translation (Korotin et al., 2023b), it is worth modelling the OT maps or plans by themselves.

The existing approaches which recover OT plans are based on various theoretically-advised techniques. Some of them (Makkuva et al., 2020; Korotin et al., 2021a) utilize the specific form of the cost function, e.g., squared Euclidean distance. The others (Xie et al., 2019; Lu et al., 2020) modify GAN objectives with additional OT regularizer, which results in biased OT solvers (Gazdieva et al., 2022, Thm. 1). The works (Fan et al., 2023; Korotin et al., 2023b; Rout et al., 2022) take advantage of dual OT problem formulation. They are capable of tackling unbiased large-scale continuous OT with general cost functions but may yield fake solutions (Korotin et al., 2023a). To overcome this issue, (Korotin et al., 2023a) propose to use strictly convex regularizers which guarantee the uniqueness of the recovered OT plans. And one popular choice which has been extensively studied both in discrete (Cuturi, 2013) and continuous (Genevay et al., 2016; Clason et al., 2021) settings is the Entropy. The well-studied methodological choices for modelling Entropy-regularized OT (EOT) include (a) stochastic dual maximization approach which prescribes alternating optimization of dual potentials (Seguy et al., 2018; Daniels et al., 2021) and (b) dynamic setup having connection to Schrödinger bridge problem (Bortoli et al., 2021; Gushchin et al., 2023; Chen et al., 2022). In contrast to the methods presented in the literature, we come up with an approach for solving EOT built upon EBMs.

Contributions. We propose a novel energy-based view on the EOT problem.

1. We take advantage of weak dual formulation for the EOT problem and distinguish the EBM-related nature of dual potential which originates due to this formulation ( 3.1). 2. We propose theoretically-grounded yet easy-to-implement modifications to the standard EBMs training procedure which makes them capable of recovering the EOT plans ( 3.2). 3. We establish generalization bounds for the EOT plans learned via our proposed method ( 3.3). 4. We showcase our algorithm s performance on lowand moderate-dimensional toy setups and large-scale 512 512 images transfer tasks solved with help of a pre-trained Style GAN ( 5).

Notations. Throughout the paper, X and Y are compact subsets of the Euclidean space, i.e., X RDx and Y RDy. The continuous functions on X are denoted as C(X). In turn, P(X) are the sets of Borel probability distributions on X. Given distributions P P(X) and Q P(Y), Π(P, Q) designates the set of couplings between the distributions P and Q, i.e., probability distributions on product space X Y with the first and second marginals given by P and Q, respectively. We use Π(P) to denote the set of probability distributions on X Y with the first marginal given by P. The absolutely continuous probability distributions on X are Pac(X). For P Pac(X) we use d P(x)

dx and d Q(y)

dy to denote the corresponding probability density functions. Given distributions µ and ρ defined on a set Z, µ ρ means that µ is absolutely continuous with respect to ρ.

2 BACKGROUND

2.1 OPTIMAL TRANSPORT

The generic theory behind OT could be found in (Villani et al., 2009; Santambrogio, 2015). For the specific details on EOT, see (Genevay et al., 2016; Genevay, 2019).

Let P P(X) and Q P(Y). The primal OT problem due to Kantorovich (Villani et al., 2009) is:

OTc(P, Q) def = inf π Π(P,Q)

X Y c(x, y)dπ(x, y). (1)

In the equation above, c : X Y R is a continuous cost function which reflects a practitioner s knowledge of how data from the source and target distribution should be aligned. Typically, the cost function c(x, y) is chosen to be Euclidean norm x y 2 yielding the 1-Wasserstain distance (W1) or halved squared Euclidean norm 1

2 x y 2 2 yielding the square of 2-Wasserstein distance (W2 2).

The distributions π Π(P, Q) which minimize objective (1) are called the Optimal Transport plans. Problem (1) may have several OT plans (Peyré et al., 2019, Remark 2.3) and in order to impose the uniqueness and obtain a more tractable optimization problem, a common trick is to regularize (1) with strictly convex (w.r.t. distribution π) functionals R : P(X Y) R.

Published as a conference paper at ICLR 2024

Entropy-regularized Optimal Transport. In our work, we utilize the popular Entropic regularization (Cuturi, 2013) which has found its applications in various works (Solomon et al., 2015; Schiebinger et al., 2019; Rukhaia, 2021). This is mainly because of amenable sample complexity (Genevay, 2019, 3) and tractable dual representation of the Entropy-regularized OT problem which can be leveraged by, e.g., Sinkhorn s algorithm (Cuturi, 2013; Vargas et al., 2021). Besides, the EOT objective is known to be strictly convex (Genevay et al., 2016) thanks to the strict convexity of Entropy H and KL divergence (Santambrogio, 2015; Nutz, 2021; Nishiyama, 2020) appearing in EOT formulations.

Let ε > 0. The EOT problem can be formulated in the following ways:

EOT(1) c,ε(P, Q) EOT(2) c,ε(P, Q) EOTc,ε(P, Q)

def = min π Π(P,Q)

X Y c(x, y)dπ(x, y) +

+εKL (π P Q) , (2) εH(π), (3) ε

X H(π( |x))d P(x). (4)

These formulations are equivalent when P and Q are absolutely continuous w.r.t. the corresponding standard Lebesgue measures since KL (π P Q) = R

X H(π( |x))d P(x) + H(Q) = H(π) + H(Q) + H(P). In other words, the equations (2), (3) and (4) are the same up to additive constants.

In the remaining paper, we will primarily work with the EOT formulation (4), and, henceforth, we will additionally assume P Pac(X), Q Pac(Y) when necessary.

Let π Π(P, Q) be the solution of EOT problem. The measure disintegration theorem yields: dπ (x, y) = dπ (y|x)dπ (x) = dπ (y|x)d P(x). Distributions π ( |x) will play an important role in our analysis. In fact, they constitute the only ingredient needed to (stochastically) transform a source point x X to target samples y1, y2, Y w.r.t. EOT plan. We say that distributions {π ( |x)}x X are the optimal conditional plans.

EOT problem as a weak OT (WOT) problem. EOT problem (4) can be understood as the so-called weak OT problem (Gozlan et al., 2017; Backhoff-Veraguas et al., 2019). Given a weak transport cost C : X P(Y) R which penalizes the displacement of a point x X into a distribution π( |x) P(Y), the weak OT problem is given by

WOTC(P, Q) def = inf π Π(P,Q)

X C(x, π( |x)) dπ(x) | {z } =d P(x)

EOT formulation (4) is a particular case of weak OT problem (5) for weak transport cost:

CEOT(x, π( |x)) = Z

Y c(x, y)dπ(y|x) εH(π( |x)). (6)

Note that if weak cost C is strictly convex and lower semicontinuous, as it is the case for CEOT, the solution for (5) exists and unique (Backhoff-Veraguas et al., 2019).

Weak OT dual formulation of the EOT problem. Similar to the case of classical Kantorovich OT (1), the weak OT problem permits the dual representation. Let f C(Y). Following (Backhoff Veraguas et al., 2019, Eq. (1.3)) one introduces weak C-transform f C : X R by

f C(x) def = inf µ P(Y)

Y f(y)dµ(y) . (7)

For our particular case of EOT-advised weak OT cost (6), equation (7) reads as

f CEOT(x) = min µ P(Y)

Y c(x, y)dµ(y) εH(µ) Z

Y f(y)dµ(y) def = min µ P(Y) Gx,f(µ). (8)

Note that the existence and uniqueness of the minimizer for (8) follows from Weierstrass theorem (Santambrogio, 2015, Box 1.1.) along with lower semicontinuity and strict convexity of Gx,f in µ. The dual weak functional F w C : C(Y) R for primal WOT problem (5) is

F w C (f) def = Z

X f C(x)d P(x) + Z

Y f(y)d Q(y).

Thanks to the compactness of X and Y, there is the strong duality (Gozlan et al., 2017, Thm. 9.5):

EOTc,ε(P, Q) = sup f C(Y)

X min µx P(Y) Gx,f(µx)d P(x) + Z

Y f(y)d Q(y) = sup f C(Y) F w CEOT(f). (9)

We say that (9) is the weak dual objective. It will play an important role in our further analysis.

Published as a conference paper at ICLR 2024

2.2 ENERGY-BASED MODELS

The EBMs are a fundamental class of deep Generative Modelling techniques (Le Cun et al., 2006; Salakhutdinov et al., 2007) which parameterize distributions of interest µ P(Y) by means of the Gibbs-Boltzmann distribution density:

Z exp ( E(y)) . (10)

In the equation above E : Y R is the Energy function (negative unnormalized log-likelihood), and Z = R

Y exp( E(y))dy is the normalization constant, known as the partition function.

Let µ P(Y) be a true data distribution which is accessible by samples and µθ(y), θ Θ be a parametric family of distributions approximated using, e.g., a deep Neural Network Eθ, which imitates the Energy function in (10). In EBMs framework, one tries to bring the parametric distribution µθ to the reference one µ by optimizing the KL divergence between them. The minimization of KL (µ µθ) is done via gradient descent by utilizing the well-known gradient (Xie et al., 2016):

θKL (µ µθ) = Z

θEθ(y)dµ(y) Z

θEθ(y) dµθ(y). (11)

The expectations on the right-hand side of (11) are estimated by Monte-Carlo, which requires samples from µ and µθ. While the former are given, the latter are usually obtained via Unadjusted Langevin Algorithm (ULA) (Roberts & Tweedie, 1996). It iterates the discretized Langevin dynamics

Yl+1 = Yl η

2 y Eθ(Yl) + ηξl , ξl N(0, 1), (12)

starting from a simple prior distribution Y0 µ0, for L steps, with a small discretization step η > 0. In practice, there have been developed a lot of methods, which improve or circumvent the procedure above by informative initialization (Hinton, 2002; Du & Mordatch, 2019), more sophisticated MCMC approaches (Lawson et al., 2019; Qiu et al., 2020; Nijkamp et al., 2022), regularizations (Du et al., 2021; Kumar et al., 2019), explicit auxiliary generators (Xie et al., 2018; Yin et al., 2022; Han et al., 2019; Gao et al., 2020). The application of these EBM improvements for the EOT problem is a fruitful avenue for future work. For a more in-depth discussion of the methods for training EBMs, see a recent survey (Song & Kingma, 2021).

3 TAKING UP EOT PROBLEM WITH EBMS

In this section, we connect EBMs and the EOT problem and exhibit our proposed methodology. At first, we present some theoretical results which characterize weak dual objective (9) and its optimizers ( 3.1). Secondly, we develop the optimization procedure ( 3.2) and corresponding algorithm capable of implicitly recovering EOT plans. Thirdly, we establish generalization bounds for our proposed method ( 3.3). All proofs are situated in Appendix B.

3.1 ENERGY-GUIDED REFORMULATION OF WEAK DUAL EOT

We start our analysis by taking a close look at objective (9). The following proposition characterizes the inner minµx optimization problem arising in (9). Theorem 1 (Optimizer of weak CEOT-transform). Let f C(Y) and x X. Then inner weak dual objective minµ P(Y) Gx,f(µ) (8) permits the unique minimizer µf x which is given by

dµf x(y) dy

def = 1 Z(f, x) exp f(y) c(x, y)

where Z(f, x) def = R

Y exp f(y) c(x,y)

By substituting minimizer (13) to (8), we obtain the close form for the weak CEOT-transform:

f CEOT(x) = Gx,f(µf x) = ε log Z(f, x) = ε log Z

Y exp f(y) c(x, y)

The equation (14) resembles (c, ε)-transform (Genevay, 2019, Eq. 4.15) appearing in standard semi-dual EOT formulation (Genevay, 2019, 4.3). For completeness, we shortly introduce

Published as a conference paper at ICLR 2024

the dual EOT and semi-dual EOT problems in Appendix A, relegating readers to (Genevay, 2019) for a more thorough introduction. In short, it is the particular form of weak dual EOT objective, which differs from semi-dual EOT objective, and allows us to utilize EBMs, as we show in 3.2. Thanks to (14), objective (9) permits the reformulation:

EOTc,ε(P, Q) = sup f C(Y) F w CEOT(f) = sup f C(Y)

X log Z(f, x)d P(x) + Z

Y f(y)d Q(y) . (15)

For a given f C(Y), consider the distribution dπf(x, y) def = dµf x(y)d P(x). We prove, that the optimization of weak dual objective (15) brings πf closer to the optimal plan π . Theorem 2 (Bound on the quality of the plan recovered from the dual variable). For brevity, define

the optimal value of (9) by F w, CEOT def = EOTc,ε(P, Q). For every f C(Y) it holds that

F w, CEOT F w CEOT(f) = ε Z

X KL π ( |x) µf x d P(x) = εKL π πf . (16)

From our Theorem 2 it follows that given an approximate maximizer f of dual objective (15), one immediately obtains a distribution πf which is close to the optimal plan π . Actually, πf is formed by conditional distributions µf x (Theorem 1), whose energy functions are given by f (adjusted with transport cost c). Below we show that f in (15) can be optimized analogously to EBMs as well.

3.2 OPTIMIZATION PROCEDURE

Following the standard machine learning practices, we parameterize functions f C(Y) as neural networks fθ with parameters θ Θ and derive the loss function corresponding to (15) by:

L(θ) def = ε Z

X log Z(fθ, x)d P(x) + Z

Y fθ(y)d Q(y). (17)

The conventional way to optimize loss functions such as (17) is the stochastic gradient ascent. In the following result, we derive the gradient of L(θ) w.r.t. θ. Theorem 3 (Gradient of the weak dual loss L(θ)). It holds true that:

θfθ(y) dµfθ x (y)d P(x) + Z

θfθ(y)d Q(y). (18)

Formula (18) resembles the gradient of Energy-based loss, formula (11). This allows us to look at EOT problem (4) from the perspectives of EBMs. In order to emphasize the novelty of our approach, and, simultaneously, establish the deep connection between the optimization of weak dual objective in form (15) and EBMs, below we characterize the similarities and differences between standard EBMs optimization procedure and our proposed EOT-encouraged gradient ascent following L(θ)/ θ.

Differences. In contrast to the case of EBMs, potential fθ, optimized by means of loss function L, does not represent an energy function by itself. However, the tandem of cost function c and fθ helps to recover the Energy functions of conditional distributions µfθ x :

Eµ fθ x (y) = c(x, y) fθ(y)

Therefore, one can sample from distributions µfθ x following ULA (12) or using more advanced MCMC approaches (Girolami & Calderhead, 2011; Hoffman et al., 2014; Samsonov et al., 2022). In practice, when estimating (18), we need samples (x1, y1), (x2, y2), . . . (x N, y N) from distribution dπfθ(x, y) def = dµfθ x (y)d P(x). They could be derived through the simple two-stage procedure:

1. Sample x1, . . . x N P , i.e., derive random batch from the source dataset. 2. Sample y1|x1 µfθ x1, . . . , y N|x N µfθ x N , e.g., performing Langevin steps (12).

Similarities. Besides a slightly more complicated two-stage procedure for sampling from generative distribution πfθ, the gradient ascent optimization with (18) is similar to the gradient descent with

Published as a conference paper at ICLR 2024

(11). This allows a practitioner to adopt the existing practically efficient architectures of EBMs, e.g., (Du & Mordatch, 2019; Du et al., 2021; Gao et al., 2021; Zhao et al., 2021), in order to solve EOT.

Algorithm. We summarize our findings and detail our optimization procedure in the Algorithm 1. The procedure is basic, i.e., for the sake of simplicity, we specifically remove all technical tricks which are typically used when optimizing EBMs (persistent replay buffers (Tieleman, 2008), temperature adjusting, etc.). Particular implementation details are given in the experiments section ( 5).

We want to underline that our theoretical and practical setup allows performing theoretically-grounded truly conditional data generation by means of EBMs, which unlocks the data-to-data translation applications for the EBM community. Existing approaches leveraging such applications with Energyinspired methodology lack theoretical interpretability, see discussions in 4.1.

Algorithm 1: Entropic Optimal Transport via Energy-Based Modelling Input :Source and target distributions P and Q, accessible by samples; Entropy regularization coefficient ε > 0, cost function c(x, y) : RDx RDy R; number of Langevin steps K > 0, Langevin discretization step size η > 0; basic noise std σ0 > 0; potential network fθ : RDy R, batch size N > 0. Output :trained potential network fθ recovering optimal conditional EOT plans for i = 1, 2, . . . do

Derive batches {xn}N n=1 = X P, {yn}N n=1 = Y Q of sizes N; Sample basic noise Y (0) N(0, σ0) of size N; for k = 1, 2, . . . , K do

Sample Z(k) = {z(k) n }N n=1, where z(k) n N(0, 1); Obtain Y (k) = {y(k) n }N n=1 with Langevin step:

y(k) n y(k 1) n + η

2ε stop_grad y [fθ(y) c(xn, y)] y=y(k 1) n

y(K) n Y (K) fθ y(K) n + 1

yn Y fθ (yn) ;

Perform a gradient step over θ by using b L

3.3 GENERALIZATION BOUNDS FOR LEARNED ENTROPIC PLANS

Below, we sort out the question of how far a recovered plan is from the true optimal plan π .

In practice, the source and target distributions are given by empirical samples XN = {xn}N m=1 P and YM = {ym}M m=1 Q, i.e., finite datasets. Besides, the available potentials f come from restricted functional class F C(Y), e.g., f are neural networks. Therefore, in practice, we actually optimize the following empirical counterpart of the weak dual objective (15)

max f F b F w CEOT(f) def = max f F

n=1 log Z(f, xn) + 1

m=1 f(ym) .

and recover bf def = arg maxf F b F w CEOT(f). A question arises: how close is π b f to the OT plan π ?

Our Theorem 4 below characterizes the error between π b f and π arising due to approximation (F is restricted), and estimation (finite samples of P, Q are available) errors. To bound the estimation error, we employ the well-known Rademacher complexity (Shalev-Shwartz & Ben-David, 2014, M26). Theorem 4 (Finite sample learning guarantees). Denote the functional class of weak CEOT-transforms corresponding to F by FCEOT = { ε log Z(f, ) : f F}. Let RN(F, Q) and RM(FCEOT, P) denote the Rademacher complexities of functional classes F and FCEOT w.r.t. Q and P for sample sizes N and M, respectively. Then the following upper bound on the error between π and π ˆ f holds:

E KL π π ˆ f

Estimation error z }| { ε 1 4RN(FCEOT, P) + 4RM(F, Q) +

Approximation error z }| { ε 1 F w, CEOT max f F F w CEOT(f) . (19)

Published as a conference paper at ICLR 2024

where the expectation is taken over random realizations of datasets XN P, YM Q of sizes N, M.

We note that there exist many statistical bounds for EOT (Genevay, 2019; Genevay et al., 2019; Rigollet & Stromme, 2022; Mena & Niles-Weed, 2019; Luise et al., 2019; del Barrio et al., 2023), yet they are mostly about sample complexity of EOT, i.e., estimation of the OT cost value, or accuracy of the estimated barycentric projection x 7 R

Y y dπ (y|x) in the non-parametric setup. In contrast to these works, our result is about the estimation of the entire OT plan π in the parametric setup. Our Theorem 4 could be improved by deriving explicit numerical bounds. This can be done by analyzing particular NNs architectures, similar to (Klusowski & Barron, 2018; Sreekumar et al., 2021). We leave the corresponding analysis to follow-up research.

4 RELATED WORKS

In this section, we look over existing works which are the most relevant to our proposed method. We divide our survey into two main parts. Firstly, we discuss the EBM approaches which tackle similar practical problem setups. Secondly, we perform an overview of solvers dealing with Entropyregularized OT. The discussion of general-purpose OT solvers is available in Appendix A.2.

4.1 ENERGY-BASED MODELS FOR UNPAIRED DATA-TO-DATA TRANSLATION

Given a source and target domains X and Y, accessible by samples, the problem of unpaired datato-data translation (Zhu et al., 2017) is to transform a point x X from the source domain to corresponding points yx 1, yx 2, Y from the target domain while preserving some notion of x s content. In order to solve this problem, (Zhao & Chen, 2021; Zhao et al., 2021) propose to utilize a pretrained EBM of the target distribution Q, initialized by source samples x P. In spite of plausibly-looking practical results, the theoretical properties of this approach remain unclear. Furthermore, being passed through MCMC, the obtained samples may lose the conditioning on the source samples. In contrast, our proposed approach is free from the aforementioned problems and can be tuned to reach the desired tradeoff between the conditioning power and data variability. The authors of (Xie et al., 2021) propose to cooperatively train Cycle GAN and EBMs to solve unpaired I2I problems. However, in their framework, EBMs just help to stabilize the training of I2I maps and can not be considered as primal problem solvers.

4.2 ENTROPY-REGULARIZED OT

To the best of our knowledge, all continuous EOT solvers are based either on KL-guided formulation (2) (Genevay et al., 2016; Seguy et al., 2018; Daniels et al., 2021) or unconditional entropic one (3) with its connection to the Schrödinger bridge problem (Finlay et al., 2020; Bortoli et al., 2021; Gushchin et al., 2023; Chen et al., 2022; Shi et al., 2023). Our approach seems to be the first which takes advantage of conditional entropic formulation (4). Methods (Genevay et al., 2016; Seguy et al., 2018; Daniels et al., 2021) exploit dual form of (2), see (Genevay, 2019, Eq. 4.2), which is an unconstrained optimization problem w.r.t. a couple of dual potentials (u, v). However, (Genevay et al., 2016; Seguy et al., 2018) do not provide a direct way for sampling from optimal conditional plans π (y|x), since it requires the knowledge of target distribution Q. In order to leverage this issue, (Daniels et al., 2021) proposes to employ a separate score-based model approximating Q. At the inference stage (Daniels et al., 2021) utilizes MCMC sampling, which makes this work to be the closest to ours. The detailed comparison is given below:

1. The authors of (Daniels et al., 2021) optimize dual potentials (u, v) following the dual form of (2). This procedure is unstable for small ε as it requires the exponentiation of large numbers which are of order ε 1. At the same time, a small ε regime is practically important for some downstream applications when one needs a close-to-deterministic plan between X and Y domains. On the contrary, our Energy-based approach does not require exponent computation and can be adapted for a small ε by proper adjusting of ULA (12) parameters (step size, number of steps, etc.).

2. In (Daniels et al., 2021), it is mandatory to have three models, including a third-party score-based model. Our algorithm results in a single potential fθ capturing all the information about the OT conditional plans and only optionally may be combined with an extra generative model (M5.3).

The alternative EOT solvers (Finlay et al., 2020; Bortoli et al., 2021; Gushchin et al., 2023; Chen et al., 2022; Vargas et al., 2021; Shi et al., 2023) are based on the connection between primal EOT (3) and the Schrödinger bridge problem. The majority of these works model the EOT plan as a

Published as a conference paper at ICLR 2024

time-dependent stochastic process with learnable drift and diffusion terms, starting from P at the initial time and approaching Q at the final time. This requires resource-consuming techniques to solve stochastic differential equations. Moreover, the aforementioned methods work primarily with the quadratic cost and hardly could be accommodated for a more general case.

5 EXPERIMENTAL ILLUSTRATIONS

In what follows, we demonstrate the performance of our method on toy 2D scenario, Gaussian-to Gaussian and high-dimensional AFHQ Cat/Wild Dog image transformation problems solved using the latent space of a pretrained Style GAN2-ADA (Karras et al., 2020). In the first two experiments the cost function is chosen to be squared halved l2 norm: c(x, y) = 1

2 x y 2 2, while in the latter case, it is more tricky and involves Style GAN generator. An additional experiment with Colored MNIST images translation setup is considered in Appendix C.1.

Our code is written in Py Torch and publicly available at https://github.com/ Petr Mokrov/Energy-guided-Entropic-OT. The actual neural network architectures as well as practical training setups are disclosed in the corresponding subsections of Appendix D.

We apply our method for 2D Gaussian Swissroll case and demonstrate the qualitative results on Figure 2 for Entropy regularization coefficients ε = 0.1, 0.001. Figure 2b shows that our method succeeds in transforming source distribution P to target distribution Q for both Entropy regularization coefficients. In order to ensure that our approach learns optimal conditional plans π (y|x) well, and correctly solves EOT problem, we provide Figures 2c and 2d. On these images, we pick several points x X and demonstrate samples from the conditional plans π( |x), obtained either by our method (π( |x) = µfθ x ) or by discrete EOT solver (Flamary et al., 2021). In contrast to our approach, the samples generated by the discrete EOT solver come solely from the training dataset. Yet these samples could be considered as a fine approximation of ground truth in 2D.

(a) Input and target distributions P and Q.

(b) Our fitted distributions; ε = 10 1 (up), ε = 10 3 (down).

(c) Our fitted conditional plans π( |x); ε = 10 1

(up), ε = 10 3 (down)

(d) Discrete conditional plans πDOT( |x); ε = 10 1

(up), ε = 10 3 (down)

Figure 2: Performance of Energy-guided EOT on Gaussian Swissroll 2D setup.

5.2 GAUSSIAN-TO-GAUSSIAN

Here we validate our method in Gaussian-to-Gaussian transformation tasks in various dimensions (Dx = Dy = 2, 16, 64, 128), for which the exact optimal EOT plans are analytically known (Janati et al., 2020). We choose ε = 0.1, 1, 10 and compare the performance of our approach with those, described in 4.2. We report the BW2 2-UVP metric, see Appendix D.2 for the explanation, between the learned ˆπ and optimal π plans in Table 1. As we can see, our method manages to recover the optimal plan rather well compared to baselines. Technical peculiarities are disclosed in Appendix D.

Published as a conference paper at ICLR 2024

Ours (Gushchin et al., 2023) (Bortoli et al., 2021) (Chen et al., 2022) (Alt) (Chen et al., 2022) (Joint) (Vargas et al., 2021) (Daniels et al., 2021) (Seguy et al., 2018)

2 16 64 128 0.01 0.2 0.56 0.97 0.02 0.08 0.19 0.34 1.97 4.22 3.44 5.87 1.44 8.22 3.5 4.33 0.45 4.8 5.6 5.28 2.96 2.94 4.06 4.0 n/a n/a n/a n/a 11.3 28.6 39.4 57.4

(a) ε = 0.1

2 16 64 128 0.03 0.1 0.26 0.31 0.006 0.04 0.12 0.33 0.88 1.29 2.32 2.43 0.75 1.7 2.45 2.64 0.07 0.21 0.34 0.58 0.3 0.9 1.34 1.8 0.92 1.36 4.62 5.33 6.77 14.6 25.6 47.1

2 16 64 128 0.11 0.09 0.18 0.29 0.22 0.15 0.4 0.86 n/a n/a n/a n/a n/a n/a n/a n/a 0.88 2.12 2.77 3.45 0.1 0.21 0.72 1.14 3.22 5.57 3.13 4.98 10.2 14.6 28.9 40.1

Table 1: Performance (BW2 2-UVP ) of Energy-guided EOT (ours) and baselines on Gaussian Gaussian tasks for dimensions D = 2, 16, 64, 128 and reg. coefficients ε = 0.1, 1, 10.

5.3 HIGH-DIMENSIONAL UNPAIRED IMAGE-TO-IMAGE TRANSLATION

Figure 3: AFHQ 512 512 Wild Dog unpaired I2I by our method in the latent space of Style GAN2-ADA. Left: source; right: translated.

In this subsection, we deal with the large-scale unpaired I2I setup. As with many other works in EBMs, e.g., (Zhao & Chen, 2021; Tiwary et al., 2022), we consider learning in the latent space. We pick a pre-trained Style GANV2-Ada (Karras et al., 2020) generator G for Dogs AFHQ 512 512 dataset and consider Cat Dog (and Wild Dog) unpaired translation. As P, we use the dataset of images of cats (or wild); as Q, we use N(0, I512), i.e., the latent distribution of the Style GAN. We use our method with ϵ = 1 to learn the EOT between P and Q with cost 1

2 x G(z) 2, i.e., ℓ2 between the input image and the image generated from the latent code z. Note that our method trains only one MLP network fθ acting on the latent space, which is then used for inference (combined with G). Moreover, our approach does not need a generative model of the source distribution P, and does not need encoder (data2latent) networks. The qualitative results are provided in Figures 1 and 3. Our method allows us to translate the images from one domain to the other while maintaining the similarity with the input image. For more examples and qualitative comparisons, see Appendix C.2. For the quantitative analysis, we compare our approach with popular unpaired I2I models ILVR (Choi et al., 2021), SDEdit (Meng et al., 2022), EGSDE (Zhao et al., 2022), Cycle GAN (Zhu et al., 2017), MUNIT (Huang et al., 2018) and Star GANv2 (Choi et al., 2020), the obtained FID metrics are reported in Table 2. As we can see, our approach achieves comparable-to-SOTA quality.

Method Ours ILVR SDEdit EGSDE Cycle GAN MUNIT Star GAN v2 Cat Dog FID 56.6 74.37 74.17 51.04 85.9 104.4 54.88 Wild Dog FID 65.8 75.33 68.51 50.43 - - -

Table 2: Baselines FID1for Cat Dog and Wild Dog.

6 DISCUSSION

Our work paves a principled connection between EBMs and EOT. The latter is an emergent problem in generative modelling, with potential applications like unpaired data-to-data translation (Korotin et al., 2023b). Our proposed EBM-based learning method for EOT is theoretically grounded and we provide proof-of-concept experiments. We believe that our work will inspire future studies that will further empower EOT with recent EBMs capable of efficiently sorting out truly large-scale setups (Du et al., 2021; Gao et al., 2021; Zhao et al., 2021).

The limitations of our method roughly match those of basic EBMs. Namely, our method requires using MCMC methods for training and inference. This may be time-consuming. For the extended discussion of limitations, see Appendix F.

The broader impact of our work is the same as that of any generative modelling research. Generative models may be used for rendering, image editing, design, computer graphics, etc. and simplify the existing digital content creation pipelines. At the same time, it should be taken into account that the rapid development of generative models may also unexpectedly affect some jobs in the industry.

1FID scores of the baselines are taken from (Zhao et al., 2022). In order to estimate FID, we downscale the generated images to 256 256 for a fair comparison with the alternative methods.

Published as a conference paper at ICLR 2024

7 ACKNOWLEDGEMENTS

The work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021).

David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428 21439, 2020.

David Alvarez-Melis, Yair Schiff, and Youssef Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=dp OYN7o8Jm.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017.

Arip Asadulaev, Alexander Korotin, Vage Egiazarian, Petr Mokrov, and Evgeny Burnaev. Neural optimal transport with general cost functionals. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=g Iiz7t Bt YZ.

Julio Backhoff-Veraguas, Mathias Beiglböck, and Gudmun Pammer. Existence, duality, and cyclical monotonicity for weak transport costs. Calculus of Variations and Partial Differential Equations, 58(6):203, 2019.

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=9Bn Cwi XB0ty.

Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nio Ad KCEd XB.

Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. in 2021 ieee. In CVF international conference on computer vision (ICCV), pp. 14347 14356, 2021.

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188 8197, 2020.

Christian Clason, Dirk A Lorenz, Hinrich Mahler, and Benedikt Wirth. Entropic regularization of continuous optimal transport problems. Journal of Mathematical Analysis and Applications, 494 (1):124432, 2021.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.

Max Daniels, Tyler Maunu, and Paul Hand. Score-based generative neural networks for large-scale optimal transport. Advances in neural information processing systems, 34:12955 12965, 2021.

Eustasio del Barrio, Alberto González Sanz, Jean-Michel Loubes, and Jonathan Niles-Weed. An improved central limit theorem and fast convergence rates for entropic transportation costs. SIAM Journal on Mathematics of Data Science, 5(3):639 669, 2023.

Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483 3491, 2018.

Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.

Published as a conference paper at ICLR 2024

Yilun Du, Shuang Li, B. Joshua Tenenbaum, and Igor Mordatch. Improved contrastive divergence training of energy based models. In Proceedings of the 38th International Conference on Machine Learning (ICML-21), 2021.

Jiaojiao Fan, Amirhossein Taghvaei, and Yongxin Chen Chen. Scalable computations of wasserstein barycenter via input convex neural networks. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 1571 1581. PMLR, 18 24 Jul 2021. URL https://proceedings. mlr.press/v139/fan21d.html.

Jiaojiao Fan, Shu Liu, Shaojun Ma, Hao-Min Zhou, and Yongxin Chen. Neural monge map estimation and its applications. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=2m ZSl Qscj3. Featured Certification.

Chris Finlay, Augusto Gerolin, Adam M Oberman, and Aram-Alexandre Pooladian. Learning normalizing flows from entropy-kantorovich potentials. ar Xiv preprint ar Xiv:2006.06033, 2020.

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html.

Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7518 7528, 2020.

Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P Kingma. Learning energy-based models by diffusion recovery likelihood. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=v_1Soh8QUNc.

Milena Gazdieva, Litu Rout, Alexander Korotin, Andrey Kravchenko, Alexander Filippov, and Evgeny Burnaev. An optimal transport perspective on unpaired image super-resolution. ar Xiv preprint ar Xiv:2202.01116, 2022.

Aude Genevay. Entropy-regularized optimal transport for machine learning. Ph D thesis, Paris Sciences et Lettres (Com UE), 2019.

Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for largescale optimal transport. Advances in neural information processing systems, 29, 2016.

Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd international conference on artificial intelligence and statistics, pp. 1574 1583. PMLR, 2019.

Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): 123 214, 2011.

Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327 3405, 2017.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.

Nikita Gushchin, Alexander Kolesov, Alexander Korotin, Dmitry Vetrov, and Evgeny Burnaev. Entropic neural optimal transport via diffusion processes. In Advances in Neural Information Processing Systems, 2023.

Published as a conference paper at ICLR 2024

Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8670 8679, 2019.

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002.

Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014.

Weining Hu, Meng Li, and Xiaomeng Ju. Improved cyclegan for image-to-image translation.

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172 189, 2018.

Hicham Janati, Boris Muzellec, Gabriel Peyré, and Marco Cuturi. Entropic optimal transport between unbalanced gaussian measures has a closed form. Advances in neural information processing systems, 33:10468 10479, 2020.

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in neural information processing systems, 33:12104 12114, 2020.

Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu and squared relu ridge functions with ℓ1 and ℓ0 controls. IEEE Transactions on Information Theory, 64(12): 7649 7656, 2018.

Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=b Eoxz W_EXsa.

Alexander Korotin, Lingxiao Li, Justin Solomon, and Evgeny Burnaev. Continuous wasserstein-2 barycenter estimation without minimax optimization. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=3t FAs5E-Pe.

Alexander Korotin, Vage Egiazarian, Lingxiao Li, and Evgeny Burnaev. Wasserstein iterative networks for barycenter estimation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022a. URL https: //openreview.net/forum?id=Gi Enzx Tna MN.

Alexander Korotin, Alexander Kolesov, and Evgeny Burnaev. Kantorovich strikes back! wasserstein GANs are not optimal transport? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum? id=Vt EEpi-d Glt.

Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Kernel neural optimal transport. In The Eleventh International Conference on Learning Representations, 2023a. URL https: //openreview.net/forum?id=Zuc_MHt Uma4.

Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport. In The Eleventh International Conference on Learning Representations, 2023b. URL https: //openreview.net/forum?id=d8CBRl WNkq H.

Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. ar Xiv preprint ar Xiv:1901.08508, 2019.

John Lawson, George Tucker, Bo Dai, and Rajesh Ranganath. Energy-inspired models: Learning with sampler-induced distributions. Advances in Neural Information Processing Systems, 32, 2019.

Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.

Published as a conference paper at ICLR 2024

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVj TT1nw5z.

Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, and Yong Yu. Large-scale optimal transport via adversarial training with cycle-consistency. ar Xiv preprint ar Xiv:2003.06635, 2020.

Giulia Luise, Saverio Salzo, Massimiliano Pontil, and Carlo Ciliberto. Sinkhorn barycenters with free support via frank-wolfe algorithm. Advances in neural information processing systems, 32, 2019.

Shaojun Ma, Shu Liu, Hongyuan Zha, and Haomin Zhou. Learning stochastic behaviour from aggregate data. In International Conference on Machine Learning, pp. 7258 7267. PMLR, 2021.

Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672 6681. PMLR, 2020.

Simone Di Marino and Augusto Gerolin. An optimal transport approach for the schrödinger bridge problem and convergence of sinkhorn algorithm. Journal of Scientific Computing, 85(2):27, 2020.

Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. Advances in Neural Information Processing Systems, 32, 2019.

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=a Bs Cjc Pu_t E.

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243 15256, 2021.

Tuan Nguyen, Trung Le, He Zhao, Quan Hung Tran, Truyen Nguyen, and Dinh Phung. Most: Multi-source domain adaptation via optimal transport for student-teacher learning. In Uncertainty in Artificial Intelligence, pp. 225 235. PMLR, 2021.

Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 5272 5280, 2020.

Erik Nijkamp, Ruiqi Gao, Pavel Sountsov, Srinivas Vasudevan, Bo Pang, Song-Chun Zhu, and Ying Nian Wu. MCMC should mix: Learning energy-based model with neural transport latent space MCMC. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=4C93Qvn-tz.

Tomohiro Nishiyama. Convex optimization on functionals of probability densities. ar Xiv preprint ar Xiv:2002.06488, 2020.

Marcel Nutz. Introduction to entropic optimal transport. Lecture notes, Columbia University, 2021.

Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Yixuan Qiu, Lingsong Zhang, and Xiao Wang. Unbiased contrastive divergence algorithm for training energy-based latent variable models. In International Conference on Learning Representations, 2020.

Philippe Rigollet and Austin J Stromme. On the sample complexity of entropic optimal transport. ar Xiv preprint ar Xiv:2206.13472, 2022.

Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Published as a conference paper at ICLR 2024

Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=5Jd LZg346Lw.

Giorgi Rukhaia. A Free Form Optics Application of Entropic Optimal Transport. Ph D thesis, Université Paris sciences et lettres, 2021.

Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pp. 791 798, 2007.

Sergey Samsonov, Evgeny Lagutin, Marylou Gabrié, Alain Durmus, Alexey Naumov, and Eric Moulines. Local-global MCMC kernels: the best of both worlds. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=zb-xf Apk4ZK.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.

Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019.

Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= B1zlp1b RW.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion schrödinger bridge matching. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qy07OHs JT5.

Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (To G), 34(4):1 11, 2015.

Yang Song and Diederik P Kingma. How to train your energy-based models. ar Xiv preprint ar Xiv:2101.03288, 2021.

Sreejith Sreekumar, Zhengxin Zhang, and Ziv Goldfeld. Non-asymptotic performance guarantees for neural estimation of f-divergences. In International Conference on Artificial Intelligence and Statistics, pp. 3322 3330. PMLR, 2021.

Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. ar Xiv preprint ar Xiv:1902.07197, 2019.

Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064 1071, 2008.

Piyush Tiwary, Kinjawl Bhattacharyya, and Prathosh AP. Boundary preserving twin energy-basedmodels for image to image translation. 2022.

Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving schrödinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021.

Published as a conference paper at ICLR 2024

Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

Rui Wang, Ruiyi Zhang, and Ricardo Henao. Wasserstein uncertainty estimation for adversarial domain matching. Frontiers in big Data, 5, 2022.

Yisen Wang, Bo Dai, Lingkai Kong, Sarah Monazam Erfani, James Bailey, and Hongyuan Zha. Learning deep hidden nonlinear dynamics from aggregate data. ar Xiv preprint ar Xiv:1807.08237, 2018.

Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pp. 2635 2644. PMLR, 2016.

Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27 45, 2018.

Jianwen Xie, Zilong Zheng, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu. Learning cycleconsistent cooperative networks via alternating mcmc teaching for unsupervised cross-domain translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10430 10440, 2021.

Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient computation of large scale optimal transport. In International Conference on Machine Learning, pp. 6882 6892. PMLR, 2019.

Xuwang Yin, Shiying Li, and Gustavo K Rohde. Learning energy-based models with adversarial training. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part V, pp. 209 226. Springer, 2022.

Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609 3623, 2022.

Yang Zhao and Changyou Chen. Unpaired image-to-image translation via latent energy transport. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16418 16427, 2021.

Yang Zhao, Jianwen Xie, and Ping Li. Learning energy-based generative models via coarse-to-fine expanding and sampling. In International Conference on Learning Representations, 2021.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017.

Published as a conference paper at ICLR 2024

A EXTENDED BACKGROUND AND RELATED WORKS

A.1 DUAL/SEMI-DUAL EOT PROBLEMS AND THEIR RELATION TO WEAK DUAL EOT PROBLEM

Dual formulation of the EOT problem. Primal EOT problem (2) has the dual reformulation (Genevay et al., 2016). Let u C(X) and v C(Y). Define the dual functional by

Fε(u, v) def = Z

X u(x)d P(x)+ Z

Y v(y)d Q(y) ε Z

X Y exp u(x) + v(y) c(x, y)

d [P Q](x, y).

Then the strong duality holds (Genevay et al., 2016, Proposition 2.1), i.e.,

EOT(1) c,ε(P, Q) = sup u C(X),v C(Y) Fε(u, v). (21)

The sup here may not be attained in C(X), C(Y), yet it is common to relax the formulation and consider u L (P) and v L (Q) instead (Marino & Gerolin, 2020). In this case, the sup becomes max (Genevay, 2019, Theorem 7). The potentials u , v which constitute a solution of the relaxed (21) are called the (Entropic) Kantorovich potentials. The optimal transport plan π which solves the primal problem (2) could be recovered from a pair of Kantorovich potentials (u , v ) as

dπ (x, y) = exp u (x) + v (y) c(x, y)

d P(x)d Q(y). (22)

From the practical viewpoint, the dual objective (21) is an unconstrained maximization problem which can be solved by conventional optimization procedures. The existing methods based on (21) as well as their limitations and drawbacks are discussed in the related works section, 4.2.

Semi-dual formulation of the EOT problem. Objective (21) is a convex optimization problem. By fixing a function v C(Y) and applying the first-order optimality conditions for the marginal optimization problem supu C(X) Fε(u, v), one can recover the solution of

vc,ε def = arg max u C(X) Fε(u, v) (23)

in the closed-form:

vc,ε(x) = ε log Z

Y exp v(y) c(x, y)

d Q(y) . (24)

Function vc,ε is called (c, ε)-transform (Genevay, 2019). By substituting the argument u C(X) of the max with vc,ε in equation (21) and performing several simplifications, one can recover the objective

EOT(1) c,ε(P, Q) = sup v C(Y)

X vc,ε(x)d P(x) + Z

Y v(y)d Q(y). (25)

This one is called the semi-dual formulation of EOT problem (Genevay, 2019, 4.3). It is not so popular as the classical dual problem (21) since the estimation of vc,ε is non-trivial, yet it has a direct relation to formulation (9), which forms the basis of our proposed method. In order to comprehend this relation, below we compare (c, ε)-transform and minµ Gx,f(µ), given by (14).

Correspondence between (semi-) dual EOT and weak dual EOT. As we already pointed out in the main part of the manuscript, equation (14) which then appears in weak dual objective (15) looks similar to (c, ε)-transform (24). The difference is the integration measure, which is Q in the case of (24) and the standard Lebesgue one in our case (14).

From the theoretical point of view, such dissimilarity is not significant. That is why semi-dual (25) and weak dual (9) optimization problems are expected to share such properties as convergence, existence of optimizers and so on. In particular, there is a relation between potentials u, v that appear in (semi-) dual EOT problems (21, 25), and our optimized potential f from (9).

Let Q Pac(Y), and EQ : Y R be the energy function of Q, i.e., d Q(y)

dy exp ( EQ(y)). Consider the parameterization of potentials v by means of f as follows:

v(y) f(y) + εEQ(y). (26)

Published as a conference paper at ICLR 2024

vc,ε(x) = ε log Z

Y exp v(y) c(x, y)

Y exp f(y) c(x, y)

exp (EQ(y)) d Q(y)

Y exp f(y) c(x, y)

exp (EQ(y)) exp ( EQ(y)) dy + Const(Q)

see (14) = ε log Z(f, x) + Const(Q). (27)

By substituting (26, 27) in (25) we obtain that semi-dual EOT objective (25) recovers our weak dual objective (15) up to reparameterization (26):

EOT(1) c,ε(P, Q) = sup v C(Y)

X vc,ε(x)d P(x) + Z

Y v(y)d Q(y)

= Const1(Q) | {z } =εH(Q)

+ sup f C(Y) εEQ

ε log Z(f, x) d P(x) + Z

Y f(y)d Q(y) .

Strictly speaking, the optimization class C(Y) εEQ = {f εEQ, f C(Y)} in the equation above is different from C(Y) that appears in (15). However, under mild assumptions on EQ the corresponding optimization problems are similar. One important consequence of the observed equivalence is the existence of optimal potential f (not necessary to be continuous) which solves weak dual objective (9). It can be expressed through optimal Kantorovich potential v by f = v εEQ.

From the practical point of view, the difference between (14) and (24) is much more influential. Actually, it is the particular form of internal weak dual problem solution (14) that allows us to utilize EBMs, see 3.2.

A.2 DISCRETE AND CONTINUOUS OT SOLVERS REVIEW

Discrete OT. The discrete OT (DOT) is the specific domain in OT research area, which deals with distributions supported on finite discrete sets. There have been developed various methods for solving DOT problems (Peyré et al., 2019), the most efficient of which deals with discrete EOT (Cuturi, 2013). In spite of good theoretically-grounded convergence guarantees, it is hard to adopt the DOT solvers for out-of-distribution sampling and mapping, which limits their applicability in some real-world scenarios.

Continuous OT. In the continuous setting, the source and target distributions become accessible only by samples from (limited) budgets. In this case, OT plans are typically parameterized with neural networks and optimized with the help of SGD-like methods by deriving random batches from the datasets. The approaches dealing with such practical setup are called continuous OT solvers.

There exists a lot of continuous OT solvers (Makkuva et al., 2020; Korotin et al., 2021a; Fan et al., 2023; Xie et al., 2019; Rout et al., 2022; Gazdieva et al., 2022; Korotin et al., 2022b; Taghvaei & Jalali, 2019; Liu et al., 2023). However, the majority of these methods model OT as a deterministic map which for each input point x assigns a single data point y rather than distribution π(y|x). Only a limited number of approaches are capable of solving OT problems which require stochastic mapping, and, therefore, potentially applicable for our EOT case. Here we exclude methods designed specifically for EOT and cover them in the further narrative.

The recent line of works (Korotin et al., 2023b;a; Asadulaev et al., 2024) considers dual formulation of weak OT problem (Gozlan et al., 2017) and comes up with max min objective for various weak (Korotin et al., 2023b;a) and even general (Asadulaev et al., 2024) cost functionals. However, their proposed methodology requires the estimation of weak cost by samples, which complicates its application for EOT. An alternative concept (Xie et al., 2019) works with primal OT formulation (1) and lifts boundary source and target distribution constraints by WGAN losses. It also utilizes sample estimation of corresponding functionals and can not be directly adapted for EOT setup.

Published as a conference paper at ICLR 2024

B.1 PROOF OF THEOREM 1

Proof. In what follows, we analyze the optimizers of the objective minµ P(Y) Gx,f(µ) introduced in (8). Let µ P(Y). We have:

Gx,f(µ) = Z

Y c(x, y)dµ(y) + ε Z

Y log dµ(y)

Y f(y)dµ(y)

c(x, y) f(y)

ε + log dµ(y)

log Z(f, x) | {z } does not depend on y

+ log Z(f, x) f(y) c(x, y)

= log dµf x(y) dy

+ log dµ(y)

= ε log Z(f, x) + ε Z

log dµf x(y) dy + log dµ(y)

= ε log Z(f, x) + εKL µ µf x . (28)

The last equality holds true thanks to the fact that µf x Pac(Y) and y Y : dµf x(y) dy > 0. This leads to the conclusion that the absolute continuity of µ (µ λ, where λ is the Lebesgue measure on Y) is equivalent to the absolute continuity of µ w.r.t. µf x (µ µf x). In particular, if µ / Pac(Y), then the last equality in the derivations above reads as + = + . From (28) we conclude that µf x = arg min µ P(Y) Gx,f(µ).

B.2 PROOF OF THEOREM 2

Proof. Recall that we denote the optimal value of weak dual objective (9) by F w, CEOT. It equals to EOTc,ε(P, Q) given by formula (4), thanks to the strong duality, i.e.,

F w, CEOT = Z

X Y c(x, y)dπ (x, y) ε Z

X H(π (y|x))d P(x). (29)

For a potential f C(Y), our Theorem 1 yields:

F w CEOT(f)= Z

X Gx,f(µf x)d P(x)+ Z

Y f(y)d Q(y) Eq. (14) = ε Z

X log Z(f, x)d P(x)+ Z

Y f(y)d Q(y). (30)

In what follows, we combine (29) and (30):

F w, CEOT F w CEOT(f) = Z

X Y c(x, y)dπ (x, y) ε Z

X H(π (y|x))d P(x)+ε Z

X log Z(f, x) d P(x) | {z } =dπ (x)

Y f(y) d Q(y) | {z } =dπ (y)

c(x, y) f(y) dπ (x, y) ε Z

X H(π (y|x))d P(x) + ε Z

X log Z(f, x)dπ (x) =

f(y) c(x, y)

=log exp( f(y) c(x,y)

dπ (x, y) + ε Z

X Y log Z(f, x) | {z }

Y exp( f(y) c(x,y)

dπ (x, y) ε Z

X H(π (y|x))d P(x) =

X Y log 1 Z(f, x) exp f(y) c(x, y)

= dµf x(y) dy

X H(π (y|x))d P(x) =

X Y log dµf x(y) dy

dπ (x, y) ε Z

X H(π (y|x))d P(x) =

Published as a conference paper at ICLR 2024

Y log dµf x(y) dy

dπ (y|x) dπ (x) | {z } =d P(x)

Y log dπ (y|x)

dπ (y|x)d P(x) =

log dπ (y|x)

log dµf x(y) dy

X KL π ( |x) µf x d P(x) .

In the last transition of the derivations above, we use the equality Z

log dπ (y|x)

log dµf x(y) dy

dπ (y|x) = KL π ( |x) µf x .

It holds true due to the same reasons, as explained in the proof of Theorem 1, after formula (28).

We are left to show that R

X KL π ( |x) µf x d P(x) = KL π πf . Since dπ (x, y) = dπ (y|x)d P(x) and dπf(x, y) = dµf x(y)d P(x), we derive: Z

X KL π ( |x) µf x d P(x) = Z

dπ (y|x)d P(x) =

dπ (y|x)d P(x)

dµf x(y)d P(x)

dπ (y|x)d P(x) =

X Y log dπ (x, y)

dπ (x, y) = KL π πf ,

which completes the proof.

B.3 PROOF OF THEOREM 3

Proof. The direct derivations for (18) read as follows:

θL(θ) = ε Z

θ log Z(fθ, x)d P(x) + Z

θfθ(y)d Q(y) =

Y exp fθ(y) c(x, y)

θfθ(y)d Q(y) =

( 1 Z(fθ, x)

exp fθ(y) c(x, y)

θfθ(y)d Q(y) =

θfθ(y) 1 Z(fθ, x) exp fθ(y) c(x, y)

=dµ fθ x (y)

θfθ(y)d Q(y) =

θfθ(y) dµfθ x (y)d P(x) + Z

θfθ(y)d Q(y) ,

which finishes the proof.

B.4 PROOF OF THEOREM 4

To begin with, for the ease of reading and comprehension, we recall the statistical learning setup from 3.3. In short, XN and YM are empirical samples from P and Q, respectively, F C(Y) is a function class in which we are looking for a potential bf, which optimizes the empirical weak dual objective

b F w CEOT(f) = ε 1

n=1 log Z(f, xn) + 1

Published as a conference paper at ICLR 2024

i.e., bf = arg maxf F b F w CEOT(f). Additionally, we introduce f F def = arg maxf F F w CEOT(f). It optimizes weak dual objective (9) in the function class under consideration. Following statistical generalization practices, our analysis utilizes Rademacher complexity. We recall the definition below. Definition 1 (Rademacher complexity RN(F, µ)). Let F be a function class and µ be a distribution. The Rademacher complexity of F with respect to µ of sample size N is

RN(F, µ) def = 1

N E sup f F

n=1 f(xn)σn

where {xn}N n=1 µ are mutually independent, {σn}N n=1 are mutually independent Rademacher random variables, i.e., Prob {σn = 1} = Prob {σn = 1} = 0.5, and the expectation is taken with respect to both {xn}N n=1 and {σn}N n=1.

Now we are ready to verify our Theorem 4.

Proof. Our Theorem 2 yields that

εKL π π b f = F w, CEOT F w CEOT( bf).

Below we upper-bound the right-hand side of the equation above.

F w, CEOT F w CEOT( bf) =

F w, CEOT F w CEOT(f F) + F w CEOT(f F) b F w CEOT( bf) + b F w CEOT( bf) F w CEOT( bf) F w, CEOT F w CEOT(f F) + (31) F w CEOT(f F) b F w CEOT( bf) + (32) b F w CEOT( bf) F w CEOT( bf) . (33)

Analysis of (31). Equation (31) relates to approximation error and depends on the richness of class F. The detailed investigation of how could the approximation error be treated in the context of our Energy-guided EOT setup and, more generally, EBMs is an interesting avenue for future work.

Analysis of (32). Similar to (Taghvaei & Jalali, 2019, Theorem 3.4), we estimate (32) using the Rademacher complexity bounds. First, we need the following technical Lemma.

Lemma 1. For each particular samples XN, YM, there exists f F, such that: F w CEOT(f F) b F w CEOT( bf) F w CEOT( f) b F w CEOT( f)

Proof. Let s consider F w CEOT(f F) b F w CEOT( bf) . There are two possibilities.

1. F w CEOT(f F) b F w CEOT( bf). Since f F : b F w CEOT( bf) b F w CEOT(f), and, in particular, b F w CEOT( bf) b F w CEOT(f F), it holds: F w CEOT(f F) b F w CEOT( bf) = F w CEOT(f F) b F w CEOT( bf)

F w CEOT(f F) b F w CEOT(f F) F w CEOT(f F) b F w CEOT(f F) ,

i.e., we set f f F.

2. b F w CEOT( bf) > F w CEOT(f F). Similar to the previous case, f F : F w CEOT(f F) F w CEOT(f) F w CEOT(f F) F w CEOT( bf). Analogous derivations read as: F w CEOT(f F) b F w CEOT( bf) = b F w CEOT( bf) F w CEOT(f F) b F w CEOT( bf) F w CEOT( bf) F w CEOT( bf) b F w CEOT( bf) ,

i.e., we set f bf and finish the proof of the Lemma.

Published as a conference paper at ICLR 2024

Now we continue the proof of our Theorem. For particular samples XN and YM, consider f from Lemma 1. We derive: F w CEOT(f F) b F w CEOT( bf) F w CEOT( f) b F w CEOT( f) =

ε log Z( f, x) d P(x)+ Z

Y f(y)d Q(y) N X

ε log Z( f, xn)

ε log Z( f, x) d P(x)

ε log Z( f, xn)

Y f(y)d Q(y)

ε log Z( f, x) d P(x)

ε log Z( f, xn)

Y f(y)d Q(y)

X f CEOT(x)d P(x)

Y f(y)d Q(y)

sup h FCEOT

X h(x)d P(x)

Y f(y)d Q(y)

Recall that FCEOT = {f CEOT : f F} = { ε log Z(f, ) : f F}. Thanks to well-known Rademacher bound (Shalev-Shwartz & Ben-David, 2014, Lemma 26.2), it holds:

sup h FCEOT

X h(x)d P(x)

2RN(FCEOT, P),

Y f(y)d Q(y)

where the expectations are taken with respect to samples XN and YM. Combining the results above, we conclude: E F w CEOT(f F) b F w CEOT( bf) 2RN(FCEOT, P) + 2RM(F, Q). (34)

Analysis of (33). Similar to the previous case, we obtain the inequality: b F w CEOT( bf) F w CEOT( bf)

sup h FCEOT

X h(x)d P(x)

Y f(y)d Q(y)

E b F w CEOT( bf) F w CEOT( bf) 2RN(FCEOT, P) + 2RM(F, Q). (35) By gathering equations (31, 34, 35), we prove the theorem:

E KL π π b f = ε 1E n F w, CEOT F w CEOT( bf) o

ε 1 F w, CEOT F w CEOT(f F) + ε 1E F w CEOT(f F) b F w CEOT( bf) + ε 1E b F w CEOT( bf) F w CEOT( bf)

ε 1 F w, CEOT F w CEOT(f F) + ε 1 4RN(FCEOT, P) + 4RM(F, Q) .

C EXTENDED EXPERIMENTS

C.1 COLORED MNIST

In this subsection, we consider Colored MNIST (Gushchin et al., 2023, M5.3). Following (Gushchin et al., 2023), we set source and target distributions P and Q to be colored handwritten images of digits 2 and 3 accordingly. The entropic regularization coefficients are in range ε = 0.01, 0.1, 1.

Published as a conference paper at ICLR 2024

ϵ 0.01 0.1 1 LPIPS (VGG) 0.043 0.063 0.11

Table 3: LPIPS variance of Our method for Colored MNIST 2 3 transfer.

The qualitative results of learning our model directly in the data space are presented in Figure 4. As we can see, our learned EOT plans successfully preserve color and geometry of the transformed images. Generated images (Figures 4b, 4c, 4d) are slightly noised since we add noise to target images when training for stability. For quantitative analysis and comparison with competitive methods, we borrow the results on the Colored MNIST transfer problem for (Bortoli et al., 2021), (Daniels et al., 2021), (Gushchin et al., 2023) from (Gushchin et al., 2023). Additionally, we run the code for the recent (Shi et al., 2023) on our own. The methods generally work for different ε due to their principles, and we choose ε = 1 as an admissible entropic regularization power for all methods except (Daniels et al., 2021) which struggles for small ε, see the discussion in 4.2. For it, we choose ε = 25. The obtained FID metrics are reported in Table 4. For the qualitative performance of baselines, see (Gushchin et al., 2023, Fig. 2). Besides, we provide Table 3 with LPIPS metric to show that the diversity of our method increases with ε.

Method Ours (Bortoli et al., 2021) (Shi et al., 2023) (Gushchin et al., 2023) (Daniels et al., 2021) FID 109 93 91.3 6.3 14.7(ε = 25)

Table 4: Baselines FID for Colored MNIST 2 3 transfer, ε = 1

We honestly state that the FID of our approach is not good. One reason is that the default Langevin dynamic produces slightly noisy samples. FID is known to terribly react to any noise. Secondly, we emphasize that we adapt the simplest long-run EBMs with persistent replay buffer (Nijkamp et al., 2020) for the experiment, see Appendix D.4 for the details. We leave the applications of modern EBMs which can generate sharp data (Du & Mordatch, 2019; Du et al., 2021) for future research.

(b) y ˆπ( |x) , ε = 0.01

(c) y ˆπ( |x) , ε = 0.1

(d) y ˆπ( |x) , ε = 1

Figure 4: Quantitative performance of Energy-guided EOT on Colored MNIST.

C.2 EXTENDED HIGH-DIMENSIONAL UNPAIRED IMAGE-TO-IMAGE TRANSLATION

In this section, we provide additional quantitative results and comparisons for our considered highdimensional I2I setup. In Table 5, we show uncurated samples from our approach learned on 512 512 AFHQ Cat Dog and Wild Dog image transfer problems. To compare our visual results with alternatives, we demonstrate the pictures generated by (Zhao et al., 2022) and (Daniels et al., 2021) solvers, see Figure 6. The former demonstrates SOTA results, see Table 2, but has no relation to OT. The latter is the closest approach to ours. For (Daniels et al., 2021), we trained their algorithm in the same setup as we used, with the latent space of the Styla GAN and transport cost c(x, y) = 1

2 x G(z) 2 2, see 5.3. We found that their method works only for ε = 10000 yielding unconditional generation. It is in concordance with our findings about the approach, see discussion in 4.2, and the insights from the original work, see (Daniels et al., 2021, 5.1).

D EXPERIMENTAL DETAILS

General Details. For the first two experiments, we take the advantage of replay buffer B constructed as described in (Du & Mordatch, 2019). When training, the ULA algorithm is initialized by samples from B with probability p = 0.95 and from Gaussian noise with probability 1 p = 0.05. For the last two image-data experiments, we also use a similar replay buffer but with p = 1, i.e., we do not update B with short-run samples.

Published as a conference paper at ICLR 2024

(a) 512 512 AFHQ Cat Dog

(b) 512 512 AFHQ Wild Dog

Figure 5: Uncurated Image-to-Image translation by our method in the latent space of Style GAN.

(a) 256 256 samples obtained with (Zhao et al., 2022)

(b) 512 512 samples obtained with (Daniels et al., 2021), ε = 10000

Figure 6: Image-to-Image Cat Dog translation by alternative methods

D.1 TOY 2D DETAILS

We parameterize the potential fθ as MLP with two hidden layers and Leaky Re LU(negative_slope= 0.2) as the activation function. Each hidden layer has 256 neurons. The hyperparameters of Algorithm 1 are as follows: K = 100, σ0 = 1, N = 1024, see the meaning of each particular variable in the algorithm listing. The Langevin discretization steps are η = 0.05 for ε = 0.1 and η = 0.005 for ε = 0.001. The reported numbers are chosen for reasons of the training stability.

Computation complexity. The experiment was conducted on a single GTX 1080 Ti and took approximately two hours for each entropy regularization parameter value.

Published as a conference paper at ICLR 2024

D.2 GAUSSIAN-TO-GAUSSIAN DETAILS

For the source and target distributions, we choose P = N(0, ΣX) and Q = N(0, ΣY ) with ΣX and ΣY chosen at random. For the reproducibility, these parameters are provided in the code.

The details of baseline methods (Table 1) are given in Appendix E. For each particular parameter set ε, D, our trained potential fθ is given by MLP with three hidden layers, 512 neurons each, and Re LU activation. The same architectural setup is chosen for Gushchin et. al. (Gushchin et al., 2023) for fair comparison. The hyperparameters of Algorithm 1 are the same for each ε, D: K = 100, σ0 = 1, N = 1024, η = 0.1. At the inference stage, we run ULA steps for Ktest = 700 iterations. The only parameter which we choose to be dependent on ε, D is the learning rate. It affects the stability of the training. The particular learning rates which are used in our experiments are given in Table 5. More specific training peculiarities could be found in our code.

Bures-Wasserstein UVP metric. The BW2 2-UVP metric (Korotin et al., 2021b, Eq. 18) is the Wasserstein-2 distance between distributions π1 and π2 that are coarsened to Gaussians and normalised by the variance of distribution π2:

BW2 2-UVP(π1, π2) def = 100% 1 2Var(π2)W2 2 N(µπ1, Σπ1), N(µπ2, Σπ2) .

In our experiment, π2 is the optimal plan π which is known to be Gaussian, and π1 is the learned plan ˆπ, whose mean and covariance are estimated by samples.

Computation complexity. Each experiment with particular ε, D takes approximately 12 hours on a single GTX 1080 Ti.

D 2 16 64 128 ε 0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10 lr 5 10 7 4 10 7 2 10 7 2 10 5 4 10 6 1 10 5 7 10 5 4 10 5 2 10 5 2 10 4 5 10 5 5 10 5

Table 5: Learning rates for Gaussian-to-Gaussian experiment; ε = 0.1, 1, 10 and D = 2, 16, 64, 128.

D.3 HIGH-DIMENSIONAL UNPAIRED IMAGE-TO-IMAGE TRANSLATION DETAILS

General details. In this experiment, we learn EOT between a source distribution of images P P(R3 512 512) and Q = N(0, I512) P(R512), which is the latent distribution of the pretrained Style GAN G. We use non-euclidean cost c(x, y) = 1

2 x G(z) 2 2. Below we describe the primary

idea behind this choice. Consider the pushforward distribution Qambi def = G Q. In our case, it is the parametric distribution of AFHQ Dogs. Thanks to our specific cost function, the learned optimal conditional plans π ˆ f( |x) between P and Q help to approximate (given ε is sufficiently small) the standard Euclidean Optimal Transport between P and Qambi, which is the motivating problem of several OT researches (Makkuva et al., 2020; Korotin et al., 2021a). The corresponding (stochastic) mapping is given by pushforward distributions G π ˆ f( |x). In practice, we sample from π ˆ f( |x) using our cost-guided MCMC and then pass the obtained samples through G. Note that our setup seems to be the first theoretically-advised attempt to leverage W2 2 OT between 512 512 images.

Technical details. The AFHQ dataset is taken from the Star GAN v2 (Choi et al., 2020) github:

https://github.com/clovaai/stargan-v2.

The dataset includes three groups of high-quality 512 512 images: Dogs, Cats and Wilds (wildlife animals). The latter two groups are used as the source distributions P. The pretrained (on AFHQ Dogs) Style GAN2-ADA Karras et al. (2020) is taken from the official Py Torch implementation:

https://github.com/NVlabs/stylegan2-ada-pytorch.

As a potential fθ which operates in the 512-dimensional latent space of the Style GAN model, we choose fully-connected MLP with Re LU activations and three hidden layers with 1024, 512 and 256 neurons, accordingly. The training hyperparameters are: K = 100, η = 0.008, σ0 = 1.0, N = 128. For both Cat Dog and Wild Dog experiments, we train our model for 11 epochs with a learning

Published as a conference paper at ICLR 2024

rate 10 4 which starts monotonically decreasing to 10 5 after the fifth epoch. At inference, we initialize the sampling procedure (in the latent space) with standard Normal noise. Then we repeat Langevin steps for Kinit test = 1000 iterations with η = 0.008. After that, additional Krefine test = 1000 steps are performed with η = 0.0005. The obtained latent codes are then passed through the Style GAN generator, yielding the images from the AFHQ Dog dataset.

Computation complexity. The training takes approximately a day on 4 A100 GPUs. The inference (as described above) takes about 20 minutes per batch on a single A100 GPU.

D.4 COLORED MNIST DETAILS

For generating Colored MNIST dataset, we make use of the code, provided by the authors of (Gushchin et al., 2023). The dataset consists of colored images of digit 2 ( 7K) and colored images of digit 3 ( 7K). The images are scaled to resolution 32 32.

To solve the problem in view, we adapt the base EBM code from (Nijkamp et al., 2020):

https://github.com/point0bar1/ebm-anatomy.

We do nothing but embed the cost function gradient s computation when performing Langevin steps, leaving all the remaining technical details unchanged. In particular, we utilize simple CNNs with Leaky Re LU(negative_slope= 0.05) activations as our learned potential fθ. We pick batch size N = 256 and initialize the persistent replay buffer at random using Uniform[ 1, 1]3 32 32 distribution. We use Adam optimizer with learning rate gradually decreasing from 3 10 5 to 10 5. The reported images 4 correspond to approximately 7000 training iterations.

For each entropic coefficient ε = 0.01, 0.1, 1, we run 6 experiments with the parameters given in Table 6. For training stability, we add Gaussian noise N(0, 9 η) to target samples when computing loss estimate ˆL in Algorithm 1.

ε 0.01 0.1 1 K {500, 1000, 2000} {500, 1000, 2000} {500, 1000, 2000} η {0.1, 0.3} {0.1, 0.3} {0.2, 0.3}

Table 6: Training parameters for Colored MNIST; ε = 0.01, 0.1, 1.

At the inference stage, we initialize the MCMC chains with source data samples. The ULA steps are repeated for Ktest = 2000 iterations with the same Langevin discretization step size η as the one used at the training stage. For each ε, the reported images 4 are picked for those parameters set K, η, which we found to be the best in terms of qualitative performance.

For LPIPS calculation, we use the official code:

https://github.com/richzhang/Perceptual Similarity,

where we pick VGG backbone for calculating lpips features. To calculate the resulting metric, we sample 18 target images from π ˆ f( |x) for each test source image x. For every pair of these 18 images, we compute LPIPS and report the average value (along the generated target images and source ones).

Computation complexity. It takes approximately one day on one V100 GPU to complete an image data experiment for each set of parameters.

E DETAILS OF THE BASELINE METHODS

In this section, we discuss details of the baseline methods with which we compare our method on the Gaussian-to-Gaussian transformation problem.

Daniels et.al. (Daniels et al., 2021). We use the code from the authors repository

https://github.com/mdnls/scones-synthetic

for their evaluation in the Gaussian case. We employ their configuration blob/main/config.py.

Published as a conference paper at ICLR 2024

Seguy et.al. (Seguy et al., 2018). We use the part of the code of SCONES corresponding to learning dual OT potentials blob/main/cpat.py and the barycentric projection blob/main/bproj.py in the Gaussian case with configuration blob/main/config.py.

Chen et.al. (Joint) (Chen et al., 2022). We utilize the official code from

https://github.com/ghliu/SB-FBSDE

with their configuration blob/main/configs/default_checkerboard_config.py for the checkerboard-to-noise toy experiment, changing the number of steps of dynamics from 100 to 200 steps. Since their hyper-parameters are developed for their 2-dimensional experiments, we increase the number of iterations for dimensions 16, 64 and 128 to 15 000.

Chen et.al. (Alt) (Chen et al., 2022). We also take the code from the same repository as above. We base our configuration on the authors one (blob/main/configs/default_moon_to_spiral_config.py) for the moon-tospiral experiment. As earlier, we increase the number of steps of dynamics up to 200. Also, we change the number of training epochs for dimensions 16, 64 and 128 to 2,4 and 8 correspondingly.

De Bortoli et.al. (Bortoli et al., 2021). We utilize the official code from

https://github.com/JTT94/diffusion_schrodinger_bridge

with their configuration blob/main/conf/dataset/2d.yaml for toy problems. We increase the amount of steps of dynamics to 200 and the number of steps of IPF procedure for dimensions 16, 64 and 128 to 30, 40 and 60, respectively.

Vargas et.al. (Vargas et al., 2021). We use the official code from

https://github.com/franciscovargas/GP_Sinkhorn

with hyper-parameters from blob/main/notebooks/2D Toy Data/2d_examples.ipynb. We set the number of steps to 200. As earlier, we increase the number of steps of IPF procedure for dimensions 16, 64 and 128 to 1000, 3500 and 5000, respectively.

Vargas et.al. (Vargas et al., 2021). We tested the official code from

https://github.com/franciscovargas/GP_Sinkhorn

Instead of Gaussian processes, we used a neural network as for ENOT . We use N = 200 discretization steps as for other SB solvers, 5000 IPF iterations, and 512 samples from distributions P0 and P1 in each of them. We use the Adam optimizer with lr = 10 4 for optimization.

Gushchin et.al. (Gushchin et al., 2023) We use the code provided by the authors. In our experiments, we use exactly the same hyperparameters for this setup as the authors (Gushchin et al., 2023, Appendix B), except the number of discretization steps N, which we set to 200 as well as for other Schrödinger Bridge based methods.

F EXTENDED DISCUSSION OF LIMITATIONS

In general, the main limitation of our approach is the usage of MCMC. This procedure is timeconsuming and requires adjusting several hyperparameters. Moreover, in practice, it may not always converge to the desired distribution which introduces additional biases. The other details of launching our proposed algorithm arise due to its connection to EBM s learning procedure. It is known that EBMs for generative modelling could be trained by two different optimization regimes: short-run (non-convergent) training and long-run (convergent) training (Nijkamp et al., 2020). In the first regime, the learned potential does not necessarily represent the energy function of the learned distribution. Because of this, the short-run mode may not always be adapted for Energyguided EOT, since it seems crucial for fθ to represent the true component of the conditional Energy potentials Eµ fθ x (y) = c(x,y) fθ(y)

ε . In particular, for our Colored MNIST experiment, we found the short-run regime to be unstable and utilize exclusively long-run mode. At the same time, for

Published as a conference paper at ICLR 2024

moderate-dimensional Toy 2D and Gaussian-to-Gaussian experiments as well as for latent-space high-dimensional I2I setup, non-convergent training was successful.