# entropic_gromovwasserstein_between_gaussian_distributions__1ec9cb52.pdf Entropic Gromov-Wasserstein between Gaussian Distributions Khang Le * 1 Dung Le * 2 Huy Nguyen * 3 Dat Do 4 Tung Pham ** 3 Nhat Ho ** 1 We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von Neumann s trace inequality, we obtain closed-form expressions for the entropic IGW between these Gaussian distributions. Finally, we consider an entropic inner product Gromov-Wasserstein barycenter of multiple Gaussian distributions. We prove that the barycenter is a Gaussian distribution when the entropic regularization parameter is small. We further derive a closed-form expression for the covariance matrix of the barycenter. 1. Introduction The recent advance in computation of optimal transport (Cuturi, 2013; Bonneel et al., 2015; Kolouri et al., 2016; Altschuler et al., 2017; Dvurechensky et al., 2018; Lin et al., 2019; Deshpande et al., 2019; Nguyen et al., 2021a) has led to a surge of interest in using optimal transport for applications in machine learning and statistics. These applications include generative modeling (Arjovsky et al., 2017; Tolstikhin et al., 2018; Salimans et al., 2018; Genevay et al., 2018; Liutkus et al., 2019), unsupervised learning (Ho et al., 2017; 2019), domain adaptation (Courty et al., 2016; Damodaran et al., 2018; Le et al., 2021), mini-batch computation (Fatras et al., 2020; Nguyen et al., 2021b), and other applications (Solomon et al., 2016; Nguyen et al., 2021c). In * The first three authors contributed equally to this work, ** Nhat Ho and Tung Pham are co-last authors, equally contributed to this work. 1University of Texas at Austin 2École Polytechnique 3Vin AI Research 4University of Michigan, Ann Arbor. Correspondence to: Nhat Ho, Tung Pham . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). these above applications, optimal transport has been used to quantify the discrepancy of the locations and mass between two probability measures which must be in the same space. When probability measures lie in different spaces, their locations are not comparable, their inner structures become the main concern. In this case, the Gromov-Wasserstein distance is adopted to measure the discrepancy between their inner structures. The main idea of Gromov-Wasserstein distance is to consider the transportation of similarity matrices of points which are in the same spaces. However, such formulation of Gromov-Wasserstein distance is computationally expensive as we need to solve a quadratic programming problem. To reduce the computational cost of Gromov-Wasserstein distance, (Peyré et al., 2016) propose to regularize the objective function of Gromov-Wasserstein based on the entropy of the transportation plan, which we refer to as entropic Gromov-Wasserstein. The entropic regularization idea had also been used earlier in optimal transport and had been shown to improve greatly the computation of optimal transport (Cuturi, 2013; Altschuler et al., 2017; Dvurechensky et al., 2018; Lin et al., 2019). The improvement in approximation of Gromov-Wasserstein distance via the entropic regularization has led to an increasing interest of using that divergence to several applications, including deep generative models (Bunne et al., 2019), computer graphics (Solomon et al., 2016; Xu et al., 2019; Chen et al., 2020), and natural language processing (Alvarez-Melis & Jaakkola, 2018; Grave et al., 2019). When measures are not probability measures and unbalanced, i.e., they may have different total masses, unbalanced version of Gromov-Wasserstein, named unbalanced Gromov-Wasserstein, via the idea of unbalanced optimal transport (Chizat et al., 2018) had been introduced in the recent work of (Séjourné et al., 2020). The entropic unbalanced Gromov-Wasserstein has been used in robust machine learning applications (Séjourné et al., 2020). Despite their practicalities, theoretical understandings of entropic (unbalanced) Gromov-Wasserstein are still nascent. The recent work of (Salmona et al., 2021) studied the closedform expression of Gromov-Wasserstein between Gaussian distributions in different dimensions. However, to the best of our knowledge, the full theoretical analysis of entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions in different dimensions has remained poorly understood. Entropic Gromov-Wasserstein between Gaussian Distributions Our contribution. In this paper, we present a comprehensive study of the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions when inner product is considered as a cost function, which we refer to as entropic (unbalanced) inner product Gromov-Wasserstein (IGW). We also study its corresponding barycenter problem among multiple Gaussian distributions. Our work complements the works of (Gelbrich, 1990), (Janati et al., 2020) and (Salmona et al., 2021) when we give an explicit explanation of the effect of the entropic regularization on the OT and the geometric structure of an optimal transport plan between Gaussians. More particular, our contribution is four-fold and can be summarized as follows: 1. Balanced Gaussian measures: We first provide a closedform expression of the entropic inner product Gromov Wasserstein between Gaussian probability measures with zero means. We demonstrate that the expression depends mainly on eigenvalues of the covariance matrices of the two measures. Furthermore, an associate optimal transportation plan is shown to be also a zero-mean Gaussian measure. Our analysis hinges upon an application of von Neuman s trace inequality (Kristof, 1969; Horn & Johnson, 1991) for singular values of matrices. 2. Unbalanced Gaussian measures: Second, by relaxing marginal constraints via Kullback-Leibler (KL) divergence, we further study the entropic unbalanced IGW between two unbalanced Gaussian measures with zero means. The main challenge comes from the two KL terms that add another level of constraint in the objective function. To overcome that challenge, we show that the objective function can be broken down into several sub-problems that can be solved explicitly through some cubic and quadratic equations. That leads to an almost closed-form formulation of an optimal Gaussian transport plan of the entropic unbalanced IGW between the unbalanced Gaussian measures. 3. Barycenter problem: Finally, we investigate the (entropic) Gromov-Wasserstein barycenter problems with inner product cost, which we refer to as (entropic) IGW barycenter, among zero-mean balanced Gaussian measures. We prove that the barycenter of zero-mean Gaussian distributions is also a zero-mean Gaussian distribution with a diagonal covariance matrix. Reposing on that Gaussian barycenter, we can directly obtain a closed-form expression for the barycenter problem when the regularized parameter is sufficiently small, which is also the setting people widely use in practice. 4. Revisiting entropic optimal transport between Gaussian distributions. We note in passing that based on the new proof technique introduced for computing the entropic Gromov-Wasserstein between Gaussian distributions, we revisit the entropic Wasserstein between (unbalanced) Gaus- sian distributions in Appendix F and provide a simpler proof than that in (Janati et al., 2020) to derive the closed-form expression for that problem. Paper organization. The paper is organized as follows. In Section 2, we provide backgrounds for the (unbalanced) inner product Gromov-Wasserstein and its corresponding entropic version. Next, we establish a closed-form expression of the entropic IGW between two balanced Gaussian distributions in Section 3 while proving that the entropic unbalanced IGW between unbalanced Gaussian measures also has a closed form in Section 4. Subsequently, we present our study of the (entropic) IGW barycenter problem among multiple Gaussian distributions in Section 5. In Section 6, we carry out empirical studies to investigate the behavior of algorithms solving the entropic Gromov-Wasserstein before concluding with a few discussions in Section 7. Additional proofs and auxiliary results are presented in the supplementary material. Notation. We use the following notations throughout the paper. For a non-negative definite matrix A, if we do not specify the spectral decomposition of A, then generally, λa,i or λi(A) denotes the i-th largest eigenvalues of A. For random vectors X = (X1, . . . , Xm) and Y = (Y1, . . . , Yn), let Kxy be the covariance matrix between X and Y , which means that Kxy i,j = Cov(Xi, Yj). For square matrices A and B, we write A B or B A if A B is nonnegative definite. For a positive integer n, we denote by Idn the identity matrix of size n n while [n] stands for the set {1, 2, . . . , n}. For any real number a, [a]+ indicates max{a, 0}. Lastly, we denote by M+(X) the set of all positive measures in a space X and by N(γ, Σ) the Gaussian distribution in Rd with mean γ and covariance matrix Σ where d will be specified in each case. 2. Preliminaries In this section, we first provide background for the Gromov Wasserstein distance between two probability distributions and discuss its properties when using ℓ2-norm as a cost function. Then, we present the formulation of an inner product Gromov-Wasserstein problem between two (unbalanced) Gaussian measures with a discussion about its geometric properties. 2.1. Gromov-Wasserstein distance Let µ and ν be two probability measures on two Polish spaces (X, d X ) and (Y, d Y), and let c X : X X R and c Y : Y Y R be two measurable functions. Then, with p 1, the p-Gromov-Wasserstein distance between µ and Entropic Gromov-Wasserstein between Gaussian Distributions ν, denoted by GWp(µ, ν), is defined as follows: inf π Π(µ,ν) Eπ π c X (X, X ) c Y(Y, Y ) p 1 where (X, Y ) and (X , Y ) are independent and identically distributed with probability distribution π that belongs to the set of joint probability distributions Π(µ, ν), in which their marginal distributions are the corresponding distributions µ and ν. Although the above problem always admits an optimal solution (Vayer, 2020), it is computationally expensive as we need to solve a quadratic optimization problem. For example, when the dimensions of X and Y are m and n, respectively, the complexity of solving the Gromov-Wasserstein minimization problem is O(m2n+n2m) (Peyré et al., 2016; Vincent-Cuaz et al., 2022). 2.2. Gromov-Wasserstein distance using ℓ2-norm Similar to the Wasserstein distance, there is no closed-form expression for the Gromov-Wasserstein distance between general distributions. Even in the important and specific cases when the input distributions are Gaussian, we still do not know if an optimal transport plan needs to be Gaussian or not (Salmona et al., 2021). However, there are some certain understandings of the Gromov-Wasserstein distance when the cost functions are Euclidean norms (Salmona et al., 2021; Vayer, 2020; Gelbrich, 1990), inf π Π(µ,ν) Eπ π X X 2 m Y Y 2 n 2 1 where m and n are Euclidean norms on Rm and Rn, respectively. Theorem 2.1. (Salmona et al., 2021) Let µ N(m0, Σ0) and ν N(m1, Σ1) where m0 Rm,m1 Rn are two vector means and Σ0 Rm m and Σ1 Rn n are the corresponding non-degenerated covariance matrices. Let P0, D0 and P1, D1 be respective diagonalizations of Σ0 and Σ1. Let us define T0 : x Rm P 0 (x m0) and T1 : y Rn P 1 (y m1). The Gromov-Wasserstein problem in equation (2) is equivalent to sup X T0#µ,T1#ν i,j Cov(X2 i , Y 2 j ) + 2 Cov(X, Y ) 2 F where X = (X1, . . . , Xm) , Y = (Y1, . . . , Yn) and . F is the Frobenius norm. This theorem shows that the ℓ2-norm Gromov-Wasserstein problem is equivalent to first shifting two input distributions to have mean zeros and then comparing covariance structures of the shifted distributions. After the shifting step, the translation invariant property is no longer needed and the problem is reduced to finding a match between the covariance structures. 2.3. Inner product Gromov-Wasserstein In this section, we will provide background on the inner product Gromov-Wasserstein between two (unbalanced) Gaussian measures with a discussion about its geometric properties and the effect of an entropic regularization term on an optimal transport plan. Although the Gromov-Wasserstein distance with ℓ2-norm cost function has been used in several practical applications, finding its closed-form expression has remained an open problem due to its dependence on co-moments of order 2 and 4 of feasible transport plans (see Theorem 2.1) (Salmona et al., 2021). In particular, they can only derive the upper and lower bounds for GW2(µ, ν) and they are even not able to show that an optimal transport plan is a Gaussian distribution. By considering the inner product cost function, our work can be seen as the first step towards better understanding and new insights on the ℓ2-norm Gromov-Wasserstein distance between Gaussian distributions. Note that proof techniques ((Salmona et al., 2021), (Janati et al., 2020)) prior to our work to study the Gromov-Wasserstein distance between Gaussian measures were quite complicated. Now, we reintroduce the Gromov-Wasserstein using the inner product as a cost function and its unbalanced variant. Definition 2.2 (Inner product Gromov-Wasserstein). (Salmona et al., 2021) Let X, X µ be i.i.d random multivariates in Rm while Y, Y ν are i.i.d random multivariates in Rn. Next, we assume that both (X, Y ) and (X , Y ) are jointly distributed as π. Then, the inner product Gromov Wasserstein (IGW) problem is defined as IGW(µ, ν) := min π Π(µ,ν) Eπ π n X, X Y, Y 2o . In this paper, we utilize an entropy-regularized approach by adding the relative entropy term to the above objective function As a result, we have the following definition, named entropic IGW, which is given by IGWε(µ, ν) := min π Π(µ,ν)Eπ π n X, X Y, Y 2o + εKL π µ ν , (3) where ε > 0 is a regularized parameter and KL(α||β) := R log dα dβ dα + R (dβ dα) for any positive measures α and β. Given that the entropic IGW between Gaussian distributions has a closed-form expression (see Theorem 3.1), it might be useful for some applications such as graph matching and word embedding alignment, which we leave for future development. Entropic unbalanced IGW problem: When µ and ν are unbalanced Gaussian measures which are not probability distributions, the entropic IGW formulation in equation (3) Entropic Gromov-Wasserstein between Gaussian Distributions is no longer valid. One solution to deal with this issue is by regularizing the marginal constraints via KL divergence (Chizat et al., 2018). The entropic IGW problem between unbalanced measures, which we refer to as entropic unbalanced IGW, admits the following form: UIGWε,τ(µ, ν) := min π Eπ π n X, X Y, Y 2o + τKL (πx µ) + τKL (πy ν) + εKL π µ ν , (4) where τ > 0, π M+(Rm Rn) in the minimum, πx and πy are the projections of π on Rm and Rn, respectively, and KL (α||β) := KL(α α||β β). Using the quadratic KL divergence KL (Séjourné et al., 2021) in place of the standard KL will result in UIGWε,τ being 2-homogeneous, i.e., if µ and ν are multiplied by θ 0, then the value of UIGWε,τ(µ, ν) is multiplied by θ2. Invariant property: Unlike the ℓ2-norm Gromov Wasserstein in equation (2), which is invariant to both translation and rotation (Salmona et al., 2021), our IGWε and UIGWε,τ only keeps unchanged under the latter transformation, which is shown in the following lemma with a note that the KL divergence is both rotation and translation invariant. Lemma 2.3 (Invariant to rotation). Let µ and ν be two probability measures on Rm and Rn. Let Tm : x Omx and Tn : y Ony be two functions in which Om and On are orthogonal matrices of size m m and n n, respectively. Then, we have IGWε(Tm#µ, Tn#ν) = IGWε(µ, ν), where # denotes the push-forward operator. The proof of Lemma 2.3 can be found in Appendix A.1. Mean-zero assumption: As being discussed above, the IGWε and UIGWε,τ are not invariant to translation, because the IGW depends on the means of distributions, which is apparently undesirable. To get around this cumbersome, we assume that the means of two distributions are equal to zero, which is equivalent to the shifting step as shown in Section 2.2. Then, the translation invariant property is no longer a concern, and the problem is reduced to working with the covariance structures. By putting the mean-zero assumption, we still follow the main idea of the Gromov-Wasserstein distance, which is comparing the inner structures of two input distributions. 3. Entropic Gromov-Wasserstein between Balanced Gaussians We present in this section a closed-form expression of the entropic inner product Gromov-Wasserstein between two Gaussian measures and a formula of a corresponding optimal transport plan in Theorem 3.1. Let µ = N(0, Σµ) and ν = N(0, Σν) be two Gaussian measures in Rm and Rn, respectively. To ease the ensuing presentation, in the entropic IGW problem, we denote the covariance matrix of the joint distribution π of µ and ν by In addition, the rotation invariant property of entropic IGW in Lemma 2.3 allows us to assume without loss of generality that Σµ and Σν are diagonal matrices with positive diagonal entries sorted in descending order. In particular, Σµ = diag λµ,1, . . . , λµ,m , Σν = diag λν,1, . . . , λν,n . (6) This step simplifies significantly the unnecessarily complicated form of the general covariance matrix, since the Gromov-Wasserstein depends on the eigenvalues of the covariance matrix. The following result gives the closed-form expression for the entropic IGW between two balanced Gaussian distributions. Theorem 3.1 (Entropic IGW has a closed form). Suppose without loss of generality that m n. Let µ = N(0, Σµ) and ν = N(0, Σν) be two zero-mean Gaussian measures in Rm and Rn, respectively, where Σµ and Σν are diagonal matrices given in equation (6). The entropic inner product Gromov-Wasserstein between µ and ν then equals to IGWε(µ, ν) = tr(Σ2 µ) + tr(Σ2 ν) 2 h λµ,kλν,k ε h log(λµ,kλν,k) log ε Furthermore, an optimal transportation plan of the entropic IGW problem is a zero-mean Gaussian measure π = N(0, Σπ ), where with K µν = diag h λµ,kλν,k 1 ε 4λµ,kλν,k +i 1 k=1 is an m n matrix. Remark 3.2. It can be seen that the quantity IGWε(µ, ν) depends only on the eigenvalues of the covariance matrices of µ and ν. In addition, as ε 0, IGWε(µ, ν) converges to Pn k=1(λµ,k λν,k)2 +Pm k=n+1 λ2 µ,k, which is the squared Euclidean distance between covariance matrices Σµ and Σν. In other words, IGWε(µ, ν) reflects the difference in the inner structures of the two input measures. Regarding the Entropic Gromov-Wasserstein between Gaussian Distributions optimal transport plan, when Σµ and Σν are not diagonal, the covariance matrix K µν in equation (7) turns into K ,new µν = PµD 1 2ν P ν = PµK µνP ν . 1 2µν = diag nh 1 ε 4λµ,kλν,k k=1, Pµ, Dµ and Pν, Dν are the orthogonal diagonalizations of Σµ(= PµDµP µ ) and Σν(= PνDνP ν ), respectively (see Appendix A.2 for the derivation). Proof Sketch of Theorem 3.1. The full proof of this theorem is in Appendix A.2. First of all, we note that the objective function depends only on the covariance matrix of the joint distribution and the KL term. Given a fixed covariance matrix, the entropic term in equation (3) forces an optimal transport plan for the entropic IGW problem to be a Gaussian distribution (see Lemma 3.3 below). Subsequently, Lemma 3.4 (cf. the end of this proof) is used to attack the the determinant and the trace involving the covariance matrix Kµν between X and Y . Then, the entropic IGW is reduced to a minimization problem of single variables in a particular interval, which has a closed-form solution. Lemma 3.3 (KL divergence minimum). Let Qv be the Gaussian distribution in Rd with mean v and covariance matrix Σv. Denote by Πu,Σu the family of all probability distributions in Rd which have mean u and covariance matrix Σu, and are dominated by Qv. Then, N u, Σu = arg min Pu Πu,Σu KL(Pu Qv). The proof of this lemma is deferred to Appendix A.3 Lemma 3.4 (Eigenvalues, determinant and trace). Let UµνΛ 1 2µνV µν be the SVD of Σ 1 2 ν , where Λ 1 2 µν,k)n k=1 with κ 1 2 µν,k is the k-th largest singular value 2 ν . Here, Λ 1 2µν is an m n matrix, while Uµν and Vµν are unitary matrices of sizes m m and n n, respectively. Then, (a) The values of κµν,1, . . . , κµν,n lie between 0 and 1. (b) The determinant of matrix Σπ could be computed as det(Σπ) = det(Σµ)det(Σν) Qn j=1 1 κµν,j . (c) We have tr K µνKµν Pn j=1 λµ,jλν,jκµν,j. The equality occurs when Uµν and Vµν are identity matrices of sizes m and n, respectively. The proof of Lemma 3.4 can be found in Appendix A.4. This lemma allows us to optimize the entropic IGW problem between Gaussian measures µ and ν over the singular values (κµν,k)n k=1 of the matrix Kµν rather than the transport plan π which is quite challenging to directly deal with. Part (a) of Lemma 3.4 shows a representation of the covariance matrix Kµν through its singular values, which is useful for calculating its determinant in part (b). Our key technique to derive the closed-form expression for the entropic IGW lies in part (c) of Lemma 3.4. In that part, we utilize the von Neumann s trace inequality which says that the sum of singular values of the product of two matrices is maximized when their SVDs share the same orthogonal bases and their corresponding singular values match that of each other by their magnitudes. Specifically, Lemma 3.5 (von Neumann s inequality). (Kristof, 1969) Let A1, . . . , An be fixed complex matrices and Z1, . . . , Zn run independently over all unitary matrices, all matrices are of size m m. Let αk1, . . . , αkm be the singular values of Ak for all k [n], all in the same sense monotonically ordered. Then, the maximum of the real part of tr [Z1A1] . . . [Zn An] is given by Pm ℓ=1 α1ℓ. . . αnℓ. In comparison with previous works such as (Janati et al., 2020) and (Salmona et al., 2021), we have a direct attack to the problem by figuring out a detailed formula for the covariance matrix and utilising the von Neumann s inequality to avoid solving a complicated system of equations of the first derivatives of the objective function. Remark 3.6. In the step of applying part (c) of Lemma 3.4, the equality of the von Neumann s inequality for tr(K µνKµν) can occur in many cases (see Appendix A.4). Therefore, the optimal transport plan mentioned in Theorem 3.1 is not unique. For instance, π = N(0, Σπ ) where is also an optimal transport plan. 4. Entropic Unbalanced Gromov-Wasserstein between Unbalanced Gaussians In this section, we consider a more general setting when the two distributions µ and ν are unbalanced Gaussian measures in Rm and Rn, respectively. Specifically, µ = mµN(0m, Σµ), ν = mνN(0n, Σν), (8) where m n; mµ, mν > 0 are their masses and Σµ, Σν are given in equation (6). Here, µ and ν do not necessarily have the same mass, i.e., mµ could be different from mν. For any positive measure α, we denote by α the normalized measure of α. Thus, µ = N(0m, Σµ) and ν = N(0n, Σν). Under the setting of entropic unbalanced IGW, we denote Entropic Gromov-Wasserstein between Gaussian Distributions the covariance matrix of a transportation plan π by where Σx and Σy are covariance matrices of distributions πx and πy, respectively. Note that the objective function of entropic unbalanced IGW in equation (4) involves two new KL divergence terms compare to that of the entropic IGW problem. Thus, before presenting the main theorem of this section, we introduce a tight lower bound for the KL divergence between two Gaussian distributions in the following lemma. Lemma 4.1 (Lower bound for KL divergence). Let π be given in equation (9), λx = (λx,i)m i=1 be the eigenvalues of Σx, and λy = (λy,j)n j=1 be the eigenvalues of Σy. Similarly, we define λµ for Σµ and λν for Σν. Then, we find that (a) KL(πx µ) 1 2 Pm i=1 Ψ λx,iλ 1 µ,i ; (b) KL(πy ν) 1 2 Pn j=1 Ψ λy,jλ 1 ν,j ; (c) KL(π µ ν) = KL(πx µ) + KL(πy ν) + 1 2 Pn k=1 log(1 κxy,k), where Ψ(x) = x log(x) 1, and (κ 1 2 xy,k)n k=1 are singular values of matrix Σ 1 The proof of Lemma 4.1 is in Appendix A.5. Next, we define some functions and quantities that will be used in our analysis. Given τ, ε > 0, let gε,τ,+(x, y; a, b) := x2 + y2 + ε h log(xy) log ε + (τ + ε) hx 2 i 2 h xy ε hε,τ(x; a) := x2 + (τ + ε) hx (ϕk, φk) := arg min x,y>0 gε,τ,+(x, y; λµ,kλν,k), k [n]; ϕk := arg min x>0 hε,τ(x, λµ,k), k = n + 1, . . . , m; for any x, y, a, b > 0. The detailed calculation of (ϕk)m k=1 and (φk)n k=1 can be found in Appendix B. Now, we are ready to state our main result regarding the entropic unbalanced IGW between two unbalanced Gaussian distributions. Theorem 4.2 (Entropic unbalanced IGW has a closed form). Let µ and ν be two unbalanced Gaussian distributions given in equation (8). Then, the entropic unbalanced IGW between µ and ν can be computed as follows: UIGWε,τ(µ, ν) = m2 π Υ + εKL(m2 π m2 µm2 ν) + τ n KL(m2 π m2 µ) + KL(m2 π m2 ν) o , (10) mπ := (mµmν) τ+ε 2τ+ε exp n Υ k=1 gε,τ,+(ϕk, φk; λµ,k, λν,k) k=n+1 hε,τ(ϕk; λµ,k). Furthermore, an optimal transportation plan is an unbalanced Gaussian measure π = mπ N(0, Σπ ), where in which Σ x = diag ϕk m k=1, Σ y = diag φk n k=1, and K xy = diag ψk n k=1 is an m n matrix with ψk := n 1 ε 2ϕkφk for all k [n]. Remark 4.3. Similar to IGWε(µ, ν), equation (10) shows that UIGWε(µ, ν) still depends on the eigenvalues of the covariance matrices of µ and ν via the quantity Υ , the detailed calculation of which is deferred to Appendix B. In addition, as µ and ν are assumed to be unbalanced in this case, the masses of these two measures are also included in the closed-form of UIGWε(µ, ν). Proof Sketch of Theorem 4.2. The full proof of this theorem can be found in Appendix A.6. Recall that feasible transport plans π in the problem (4) are not necessarily probability measures. Therefore, we aim to optimize over the shape π and the masses mπ of π. To do so, we separate those two factors in the objective function (4) as follows: m2 πΥ + εKL(m2 π m2 µm2 ν) + τ n KL(m2 π m2 µ) + KL(m2 π m2 ν) o , (11) Υ := E π π X, X Y, Y 2 + 2εKL( π µ ν) + 2(ε + τ) n KL( πx µ) + KL( πy ν) o . (12) Shape optimization. In a similar fashion to the proof of Theorem 3.1, we show that the sum of the expectation and entropic terms in equation (12) depend only the covariance matrix Σπ of π as below: λx 2 2 + λy 2 2 2 h log(λx,kλy,k) log ε Entropic Gromov-Wasserstein between Gaussian Distributions where λx, λy, λµ and λν are defined as in Lemma 4.1. Putting this result together with parts (a) and (b) of Lemma 4.1, the problem (12) is reduced to min λx,λy>0 k=1 gε,τ,+(λx,k, λy,k; λµ,k, λν,k) k=n+1 hε,τ(λx,k; λµ,k) o . Finally, Lemma C.4 (in Appendix C) allows us to optimize each summation term in the above problem independently but still preserving the decreasing order of eigenvalue sequences (λx,k)n k=1. Eventually, we obtain the optimal value Υ , which will be calculated in detail in Appendix B. Mass optimization. Based on the equation (11), the optimal mass mπ is the square root of the minimizer of the function f(x) := xΥ +εKL(x m2 µm2 ν) + τ n KL(x m2 µ) + KL(x m2 ν) o , which is obtained by solving the equation f (x) = 0. 5. (Entropic) Gromov-Wasserstein Barycenter of Gaussian Distributions In this section, we consider the problem of computing the (entropic) Gromov-Wasserstein barycenter of T Gaussian measures µ1, . . . , µT defined in spaces of various dimensions. To tackle this problem, we need to fix the desired dimension of a barycenter, e.g, d, and choose beforehand the positive weighting coefficients α1, . . . , αT such that PT ℓ=1 αℓ= 1 associated with the measures µ1, . . . , µT . Here we allow the dimension of the barycenter to be flexible, since the given Gaussian measures could be in different dimensional spaces. Then, the barycenter of T positive measures under the inner-product Gromov-Wasserstein distance is defined as follows: Definition 5.1 (Inner product Gromov-Wasserstein barycenter). Let µℓbe a positive measure in Rmℓfor any ℓ [T]. Let α1, . . . , αT be positive constants such that PT ℓ=1 αℓ= 1. The inner product Gromov-Wasserstein (IGW) barycenter of {µℓ, αℓ}T ℓ=1 is defined as the probability distribution of the random vector Y in Rd which is a solution of the following problem arg min Xℓ µℓ; dim(Y )=d; (Xℓ,Y ),(X ℓ,Y ) πℓ,y ℓ=1 αℓEπℓ,y πℓ,y n Xℓ, X ℓ Y, Y 2o , where in the above minimum, (Xℓ, Y ) and (X ℓ, Y ) are i.i.d random vectors for each ℓ= 1, . . . , T. Based on this definition, the following theorem provides a closed-form expression for the IGW barycenter of T Gaussian measures. Theorem 5.2 (Inner product Gromov-Wasserstein barycenter has a closed form). Let µℓ= N(0, Σℓ) be Gaussian distribution in Rmℓfor all ℓ [T], where Σℓ= diag λℓ,i mℓ i=1 is an mℓ mℓpositive definite matrix. Let d be a positive integer and assume that d maxℓ [T ] mℓ, the IGW barycenter of the formulation (13) admits a Gaussian solution: µ = N(0, Σµ ), where Σµ = diag λµ ,j d j=1, in which λµ ,j = PT ℓ=1 αℓλℓ,j1j mℓfor all j [d]. The proof of Theorem 5.2 can be found in Appendix A.7. Subsequently, we define the formulation of the barycenter of positive measures using the entropic IGW. Definition 5.3 (Entropic inner product Gromov-Wasserstein barycenter). Let µℓbe a positive measure in Rmℓ for any ℓ [T]. Let α1, . . . , αT be positive constants such that PT ℓ=1 αℓ= 1. The entropic inner-product Gromov Wasserstein barycenter of {µℓ, αℓ}T ℓ=1 is defined as the probability distribution of the random vector Y in Rd which is a solution of the following problem arg min Xℓ µℓ;Y Rd;Y µ; (Xℓ,Y ),(X ℓ,Y ) πℓ,y ℓ=1 αℓ n Eπℓ,y πℓ,y Xℓ, X ℓ Y, Y 2 + εKL(πℓ,y µℓ µ) o , (14) where in the above minimum, {X ℓ} and {Xℓ} are i.i.d. random vectors and Y and Y are i.i.d. random vectors. From that definition, we have the following result for the entropic IGW barycenter of T Gaussian distributions. Theorem 5.4 (Entropic inner product Gromov-Wasserstein barycenter has a closed form). Let µℓ= N(0, Σℓ) be Gaussian distribution in Rmℓfor all ℓ [T], where Σℓ= diag λℓ,i mℓ i=1 is an mℓ mℓpositive definite matrix. Let d be a positive integer such that d maxℓ [T ] mℓ. Define dℓ= min{d, mℓ}. Let ε be a positive constant satisfying ε 2λℓ,j Aj + (A2 j εBj) 1 2 , (15) and A2 j εBj for any ℓ [T] and j [d], where Aj = PT ℓ=1 αℓλℓ,j1j dℓand Bj = PT ℓ=1 αℓ1j dℓ. Then, the barycenter of the formulation (14) admits a Gaussian solution which has the form: µ = N(0, Σµ ), in which Σµ = diag(λµ ,j)d j=1 with ℓ=1 αℓλℓ,j1j dℓ λℓ,j Aj + (A2 j εBj) 1 2 Entropic Gromov-Wasserstein between Gaussian Distributions The proof of Theorem 5.4 is in Appendix A.8. Remark 5.5. In the above theorem, we need a set of conditions for ε that could be satisfied when ε is small, which is often chosen in practice. When ε is not small, then the readers could follow the guideline in Lemma C.5 in Appendix C to find the covariance matrix Σµ . The proof of Theorem 5.4 needs the below lemma to show the existence of a Gaussian minimizer. Lemma 5.6. Let QX = N(γx, Σx) be a Gaussian distribution in Rm and PY be a probability distribution with mean γy Rn and covariance matrix Σy. Denote by Π(QX, PY ) the family of all probability distributions in Rm+n which have marginal distributions QX and PY , and covariance matrix ΣX,Y = . Here, Σx and Σy are non- degenerate covariance matrices of size m m and n n, respectively. Then, N [γx, γy] , ΣX,Y arg min PX,Y Π(QX,PY ) KL PX,Y QX PY . The proof of Lemma 5.6 is in Appendix A.9. 6. Empirical Studies In this section, we will use the derived closed-forms to inspect the behavior of algorithms solving entropic Gromov Wasserstein problem, in particular those studied in (Peyré et al., 2016). We use the implementation of these algorithms in Python Optimal Transport library (Flamary & Courty, 2017). The implementation is available at https://github.com/lntk/egw_gaussians. Projected mirror descent for IGWε (Figure 1). The algorithm in (Peyré et al., 2016) reads C(t) = Da P (t)Db P (t+1) = min P Π(a,b) P, C(t) εH(P), in which Da, Db correspond to inner product cost matrices on supports of µN and νN, Π(a, b) is the set of discrete couplings {P RN N : P1N = a, P 1N = b}, and P, C(t) = P i,j [N] Pij C(t) ij . Here, we set m = 2, n = 3 and sample N (between 10 and 2000) data points from N(02, Σµ) and N(03, Σν), in which Σµ and Σν are diagonal matrices whose diagonal values are uniformly sampled from the interval [0, 2]. The regularization parameter ε is chosen from {0.5, 1, 5}. We plot means and one standarddeviation areas over 20 independent runs for each N. As expected, with more samples we can approximate IGWε more accurately. However, since this algorithm only converges to a stationary point, there might still be a gap between the converged value and the optimal value. 0 500 1000 1500 2000 2 empirical true 0 500 1000 1500 2000 3 empirical true 0 500 1000 1500 2000 10 empirical true # of samples Figure 1. Empirical convergence for (Peyré et al., 2016) in computing IGWε. From left to right: ε = 0.5, ε = 1 and ε = 5. The red dashed lines correspond to the theoretical optimal values from Theorem 3.1, while the blue lines and shaded regions are the means and standard variations of the objective values computed according to (Peyré et al., 2016), respectively. 0 500 1000 1500 2000 1.48 empirical true 0 500 1000 1500 2000 1.80 empirical true 0 500 1000 1500 2000 4.4 empirical true # of samples Figure 2. Empirical convergence for (Séjourné et al., 2021) in computing UIGWε,τ. From left to right: ε = 0.5, ε = 1 and ε = 5 while τ is fixed to 1. The red dashed lines correspond to the theoretical optimal values from Theorem 4.2, while the blue lines and shaded regions are the means and standard variations of the objective values computed according to (Séjourné et al., 2021), respectively. Alternating minimization for UIGWε,τ (Figure 2). Next, we use the closed-form for the unbalanced formulation (see Theorem 4.2) to demonstrate the convergence of a recently proposed algorithm (Séjourné et al., 2021, Algorithm 1), solving a bi-convex relaxation of the problem (4), which is tight for negative semi-definite kernels and the mass balance between two measures (Séjourné et al., 2021, Theorem 4). Thus, we will only consider two measures of the same mass in this section, and the setting is essentially the same as the previous section as follows: m = 2, n = 3 and we sample N (between 10 and 2000) data points from N(02, Σµ) and N(03, Σν), in which Σµ and Σν are diagonal matrices whose diagonal values are uniformly samples from the interval [0, 2]. The regularization parameter ε is chosen from {0.5, 1, 5} and τ is set to 1. We plot means and one standard-deviation areas over 20 independent runs for each N. The observed behavior is also similar to what we saw previously: the approximation quality for UIGWε,τ increases with more samples. Visualizing optimal transportation plans of for n = m = 1 (Figure 3). Now we take a look at the trans- Entropic Gromov-Wasserstein between Gaussian Distributions Figure 3. Empirical transportation plans between 1-dimensional Gaussians N(0, 2) and N(0, 10). From left to right: ε = 0.1, ε = 1, ε = 20, ε = 40, ε = 80, ε = 100. The first rows present plans returned by (Peyré et al., 2016) and the second row corresponds to theoretical plans in Theorem 3.1. portation plans computed by the above algorithm for different values of ε. Specifically, we create 1000-bin histograms of N(0, α) and N(0, β) (which will be µN and νN with N = 1000) in which α = 2, β = 10, and set ε {0.1, 1, 20, 40, 80, 100} (values in this set are chosen based on αβ). The algorithm is run till convergence (with tolerance 10 9) and the plans are plotted in Figure 3. It is apparent that the plans returned by the projected mirror descent algorithm are Gaussians and resemble our theoretical ones (except for the case ε = 1, but noting that when n = m = 1, if N(0, ( α γ γ β )) is an optimal plan for IGWε, so is N(0, ( α γ γ β )). Alternating minimization for the barycenter problem (Figure 4). Given T discrete measures {µ(i) N }T i=1, a set of weights {αi}T i=1, cost matrices C(i) corresponding to measure µ(i) N and a probability vector p, (Peyré et al., 2016) proposes an alternating minimization scheme to find a ddimensional barycenter µ of {µ(i) N }T i=1 with probability mass p, i.e., finding the cost matrix C on the support of this barycenter. In our case of inner product metric, C is a Gram matrix, and we can recover the supports of µ via the Cholesky decomposition C = LL where L RN N and choosing the first d columns of L. In our setting, for i [T], we choose µ(i) N = PN j=1 1 N δXj where Xj N(0, Σ(i)), C(i) kl = Xk, Xl for k, l [N] and p = 1N/N. Specifically, we consider N = 1000, T = 3, Σ(1) = 5, Σ(2) = diag(10, 1), Σ(3) = diag(2, 1, 1), α = (0.3, 0.6, 0.1) and d = 2. The regularization parameter ε is set to 0.1, which satisfies all the constraints in Theorem 5.4. After finding N support points for µ by the discussed method, we compute the sample covariance matrix ˆΣ, perform orthogonal diagonalization, and apply the corresponding transformation to the support. We then compute its sample mean and sample covariance matrix, and compare the Gaussian with these statistics with our theoretical optimal plan (see Figure 4). It is worth noting that any orthogonally-transformed version of the found barycenter is also a barycenter of {µ(i) N }T i=1 Figure 4. Visualizing the barycenter. The blue points correspond to one possible set of support for the computed barycenter by (Peyré et al., 2016) (see discussion). The blue contours are the rotated version of the Gaussian fitted to this support set (for comparison). The red contours represent the theoretical barycenter whose form is presented in Theorem 5.4. under the Gromov-Wasserstein setting. 7. Conclusion In this paper, we provide a comprehensive study of the entropic (unbalanced) inner product Gromov-Wasserstein (IGW) between (unbalanced) Gaussian distributions. We demonstrate that the optimal transportation plan is (unbalanced) Gaussian distribution. Based on that result and a novel application of von Neumann s trace inequality, we derive the closed-form expression for the entropic (unbalanced) IGW between these distributions. Finally, we also consider the (entropic) Gromov-Wasserstein barycenter problem of multiple Gaussian measures. We prove that the barycenter problem admits a Gaussian minimizer and obtain the closed-form expression for the covariance matrix of the barycenter when the entropic regularization parameter is small. Acknowledgements NH gratefully acknowledges support from the NSF IFML 2019844 award and research gifts by the NSF AI Institute for Foundations of Machine Learning. Altschuler, J., Niles-Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in neural information processing systems, pp. 1964 1974, 2017. (Cited on page 1.) Alvarez-Melis, D. and Jaakkola, T. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1881 1890, 2018. (Cited Entropic Gromov-Wasserstein between Gaussian Distributions on page 1.) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017. (Cited on page 1.) Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 1(51):22 45, 2015. (Cited on page 1.) Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S. Learning generative models across incomparable spaces. In International Conference on Machine Learning, pp. 851 861, 2019. (Cited on page 1.) Chen, L., Gan, Z., Cheng, Y., Li, L., Carin, L., and Liu, J. Graph optimal transport for cross-domain alignment. In International Conference on Machine Learning, pp. 1542 1553. PMLR, 2020. (Cited on page 1.) Chizat, L., Peyré, G., Schmitzer, B., and Vialard, F.-X. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87(314):2563 2609, 2018. (Cited on pages 1 and 4.) Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9): 1853 1865, 2016. (Cited on page 1.) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292 2300, 2013. (Cited on page 1.) Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., and Courty, N. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 447 463, 2018. (Cited on page 1.) Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced Wasserstein distance and its use for GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019. (Cited on page 1.) Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn s algorithm. In ICML, pp. 1367 1376, 2018. (Cited on page 1.) Fatras, K., Zine, Y., Flamary, R., Gribonval, R., and Courty, N. Learning with minibatch Wasserstein: asymptotic and gradient properties. In AISTATS 2020-23nd International Conference on Artificial Intelligence and Statistics, volume 108, pp. 1 20, 2020. (Cited on page 1.) Flamary, R. and Courty, N. Pot python optimal transport library, 2017. URL https://pythonot.github. io/. (Cited on page 8.) Gelbrich, M. On a formula for the L2 wasserstein metric between measures on euclidean and hilbert spaces. Mathematische Nachrichten, 147:185 203, 1990. (Cited on pages 2 and 3.) Genevay, A., Peyre, G., and Cuturi, M. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608 1617, 2018. (Cited on page 1.) Grave, E., Joulin, A., and Berthet, Q. Unsupervised alignment of embeddings with Wasserstein procrustes. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1880 1890, 2019. (Cited on page 1.) Ho, N., Nguyen, X., Yurochkin, M., Bui, H. H., Huynh, V., and Phung, D. Multilevel clustering via Wasserstein means. In International Conference on Machine Learning, pp. 1501 1509, 2017. (Cited on page 1.) Ho, N., Huynh, V., Phung, D., and Jordan, M. I. Probabilistic multilevel clustering via composite transportation distance. In AISTATS, 2019. (Cited on page 1.) Horn, R. A. and Johnson, C. R. Topics in Matrix Analysis. Cambridge University Press, 1991. doi: 10.1017/ CBO9780511840371. (Cited on pages 2, 15, 19, 33, 37, and 38.) Janati, H., Muzellec, B., Peyré, G., and Cuturi, M. Entropic optimal transport between unbalanced gaussian measures has a closed form. In Advances in Neural Information Processing Systems, volume 33, pp. 10468 10479, 2020. (Cited on pages 2, 3, 5, 12, and 32.) Kolouri, S., Zou, Y., and Rohde, G. K. Sliced Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258 5267, 2016. (Cited on page 1.) Kristof, W. Generalization of a theorem by John von Neumann on the trace of certain matrix products. ETS Research Bulletin Series, 1969:i 13, 12 1969. doi: 10.1002/j.2333-8504.1969.tb00744.x. (Cited on pages 2, 5, 15, 19, 33, 37, and 38.) Le, T., Nguyen, T., Ho, N., Bui, H., and Phung, D. LAMDA: Label matching deep domain adaptation. In ICML, 2021. (Cited on page 1.) Lin, T., Ho, N., and Jordan, M. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pp. 3982 3991, 2019. (Cited on page 1.) Entropic Gromov-Wasserstein between Gaussian Distributions Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., and Stöter, F.-R. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pp. 4104 4113. PMLR, 2019. (Cited on page 1.) Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-Wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021a. URL https://openreview.net/ forum?id=QYj O70ACDK. (Cited on page 1.) Nguyen, K., Nguyen, D., Nguyen, Q., Pham, T., Bui, H., Phung, D., Le, T., and Ho, N. On transportation of mini-batches: A hierarchical approach. ar Xiv preprint ar Xiv:2102.05912, 2021b. (Cited on page 1.) Nguyen, T., Pham, Q.-H., Le, T., Pham, T., Ho, N., and Hua, B.-S. Point-set distances for learning representations of 3D point clouds. In ICCV, 2021c. (Cited on page 1.) Peyré, G., Cuturi, M., and Solomon, J. Gromov-Wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pp. 2664 2672, 2016. (Cited on pages 1, 3, 8, and 9.) Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018. (Cited on page 1.) Salmona, A., Delon, J., and Desolneux, A. Gromov Wasserstein distances between Gaussian distributions. Arxiv Preprint Arxiv: 2104.07970, 2021. (Cited on pages 1, 2, 3, 4, and 5.) Séjourné, T., Vialard, F.-X., and Peyré, G. The unbalanced Gromov Wasserstein distance: Conic formulation and relaxation. Arxiv Preprint Arxiv: 2009.04266, 2020. (Cited on page 1.) Solomon, J., Peyré, G., Kim, V. G., and Sra, S. Entropic metric alignment for correspondence problems. ACM Transactions on Graphics (TOG), 35(4):72, 2016. (Cited on page 1.) Séjourné, T., Vialard, F.-X., and Peyré, G. The unbalanced gromov wasserstein distance: Conic formulation and relaxation. In Advances in Neural Information Processing Systems, 2021. (Cited on pages 4 and 8.) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. (Cited on page 1.) Vayer, T. A contribution to optimal transport on incomparable spaces, 2020. (Cited on page 3.) Vincent-Cuaz, C., Flamary, R., Corneli, M., Vayer, T., and Courty, N. Semi-relaxed gromov-wasserstein divergence with applications on graphs, 2022. (Cited on page 3.) Xu, H., Luo, D., and Carin, L. Scalable Gromov Wasserstein learning for graph partitioning and matching. In Advances in neural information processing systems, pp. 3052 3062, 2019. (Cited on page 1.) Entropic Gromov-Wasserstein between Gaussian Distributions Supplementary Materials for Entropic Gromov-Wasserstein between Gaussian Distributions In this supplement, we first provide proofs of remaining lemmas and theorems in Appendix A. Then, we present how to derive a closed-form formulation for Υ in Theorem 4.2 in Appendix B. Auxiliary results are presented in Appendix C while another proof of Lemma C.4 is in Appendix D. We highlight the issue of obtaining closed-form expression for entropic inner product Gromov-Wasserstein between non-zero means Gaussian distributions in Appendix E. Finally, we revisit the entropic optimal transport between (unbalanced) Gaussian distributions in Appendix F and provide a simpler proof than that in (Janati et al., 2020) to derive the closed-form expression for that problem. A. Proofs of remaining results This appendix is devoted to provide the proofs of lemmas and theorems presented in the paper. A.1. Proof of Lemma 2.3 Firstly, note that when π Π(Tm#µ, Tn#ν), there exists π Π(µ, ν) such that (Tm, Tn)#π = π . Therefore, we have Eπ π [ X, X Y, Y ]2 = E(Tm,Tn)#π (Tm,Tn)#π[ Tm X, Tm X Tn Y, Tn Y ]2 = Eπ π[ X, X Y, Y ]2. (16) Next, we will show that π is absolutely continuous w.r.t µ ν if and only if π is absolutely continuous w.r.t (Tm#µ Tn#ν). Assume that π is absolutely continuous w.r.t µ ν. Consider two arbitrary Borel set U, V such that (Tm#µ)(U) (Tn#ν)(V ) = 0, which is equivalent to µ(T 1 m (U)) ν(T 1 n (V )) = 0. Since π is dominated by µ ν, we have π(T 1 m (U), T 1 n (V )) = 0, or equivalently, π (U, V ) = [(Tm, Tn)#π](U, V ) = 0. Thus, the first part of the above statement is proved, and the second part is obtained similarly. It follows that the KL terms used below are well defined. Notice that dπ (x, y) = dπ(T 1 m (x), T 1 n (y)), d(Tm#µ)(x) = dµ(T 1 m (x)) and d(Tn#ν)(y) = dν(T 1 n (y)). Therefore, by changing of variables x 7 Tm(u) and y 7 Tn(v), we get KL(π (Tm#µ) (Tn#ν)) (17) = Z log dπ (x, y) d(Tm#µ)(x)d(Tn#ν)(y) = Z log dπ (Tm(u), Tn(v))) d(Tm#µ)(Tm(u))d(Tn#ν)(Tn(v)) |det(T m(u))det(T n(v))|dπ (Tm(u), Tn(v)) = Z log dπ(u, v) dµ(u)dν(v) = KL(π µ ν), (18) with a note that |det(T m(u)| = |det(Om)| = 1 and |det(T n(v))| = |det(On)| = 1 since Om and On are orthogonal matrices. Putting the results in equations (16) and (18) together, we obtain the conclusion. A.2. Proof of Theorem 3.1 Firstly, we will show that Eπ π n X, X Y, Y 2o = tr(Σ2 µ) + tr(Σ2 ν) 2 tr K µνKµν . (19) Entropic Gromov-Wasserstein between Gaussian Distributions Indeed, we have Eπ π n X, X Y, Y 2o = Eπ π X, X 2 + Eπ π Y, Y 2 2Eπ π X, X Y, Y . As π is a coupling of two zero-mean distributions µ and ν, π also has mean zero. Combining this result with the independence between X and X and Y and Y leads to Eπ π X, X 2 = tr(Σ2 µ) and Eπ π Y, Y 2 = tr(Σ2 ν). Meanwhile, we have Eπ π[ X, X Y, Y ] = Eπ π h m X j=1 Xi X i Yj Y j i = j=1 Eπ π[Xi Yj]Eπ[X i Y j ] = tr K µνKµν . Putting the above results together, we obtain the desired equality (19). It indicates that we can rewrite the formulation of IGWε(µ, ν) as follows: IGWε(µ, ν) = tr(Σ2 µ) + tr(Σ2 ν) + min π Π(µ,ν) n εKL(π µ ν) 2 tr K µνKµν o . (20) To solve the minimization problem (20), we firstly fix the matrix Kµν, therefore, the covariance matrix of π is also fixed due to equation (5). By Lemma 3.3, the optimal transport plan of this problem needs to be a Gaussian distribution. Thus, according to Lemma C.3 in Appendix C and part (b) of Lemma 3.4, the entropic term in the objective function reads εKL(π µ ν) = 1 2ε n tr ΣπΣ 1 µ ν (m + n) + log det(Σµ ν) i=1 log(1 κµν,i), where ([κµν,i] 1 2 )n i=1 are singular values (in descending order) of matrix Σ 1 2 ν . By applying part (c) of Lemma 3.4, the optimal value of the term tr K µνKµν is achieved at Kµν = Σ 1 2µdiag([κµν,k] 1 2 )n k=1Σ 1 2ν . Therefore, the optimization problem (20) reduces to IGWε(µ, ν) = tr(Σ2 µ) + tr(Σ2 ν) max κµν,k [0,1) k=1 λµ,kλν,kκµν,k + ε k=1 log(1 κµν,k) o . The function f(x) = ax + ε 4 log(1 x) for a > 0 determined in the interval [0, 1) attains its maximum at x = 1 ε 4a +. Thus, κ µν,k = 1 ε 4λµ,kλν,k + for all k [n], and IGWε(µ, ν) = tr(Σ2 µ) + tr(Σ2 ν) 2 h log(λµ,kλν,k) log ε Additionally, we have K µν = diag n λµ,kλν,k h 1 ε 4λµ,kλν,k As a consequence, the proof is completed. When Σµ and Σν are not diagonal: The value of IGWε(µ, ν) still remains according to Lemma 2.3 but the optimal transport plan does not. More specifically, the equality conditions mentioned in part (c) of Lemma 3.4 will be no longer valid. Instead, let Pµ, Dµ and Pν, Dν are the orthogonal diagonalizations of Σµ(= PµDµP µ ) and Σν(= PνDνP ν ), Entropic Gromov-Wasserstein between Gaussian Distributions respectively. By using the same approach as in Lemma 3.4, let UµνΛ 1 2µνV µν be the SVD of matrix Σ 1 1 2µν = diag([κµν,k] 1 2 )n k=1, we have Kµν = Σ 1 2ν . Therefore, tr(K µνKµν) = tr Σ 1 2µν) U µνΣµUµνΛ 1 2µν) U µνΣµUµνΛ 1 2µνV µνΣν 1 2µν) U µνPµDµP µ UµνΛ 1 2µνV µνPνDνP ν = tr [P ν Vµν(Λ 1 2µν) ][U µνPµDµ][P µ UµνΛ 1 2µν][V µνPνDν] i=1 λµ,iλν,iκµν,i. The equality occurs when Uµν = Pµ and Vµν = Pν. Hence, the covariance matrix K µν in this case turns into K ,new µν = Σ 1 2ν Pν = PµK µνP ν . A.3. Proof of Lemma 3.3 Let qv and pu be the probability density functions of distributions Qv and Pu, respectively, whereas qu be the probability density function of the Gaussian distribution Qu in Rd which has mean u and covariance matrix Σu. Then, we have KL(Pu Qv) = Z Rd pu log pu Rd pu log pu Rd pu log qu = KL(Pu Qu) + Z Rd pu log qu Rd pu log qu The equality happens when Pu = Qu. Now we prove the lower bound is constant. x pu log qv = Epu log(qv) = Epu h 1 2 log (2π)ddet(Σv) 1 2(x v) Σ 1 v (x v) i 2Epu h (x v) Σ 1 v (x v) i + const 2Epu h (x u + u v) Σ 1 v (x u + u v) i + const 2Epu h (x u) Σ 1 v (x u) + 2(x u) Σ 1 v (u v) + (u v) Σ 1 v (u v) i + const 2Epu h (x u) Σ 1 v (x u) i + const 2Epu h (x u) Σ 1 2 v (x u) i + const 2Epu h tr Σ 1 2 v (x u)(x u) Σ 1 2 v i + const 2 v + const. (21) Similarly, we have Epu log(qu) is a constant. Hence, the proof is completed. Entropic Gromov-Wasserstein between Gaussian Distributions A.4. Proof of Lemma 3.4 (a) We start with decomposing the matrix Kµν as Kµν = Σ 1 2ν . It follows from the fact Σπ is non-negative definite that Σν K µνΣ 1 µ Kµν 1 2µν) U µνUµνΛ 1 2ν VµνΛµνV µνΣ Idn VµνΛµνV µν, where Λµν = diag κµν,k n k=1 is an n n matrix. Since Vµν is an unitary matrix, all eigenvalues of matrix Λµν, which are (κµν,k)n k=1, belong to the interval [0, 1]. (b) By the definition of matrix Σπ, we have det(Σπ) = det(Σµ)det Σν K µνΣ 1 µ Kµν = det(Σµ)det(Σν)det Idn VµνΛµνV µν = det(Σµ)det(Σν) k=1 (1 κµν,k). The third equality results from the fact that all eigenvalues of matrix Idn VµνΛµνV µν are 1 κµν,k for k [n]. (c) Using the decomposition of matrix Kµν in part (a), tr K µνKµν can be rewritten as 1 2µν) U µνΣµUµνΛ 1 2ν = tr Vµν(Λ 1 2µν) U µνΣµ UµνΛ 1 2µν V µνΣν . Applying the generalization of the von Neumann s inequality (Kristof, 1969; Horn & Johnson, 1991) for singular values with a note that the equality happens when Uµν and V µν are identity matrices, we obtain i=1 λµ,iλν,iκµν,i. Hence, we obtain the conclusion of this lemma. It is worth noting that the equality for the above inequality can also occur in other cases, for example, when Uµν = Idm, Vµν = Idn or when Uµν = Idm, Vµν = Idn. A.5. Proof of Lemma 4.1 By the formula for KL divergence between Gaussians in Lemma C.3 in Appendix C and applying the von Neumann s trace inequality (Kristof, 1969; Horn & Johnson, 1991), we have KL(πx µ) = 1 tr ΣxΣ 1 µ m + log det(Σµ) λµ,i log λx,i Similarly, we find that λν,j log λy,j Entropic Gromov-Wasserstein between Gaussian Distributions For the last equality, KL(π µ ν) is equal to tr ΣπΣ 1 µ ν (m + n) log det(Σπ) n tr ΣxΣ 1 µ + tr ΣyΣ 1 ν (m + n) log det(Σx)det(Σy) det(Σµ)det(Σν) k=1 log(1 κxy,k) o = KL(πx µ) + KL(πy ν) 1 k=1 log(1 κxy,k). The first equality results from the form of the matrix Σµ ν = A.6. Proof of Theorem 4.2 For the sake of computing, we will firstly show that all quadratic KL terms in the objective function (4) can be transformed to their normal versions as follows: KL (πx µ) = 2m2 πKL( πx µ) + KL(m2 π m2 µ), (22) KL (πy ν) = 2m2 πKL( πy ν) + KL(m2 π m2 ν), (23) KL (π µ ν) = 2m2 πKL( π µ ν) + KL m2 π m2 µm2 ν . (24) By applying parts (ii) and (i) of Lemma C.1 in that order, we get the equation (22) KL (πx µ) = KL (mπ πx mµ µ) = m2 πKL ( πx µ) + m2 π log m2 π m2µ + (m2 µ m2 π) = 2m2 πKL( πx µ) + KL(m2 π m2 µ). The equations (23) and (24) are obtained similarly. As a result, the objective function in equation (4) can be rewritten as m2 πE π π n X, X Y, Y 2o + ε n 2m2 πKL( π µ ν) + KL m2 π m2 µm2 ν o + τ n 2m2 πKL( πx µ) + KL(m2 π m2 µ) + 2m2 πKL( πy ν) + KL(m2 π m2 ν) o =m2 πΥ + εKL(m2 π m2 µm2 ν) + τ n KL(m2 π m2 µ) + KL(m2 π m2 ν) o , (25) in which Υ is defined as Υ := E π π n X, X Y, Y 2o + 2εKL( π µ ν) + 2τ n KL( πx µ) + KL( πy ν) o . (26) Now, we will divide the rest of the proof into two parts: shape optimization and mass optimization. Shape optimization. The problem of interest now is min π Υ. We will first prove that the optimal shape π is a zero-mean Gaussian measure. Denote by (u, v) and Σπ the mean vector and covariance matrix of π where u Rm and v Rn. Let Z = X u, Z = X u and T = Y v, T = Y v. According to Lemma C.7, we have E π π n X, X Y, Y 2o =E π π n Z, Z T, T 2o + 2u Σxu 4u Σxyv + 2v Σyv + h u 2 v 2i2 = tr(Σ2 x) + tr(Σ2 y) 2 tr(K xy Kxy) + 2u Σxu 4u Σxyv + 2v Σyv + h u 2 v 2i2 . Entropic Gromov-Wasserstein between Gaussian Distributions The second equality is obtained by using the same arguments for deriving equation (19). Let us fix the covariance matrix and mean vector of π, therefore, the value of the expectation term in equation (26) is also fixed. Then, Lemma 3.3 indicates that π needs to be a Gaussian measure. while Lemma C.3 forces its mean to be zero. Consequently, the problem min π Υ is reduced to Υ := min Σπ n tr(Σ2 x) + tr(Σ2 y) 2 tr(K xy Kxy) + 2εKL( π µ ν) + 2τKL( πx µ) + 2τKL( πy ν) o (27) Using the same arguments as in Theorem 3.1, we obtain that the minimum value of tr(Σ2 x) + tr(Σ2 y) 2 tr(K xy Kxy) + 2εKL( π µ ν) is equal to λx 2 2 + λy 2 2 2 h log(λx,kλy,k) log ε Combining this result with parts (a) and (b) of Lemma 4.1 implies that the problem (27) is equivalent to Υ = min λx,λy>0 k=1 gε,τ,+(λx,k, λy,k; λµ,k, λν,k) + k=n+1 hε,τ(λx,k; λµ,k) o , with note that (λx,k)m k=1 and (λy,k)n k=1 are decreasing sequences. By Lemma C.4, each summation term of the above problem can be optimized independently but still preserving the order of (λx,k)n k=1 as follows: k=1 min λx,k,λy,k>0 gε,τ,+ λx,k, λy,k; λµ,i, λν,j + k=n+1 min λx,k>0 hε,τ λx,k, λµ,k . A detailed calculation of Υ is deferred to Appendix B. Mass optimization. Finally, based on equation (25), the optimal mass is obtained by taking the square root of the minimizer of the following problem: m2 π = arg min x>0 f(x) := Υ x + εKL(x m2 µm2 ν) + τKL(x m2 µ) + τKL(x m2 ν). By part (c) of Lemma C.6, we obtain mπ = (x ) 1 2 = (mµmν) τ+ε 2τ+ε exp n Υ 2(2τ+ε) o . Hence, the proof is completed. A.7. Proof of Theorem 5.2 Firstly, let Kℓy be the covariance matrix between Xℓand Y while v and Σy be the mean and covariance matrix of Y , respectively. By Lemma C.7, the objective function in equation (13) is rewritten as ℓ=1 αℓEπℓ,y πℓ,y n Xℓ, X ℓ Y, Y 2o = ℓ=1 αℓ n Eπℓ,y πℓ,y Xℓ, X ℓ Y v, Y v 2 + 2v Σyv + v 4o . As Σy is a positive definite matrix, we have 2v Σyv + v 4 > 0 when v = 0d. Therefore, the barycenter µ must have mean zero. Next, we denote by λℓ,i and λy,i the ith largest eigenvalues of Σℓand Σy, respectively. By using the same arguments for deriving equation (19), we get ℓ=1 αℓEπℓ,y πℓ,y n Xℓ, X ℓ Y, Y 2o = ℓ=1 αℓ n d X i=1 λ2 y,i 2 tr K ℓy Kℓy + j=1 λ2 ℓ,j o . According to parts (a) and (c) of Lemma 3.4, we have tr(K ℓy Kℓy) min{d,mℓ} X j=1 λy,jλℓ,j, Entropic Gromov-Wasserstein between Gaussian Distributions where the equality happens when (Xℓ)j = p λℓ,j Zj and Yj = p λy,j Zj with (Zj)j are i.i.d. standard normal random variables. Thus, the problem now is reduced to minimize ℓ=1 αℓ n d X j=1 λ2 y,j 2 min{d,mℓ} X j=1 λy,jλℓ,j o = n λ2 y,j 2λy,j ℓ=1 αℓλℓ,j1j mℓ o . This quantity has a quadratic form, so it attains its minimum at ℓ=1 αℓλℓ,j1j mℓ; j [d]. Hence, we obtain the conclusion of the theorem. A.8. Proof of Theorem 5.4 Firstly, we will show that the barycenter µ has mean zero. Let Kℓy be the covariance matrix between Xℓand Y while v and Σy be the mean and covariance matrix of Y , respectively. By Lemma C.7, the objective function in equation (14) can be rewritten as ℓ=1 αℓ n Eπℓ,y πℓ,y Xℓ, X ℓ Y, Y 2 + εKL(πℓ,y µℓ µ) o ℓ=1 αℓ n Eπℓ,y πℓ,y Xℓ, X ℓ Y v, Y v 2 + 2v Σyv + v 4o + εKL(πℓ,y µℓ µ). Note that πℓ,y and µℓ µ share the same mean and KL divergence is invariant to translation, which make the quantity KL(πℓ,y µℓ µ) become independent of the mean vector of πℓ,y. Additionally, as Σy is a positive definite matrix, we have 2v Σyv + v 4 > 0 when v = 0d. Therefore, the barycenter µ must have mean zero. Moreover, according to Lemma 5.6, there exists a Gaussian solution, says µ . Then, the objective function is reduced to ℓ=1 αℓ n d X i=1 λ2 y,i 2 tr K ℓy Kℓy + j=1 λ2 ℓ,j o + εKL(πℓ,y µℓ µ). Denote by Σℓ,y and Σℓ y the covariance matrices of πℓ,y and µℓ µ, respectively. The entropic term in the above equation is then equal to tr(Σℓ,yΣ 1 ℓ y) + log det(Σℓ y) (mℓ+ d) = ε 2 log det(Σℓ y) The problem is thus reduced to minimize j=1 λ2 y,j 2 ℓ=1 αℓtr K ℓy Kℓy + 1 ℓ=1 αℓε log det(Σℓ y) Use the same approach of Lemma 3.4, let UℓyΛ 1 2 ℓy V ℓy be the SVD of matrix Σ 1 2 y , or equivalently, 1 2 ℓy V ℓy Σ Denote by κ 1 2 ℓy,j the j-th largest singular value of Σ 1 2 y . Then, we find that log det(Σℓ y) j=1 log(1 κℓy,j). Entropic Gromov-Wasserstein between Gaussian Distributions By applying the von Neumann s trace inequality (Kristof, 1969; Horn & Johnson, 1991), we have tr(K ℓy Kℓy) = tr Σ 1 2 ℓy) U ℓyΣℓUℓyΛℓy V ℓy Σ 1 2y = tr [Vℓy(Λ 1 2 ℓy) ] [U ℓyΣℓ] [UℓyΛ 1 2 ℓy] [V ℓy Σy] j=1 λℓ,jλy,jκℓy,j, with a note that all Λℓy, Σy and Σℓare diagonal matrices. The equality occurs when Vℓy and Uℓy are identity matrices. Put the above results together, the problem in equation (28) is reduced to minimizing t=1 λℓ,tκℓy,tλy,t ε t=1 log(1 κℓy,t), or equivalently, n λ2 y,j 2λy,j ℓ=1 αℓλℓ,jκℓy,j1j dℓ o ε t=1 log(1 κℓy,t). By fixing the values of κℓy,t [0, 1] for all t [dℓ], this quantity attains its minimum at ℓ=1 αℓλℓ,jκℓy,j1j dℓ, j [d]. Therefore, the problem of entropic IGW barycenter is equivalent to finding the maximum of ℓ=1 αℓλℓ,jκℓy,j1j dℓ i2 + ε j=1 αℓlog(1 κℓy,j) ℓ=1 αℓλℓ,jκℓy,j1j dℓ i2 + ε ℓ=1 αℓlog(1 κℓy,j)1j dℓ o . We need to maximize each term, which is ℓ=1 αℓλℓ,jκℓy,j1j dℓ i2 + ε 2αℓlog(1 κℓy,j)1j dℓ, under the constraints that 0 κℓy,j 1, for all j [dℓ]. By Lemma C.5 in Appendix C, and when ε satisfying the condition (15) for j min{d, mℓ}, we have κℓy,j = 1 ε 2λℓ,j n Aj + q A2 j εBj o, where Aj = PT ℓ=1 αℓλℓ,j1j dℓand Bj = PT ℓ=1 αℓ1j dℓ. Hence, we reach the conclusion of the theorem. A.9. Proof of Lemma 5.6 Let QX,Y = N([γx, γy] , ΣX,Y ) and QY = N(γy, Σy) be Gaussian distributions in Rm+n and Rn, respectively. Denote by q X, q Y , q X,Y , p Y and p X,Y the probability density functions of distributions QX, QY , QX,Y , PY and PX,Y , respectively, we need to show that KL PX,Y QX PY KL QX,Y QX QY . Expanding the KL divergence, we need to prove Z p X,Y (x, y) log p X,Y (x, y) log q X(x) log p Y (y) dxdy Z q X,Y (x, y) log q X,Y (x, y) log q X(x) log q Y (y) dxdy. Entropic Gromov-Wasserstein between Gaussian Distributions Similar to the derivation for equation (21) in the proof Lemma 3.3 with a note that q X,Y , q X and q Y are Gaussian distributions, we have Z p X,Y (x, y) log q X(x)dxdy = Z q X,Y (x, y) log q X(x)dxdy; Z p X,Y (x, y) log q Y (y)dxdy = Z q X,Y (x, y) log q Y (y)dxdy; Z p X,Y (x, y) log q X,Y (x, y)dxdy = Z q X,Y (x, y) log q X,Y (x, y)dxdy, since both sides are equal to the same function of the covariance matrices of PX,Y and QX,Y , which are both equal to ΣX,Y . Hence, the inequality is equivalent to Z p X,Y (x, y) log p X,Y (x, y) log p Y (y) dxdy Z p X,Y (x, y) log q X,Y (x, y) log q Y (y) dxdy Z p X,Y (x, y) log p X,Y (x, y)/p Y (y) q X,Y (x, y)/q Y (y) dxdy 0. By using the formula for conditional distributions, we get p X,Y (x, y) = p Y (y)p X|Y (x|y); q X,Y (x, y) = q Y (y)q X|Y (x|y). Then the left hand side is equal to Z p Y (y)p X|Y (x|y) log p X|Y (x|y) q X|Y (x|y)dxdy = Z y p Y (y) h Z x p X|Y (x|y) log p X|Y (x|y) q X|Y (x|y)dx i dy y p Y (y)KL PX|Y QX|Y dy which is not less than zero, since the KL term is non-negative. The inequality becomes equality when PX|Y = QX|Y . B. On detailed calculation of Υ In this appendix, we discuss how to derive a closed-form formulation for Υ in Theorem 4.2. It is equivalent to finding the minimizers of functions gε,τ,+ and hε,τ in Lemma B.1 and Lemma B.4, respectively. Lemma B.1 (Minimizer of gε,τ,+). Let gε,τ,+(x, y; a, b) := x2 + y2 + (τ + ε) x i+ + ε h log(xy) log ε gε,τ, 1(x, y; a, b) := x2 + y2 + (τ + ε) hx gε,τ,1(x, y; a, b) := x2 + y2 + (τ + ε) hx 2) + ε h log(xy) log ε (ex, ey) := arg min x,y>0 gε,τ, 1(x, y; a, b), (bx, by) := arg min x,y>0 gε,τ,1(x, y; a, b), (x , y ) := arg min x,y>0 gε,τ,+(x, y, ; a, b). Then if exey < ε 2, then (x , y ) = (ex, ey) where solutions are in Lemma B.2, and if otherwise, then (x , y ) = (bx, by), where solutions are in Lemma B.3. Entropic Gromov-Wasserstein between Gaussian Distributions Proof of Lemma B.1. By definition of functions gε,τ,1, gε,τ, 1 and gε,τ,+, we have gε,τ,+(x, y; a, b) = gε,τ, 1(x, y; a, b)1{xy< ε 2 } + gε,τ,1(x, y; a, b)1{xy ε We observe that both functions gε,τ,1 and gε,τ, 1 are convex functions. Moreover, we have gε,τ, 1(x, y; a, b) gε,τ,1(x, y; a, b) = 2 xy ε ε h log(xy) log ε ε/2 1 log xy It means that gε,τ, 1(x, y; a, b) gε,τ,1(x, y; a, b). Next, we prove that ε 2 cannot lie between bxby and exey. Assume the contradictory, then the segment connecting (ex, ey) to (bx, by) cuts the hyperpole xy = ε 2 at a point with coordinates denoted by (x, y). Since gε,τ,1 is a convex function, we have max n gε,τ,1(bx, by; a, b), gε,τ,1(ex, ey; a, b) o > gε,τ,1(x, y; a, b). (29) Moreover, we find that gε,τ,1(x, y; a, b) gε,τ,1(bx, by; a, b) gε,τ,1(x, y; a, b) = gε,τ, 1(x, y; a, b) gε,τ, 1(ex, ey; a, b) gε,τ,1(ex, ey; a, b). It contradicts inequality (29). Thus, both bxby and exey have to be greater or smaller than ε 2 at the same time. Given that, we have two separate cases: Case 1: They are both greater than ε gε,τ, 1(x, y; a, b) gε,τ, 1(ex, ey; a, b) gε,τ,1(ex, ey; a, b) gε,τ,1(bx, by; a, b). It means that bx = x and by = y . Case 2: Both of them are smaller than ε 2. For any (x, y) such that xy ε 2, there exists x, y lies in the segment connecting (x, y) to (bx, by) such that x y = ε 2. Since the function gε,τ,1 is convex, we find that tgε,τ,1(bx, by; a, b) + (1 t)gε,τ,1(x, y; a, b) gε,τ,1(x, y; a, b), where t = (bx,by) (x,y) 2 (x,y) (bx,by) 2 . However, gε,τ,1(bx, by; a, b) gε,τ,1(x, y; a, b), which implies that gε,τ,1(x, y; a, b) gε,τ,1(x, y; a, b) = gε,τ, 1(x, y; a, b) gε,τ, 1(ex, ey; a, b). It means that ex = x and ey = y . As a consequence, we obtain the conclusion of the lemma. Lemma B.2 (Minimizer of function gε,τ, 1). With a, b > 0, let gε,τ, 1(x, y; a, b) = x2 + y2 + (τ + ε) x b log(xy) log(ab) subjected to x, y > 0. Let (ex, ey) = arg minx,y>0 gε,τ, 1(x, y; a, b). Then, 2(τ + ε) + (τ + ε)2 4a2 , ey = τ + ε 2(τ + ε) + (τ + ε)2 Proof of Lemma B.2. 1. Taking the derivatives of gε,τ, 1 with respect to x and y, we get x = 2x + τ + ε x , gε,τ, 1 y = 2y + τ + ε Entropic Gromov-Wasserstein between Gaussian Distributions It follows that 2gε,τ, 1 x2 = 2 + τ + ε x2 , 2gε,τ, 1 x y = 0, 2gε,τ, 1 y2 = 2 + τ + ε The Hessian matrix of function gε,τ, 1 is positive definite, so gε,τ, 1 is a convex function. 2. Solving the equations gε,τ, 1 x = 0 and gε,τ, 1 y = 0, we have 2(τ + ε) + (τ + ε)2 4a2 , y = τ + ε 2(τ + ε) + (τ + ε)2 Therefore, we reach the conclusion of the lemma. Lemma B.3 (Minimizer of function gε,τ,1). For a, b > 0, let gε,τ,1(x, y; a, b) = x2 + y2 + ε h log(xy) log ε + (τ + ε) hx subjected to x, y > 0. Define (bx, by) = arg minx,y>0 gε,τ,1(x, y; a, b). Then, the value of (bx, by) is given in the following equations. ; by = τ + ε where bz is the unique positive root of the equation τz3 + h 8τ (τ + ε)2 i z2 + h 16τ 2(τ + ε)2 1 2i z 4(τ + ε)2 1 Proof of Lemma B.3. Taking the first derivatives of gε,τ,1 with respect to x and y , we have x = 2x 2y + τ + ε y = 2y 2x + τ + ε It follows that 2gε,τ,1 x2 = 2 + τ 2x2 , 2gε,τ,1 x y = 2, 2gε,τ,1 Thus, the Hessian matrix of this function is positive definite, which implies that gε,τ,1 is a convex function. The equations of the stationary point give us 2bx 2by + τ + ε bx = 0; 2by 2bx + τ + ε Taking the sum of both equations leads to 2(bx by) + τ + ε 2(by bx) + τ + ε Taking the difference between two equations in system (30), we get (bx by) 4 + τ Entropic Gromov-Wasserstein between Gaussian Distributions bxby , we have bx by = (τ + ε)1/b 1/a 4 + z . Take the product of two equations in system (30), we have h 2(bx by) + τ + ε ih 2(by bx) + τ + ε Substitute bx by = (τ + ε) 1/b 1/a 4+z to the above equation, we obtain (τ + ε)2h 2 1 i = τz(z + 4)2, which is equivalent to the following cubic equation t(z) := τz3 + h 8τ (τ + ε)2 i z2 + h 16τ 2(τ + ε)2 1 2i z 4(τ + ε)2 1 Since it is a cubic equation, it has either three real roots or one real root. In the first case, we know that it has at least one positive root, since the highest coefficient of the equation is positive. Assume that it has other two positive roots. By applying Viete s theorem, we get 8τ (τ + ε)2 ab < 0; 16τ (τ + ε)2 1 which implies that 1 b 2 < 2 ab, it is contradiction by Cauchy-Schwarz s inequality. Thus, the equation t(z) = 0 has unique positive root which is denoted by bz. With bz = τ bxby , we have bxby = τ bz . Substituting it to the system of equations (30), we have bx2 + τ + ε by2 + τ + ε This implies that , by = τ + ε As a consequence, we obtain the conclusion of the lemma. Lemma B.4 (Minimizer of function hε,τ). Given a constant a > 0, the function hε,τ(x; a) = x2 + (τ + ε) hx attains its minimum at x = τ + ε 16a2 + τ + ε Proof of Lemma B.4. Taking the first and second derivatives of function hτ,ε, we have x = 2x + τ + ε x2 = 2 + τ + ε Note that the second derivative of hτ,ε is positive, therefore, the positive minimizer of this function is exactly the positive solution of equation hτ,ε x = 0, which is 16a2 + τ + ε Entropic Gromov-Wasserstein between Gaussian Distributions C. Auxiliary results In this appendix, we provide additional lemmas that are used to derive the closed-form expressions of entropic (unbalanced) IGW between (unbalanced) Gaussian distributions. Lemma C.1 (KL divergence between Gaussian measures). Let α = mαNk(0, Σα) and β = mβNk(0, Σβ) be scaled Gaussian measures while α = Nk(0, Σα) and β = Nk(0, Σβ) be their normalized versions. Then, the generalized KL divergence between α and β is KL(α β) = mαKL(α β) + KL(mα mβ), where the KL divergence between α and β is tr ΣαΣ 1 β k + log det(Σβ) Proof of Lemma C.1. Let f(x, Σα) and f(x, Σβ) be the distribution functions of Nk(0, Σα) and Nk(0, Σβ), respectively. KL(α β) = Z Rk log mαf(x, Σα) dα(x) mα + mβ Rk log f(x, Σα) mαdα(x) + log mα = mαKL(α β) + KL(mα mβ). The expression for KL(α β) results from Lemma C.3. Hence, the proof is completed. Lemma C.2 (Double integrals). For the sake of simple computations, we derive some relations between the quadratic-KL and the standard KL as follows: (i) KL (α β) = 2mαKL(α β) + (mα mβ)2; (ii) KL (tα rβ) = t2KL (α β) + t2 log t2 m2 α + (r2 t2)m2 β, t, r > 0, for any positive measures α and β. Proof of Lemma C.2. Let pα and pβ be the Radon-Nikodym derivatives of α and β with respect to the Lebesgue measure. Then, we have Part (i) KL (α β) = Z Z log pα(x)pα(x ) pβ(x)pβ(x ) pα(x)pα(x )dxdx m2 α + m2 β Z log pα(x) pα(x)dx + mα Z log pα(x ) pα(x )dx m2 α + m2 β Z log pα(x) pα(x)dx m2 α + m2 β = 2mα[KL(α β) + mα mβ] m2 α + m2 β = 2mαKL(α β) + (mα mβ)2. Entropic Gromov-Wasserstein between Gaussian Distributions KL (tα rβ) = ZZ log t2pα(x)pα(x ) r2pβ(x)pβ(x ) t2pα(x)pα(x )dxdx + r2m2 β t2m2 α = t2 ZZ log pα(x)pα(x ) pβ(x)pβ(x ) pα(x)pα(x )dxdx + log t2 + r2m2 β t2m2 α = t2 ZZ log pα(x)pα(x ) pβ(x)pβ(x ) pα(x)pα(x )dxdx + m2 β m2 α + t2 log t2 m2 α + (r2 t2)m2 β = t2KL (α β) + t2 log t2 m2 α + (r2 t2)m2 β. As a consequence, we obtain the conclusion of the lemma. Lemma C.3 (KL divergence between Gaussians). The Kullback-Leibler divergence between Gaussian measures α = N(µα, Σα) and β = N(µβ, Σβ) on Rk is given by tr(Σ 1 β Σα) + (µβ µα) Σ 1 β (µβ µα) k + log det(Σβ) As a result, when Σβ, Σα are fixed, KL(α β) achieves its minimum value when µα = µβ. Proof of Lemma C.3. Let pα(x) and pβ(x) be the probability density functions of α and β, respectively. By the formula for KL divergence, we have KL(α β) = Z Rk log pα(x) pα(x)dx, (31) log(pα(x)) = 1 2[k log(2π) + log(det(Σα)) + (x µα) Σ 1 α (x µα)], log(pβ(x)) = 1 2[k log(2π) + log(det(Σβ)) + (x µβ) Σ 1 β (x µβ)], where π denotes a constant number pi only in this lemma. Plugging this result in equation (31), we get KL(α β) = 1 log det(Σβ) Rk[(x β) Σ 1 β (x β) (x α) Σ 1 α (x α)]pα(x)dx log det(Σβ) + (µβ µα) Σ 1 β (µβ µα) + Z Rk[(x µα) (Σ 1 β Σ 1 α )(x µα)]pα(x)dx . Note that Z Rk[(x µα) (Σ 1 β Σ 1 α )(x µα)]pα(x)dx = Z Rk tr((x µα)(x µα) (Σ 1 β Σ 1 α ))pα(x)dx Rk[(x µα)(x µα) (Σ 1 β Σ 1 α )]pα(x)dx = tr Σα(Σ 1 β Σ 1 α ) = tr(ΣαΣ 1 β ) k. Putting the above results together, we obtain the conclusion of this lemma. Lemma C.4 (Order solutions). Let {ai}m i=1 and {bj}m j=1 be positive decreasing sequences. Let {exi}m i=1 and {eyj}n j=1 be the minimizer of min xi>0,yj>0 D {xi}, {yj}; {ai}, {bj} := j=1 y2 j +(τ + ε) n m X ai log(xi) + bj log(yj) o 2 + + ε log(xkyk) log ε Entropic Gromov-Wasserstein between Gaussian Distributions Then (exi) and (eyj) are also decreasing sequences. Proof of Lemma C.4. We have two proofs for this lemma. While the first proof will be presented here, the second one can be found in Appendix D. Let us consider all permutations (bxi) and (byj) of (exi) and (eyj), respectively. The difference between D {bxi}, {byj}; {ai}, {bj} and D {exi}, {eyj}, {ai}, {bj} lies in the term i+ + ε h log(xkyk) log ε which is actually the minimum value of 2 tr K xy Kxy + 2εKL where its solution has to match order by magnitude {exi} to {bxi} and {eyj} to {byj} of the largest eigenvalues. The part Pm i=1 xi ai + Pn j=1 yj bj is minimized by the rearrangement inequality (cf. Lemma D.2). Therefore, we obtain the conclusion of the lemma. Lemma C.5. Let {ai}s i=1 and {bi}s i=1 be positive sequences. The problem of finding the maximum value of i=1 aixi o2 + 2 log(1 xi), (32) where xi [0, 1], is equivalent to finding the set S [s] in the below problem: n AS BS 2 AS + (A2 S BS) 1 2 o2 + X i S bi log(bi) BS log AS + (A2 S BS) 1 2 , where AS = P i S ai, BS = P i S bi and A2 S BS and bi 2ai AS + (A2 S BS) 1 2 ; i S. Proof of Lemma C.5. Taking the derivative of function f(x) = (ax + b)2 + c log(1 x) in [0, 1], we have f (x) = 2a(ax + b) c 1 x. The equation f (x) = 0 has either two solutions or no solution. It means that either f(x) is monotone or it has one minimum and one maximum. When x 1 , f(x) diverges to infinity. Thus, the xmax where function f attains its maximum is closer to 1 than the xmin, where f attains its minimum. In the second case, f is monotone, then f is monotone decreasing, because f(0) = 0 and f(1 ) = . Overall, ( 0 the larger solution of equation: 2a(ax + b) = c 1 x. Hence, if {exi} is a maximizer of the objective function, then either exi = 0 or exi is the larger solution of the first derivative system of equations. We consider the case that all exi are not equal to zero. By taking the derivatives of the function in equation (32) with respect to xi, we obtain j=1 ajexj o bi 2 1 1 exi = 0, 4 ai aiexi n s X j=1 ajexj o = bi, (33) j=1 aj, exj o n s X j=1 ajexj o = Entropic Gromov-Wasserstein between Gaussian Distributions Denote A = Ps j=1 aj and B = Ps j=1 bj. Then, by solving the quadratic equation 4X2 4AX + B = 0, we have j=1 ajexj = 1 Substituting it into equation (33), we obtain 4ai(1 exi) = 2bi A + exi = 1 bi 2ai(A + Then, the minimum value of the objective function in this case equals to i=1 ai 1 2(A + i=1 bi o2 + i=1 bi log(bi) log A + p We have thus proved the stated result. Lemma C.6 (Optimizers of some functions). (a) For positive constants a and c, the function f(x) = ax+c log(1 x) for x [0, 1) attains its minimum at x = h 1 c (b) For positive constants a and c, the function f(x) = 2ax 1 2 + c log(1 x) for x [0, 1) attains its maximizer at c2 4a2 + 1 c 2a i2 . (c) For positive constants a, b, c and Υ, the function f(x) = Υx+τKL(x a)+τKL(x b)+εKL(x c) attains its minimum at x = a τ 2τ+ε b τ 2τ+ε c ε 2τ+ε exp n Υ 2τ+ε o . Proof of Lemma C.6. For part (a), we have f (x) = a c 1 x which is a decreasing function when x [0, 1). It means that the function is convex. Hence its maximum attains at the equation f (x) = 0 or at the boundary. In fact, we have f (1 c/a) = 0. Therefore, the maximizer attains at max(0, 1 c/a). For part (b), we have f (x) = ax 1 2 c 1 x, which is also a decreasing function. Hence, f is concave, it means that the maximum attains at the solution of f (x) = 0 or the boundary. Solving the equation f (x) = 0 we obtain c2 4a2 + 1 c 2a i2 , which belongs to (0, 1). For part (c), by taking the derivative of f(x), we have f (x) = Υ + τ 2 log(x) log(ab) + ε log(x) log(c) Solving the equation f (x) = 0, we obtain x = exp nτ log(a) + τ log(b) + ε log(c) Υ o = a τ 2τ+ε b τ 2τ+ε c ε 2τ+ε exp n Υ Lemma C.7. Let (X, Y ) and (X , Y ) be two i.i.d random vectors in Rm Rn and follow a probability distribution π which have a mean vector (u, v) Rm n and a covariance matrix where Σx and Σy are covariance matrices of X and Y , respectively. Denote Z = X u, Z = X u and T = Y v, T = Y v, then Eπ π h X, X Y, Y i2 = Eπ π h Z, Z T, T i2 + 2u Σxu 4u Σxyv + 2v Σyv + h u 2 v 2i2 . Entropic Gromov-Wasserstein between Gaussian Distributions Proof of Lemma C.7. Firstly, we have X, X Y, Y = Z + u, Z + u T + v, T + v = Z, Z T, T + u, Z + Z v, T + T + u 2 v 2. Eπ π u, Z Z, Z = Eπ π h m X j=1 uj Zj i = i=1,j=1 uj Eπ π Zi Z i Zj = i=1,j=1 uj Eπ π Zi Zj Eπ π[Z i] = 0 Eπ π Z, Z = Eπ π h m X i=1 Zi Z i i = 0 Eπ π Z, Z v, T = Eπ π h m X j=1 vj Tj i = j=1 vj Eπ π Zi Z i Tj = j=1 Eπ π[Zi Tj]Eπ π[Z i] = 0. Similarly, we get Eπ π v, T T, T = Eπ π T, T = Eπ π T, T u, Z = 0. As a consequence, we can deduce that Eπ π h X, X Y, Y i2 = Eπ π h Z, Z T, T i2 + Eπ π h u, Z + Z v, T + T i2 + h u 2 v 2i2 . Next, it is sufficient to show that Eπ π h u, Z + Z v, T + T i2 = 2u Σxu + 2v Σyv 4u Σxyv. Indeed, we have Eπ π h u, Z + Z i2 = Eπ π h u (Z + Z )(Z + Z ) u i = u Var(Z + Z )u = 2u Σxu Eπ π h v, T + T i2 = Eπ π h v (T + T )(T + T ) v i = v Var(T + T )v = 2v Σyv Eπ π h u, Z + Z v, T + T i = u Eπ π h (Z + Z )(T + T ) i v = u Eπ π h ZT + Z (T ) i v = 2u Σxyv. Hence, the proof is completed. D. Another proof of Lemma C.4 In this appendix, we introduce another proof of Lemma C.4 using Karamata s inequality. Firstly, we introduce an inequality which can be considered to be a generalization of Karamata s inequality in a special case. Lemma D.1 (A quasi-Karamata inequality). Let a1 a2 . . . an 0 and b1 b2 . . . bn 0 which satisfy a1 b1 a1 + a2 b1 + b2 . . . a1 + a2 + . . . + an 1 b1 + b2 + . . . + bn 1 a1 + a2 + . . . + an b1 + b2 + . . . + bn Let f be a concave and non-increasing function in [0, + ). Then, we have f(a1) + f(a2) + . . . + f(an) f(b1) + f(b2) + . . . + f(bn). Entropic Gromov-Wasserstein between Gaussian Distributions Proof of Lemma D.1. We prove the Lemma by using induction. The Lemma is obvious when n = 1. Suppose that the problem is true for n 1. For i {1, 2, . . . , n}, let ci = a1 + a2 + . . . + ai b1 b2 . . . bi. Let k (1 k n) be the smallest index such that ck = min1 i n ci. Let b 1 = b1 + ck, then, we have b 1 b2 . . . bn. We consider two cases: If k = n, we have a1 b 1 a1 + a2 b 1 + b2 . . . a1 + a2 + . . . + an = b 1 + b2 + . . . + bn Applying Karamata s inequality, we have f(a1) + f(a2) + . . . + f(an) f(b 1) + f(b2) + . . . + f(bn) f(b1) + f(b2) + . . . + f(bn) If k < n. We have a1 b 1 a1 + a2 b 1 + b2 . . . a1 + a2 + . . . + ak = b 1 + b2 + . . . + bk. Applying Karamata s inequality, we have f(a1) + f(a2) + . . . + f(ak) f(b 1) + f(b2) + . . . + f(bn) f(b1) + f(b2) + . . . + f(bk) Moreover, when i > k, we have ak+1 + . . . + ai bk+1 . . . bi = ci ck 0. Applying the inductive assumption for two arrays ak+1, . . . , an and bk+1, . . . , bn, we have f(ak+1) + f(ak+2) + . . . + f(an) f(bk+1) + f(bk+2) + . . . + f(bn). which implies f(a1) + f(a2) + . . . + f(an) f(b1) + f(b2) + . . . + f(bn). As a consequence, we obtain the conclusion of the Lemma. Lemma D.2 (Extension of rearrangement inequality). Let a1 a2 . . . an 0, b1 b2 . . . bn 0 be two decreasing array. Consider α and β are two injections from the set {1, 2, . . . , k} to {1, 2 . . . , n} in the set {1, 2, . . . , n}, where k is a positive integer such that k n, we have max aα(1)bβ(1) + . . . + aα(k)bβ(k) = a1b1 + a2b2 + . . . + akbk, and the maximum value is achieved when α, β are two permutation which satisfy aα(i) = ai, bβ(i) = bi i : 1 i k. Entropic Gromov-Wasserstein between Gaussian Distributions Proof of Lemma D.2. Let I = {i1, i2, . . . , it} be the indices which are not greater than k such that α(ir) > k, r : 1 r t. Let J = {j1, j2 . . . , jt} be the indices which are not greater than k such that α(r) / J, r : 1 r k. Consider the permutation α : {1, 2, . . . , k} {1, 2, . . . , k} which is defined by ( α(j), if j / I jr if j = ir . It is obviously that aα(1)bβ(1) + . . . + aα(k)bβ(k) a α(1)bβ(1) + . . . + a α(k)bβ(k). Similarly, there exist a permutation β : {1, 2, . . . , k} {1, 2, . . . , k} such that aα(1)bβ(1) + . . . + aα(k)bβ(k) a α(1)bβ(1) + . . . + a α(k)bβ(k) a α(1)b β(1) + . . . + a α(k)b β(k). Applying the rearrangement inequality, we have a α(1)b β(1) + . . . + a α(k)b β(k) a1b1 + a2b2 + . . . + akbk, which implies aα(1)bβ(1) + . . . + aα(k)bβ(k) a1b1 + a2b2 + . . . + akbk. From this, we can conclude our proof. Given the above lemmas, we have another proof of Lemma C.4. Another Proof of Lemma C.4. Let us consider a permutation (bxi) and (byi) of ( xi) and ( yi), respectively, such that (bxi) and (byi) are decreasing sequence. The difference between D {bxi}, {byj}; {ai}, {bj} and D {exi}, {eyj}, {ai}, {bj} lies in the part i+ + ε h log(xkyk) log ε Let f(x) = 2 x ε + + ε log(x) log ε . This function has non-positive and decreasing derivative ( 0, when x ε 2 ε x 2, when x ε Thus, f is concave function in [0. ). Moreover, according to Lemma D.2, we have (bxibyi) and ( xi yi)) satisfy the condition of Lemma D.1. Thus, applying this Lemma with the function f, we have the term i+ + ε h log(xkyk) log ε is minimized when xk and yk are decreasing sequence. The part Pm i=1 xi ai + Pn j=1 yj bj is minimized by the rearrangement inequality. Therefore, we obtain the conclusion of the lemma. E. Entropic IGW between Gaussian distributions with non-zero means In this appendix, by considering the problem of solving entropic Gromov-Wasserstein between balanced Gaussians in a special case, we show the difficulty to solve this problem in the general case. Let m be a positive integer. Let µ = N(θµ, Σµ) and ν = N(1, 1) be two Gaussian measures in Rm and R, respectively, where Σµ = diag a2 1, . . . , a2 m , θµ = (b1, . . . , bm) . Entropic Gromov-Wasserstein between Gaussian Distributions To find IGWε(µ, ν), we must find the joint distribution π between µ and ν which has the covariance matrix that minimizes the following term Eπ π n X, X Y, Y 2o + εKL π µ ν . Note that, in this case, Kµν is an m 1 matrix, thus, we can denote that Kµν = (x1, . . . , xm) . Firstly, we prove the following equality Eπ π n X, X Y, Y 2o = i=1 b2 i + 2 2 i=1 (xi + bi)2. (34) Indeed, we have Eπ π n X, X Y, Y 2o = Eπ π X, X 2 + Eπ π Y, Y 2 2Eπ π X, X Y, Y . Note that the independence between X and X and Y and Y leads to Eπ π X, X 2 = [Eπ π X, X ]2 + Varπ X, X = [ θµ, θµ ]2 + i=1 Varπ[Xi]Varπ[X i] Similarly, we have Eπ π Y, Y 2 = 2. Meanwhile, Eπ π X, X Y, Y = Eπ π n m X Xi X i Y Y o = i=1 Eπ π{Xi Y }Eπ{X i Y } = i=1 (xi + bi)2. Putting the above results together, we obtain the desired equality (34). By Lemma 3.3, the optimal transportation plan π is a Gaussian distribution, namely, π = N(θπ , Σπ ), where θπ = (b1, . . . , bn, 1) . Thus, according to Lemma C.3, the entropic term in the objective function reads εKL(π µ ν) = 1 2ε n tr Σπ Σ 1 µ ν (m + n) + log det(Σµ ν) From the above calculation, our problem turns out to find the non-negative real numbers x1, . . . , xm which maximize the following function ψ(x1, . . . , xm) = 2 i=1 (xi + bi)2 + ε Taking the derivatives of the above function with respect to xi, i [m], we get ψ xi = 4(xi + bi) εxi/a2 i 1 Pm i=1 x2 i a2 i Entropic Gromov-Wasserstein between Gaussian Distributions To solve the system of equations ψ xi = 0 for all i [m], we denote t = ε 1 Pm i=1 x2 i a2 i . Then, t = 4(xi + bi) a2 i xi , which implies that xi = 4a2 i bi t 4a2 i . By substituting this representation of xi to the expression t = ε 1 Pm i=1 x2 i a2 i , we obtain 16a2 i b2 i (t 4a2 i )2 After some transformations, we achieve the equation of degree 2m, which is difficult to solve in close-form. F. Revisiting the entropic optimal transport between Gaussian measures In this appendix, via the technique that we use to study the entropic Gromov Wasserstein between two Gaussians, we provide a simpler proof for deriving the close-form expressions of entropic optimal transport (OT) between (unbalanced) Gaussian measures than the proof that was presented in (Janati et al., 2020). F.1. Entropic optimal transport between balanced Gaussian measures Before moving to the main theorem of this section, we review some backgrounds on the entropic optimal transport problem between balanced Gaussian measures. Entropic Optimal Transport. Let α and β be two probability measures in Rd. Then, the entropic OT between them is defined as follows: OTε(α, β) := min π Π(α,β) Eπ X Y 2 + εKL(π α β), (35) where X α and Y β are independent random vectors in Rd such that (X, Y ) π, and ε > 0 denotes the entropyregularized parameter. Now, we are ready to show the closed-form expression of entropic OT between Gaussians in the following theorem. Theorem F.1. Let µ = N(u, Σµ) and ν = N(v, Σν) be two Gaussian measures in Rd, where Σµ and Σν are positive definite matrices. Then, the entropic optimal transport between µ and ν is given by OTε(µ, ν) = u v 2 2 + tr(Σµ) + tr(Σν) tr(Dε) + dε 2 (1 log ε) + log det Dε + ε where Dε = 4Σ 2 . Moreover, the optimal transportation plan admits the form π = N (u , v ) , Σπ with K µν = 1 Remark F.2. Note that if we replace ε by 2σ2 in the above result, we obtain Theorem 1 in (Janati et al., 2020). Proof of Theorem F.1. Note that if µ and ν are respective centered transformations of µ and ν, then OTε(µ, ν) = OTε( µ, ν) + u v 2 2. Therefore, it is sufficient to solve the case when u = v = 0d. Under this assumption, we have Eπ X Y 2 = Eπ X 2 + Eπ Y 2 2Eπ X, Y = tr(Σµ) + tr(Σν) 2 tr(Kµν). Entropic Gromov-Wasserstein between Gaussian Distributions Thus, the entropic optimal transport between µ and ν can be rewritten as OTε(µ, ν) = tr(Σµ) + tr(Σν) + min π Π(µ,ν) n εKL(π µ ν) 2 tr(Kµν) o . (36) By fixing the covariance matrix Σπ = , Lemma 3.3 indicates that the optimal solution must be a Gaussian measure with the form π = N(02d, Σπ). Let UµνΛ 1 2µνV µν be the SVD of matrix Σ 1 1 2µν = diag(κ 1 2 µν,i)d i=1. It follows from Lemma C.1 and Lemma 3.4 part (b) that εKL(π µ ν) = ε tr(ΣπΣ 1 µ ν) 2d + log det(Σµ ν) i=1 log(1 κµν,i). Let UΛ 1 2 V be the SVD of Σ 1 2µ, where Λ 1 2 = diag(λ 1 2 µν,i)d i=1. By using von Neumann s inequality (Kristof, 1969; Horn & Johnson, 1991), we have tr(Kµν) = tr(Σ 1 2ν ) = tr(UµνΛ 1 2µ) = tr([V UµνΛ 1 2µν][V µνUΛ 1 2 ]) The equality occurs when Uµν = V and Vµν = U. Collecting the above two results, the problem in equation (36) is reduced to OTε(µ, ν) = tr(Σµ) + tr(Σν) max κµν,i (0,1) 1 2 µν,i + ε 2 log(1 κµν) i . Note that the function f(x) = 2ax 1 2 + ε 2 log(1 x) attains its maximum at x = ε+ ε2+16a2 4a 2 . Then, the optimal solution of the above problem is κ µν,i = ε2+16λµν,i 2 16λµν,i , which leads to OTε(µ, ν) = tr(Σµ) + tr(Σν) 1 ε2 + 16λµν,i + ε log ε2 + 16λµν,i) 8λµν,i = tr(Σµ) + tr(Σν) + dε 2 (1 log ε) 1 ε2 + 16λµν,i ε ε2 + 16λµν,i 8λµν,i = tr(Σµ) + tr(Σν) + dε 2 (1 log ε) 4 + 4λµν,i + ε = tr(Σµ) + tr(Σν) + dε 2 (1 log ε) tr h 4Σ + log det h 4Σ The last equality results from the fact that (λµν,i)d i=1 are eigenvalues of Σ 1 2µ. Additionally, ε2 + 16λµν,i 1 2µV diag h ε2 4λµν,i + 4 i 1 1 2µV diag λ 1 Entropic Gromov-Wasserstein between Gaussian Distributions Recall that Σ 1 2µ = UΛ 1 2 V , or equivalently U Σ 1 2ν = Λ 1 2 V Σ 1 2 µ . Then, we have 1 2µV diag h ε2 4λµν,i + 4 i 1 1 2µ(UΛ 1 2 V ) 1Σ 1 2µV diag hε2 4 + 4λµν,i i 1 As a result of SVD property, we get V diag(λµν,i)d i=1V = Σ 1 2µ. Thus, V diag ε2 4 + 4λµν,i d 1 2µ, which follows that 4 + 4λµν,i i 1 Consequently, we obtain Hence, the proof is completed. F.2. Entropic optimal transport between unbalanced Gaussian measures Prior to presenting the closed-form expression of entropic optimal transport between unbalanced Gaussian measures, let us review the formulation of entropic unbalanced optimal transport (UOT). Entropic Unbalanced Optimal Transport. When either α or β is not a probability measure, the formulation in equation (35) is no longer valid. One solution to deal with this issue is using entropic UOT, which is given by UOTε,τ(α, β) := min π M+(Rd Rd) Eπ X Y 2 + τKL(πx α) + τKL(πy β) + εKL(π α β), where X, Y are independent random vectors in Rd such that (X, Y ) π while ε, τ > 0 are regularized parameters, and πx, πy are the marginal distributions of the coupling π corresponding to α and β, respectively. Consider two unbalanced Gaussian measures in Rd: µ = mµN(u, Σµ) and ν = mνN(v, Σν), (37) where mµ, mν > 0 are their masses and Σµ, Σν are positive definite matrices. To state our main result, we need to define some quantities which are necessary for our analysis. Firstly, let us denote Σµ,η := ηΣ 1 µ + Id; Σν,η := ηΣ 1 ν + Id, where η := τ+ε 2 . Next, we define 4 Id +τ(2η)B B i 1 2 Id , where B := Σ 1 Υu ,v := (u v) Id (Σµ + Σν) h η Id +Σ 1 µ + Σ 1 ν i 1 (u v), ΥΣ := 2η n 1 log(η) + log det(A) log det(B) 1 2 log det(ΣµΣν) o ε 2 log det Id A2(B B) 1 , Υ := Υu ,v + ΥΣ + 2ηd. Entropic Gromov-Wasserstein between Gaussian Distributions Theorem F.3. Let µ and ν be two unbalanced Gaussian measures given in equation (37), the entropic UOT between µ and ν can be computed as UOTε,τ(µ, ν) = mπ Υ + εKL(mπ mµmν) + τ[KL(mπ mµ) + KL(mπ mν)]. where mπ = (mµmν) τ+ε 2τ+ε exp n Υ 2τ+ε o . Moreover, the optimal transport plan is also an unbalanced Gaussian measure which admits the form π = mπ N (u , v ) , Σ xy , where u = u Σµ h η Id +Σµ + Σν i 1 (u v); v = v + Σν h η Id +Σ 1 µ + Σ 1 ν i 1 (u v); 2 µ,ηA(B B) 1 2 µ,η; Σ y = ηΣ 1 2 ν,η A(B B) 1 K xy = ηΣ 1 2 µ,ηA(Id A) 1(B B) 1 Proof of Theorem F.3. We assume that π is a positive measure such that π = mππ with π is a probability measure with mean (ux, vy) and covariance matrix Here the two marginals of π are denoted by πx and πy where πx has mean ux and covariance matrix Σx while πy has mean vy and covariance matrix Σy. Let πx and πy be two marginals of π, then πx = mππx and πy = mππy. By Lemma 3.3, π needs to be Gaussian. We also have Eπ X Y 2 = mπ n ux vy 2 + tr(Σx) + tr(Σy) 2 tr(Kxy) o . According to Lemma C.1, KL(πx µ) = mπKL(πx µ) + KL(mπ mµ), KL(πy ν) = mπKL(πy ν) + KL(mπ mν), KL(π µ ν) = mπKL(π µ ν) + KL(mπ mµmν). Combining the above results, the entropic optimal transport between µ and ν reads as UOTε,τ(µ, ν) = min π M+(Rd Rd) n mπΥ + εKL(mπ mµmν) + τ[KL(mπ mµ) + KL(mπ mν)] o , where Υ := ux vy 2 + tr(Σx) + tr(Σy) 2 tr(Kxy) + εKL(π µ ν) + τ KL(πx µ) + KL(πy ν) . We divide our proof into two main steps. The first step is to minimize Υ, and the second step is to find the optimal scale mπ . Step 1: Minimization of Υ For the three KL terms, we have KL(πx µ) = 1 n tr(ΣxΣ 1 µ ) d log det(Σx) + log det(Σµ) o + 1 2(ux u) Σ 1 µ (ux u), KL(πy ν) = 1 n tr(ΣyΣ 1 ν ) d log det(Σy) + log det(Σν) o + 1 2(vy v) Σ 1 ν (vy v), KL(π µ ν) = 1 n tr(ΣxΣ 1 µ ) d log det(Σx) + log det(Σµ) o + 1 n tr(ΣyΣ 1 ν ) d log det(Σy) + log det(Σν) o 2(ux u, vy v) Σ 1 µ ν(ux u, vy v) 1 i=1 log(1 κxy,i), Entropic Gromov-Wasserstein between Gaussian Distributions 1 2 xy,i is the i-th largest singular value of Σ 1 2 y for all i [d]. Here we use the equation tr(ΣπΣ 1 µ ν) = tr(ΣxΣ 1 µ ) + tr(ΣyΣ 1 ν ) and results from part (b) of Lemma 3.4. Put the results together, we get Υ = tr(Σx) + tr(Σy) 2 tr(Kxy) + η n tr(ΣxΣ 1 µ ) log det(Σx) + tr(ΣyΣ 1 ν ) log det(Σy) o i=1 log(1 κxy,i) + ux vy 2 + 1 n τ(ux u) Σ 1 µ (ux u) + τ(vy v) Σ 1 ν (vy v) + ε(ux u, vy v) Σ 1 µ ν(ux u, vy v) o + 2ηd. The means part: We first work with the terms involving ux and vy. Let ux u = eux and vy v = evy, then (eux, evy) Σ 1 µ ν(eux, evy) = eu x Σ 1 µ eux + evyΣ 1 ν evy ux vy 2 = eux evy + u v 2 = eux 2 + evy 2 2eu x evy + 2eu x (u v) 2ev y (u v) + u v 2. Hence, sum of all terms which include ux and vy is equal to Υu,v :=eu x ηΣ 1 µ + Id eux + ev y ηΣ 1 ν + Id evy 2eu x evy + 2eu x (u v) 2ev y (u v) + u v 2 =eu x Σµ,ηeux + ev y Σν,ηevy 2eu x evy + 2eu x (u v) 2ev y (u v) + u v 2. (38) Taking the derivative with respect to eux and evy of Υu,v, and set the equation to zero we obtain Σµ,ηeux evy + (u v) = 0, Σν,ηevy eux (u v) = 0. Adding two equations we get (ηΣ 1 µ + Id)eux evy + (ηΣ 1 ν + Id)evy eux = 0, Σ 1 µ eux = Σ 1 ν evy. Replace them into the above equation, we obtain Σµ,ηeux + ΣνΣ 1 µ eux = (u v) eux = h Σµ,η + ΣνΣ 1 µ i 1 (u v) = Σµ h η Id +Σµ + Σν i 1 (u v). Similarly, we obtain evy = Σν h η Id +Σ 1 µ + Σ 1 ν i 1 (u v). The value of Υu,v at the minimizer is computed as follow eu x Σµ,ηeux eu x evy = eu x (u v) ev y Σν,ηevy eu x evy = ev y (u v). Adding them together eu x Σµ,ηeux + ev y Σν,ηevy 2eu x evy = (eux evy) (u v). Put it back to equation (38) Υu ,v = (eux evy) (u v) + u v 2 = (u v) Id (Σµ + Σν) h η Id +Σ 1 µ + Σ 1 ν i 1 (u v). Entropic Gromov-Wasserstein between Gaussian Distributions The covariance matrix part: The second part is to group all factors of Σx, Σy and κxy,i, which is ΥΣ := tr(Σx) + tr(Σy) 2 tr(Kxy) + η n tr(ΣxΣ 1 µ ) log det(Σx) + tr(ΣyΣ 1 ν ) log det(Σy) i=1 (1 κxy,i). (39) Grouping terms which have the same factors, the above quantity is rewritten as ΥΣ = tr(ΣxΣµ,η) + tr(ΣyΣν,η) 2 tr(Kxy) η h log det(Σx) + log det(Σy) i=1 log(1 κxy,i). (40) Let UxyΛxy V xy be SVD of Σ 1 2 y and denote 1 2x Uxy = X; V xyΣ 1 2y = Y. (41) Then we get XX = Σx; Y Y = Σy; Λxy = diag κ Note that if we fix Σx, Σy, Λxy, then we only need to maximize the term tr(Kxy), which is equal to tr(Λxy Y X). We could choose Uxy and Vxy such that Y X = diag(λxy,i) where λxy,i is the ith singular value of Σ 1 2x . By von Neumann s trace inequality (Kristof, 1969; Horn & Johnson, 1991), it is where the term tr(Kxy) attains its maximum. Hence, Y X is a diagonal matrix when the objective function ΥΣ attains its minimum. We also note that the form Σ 1 2x Uxy and V xyΣ 1 2y are the QR decompositions of X and Y , respectively. Hence there is no condition on X and Y , except that they are invertible. Then ΥΣ now is equal to ΥΣ = tr(XX Σµ,η) + tr(Y Y Σν,η) 2 tr(XΛxy Y ) 2η h log det(X)det(Y ) i ε i=1 log(1 κxy,i) + const. Taking the derivative with respect to X and Y (since in a small local neighbourhood, they are still invertible), we get the following system of equations 2Σµ,ηX 2Y Λ xy 2η(X 1) = 0 (42) 2Y Σν,η 2Λ xy X 2η(Y 1) = 0. (43) We multiply the first equation with X on the left side and the second equation with Y on the right side, X Σµ,ηX = (Y X) Λxy + η Id (44) Y Σν,ηY = Λ xy(Y X) + η Id . (45) The right hand side of (44) and (45) are diagonal matrices, since Λxy and Y X are diagonal matrices. It means that X Σµ,ηX is also a diagonal matrix. Denote (Y X) Λxy + η Id = Λxy,η := diag(λxy,η,i). (46) X Σµ,ηX = X Uµ,ηΛµ,ηU µ,ηX = Λxy,η h Λ 2 xy,ηX Uµ,η i Λµ,η h U µ,ηXΛ 1 1 2µ,η i = Λµ,η, where Uµ,ηΛµ,ηU µ,η and Uν,ηΛν,ηU ν,η are the spectral decomposition of Σµ,η and Σν,η, respectively. The equation has the form AΛA = Λ, with Λ is a diagonal matrix. Let AΛ 1 2 = Λ 1 2 U, then Λ 1 2 UU Λ 1 2 = AΛA = Λ. It means that Entropic Gromov-Wasserstein between Gaussian Distributions UU = Id, then U is unitary matrix. Thus we obtain A = Λ 1 2 UΛ 1 2 . Replace A by h Λ 2 xy,ηX Uµ,η i , there exists unitary matrix U such that 2 xy,ηX Uµ,η = Λ 2 xy,ηX Uµ,η = UΛ 1 1 2xy,ηUΛ 1 X = Uµ,ηΛ 1 1 2xy,η. (47) Similarly, there exists an unitary matrix V such that 1 2xy,ηV Λ 1 2 ν,η U ν,η. (48) Combine the results with a note that Uµ,η = Uµ and Uν,η = Uν, we obtain 1 2xy,ηV Λ 1 2 ν,η U ν,ηUµ,ηΛ 1 (Y X)Λ 1 xy,η = V Λ 1 2 ν,η U ν UµΛ 1 2 µ,ηU := diag(αi)d i=1. (49) Here, αi is the ith singular values of UνΛ 1 2 ν,τ,εU ν UµΛ 1 2 µ,ηUµ, where U and V could be chosen such that 2 ν,η U ν UµΛ 1 2 µ,τ,εU is a diagonal matrix diag(αi)d i=1. By the von Neumann s trace inequality (Kristof, 1969; Horn & Johnson, 1991), tr(Kxy) = tr(Λxy Y X) 1 2 xy,iλxy,η,i. The equality happens when (κxy,i), (αi) and (λxy,η,i) are decreasing sequences with V Λ 1 2 ν,η U ν UµΛ 1 2 µ,ηU is diagonal. Substitute it into the equation (46) with Λxy,η := diag(λxy,η,i)d i=1 λxy,η,i = η + αiκ 1 2 xy,iλxy,η,i λxy,η,i = η h 1 αiκ 1 2 xy,i i 1 . (50) Combine it with equations (44) and (46), we get det(X Σµ,ηX) = det(X)2det(Σµ,η) = ηd d Y 1 2 xy,i 1. log det(X) = 1 i=1 log 1 αiκ 1 2 xy,i + const. (51) Similarly, we also have log det(Y ) = 1 i=1 log 1 αiκ 1 2 xy,i + const. (52) Put them (44), (45), (51), (52) together, with Σ to be the optimal solution of X, Y , etc, we obtain i=1 log 1 αiκ 1 2 xy,i) ε i=1 log(1 κxy,i) + const. Entropic Gromov-Wasserstein between Gaussian Distributions Let us consider the function for α, t (0, 1) f(t) = 2η log 1 αt 1 2 ε 2 log(1 t). Taking derivative with respect to t we get f (t) = 2ηα 2(1 αt 1 2 )t 1 2 + ε 2 1 1 t = ε(1 αt 1 2 )t 1 2 2ηα(1 t) 2(1 αt 1 2 )t 1 2 (1 t) = (2η ε)αt + εt 1 2 2ηα 2(1 αt 1 2 )t 1 2 (1 t) The nominator is a quadratic function of t 1 2 , which is increasing function taking negative value with t = 0 and positive value when t = 1, since 0 < α < 1. Thus, it has only one solution ε2 + 8τηα2 ε ε2/4 + 2τηα2 ε/2 It also deduces that if α increases, the t also increases. Substituting the result into the equation (50), we obtain λxy,η,i = η 1 2 xy,i = η ε2/4+τ(2η)α2 i ε/2 τ Due to equation (49), we get αiλxy,η,i = (Y X)i,i. Combine with equation (46), we get λxy,i = λxy,η,i η αiλxy,η,i . (54) Covariance matrix: Recall from equations (47) and (48), we have X = Uµ,ηΛ 1 1 2xy,η; Y = Λ 1 2xy,ηV Λ 1 2 ν,η U ν,η. Since equation (41), V Λ 1 2 ν,η U ν UµΛ 1 2 µ,ηU is a diagonal matrix. It means that (V U ν )UνΛ 1 2 ν,η U ν UµΛ 1 2 µ,ηU µ (UµU) = (V U ν )(Σ 1 2 µ,η)(UµU ) = diag(αi)d i=1, which means that the αi are the singular values of Σ 1 2 µ,η. Then 2 µ,η = UνV diag(αi) UU µ = Uµν,ηdiag(αi)V µν,η which is a SVD of Σ 1 2 µ,η, where UµU = Vµν,η and V U ν = U µν,η. Put them into the equation (41) Kxy = XΛxy Y = UµΛ 1 1 2xy,ηΛxyΛ 1 2xy,ηV Λ 1 2 ν,η U ν = Σ 1 2 µ,η UµU Λ 1 2xy,ηΛxyΛ 1 2xy,η V U ν Σ 1 Vµν,ηdiag λxy,η,i η 2 µ,ηA(Id A) 1(B B) 1 2 ν,η . (55) We could obtain it from B = Uµν,ηdiag(αi)d i=1V µν,η B B = Vµν,ηdiag(α2 i )d i=1V µν,η τ Vµν,ηdiag q ε2/4 + τ(2η)α2 i ε/2 d 4 Id +τ(2η)B B i 1 η(Id A) 1 = Vµν,ηdiag(λxy,η,i)d i=1V µν,η η (Id A) 1 Id (B B) 1 2 = Vµν,ηdiag λxy,η,i η i=1 V µν,τ. Entropic Gromov-Wasserstein between Gaussian Distributions Multiply both side with B 1 = Vµν,ηdiag(αi)U µν,η, we obtain the formula in (55). For the Σx, multiply with X to the right side of equation (42), we have Σµ,ηXX Y Λ xy X η(X 1) X = 0 Σµ,ηΣx = (XΛxy Y ) + η Id Σx = Σ 1 µ,η(K xy + η Id). (56) Another way to derive Σx is to use equation (47) of X and equation (53) of λxy,η Σx = XX = UµΛ 1 2 µ,ηU Λxy,ηUΛ 1 2 µ,ηU µ = Σ 1 2 µ,ηUµU Λxy,ηUU µ Σ 1 2 µ,ηVµν,ηΛxy,ηV µν,ηΣ 1 2 µ,ηA(B B) 1 2 µ,η, (57) where the last equation is obtained from the equation (54) Vµν,ηdiag(λxy,i)d i=1V µν,η = η (Id A) 1 Id (Id A)(B B) 1 2 = η Id (Id A) (B B) 1 = ηA(B B) 1 Similarly for equation (43) Y Y Σν,η Y Λ xy X η Id = 0 Σy = (K xy + η Id)Σ 1 ν,η. (58) or we could derive another way to obtain Σy = ηΣ 1 2 ν,η A(B B) 1 Next we calculate the ΥΣ . Since equations (56) and (58), we have tr(ΣxΣµ,η) + tr(ΣyΣν,η) 2 tr(Kxy) = 2η. We also have i=1 (1 κxy,i) = det Id diag(κxy,i)d i=1 Vµν,ηdiag(κxy,i)d i=1V µν,η = A2(B B) 1. It means that Pd i=1 log(1 κxy,i)d i=1 = log det Id A2(B B) 1 . Put all the results into equation (40), we obtain ΥΣ = tr(ΣxΣµ,η) + tr(ΣyΣν,η) 2 tr(Kxy) η h log det(Σx) + log det(Σy) i=1 log(1 κxy,i) ΥΣ = 2η 2η n log(η) + log det(A) log det(B) 1 2 log det(Σµ) 1 2 log det(Σν) o 2 log det Id A2(B B) 1 = 2η n 1 log(η) + log det(A) log det(B) 1 2 log det(ΣµΣν) o ε 2 log det Id A2(B B) 1 . Step 2: Calculation of mπ We need to find the minimizer of the following problem: min mπ>0 mπΥ + τKL(mπ mµ) + τKL(mπ mν) + εKL(mπ mµmν). By part (c) of Lemma C.6, we obtain τ 2τ+ε ν (mµmν) ε 2τ+ε exp n Υ Hence, we have thus proved our claims.