# multiobjective_distribution_matching__ac5f9222.pdf

Multiobjective Distribution Matching

Xiaoyuan Zhang 1 Peijie Li 2 Yingying Yu 1 Yichi Zhang 3 Han Zhao 4 Qingfu Zhang 1

Abstract Distribution matching is a core concept in machine learning, with applications in generative models, domain adaptation, and algorithmic fairness. A closely related but less explored challenge is generating a distribution that aligns with multiple underlying distributions, often with conflicting objectives, known as a Pareto optimal distribution. In this paper, we develop a general theory based on information geometry to construct the Pareto set and front for the entire exponential family under KL and inverse KL divergences. This formulation allows explicit derivation of the Pareto set and front for multivariate normal distributions, enabling applications like multiobjective variational autoencoders (MOVAEs) to generate interpolated image distributions. Experimental results on real-world images demonstrate that both algorithms can generate high-quality interpolated images across multiple distributions.

1. Introduction

Distribution Matching (DM) is a fundamental concept in machine learning and has rich applications across multiple applications, including generative modeling (Goodfellow et al., 2014; Ho and Ermon, 2016; Li et al., 2015), domain adaptation (Baktashmotlagh et al., 2016; Ganin et al., 2016; Gong et al., 2024; Tachet des Combes et al., 2020; Zhao et al., 2018), causal representation learning (Johansson et al., 2016; Shalit et al., 2017), and algorithmic fairness (Zhang et al., 2018; Zhao et al., 2019b; 2022), just to name a few. A typical DM problem is formulated as:

min θ D(pθ p),

where p denotes the target distribution, and pθ is the distribution which needs to be optimized. Alternatively, the problem

1Department of Computer Science, City UHK. 2Department of Mathematics, HKU. 3Department of Statistics, IU. 4Department of Computer Science, UIUC. Correspondence to: Qingfu Zhang <qingfu.zhang@cityu.edu.hk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

can also be expressed in its inverse form: minθ D(p pθ), since the divergence D( ) is not necessarily symmetric. This classical distribution matching problem has been extensively studied, with various approaches proposed for calculating the divergence such as the Wasserstein distance W(pθ p) (Villani et al., 2009) and f-divergence Minimization (Nowozin et al., 2016). A widely used but underexplored problem addressed in this paper is aligning a single distribution with multiple distributions simultaneously, formulated as:

min θ f(θ) = (f1(θ), . . . , fm(θ))

= (D(pθ p1), . . . , D(pθ pm)), (1)

where each objective can also be presented in its inverted form, D(pi pθ). MODM has broad applications in machine learning, including the controllable generation of intermediate distributions (e.g., images, drugs, speeches) between multiple underlying distributions, the development of models that balance multiple domain adaptations (Han and Pimentel, 2024; Jin et al., 2020; Wu et al., 2021), multi-source domain adaptation (Wen et al., 2020; Zhao et al., 2018), and group fairness with multiple sub-populations (Chen et al., 2023; Xian and Zhao, 2024; Xian et al., 2023).

This paper presents two approaches to study the MODM problem. The first approach assumes that the distributions pθ, p1, . . . , pm belong to a specific distribution family, such as the Exponential family. Under this condition, we first investigate a more general case: how to generate Pareto solutions when the decision space is a dually flat manifold endowed with a Riemannian metric. Building on these general results, we characterize the Pareto set for multivariate normal (MVN) distributions under both Kullback-Leibler (KL) divergence and inverse KL divergence as illustrative examples. Furthermore, we highlight a direct application of generating Pareto-optimal MVN distributions in the context of multiobjective Variational Autoencoder (MOVAE) algorithm. This algorithm learns multiple latent MVN distributions and employs non-linear decoders to map them into complex real-world applications. Experimental results demonstrate the effectiveness of the proposed method in this context. The contributions of this paper are summarized as follows:

Multiobjective Distribution Matching

1. We investigate the geometric structure of the multiobjective distribution matching problem with the tools of information geometry. We derive the explicit form of Pareto set on the dually flat manifold under the canonical divergence. Then, we give the shape of PS for the entire exponational family and use multivariate normal (MVN) distributions as a specific example.

2. Besides exponational family, we derive the form of the Pareto set under the α-divergence Dα( ). Based on our theoretical results, we design an algorithm called multiobjective variational auto-encoder (MOO-VAE) to generate interpolated images between multiple distributions.

3. We evaluate the performance of the proposed multiobjective distribution matching method not only on synthetic distributions but also on real-world image distributions. The proposed method can generate high-quality image distributions among multiple underlying distributions.

2. Related Work

Since this paper is both related with gradient-based multiobjective optimization (MOO) and distribution matching. We briefly discuss those two related topics separately.

2.1. Gradient-based MOO

Gradient-based MOO methods gained growing popularity with the successful application of Multiple Gradient Descent Algorithm (MGDA) (Sener and Koltun, 2018) in deep multitask learning. Methods to find the Pareto set (PS) using two strategies: identifying a diverse set of Pareto solutions or modeling the entire PS directly. Notable works include Pareto Multi-Task Learning (PMTL) (Lin et al., 2019), finding solutions restricted in specific objective regions; Exact Pareto Optimization (EPO) (Mahapatra and Rajan, 2020; 2021), Weighted Chebyshev (WC)-MGDA (Momma et al., 2022), and Preference-based MGDA (PMGDA) (Zhang et al., 2024), which align objectives with preference vectors; and Gradient-based Hypervolume maximization (HVGrad (Deist et al., 2020; 2021; Emmerich et al., 2007)), which optimizes the hypervolume of a set of solutions to achieve both Pareto optimality and diversity. The most common approach, however, remains optimizing some aggregation functions that convert a multiobjective optimization problem (MOP) into a single-objective one. Besides finding a single Pareto optimal solution or a set of Pareto solutions, another popular gradient-based MOO paradigm is Pareto set learning (Chen and Kwok, 2024; Dimitriadis et al., 2023; Navon et al., 2020; Ruchte and Grabocka, 2021), which trains a single model to predict the entire PS, typically using a hypernetwork or a neural network with a low-rank adaptation structure.

2.2. Distribution matching

Distribution matching (DM), or distribution alignment, is a core concept in machine learning with broad applications, including domain adaptation (Ganin et al., 2016; Nguyen et al., 2023; Xiao et al., 2024; Zhang et al., 2022; Zhao et al., 2018; 2019a), algorithmic fairness (Prost et al., 2019; Quadrianto and Sharmanska, 2017; Zhao and Gordon, 2022; Zhao et al., 2019b), and generative models (e.g., GANs, VAEs, diffusion models, imitation learning) (Higgins et al., 2017; Ho and Ermon, 2016; Jin et al., 2020; Yin et al., 2023).

For generative models like GANs, the objective is to generate a distribution that closely aligns with the data distribution. Similarly, in generative adversarial imitation learning (GAIL) (Ho and Ermon, 2016), the goal is to learn a policy such that the state-action density function matches that of the expert. DM also help learn a representation that aligns two distributions and has been applied to enhance robustness and enforce constraints in domain generalization, causal discovery, and fair representation learning (Gong et al., 2024).

3. Preliminaries

3.1. Multiobjective optimization

For a MOP (Equation (1)), it is difficult to compare the quality of solutions since vector objectives do not admit a total order. To describe the optimality for a MOP, we first introduce the concept of Pareto optimality.

Definition 1 (Pareto Optimal (PO), Pareto Set (PS), Pareto Front (PF)). A solution θ is PO if no other solution θ Θ dominates it, denoted as f(θ ) strict f(θ ), i.e., fi(θ ) fi(θ ) for all i [m] with at least one strict inequality hold. The set of all PO solutions is called the Pareto set, and its image is the Pareto front.

In addition to PO solutions, weakly PO solutions are those that cannot be strictly dominated by the other solutions, i.e., no solution θ exists such that i [m], fi(θ ) < fi(θ ), or f(θ ) f(θ ). The simplest way to find PO solutions is to use aggregation functions gλ( ) : Rm 7 R to convert a vector optimization problem into a scalar one. Widely-used aggregation functions include linear scalarization, gλ(f(θ)) = Pm i=1 λifi(θ) and Tchebycheff functions, gλ(f(θ)) = maxi [m] λi(fi(θ) zi), where z is a reference point such that z f(θ), for θ Θ. When applying aggregation functions to solve a MOP, we have the following lemma for the optimal solution of an aggregation function.

Lemma 2 (Adapted from (Miettinen, 1999), Theorem 2.6.2). If gλ(f(θ)) is decreasing w.r.t, f(θ)), i.e., f(θ(a)) strict f(θ(b)), gλ(f(θ(a))) gλ(f(θ(b))), the optimal solution of gλ(f(θ)) is a weakly PO solution. If gλ(f(θ))

Multiobjective Distribution Matching

is strictly decreasing w.r.t, f(θ), i.e., when f(θ(a)) strict f(θ(b)), gλ(f(θ(a))) < gλ(f(θ(b))), the optimal solution of gλ(f(θ)) is a PO solution.

Lemma 2 guarantees that, optimizing aggregation functions with a specific preference vector can find a (weakly) Pareto optimal solution. The remaining issue is whether the entire PS can be recovered by using all preference vectors that span the probability simplex. For this question we know that (1) when each objective function fi(θ) is convex, for any Pareto solution, there exists a preference vector where the optimal solution is the result of the linear aggregation function under this preference vector (Boyd, 2004)[Section 4.7], (2) for any weakly PO solution, there exists a preference vector such that the optimal value of the Tchebycheff scalarization function corresponds to this solution (Choo and Atkins, 1983). Under mild conditions (c.f. (Zhang et al., 2023)[Prop. 2]), optimizing the Tchebycheff function yields the exact Pareto solution, where the optimal solution θ of the Tchebycheff function satisfies: λ1(f1(θ ) z1) = . . . = λm(fm(θ ) zm).

Another popular MOO paradigm is called Pareto Set Learning (PSL) (Lin et al., 2020; 2022), which learns a model tβ(λ) : m 7 PS that maps a preference vector to a PO solution. Different from previous methods, PSL aims to learn the entire PS with a single model rather than to find a finite set of Pareto solutions.

3.2. Information geometry

This section introduces some basic concepts of information geometry (Amari, 2016; Ay et al., 2017; Nielsen, 2020), begin with the concept of Riemannian manifold.

Definition 3 (Riemannian manifold (Lee, 2012)). A Riemannian manifold is a pair (S, g), where S is a smooth manifold locally resembling the Euclidean space, and g : p 7 , p is a Riemannian metric that equips each tangent space Tp(S) with an inner product , p : Tp(S) Tp(S) 7 R+.

A connection on a Riemannian manifold describes how to differentiate vector fields.

Definition 4 (Connection of a smooth manifold (Lee, 2012)). A connection on a manifold S is a bilinear map on the space of smooth vector fields Γ(S):

: Γ(S) Γ(S) 7 Γ(S),

satisfying the following properties:

1. f XZ = f XZ,

2. X(f Y ) = f XY + (Xf)Y ,

for vector fields X, Y, Z Γ(S) and smooth function f C (S), where Xf is the derivative of f along X.

Given a (local) coordinate system ϑ = [ϑi] on S, denoted by { i := ϑi } the coordinate vector fields, the metric g can be characterized by the smooth functions gij = i, j (the metric components), and the connection can be characterized by the smooth functions Γij,k = i j, k (the Christoffel symbols). A connection is said to be flat if there exists a coordinate system ϑ such that the corresponding Christoffel symbols of are all zero, and ϑ is said to be an affine coordinate system of the flat connection .

Statistical models S = {pϑ}ϑ Θ, that is, parametrized probability distributions (density functions), are considered smooth manifolds in information geometry. Moreover, these statistical manifolds are often endowed with the Fisher information metric g F given by

g F ij(ϑ) := Eϑ [ iℓϑ jℓϑ] ,

and the α-connections (α) given by

Γ(α) ij,k(ϑ) := Eϑ

i jℓϑ + 1 α

where ℓϑ := log pϑ and Eϑ[f] := R f(x) pϑ(x) dx. Next, we introduce dualistic structures on statistical manifolds, essential for analyzing information geometry.

Definition 5 (Dual connection ). Let (S, g) be a Riemannian manifold and , be two connections on S. If

Z X, Y = ZX, Y + X, ZY

holds for all X, Y, Z Γ(S), then we say that and

are duals of each other with respect to g.

Under a coordinate system, the condition for dual connection can be rewritten as

igjk = Γij,k + Γ ik,j.

Clearly, (α) and ( α) are dual with respect to g F . In particular, the pair (1) and ( 1) are of special interest. We call them exponential connection (e) := (1) and mixture connection (m) := ( 1) respectively.

In this paper, we focus on dually flat manifolds (S, g, , ) where the pair of dual connections and on (S, g) are both flat. On a dually flat manifold, we can always find a pair of dual coordinate systems ϑ = [ϑi], η = [ηj], such that ϑ and η are the affine coordinate systems of and

respectively, and i, j = δj i for i := ϑi , j := ηj . Moreover, we can also find a pair of potential functions ψ, φ corresponding to ϑ, η such that iψ = ηi, iφ = ϑi

and ψ + φ = P

i ϑiηi. Under these settings, the canonical divergence of (S, g, , ), which is the key notion in our discussion, is then defined as

D(p q) := ψ(p) + φ(q) X

i ϑi(p) ηi(q).

Multiobjective Distribution Matching

Note that the canonical divergence is independent of the choice of dual coordinate systems and potential functions (Amari and Nagaoka, 2000). Dually flat manifolds are closely related to the widely-used Bregman divergences in machine learning.

Definition 6 (Bregman divergence). For a strictly convex smooth function f defined on some open Ξ Rn, the corresponding Bregman divergence Df is defined as

Df(x y) := f(x) f(y) X

i if(y) xi yi ,

for x, y Ξ, where if is the i-th partial derivative of f.

Given a pair of dual coordinate systems ϑ, η and the corresponding potential functions ψ, φ on a dually flat manifold (S, g, , ), the canonical divergence can be expressed as a Bregman divergence:

D(p q) = Dψ(ϑ(p) ϑ(q)) = Dφ(η(q) η(p)).

Conversely, given a strictly convex smooth function f on Ξ Rn, the Riemannian metric characterized by

gij(x) := i jf(x),

and the pair of affine connections characterized by

Γij,k(x) = 0, Γ ij,k(x) = i j kf(x),

form a dually flat structure on Ξ, with canonical divergence given by the Bregman divergence Df. In the following, we present some examples of dually flat statistical manifolds.

Example 7 (Exponential family). An n-dimensional model S = {pϑ}ϑ Θ is called an exponential family, if it can be expressed in terms of functions {C, F1, . . . , Fn} on the base space and a function ψ on Θ as

pϑ(x) = exp

i=1 ϑi Fi(x) ψ(ϑ)

ψ(ϑ) = log Z exp

i=1 ϑi Fi(x)

For an exponential family S, (S, g F , (e), (m)) is indeed a dually flat manifold. The (e)-affine natural parameters ϑ = [ϑi] and the (m)-affine expectation parameters η = [ηj] defined by ηj(ϑ) := Eϑ[Fj] give a pair of dual coordinate systems. The corresponding potential functions are given by the cumulant function ψ and

i=1 θiηi(ϑ) ψ(ϑ) = H(pϑ) Eϑ [C] ,

where H is the Shannon entropy given by

H(p) := Z p(x) log p(x) dx.

The canonical divergence is then given by

ϑi(q) ϑi(p) ηi(q) + ψ(p) ψ(q)

= Z (log q(x) log p(x)) q(x) dx = DKL(q p),

which is indeed the dual of the Kullback Leibler divergence. Many important probabilistic models belong to exponential families, such as multivariate normal distributions, chi-squared distributions, gamma distributions, Poisson distributions, multinomial distribution (with fixed trial), distributions on a finite space (probability simplex) .. .In fact, arbitrarily given density functions p0, p1, . . . , pn, then

pϑ(x) := p0(x)1 P

i ϑip1(x)ϑ1 pn(x)ϑn Z p0(x)1 P

i ϑip1(x)ϑ1 pn(x)ϑn dx

gives an exponential family. Furthermore, (Banerjee et al., 2005) showed that any dually flat manifold can be realized as an exponential family (see also (Amari, 2016)).

Here we present the model of multivariate normal distributions (MVNs), which is widely used in machine learning:

pϑ(x) = |2πΣ| 1

2 (x µ) Σ 1 (x µ)

= exp (ϑA) FA(x) + Tr(ϑBFB(x)) ψ(ϑ)

is expressed as an exponential family in terms of

C(x) = 0, FA(x) = x, FB(x) = xx ,

2µ Σ 1µ + 1

2 log |2πΣ|

4(ϑA) (ϑB) 1ϑA + 1

2 log π(ϑB) 1 ,

with respect to the natural parameters ϑ = [ϑA, ϑB]:

ϑA = Σ 1µ, ϑB = 1

and the expectation parameters η = [ηA, ηB]:

ηA = µ, ηB = Σ + µµ .

Example 8 (Mixture family). An n-dimensional model S = {pϑ}ϑ Θ is called a mixture family, if it can be expressed in terms of functions {C, F1, . . . , Fn} on the base space as

pϑ(x) = C(x) +

i=1 ϑi Fi(x).

Multiobjective Distribution Matching

For a mixture family S, (S, g F , (m), (e)) is indeed a dually flat manifold. The (m)-affine parameters ϑ = [ϑi] and the (e)-affine parameters η = [ηj] defined by

ηj(ϑ) := Z Fj(x) log pϑ(x) dx

give a pair of dual coordinate systems. The corresponding potential functions are given by

ψ(ϑ) = H(pϑ), φ(ϑ) = Z C(x) log pϑ(x) dx.

The canonical divergence is then given by

i=1 ϑi(p) (ηi(p) ηi(q)) + φ(q) φ(p)

= Z p(x) (log p(x) log q(x)) dx = DKL(p q),

which is exactly the Kullback Leibler divergence.

A frequently used formulation is the mixture of distributions: arbitrarily given density functions p0, p1, . . . , pn, then

i=1 ϑipi(x)

i=1 ϑi (pi(x) p0(x))

is expressed as a mixture family in terms of

C(x) = p0(x), Fi(x) = pi(x) p0(x), i [n].

4. Multiobjective distribution matching theories

In this section, we first introduce the underlying geometry of multiobjective distribution matching (MODM) based on dually flat spaces, and later derive the explicit forms of PS and PF of MODM under some specific divergences.

4.1. Geometric structure of MODM

Theorem 9. Let (S, g, , ) be a dually flat manifold, ϑ and η be a pair of dual coordinate systems, and D(p q) be the canonical divergence.

The Pareto set of the MOP with the

min p S (D(p1 p), D(p2 p), . . . , D(pm p))

is given by the convex hull of p1, . . . , pm enclosed by -geodesics, that is (

p S : ϑ(p) =

k=1 λkϑ(pk), λ m

where ϑ(p) denote the coordinate of p under the affine coordinate system ϑ and the same for η(p).

The Pareto set of the MOP

min p S (D(p p1), D(p p2), . . . , D(p pm))

is given by the convex hull of p1, . . . , pm enclosed by -geodesics, that is (

p S : η(p) =

k=1 λkη(pk), λ m

The proof is postponed to Appendix A. The theorem builds that for non-degrade cases (dim(S) = n and p1, . . . , pk are linearly independent in corresponding coordinates), the PS is indeed an isomorphism of the m-dimensional simplex.

Next, we present the direct results for MVNs:

For the MOP:

min p MVNs (DKL(p p1), DKL(p p2), . . . , DKL(p pm))

where pk is the MVN of (µk, Σk), the PS contains MVNs of (µ, Σ) such that

k=1 λkΣ 1 k µk

k=1 λkΣ 1 k

For the MOP:

min p MVNs (DKL(p1 p), DKL(p2 p), . . . , DKL(pm p))

where pk is the MVN of (µk, Σk), the PS contains MVNs of (µ, Σ) such that,

k=1 λk(Σk + µkµ k )

The corresponding PFs can then be obtained by computing the KL divergence between the MVNs on the PS and the given MVNs. Note that the KL divergence between the MVN p1 of (µ1, Σ1) and the MVN p2 of (µ2, Σ2) is given by

DKL(p1 p2) = 1

h Dψ(Σ 1 1 Σ 1 2 ) + DΣ 1 2 (µ1, µ2) i ,

Multiobjective Distribution Matching

Dψ(Σ 1 1 Σ 1 2 ) := Tr Σ1Σ 1 2 log Σ1Σ 1 2 n

is a Bregman divergence between positive-definite matrices corresponding to ψ(Σ 1) = log Σ 1 , and

DΣ 1 2 (µ1, µ2) := (µ1 µ2) Σ 1 2 (µ1 µ2)

is the (squared) Mahalanobis distance between µ1 and µ2 with respect to Σ 1 2 . Note also that 1

2Dψ(Σ 1 1 Σ 1 2 ) is exactly the KL divergence between the MVNs of covariant matrices Σ1 and Σ2 with the same mean, and 1

2DΣ 1 2 (µ1, µ2) is exactly the KL divergence between the MVNs of means µ1 and µ2 with the same covariant matrix Σ2.

Figures 1 and 2a illustrate the Pareto fronts (PFs) of MVN distributions with two and three objectives under KL divergence. In Figure 1, blue dots represent gradient descent solutions with Σ obtained via LU decomposition, while the red curve shows the theoretical PF from Theorem 9. The numerical results from gradient descent align well with the theoretical predictions. For completeness, we provide the PS for MVNs under Wassertein distance.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 f1

Solutions by GD PF by the Theoreom

Figure 1. The PF of a 2-objective, 3-dimensional multivariate normal (MVN) distribution matching under the KL divergence, where µ1 = [0, 0, 0] , µ2 = [1, 1, 1] , Σ1 = diag([1, 1, 1]), and

2.5 0.5 0 0.5 2.5 0 0 0 4

. Discrete solutions (in blue) are opti-

mized by gradient descent (GD), and the theoretical PF (in red) curve is predicted by Theorem 9.

Example 10 (PS under 2-Wasserstein distance W2). The Wasserstein distance between two MVN distributions owns an explicit form:

W2(p, pk)2 = µ µk 2 + Tr h (Σ1/2 Σ1/2 k )2i .

The gradients w.r.t µ and Σ are:

µ = 2(µ µk),

Σ1/2 = 2(Σ1/2 Σ1/2 k ).

0.000.250.500.751.001.251.50

1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3

(a) PF of a 3-objective MODM problem.

Pareto Front Optimization curve Preference vector

(b) The optimization trajectory using Tchebycheff aggregation function.

Figure 2. Results on a 3-obj problem under KL divergence using parameters, µ1 = [0, 0] , µ2 = [1, 1] , µ3 = [2, 2] , and

Σ1 = 1 0 0 1

, Σ2 = 1 1 1 3

, Σ3 = 5 2 2 4

The PS can be formulated as:

k=1 λkΣ1/2 k

Equations (2) to (4) provide a principled approach for merging multiple distributions, which is also closely related to recently proposed model merging techniques (Yang et al., 2024; Zeng et al., 2025). Our results show that, under certain divergences, the entire Pareto set can be recovered through appropriate manipulations in the parametric space specifically, by manipulating mean vectors and variance matrices.

4.2. MODM on the preference simplex

In this section, we provide a result of knowing the exact from of the PS of the MODM problem. Once we know the of the PS, it is easy to locate a specific Pareto solution under a preference vector. The following theorem shows that optimizing over the entire probability space is equivalent to optimizing on the preference simplex.

Theorem 11. For a decreasing aggregation function, optimizing it on the PS is equivalent to optimizing it over the entire decision space Θ.

For example, with MVN distributions under KL divergence, optimizing a solution in Θ reduces to optimizing within a smaller m-dim space m:

min θ Θ gλ(f(θ)).

Multiobjective Distribution Matching

Proof. Assuming θ (Θ PS), but there exists a solution θ PS such that f(θ ) strict f(θ ), which contradicts a decreasing aggregation function.

These results show that MODM with aggregation functions reduces to a single-objective optimization with a simplex constraint, while the original problem involves more parameters and requires costly semi-definite programming due to the positive definite constraint on Σ. Additionally, gλ is convex w.r.t λ, and the decision space, an m-dim simplex, is convex. The gradient of gλ under KL divergence is:s

which is convex with respect to λ. The proof is in Appendix B. Optimizing the modified Tchebycheff function on the simplex using projective gradient descent (Algorithm 1) yields the exact Pareto solution, allowing precise control over its position. The optimization curve is shown in Figure 2b.

Algorithm 1 MODM on the Preference Simplex.

1: Initialization: The initial preference vector λ m, where m is the preference simplex. 2: for epoch = 1 to Nepoch do 3: λ λ η λgλ (f (Pm i=1 λiθi)) 4: λ Proj m(λ) 5: end for

4.3. The Pareto set under α-divergence

We first recap that, The α-divergence is defined as:

Dα(p q) = 4 1 α2

1 Z p(x) 1 α

It recovers the KL divergence at α 1 and the reverse KL divergence at α = 1, making it a special case of fdivergence.

Theorem 12. For the MOP:

min p (Dα(p1 p), Dα(p2 p) . . . , Dα(pm p))

over all distributions, the PS is given by:

!# 2 1 α , λ m, (5)

k=1 λkpk(x) 1 α

Here we use pλ, p1, . .. , pm as shorthand notations for the density functions of the distributions pλ, p1, and p2 on the underlying space, respectively.

Proof. Fix λ m, consider the convex combination:

k=1 λk Dα(pk p) (6)

Then we have the difference between f(p) and f(pλ) can be formulated as:

f(p) f(pλ) =

k=1 λk(Dα(pk p) Dα(pk pλ)) (7)

If we set pλ to be the following form

2 λ = 1 ψ(λ)

then by replacing the term p

2 λ in Equation (8) we have,

= 4 1 α2 ψ(λ) Z p

= ψ(λ) 4 1 α2

= ψ(λ) Dα(pλ p) 0.

Thus, pλ minimizes f(p), and the minimum is given by

k=1 λk Dα(pk pλ)

= 4 1 α2 (1 ψ(λ))

Example 13. Especially for α = 1, we have

When α = 1, D 1 = DKL. Hence for the MOP:

min p (DKL(p1 p), DKL(p2 p) . . . , DKL(pm p))

over all distributions, the PS is given by:

k=1 λkpk, λ m.

Multiobjective Distribution Matching

Furthermore,

k=1 λk DKL(pk pλ) = D(λ) JS (p1 pm),

which is the λ-skew Jensen-Shannon divergence.

When α = 1, D1 = D KL. Hence for the MOP:

min p (DKL(p p1), DKL(p p2) . . . , DKL(p pm))

over all distributions, the PS is given by:

D(λ) B (p1 pm) +

k=1 λk ln pk

where D(λ) B (p1 pm) := ln R pλ1 1 pλm m is the λ-skew Bhattacharyya divergence. Furthermore,

k=1 λk DKL(pλ pk) = D(λ) B (p1 pm).

4.4. Multiobjective Varational Auto Encoders (MOVAEs)

The MOVAE algorithm has two parts, which is shown in Algorithm 2 in Appendix C. During training, a VAE model is trained with all samples from multiple distribution with shared encoder (ϕi) and decoder (ψi) parameters. After passing through the encoder network, each image is converted into a multivariate normal distribution.

During inference, for any preference vector, a Paretooptimal distribution is generated using Equation (2), Equation (3) or Equation (4). This interpolated MVN distribution is then input to the decoder network. MOVAE s core idea is that directly generating PO distributions is challenging, so encoders and decoders transform normal distributions into real-world image distributions. Generating PO distribution and then converting it to other distributions approximates the generation of PO distributions. Another advantage of MOVAE over VAEs is that it requires only a single model, as Pareto MVN distributions in the hidden layers are computed using explicit formulations, avoiding the need for neural models. In contrast, mixture models with a controlling ratio λ require training separate models for each λ. Therefore, the proposed method is more efficient in training and storage compared to VAEs with mixture distributions.

5. Experiments

In this section, we present the results of MOVAE. Results for various tradeoff levels are shown in Figure 3 under the inverse KL divergence. For other results, please refer to Appendix D. The size of the image is 28. Both the encoder and

decoder networks have around 157K parameters. Number of training images is around 12K. The optimizer is Adam with a learning rate of 3e-5.

The inference results of Figure 3 demonstrate that MOVAE is able to generate smooth interpolations between the (alarm,clock) distribution and the other circle distribution. We use five uniform preferences from [1, 0] to [0, 1] as examples, though any preference within this range can serve as input to the MOVAE network. By taking the preference from [1, 0] to [0, 1], the interpolated images gradually resembles the second distribution. For an intermediate preference, MOO generate a blending distribution of images under this preference.

Figure 3. MOVAE Results under inverse KL divergence.

6. Conclusion, further work, and limitations

Conclusion. This paper explores the less-studied MODM by modeling a general multiobjective optimization problem on the dually flat Riemannian manifold. This approach provides explicit formulations for the entire exponential family with KL and inverse KL divergence. We also discuss the explicit form of the PS under α-divergence. The theoretical results have direct applications, including multiobjective variational autoencoders (MOVAE). We evaluate MOVAE on the Quick Draw dataset, demonstrating their ability to generate blending images across different preference vectors.

Future work. Beyond the geometry explored here, distribution matching and information geometry share an intricate connection. Future work will address advanced topics, such as projecting distributions outside a manifold onto the manifold.

Limitations. This paper mainly focus on the theoretical results on multiobjective distribution matching and do not

Multiobjective Distribution Matching

discuss much on the applications. In the future, we will discuss the relationship between multiobjective distribution matching LLM, trustworthy machine learning and other topics.

Acknowledgement

This work was supported by the Research Grants Council of Hong Kong, GRF Project No. City U 11212524.

Impact Statement

This paper presents work whose goal is to advance the field of multiobjective optimization and its applications in distribution matching. Given the scope of this research, we do not anticipate immediate ethical concerns or direct societal consequences. Therefore, we believe there are no specific ethical considerations or immediate societal impacts to be emphasized in the context of this work.

S.-i. Amari. Information geometry and its applications, volume 194. Springer, 2016.

S.-i. Amari and H. Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000.

N. Ay, J. Jost, H. Vˆan Lˆe, and L. Schwachh ofer. Information geometry, volume 64. Springer, 2017.

M. Baktashmotlagh, M. Har, i, and M. Salzmann. Distribution-matching embedding for visual domain adaptation. Journal of Machine Learning Research, 17(108):1 30, 2016. URL http://jmlr.org/papers/v17/ 15-207.html.

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. J. Mach. Learn. Res., 6:1705 1749, Dec. 2005. ISSN 1532-4435.

S. Boyd. Convex optimization. Cambridge UP, 2004.

W. Chen and J. Kwok. Efficient pareto manifold learning with low-rank structure. In Proceedings of the 41st International Conference on Machine Learning, pages 7015 7032, 2024.

W. Chen, Y. Klochkov, and Y. Liu. Post-hoc bias scoring is optimal for fair classification. ar Xiv preprint ar Xiv:2310.05725, 2023.

E. U. Choo and D. R. Atkins. Proper efficiency in nonconvex multicriteria programming. Mathematics of Operations Research, 8(3):467 470, 1983.

T. M. Deist, S. C. Maree, T. Alderliesten, and P. A. Bosman. Multi-objective optimization by uncrowded hypervolume gradient ascent. 2020.

T. M. Deist, M. Grewal, F. J. Dankers, T. Alderliesten, and P. A. Bosman. Multi-objective learning to predict pareto fronts using hypervolume maximization. ar Xiv preprint ar Xiv:2102.04523, 2021.

N. Dimitriadis, P. Frossard, and F. Fleuret. Pareto manifold learning: Tackling multiple tasks via ensembles of singletask models. In International Conference on Machine Learning, pages 8015 8052, 2023.

M. Emmerich, A. Deutz, and N. Beume. Gradientbased/evolutionary relay hybrid for computing Pareto front approximations maximizing the s-metric. In Hybrid Metaheuristics: 4th International Workshop, 2007.

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky. Domainadversarial training of neural networks. Journal of machine learning research, 17(59):1 35, 2016.

Z. Gong, B. Usman, H. Zhao, and D. I. Inouye. Towards practical non-adversarial distribution matching. In International Conference on Artificial Intelligence and Statistics, pages 4276 4284. PMLR, 2024.

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks, 2014. URL https://arxiv.org/abs/1406.2661.

S. Han and S. D. Pimentel. Multiobjmatch: Matching with optimal tradeoffs between multiple objectives in r. ar Xiv preprint ar Xiv:2406.18819, 2024.

I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. ICLR (Poster), 3, 2017.

J. Ho and S. Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

W. Jin, R. Barzilay, and T. Jaakkola. Multi-objective molecule generation using interpretable substructures. In International conference on machine learning, pages 4849 4859. PMLR, 2020.

F. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020 3029. PMLR, 2016.

J. M. Lee. Smooth manifolds. Springer, 2012.

Multiobjective Distribution Matching

Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International conference on machine learning, pages 1718 1727. PMLR, 2015.

X. Lin, H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong. Pareto multi-task learning. Advances in neural information processing systems, 32, 2019.

X. Lin, Z. Yang, Q. Zhang, and S. Kwong. Controllable pareto multi-task learning. ar Xiv preprint ar Xiv:2010.06313, 2020.

X. Lin, Z. Yang, X. Zhang, and Q. Zhang. Pareto set learning for expensive multi-objective optimization. Advances in Neural Information Processing Systems, 35:19231 19247, 2022.

D. Mahapatra and V. Rajan. Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization. In International Conference on Machine Learning, pages 6597 6607, 2020.

D. Mahapatra and V. Rajan. Exact pareto optimal search for multi-task learning and multi-criteria decision-making. ar Xiv preprint ar Xiv:2108.00597, 2021.

K. Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media, 1999.

M. Momma, C. Dong, and J. Liu. A multi-objective/multitask learning framework induced by pareto stationarity. In International Conference on Machine Learning, pages 15895 15907. PMLR, 2022.

A. Navon, A. Shamsian, E. Fetaya, and G. Chechik. Learning the Pareto front with hypernetworks. In International Conference on Learning Representations, 2020.

T. Nguyen, K. Do, B. Duong, and T. Nguyen. Domain generalisation via risk distribution matching, 2023. URL https://arxiv.org/abs/2310.18598.

F. Nielsen. An elementary introduction to information geometry. Entropy, 22(10):1100, 2020.

S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016.

F. Prost, H. Qian, Q. Chen, E. H. Chi, J. Chen, and A. Beutel. Toward a better trade-off between performance and fairness with kernel-based distribution matching, 2019. URL https://arxiv.org/abs/1910.11779.

N. Quadrianto and V. Sharmanska. Recycling privileged learning and distribution matching for fairness. Advances in neural information processing systems, 30, 2017.

M. Ruchte and J. Grabocka. Scalable Pareto front approximation for deep multi-objective learning. In IEEE International Conference on Data Mining, pages 1306 1311, 2021.

O. Sener and V. Koltun. Multi-task learning as multiobjective optimization. Advances in neural information processing systems, 31, 2018.

U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076 3085. PMLR, 2017.

R. Tachet des Combes, H. Zhao, Y.-X. Wang, and G. J. Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33:19276 19289, 2020.

C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

J. Wen, R. Greiner, and D. Schuurmans. Domain aggregation networks for multi-source domain adaptation. In International conference on machine learning, pages 10214 10224. PMLR, 2020.

R. Wu, Y. Zhang, Z. Yang, and Z. Wang. Offline constrained multi-objective reinforcement learning via pessimistic dual value iteration. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 25439 25451. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ d5c8e1ab6fc0bfeb5f29aafa999cdb29-Paper. pdf.

R. Xian and H. Zhao. A unified post-processing framework for group fairness in classification. ar Xiv preprint ar Xiv:2405.04025, 2024.

R. Xian, L. Yin, and H. Zhao. Fair and optimal classification via post-processing. In International Conference on Machine Learning, pages 37977 38012. PMLR, 2023.

W. Xiao, Y. Chen, Q. Shan, Y. Wang, and J. Su. Feature distribution matching by optimal transport for effective and robust coreset selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 9196 9204, 2024.

E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities, 2024. URL https://arxiv.org/abs/2408.07666.

Multiobjective Distribution Matching

T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park. One-step diffusion with distribution matching distillation, 2023. URL https: //arxiv.org/abs/2311.18828.

S. Zeng, Y. He, W. You, Y. Hao, Y.-H. H. Tsai, M. Yamada, and H. Zhao. Efficient model editing with task vector bases: A theoretical framework and scalable approach, 2025. URL https://arxiv.org/abs/ 2502.01015.

B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335 340, 2018.

X. Zhang, X. Lin, B. Xue, Y. Chen, and Q. Zhang. Hypervolume maximization: A geometric view of pareto set learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

X. Zhang, X. Lin, and Q. Zhang. Pmgda: A preferencebased multiple gradient descent algorithm. ar Xiv preprint ar Xiv:2402.09492, 2024.

Y. Zhang, M. Li, R. Li, K. Jia, and L. Zhang. Exact feature distribution matching for arbitrary style transfer and domain generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8035 8045, 2022.

H. Zhao and G. J. Gordon. Inherent tradeoffs in learning fair representations. The Journal of Machine Learning Research, 23(1):2527 2552, 2022.

H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pages 8568 8579, 2018.

H. Zhao, R. T. D. Combes, K. Zhang, and G. Gordon. On learning invariant representations for domain adaptation. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7523 7532. PMLR, 09 15 Jun 2019a. URL https://proceedings.mlr. press/v97/zhao19a.html.

H. Zhao, A. Coston, T. Adel, and G. J. Gordon. Conditional learning of fair representations. In International Conference on Learning Representations, 2019b.

H. Zhao, C. Dan, B. Aragam, T. S. Jaakkola, G. J. Gordon, and P. Ravikumar. Fundamental limits and tradeoffs in invariant representation learning. The Journal of Machine Learning Research, 23(1):15356 15404, 2022.

Multiobjective Distribution Matching

A. Proof of Theorem 9

Proof. Due to the following property of the potential functions:

φ η = ϑ, ψ ϑ = η,

the gradients of φ

ϑ are given by:

η(p)D(pk p) = η(p) ψ(pk) + φ(p) η(p) ϑ(pk)

η (p) ϑ(pk)

= ϑ(p) ϑ(pk)

ϑ(p)D(p pk) = ϑ(p) ψ(p) + φ(pk) ϑ(p) η(pk)

= η(p) η(pk)

Hence for any λ m, we have

k=1 λk η(p)D(pk p) = 0 ϑ(p) =

k=1 λkϑ(pk)

k=1 λk ϑ(p)D(p pk) = 0 η(p) =

k=1 λkη(pk)

which completes the proof.

B. Proof of Theorem 11

If f(x) is a convex function, then f (Pm i=1 λixi) is also convex with respect to the weights λi, provided that λ belongs to the simplex = {λ Rm | λi 0, Pm i=1 λi = 1}. Given that f(x) is convex, for any points x1, . . . , xm and any set of weights λ1, . . . , λm, the following inequality holds:

i=1 λif(xi).

To examine the convexity of f (Pm i=1 λixi) with respect to λ, let us consider λa and λb as two distinct points in the simplex, and k1, k2 0 such that k1 + k2 = 1. Then, we have:

i=1 λa i xi + k2

i=1 λa i xi

This inequality demonstrates that f (Pm i=1 λixi) is convex with respect to the decision variables λ. Essentially, the convexity of f over its argument translates into the convexity of the function with respect to the weight parameters λi, as long as λ remains within the simplex constraints.

Further since for any fi are convex, take gλ is a non-decreasing convex function, the overall function gλ(f(Pm i=1 λixi)) is convex.

C. Algorithms

Multiobjective Distribution Matching

Algorithm 2 Multiobjective VAE (MOVAE)

1: # Step 1. Offline VAE Training. 2: Input: m datasets D1, . . . , Dm. 3: for epoch = 1 : Nepoch do 4: Sample a batch size of data, ˆD1, . . . , ˆDm. 5: for i = 1 : m do 6: ℓi = ℓBCE + DKL(p(z θ), N(0, I)). 7: end for 8: ℓ= 1

m Pm i=1 ℓi. 9: for i = 1 : m do 10: Update: ϕi ϕi η ℓ

ϕi , ψi ψi η ℓ

ψi , where η is the learning rate. 11: end for 12: end for 13: # Step 2. MOVAE Prediction.

14: Input: {θ(1) i }K i=1, . . . , {θ(m) i }K i=1 to obtain m parameters (µ1, Σ1), . . . , (µm, Σm) from the encoder network. 15: Given a preference vector λ, generate the PO distribution with parameters using Equation (2),Equation (3) or Equation (4). 16: Given a preference vector λ, generate the PO distribution with parameters using Equation (2), Equation (3) or Equation (4). 17: Output: Image distribution under the Pareto optimal distribution.

D. Extra experimental results

In this section, we present extra results under KL and Wasserstein distance in Figures 4 and 5.

Figure 4. MOVAE Results under KL divergence.

Multiobjective Distribution Matching

Figure 5. MOVAE Results under Wasserstein distances.