# gromovwasserstein_autoencoders__241d7ca0.pdf

Published as a conference paper at ICLR 2023

GROMOV-WASSERSTEIN AUTOENCODERS

Nao Nakagawa1, Ren Togo2, Takahiro Ogawa2, & Miki Haseyama2

1 Graduate School of Information Science and Technology, Hokkaido University, Japan 2 Faculty of Information Science and Technology, Hokkaido University, Japan {nakagawa,togo,ogawa,mhaseyama}@lmd.ist.hokudai.ac.jp

Variational Autoencoder (VAE)-based generative models offer flexible representation learning by incorporating meta-priors, general premises considered beneficial for downstream tasks. However, the incorporated meta-priors often involve ad-hoc model deviations from the original likelihood architecture, causing undesirable changes in their training. In this paper, we propose a novel representation learning method, Gromov-Wasserstein Autoencoders (GWAE), which directly matches the latent and data distributions using the variational autoencoding scheme. Instead of likelihood-based objectives, GWAE models minimize the Gromov-Wasserstein (GW) metric between the trainable prior and given data distributions. The GW metric measures the distance structure-oriented discrepancy between distributions even with different dimensionalities, which provides a direct measure between the latent and data spaces. By restricting the prior family, we can introduce meta-priors into the latent space without changing their objective. The empirical comparisons with VAE-based models show that GWAE models work in two prominent meta-priors, disentanglement and clustering, with their GW objective unchanged.

1 INTRODUCTION

One fundamental challenge in unsupervised learning is capturing the underlying low-dimensional structure of high-dimensional data because natural data (e.g., images) lie in low-dimensional manifolds (Carlsson et al., 2008; Bengio et al., 2013). Since deep neural networks have shown their potential for non-linear mapping, representation learning has recently made substantial progress in its applications to high-dimensional and complex data (Kingma & Welling, 2014; Rezende et al., 2014; Hsu et al., 2017; Hu et al., 2017). Learning low-dimensional representations is in mounting demand because the inference of concise representations extracts the essence of data to facilitate various downstream tasks (Thomas et al., 2017; Higgins et al., 2017b; Creager et al., 2019; Locatello et al., 2019a). For obtaining such general-purpose representations, several meta-priors have been proposed (Bengio et al., 2013; Tschannen et al., 2018). Meta-priors are general premises about the world, such as disentanglement (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018; Ding et al., 2020), hierarchical factors (Vahdat & Kautz, 2020; Zhao et al., 2017; Sønderby et al., 2016), and clustering (Zhao et al., 2018; Zong et al., 2018; Asano et al., 2020).

A prominent approach to representation learning is a deep generative model based on the variational autoencoder (VAE) (Kingma & Welling, 2014). VAE-based models adopt the variational autoencoding scheme, which introduces an inference model in addition to a generative model and thereby offers bidirectionally tractable processes between observed variables (data) and latent variables. In this scheme, the reparameterization trick (Kingma & Welling, 2014) yields representation learning capability since reparameterized latent codes are tractable for gradient computation. The introduction of additional losses and constraints provides further regularization for the training process based on meta-priors. However, controlling representation learning remains a challenging task in VAE-based models owing to the deviation from the original optimization. Whereas the existing VAE-based approaches modify the latent space based on the meta-prior (Kim & Mnih, 2018; Zhao et al., 2017; Zong et al., 2018), their training objectives still partly rely on the evidence lower bound (ELBO). Since the ELBO objective is grounded on variational inference, ad-hoc model modifications cause implicit and undesirable changes, e.g., posterior collapse (Dai et al., 2020) and implicit prior change (Hoffman et al., 2017) in β-VAE (Higgins et al., 2017a). Under such modifications, it is also unclear whether a

Published as a conference paper at ICLR 2023

latent representation retains the underlying data structure because VAE models implicitly interpolate data points to form a latent space using noises injected into latent codes by the reparameterization trick (Rezende & Viola, 2018a;b; Aneja et al., 2021).

As another paradigm of variational modeling, the ELBO objective has been reinterpreted from the optimal transport (OT) viewpoint (Tolstikhin et al., 2018). Tolstikhin et al. (2018) have derived a family of generative models called the Wasserstein autoencoder (WAE) by applying the variational autoencoding model to high-dimensional OT problems as the couplings (Appendix A.4 for more details). Despite the OT-based model derivation, the WAE objective is equivalent to that of Info VAE (Zhao et al., 2019), whose objective consists of the ELBO and the mutual information term. The WAE formulation is derived from the estimation and minimization of the OT cost (Tolstikhin et al., 2018; Arjovsky et al., 2017) between the data distribution and the generative model, i.e., the generative modeling by applying the Wasserstein metric. It furnishes a wide class of models, even when the prior support does not cover the entire variational posterior support. The OT paradigm also applies to existing representation learning approaches originally derived from re-weighting the Kullback-Leibler (KL) divergence term (Gaujac et al., 2021).

Another technique for optimizing the VAE-based ELBO objective called implicit variational inference (IVI) (Huszár, 2017) has been actively researched. While the VAE model has an analytically tractable prior for variational inference, IVI aims at variational inference using implicit distributions, in which one can use its sampler instead of its probability density function. A notable approach to IVI is the density ratio estimation (Sugiyama et al., 2012), which replaces the f-divergence term in the variational objective with an adversarial discriminator that distinguishes the origin of the samples. For distribution matching, this algorithm shares theoretical grounds with generative models based on the generative adversarial networks (GANs) (Goodfellow et al., 2014; Sønderby et al., 2017), which induces the application of IVI toward the distribution matching in complex and high-dimensional variables, such as images. See Appendix A.6 for more discussions.

In this paper, we propose a novel representation learning methodology, Gromov-Wasserstein Autoencoder (GWAE) based on the Gromov-Wasserstein (GW) metric (Mémoli, 2011), an OT-based metric between distributions applicable even with different dimensionality (Mémoli, 2011; Xu et al., 2020; Nguyen et al., 2021). Instead of the ELBO objective, we apply the GW metric objective in the variational autoencoding scheme to directly match the latent marginal (prior) and the data distribution. The GWAE models obtain a latent representation retaining the distance structure of the data space to hold the underlying data information. The GW objective also induces the variational autoencoding to perform the distribution matching of the generative and inference models, despite the OT-based derivation. Under the OT-based variational autoencoding, one can adopt a prior of a GWAE model from a rich class of trainable priors depending on the assumed meta-prior even though the KL divergence from the prior to the encoder is infinite. Our contributions are listed below.

We propose a novel probabilistic model family GWAE, which matches the latent space to the given unlabeled data via the variational autoencoding scheme. The GWAE models estimate and minimize the GW metric between the latent and data spaces to directly match the latent representation closer to the data in terms of distance structure.

We propose several families of priors in the form of implicit distributions, adaptively learned from the given dataset using stochastic gradient descent (SGD). The choice of the prior family corresponds to the meta-prior, thereby providing a more flexible modeling scheme for representation learning.

We conduct empirical evaluations on the capability of GWAE in prominent meta-priors: disentanglement and clustering. Several experiments on image datasets Celeb A (Liu et al., 2015), MNIST (Le Cun et al., 1998), and 3D Shapes (Burgess & Kim, 2018), show that GWAE models outperform the VAE-based representation learning methods whereas their GW objective is not changed over different meta-priors.

2 RELATED WORK

VAE-based Representation Learning. VAE (Kingma & Welling, 2014) is a prominent deep generative model for representation learning. Following its theoretical consistency and explicit handling of latent variables, many state-of-the-art representation learning methods are proposed

Published as a conference paper at ICLR 2023

based on VAE with modification (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018; Achille & Soatto, 2018; Kumar et al., 2018; Zong et al., 2018; Zhao et al., 2017; Sønderby et al., 2016; Zhao et al., 2019; Hou et al., 2019; Detlefsen & Hauberg, 2019; Ding et al., 2020). The standard VAE learns an encoder and a decoder with parameters ϕ and θ, respectively, to learn a low-dimensional representation in its latent variables z using a bottleneck layer of the autoencoder. Using data x pdata(x) supported on the data space X, the VAE objective is the ELBO formulated by the following optimization problem:

maximize θ,ϕ Epdata(x) Eqϕ(z|x) [log pθ(x|z)] DKL(qϕ(z|x) π(z)) , (1)

where the encoder qϕ(z|x) and decoder pθ(x|z) are parameterized by neural networks, and the prior π(z) is postulated before training. The first and second terms (called the reconstruction term and the KL term, respectively) in Eq. (1) are in a trade-off relationship (Tschannen et al., 2018). This implies that learning is guided to autoencoding by the reconstruction term while matching the distribution of latent variables to the pre-defined prior using the KL term.

Implicit Variational Inference. IVI solves the variational inference problem using implicit distributions (Huszár, 2017). A major approach to IVI is density ratio estimation (Sugiyama et al., 2012), in which the ratio between probability distribution functions is estimated using a discriminator instead of their closed-form expression. Since IVI-based and GAN-based models share density ratio estimation mechanisms in distribution matching (Sønderby et al., 2017), the combination of VAEs and GANs has been actively studied, especially from the aspect of the matching of implicit distributions. The successful results achieved by GAN-based models in high-dimensional data, such as natural images, have propelled an active application and research of IVI in unsupervised learning (Larsen et al., 2016; Makhzani, 2018).

Optimal Transport. The OT cost is used as a measure of the difference between distributions supported on high-dimensional space using SGD (Arjovsky et al., 2017; Tolstikhin et al., 2018; Gaujac et al., 2021). This provides the Wasserstein metric for the discrepancy between distributions. For a constant ξ 1, the ξ-Wasserstein metric between distributions r and s is defined as

Wξ(r, s) = inf γ P(r(x),s(x )) Eγ(x,x ) dξ(x, x ) 1/ξ , (2)

where x denotes the random variable in which the distributions r and s are defined, and P(r(x), s(x )) denotes the set consisting of all couplings whose x-marginal is r(x) and whose x -marginal is s(x ). Owing to the difficulty of computing the exact infimum in Eq. (2) for high-dimensional, large-scale data, several approaches try to minimize the estimated ξ-Wasserstein metric using neural networks and SGD (Tolstikhin et al., 2018; Arjovsky et al., 2017). The form in Eq. (2) is the primal form of the Wasserstein metric, particularly compared with its dual form for the case of ξ = 1 (Arjovsky et al., 2017). The two prominent approaches for the OT in high-dimensional, complex large-scale data are: (i) minimizing the primal form using a probabilistic autoencoder (Tolstikhin et al., 2018), and (ii) adversarially optimizing the dual form using a generator-critic pair (Arjovsky et al., 2017).

Wasserstein Autoencoder (WAE). WAE (Tolstikhin et al., 2018) is a family of generative models whose autoencoder estimates and minimizes the primal form of the Wasserstein metric between the generative model pθ(x) and the data distribution pdata(x) using SGD in the variational autoencoding settings, i.e., the VAE model architecture (Kingma & Welling, 2014). This primal-based formulation induces a representation learning methodology from the OT viewpoint because the WAE objective is equivalent to that of Info VAE (Zhao et al., 2019), which learns the variational autoencoding model by retaining the mutual information of the probabilistic encoder.

Kantorovich-Rubinstein Duality. The Wasserstein GAN models (Arjovsky et al., 2017) adopt an objective based on the 1-Wasserstein metric between the generative model pθ(x) and data distribution pdata(x). This objective is estimated using the Kantorovich-Rubinstein duality (Villiani, 2009; Arjovsky et al., 2017), which holds for the 1-Wasserstein as

W1(r, s) = sup f:1-Lipschitz Er(x) [f(x)] Es(x) [f(x)] . (3)

To estimate this function f using SGD, a 1-Lipschitz neural network called a critic is introduced, as with a discriminator in the GAN-based models. The training process using mini-batches is

Published as a conference paper at ICLR 2023

adversarially conducted, i.e., by repeating updates of the critic parameters and the generative parameters alternatively. During this process, the critic maximizes the objective in Eq. (3) to approach the supremum, whereas the generative model minimizes the objective for the distribution matching pθ(x) pdata(x).

3 PROPOSED METHOD

Our GWAE models minimize the OT cost between the data and latent spaces, based on generative modeling in the variational autoencoding. GWAE models learn representations by matching the distance structure between the latent and data spaces, instead of likelihood maximization.

3.1 OPTIMAL TRANSPORT BETWEEN SPACES

Although the OT problem induces a metric between probability distributions, its application is limited to distributions sharing one sample space. The GW metric (Mémoli, 2011) measures the discrepancy between metric measure spaces using the OT of distance distributions. A metric measure space consists of a sample space, metric, and probability measure. Given a pair of different metric spaces, i.e., sample spaces and metrics, the GW metric measures the discrepancy between probability distributions supported on the spaces. In terms of the GW metric, two distributions are considered to be equal if there is an isometric mapping between their supports (Sturm, 2012; Sejourne et al., 2021). For a constant ρ 1, the formulation of the ρ-GW metric between probability distributions r(x) supported on a metric space (X, d X ) and s(z) supported on (Z, d Z) is given by

GWρ(r, s) := inf γ P(r(x),s(z)) Eγ(x,z)Eγ(x ,z ) |d X (x, x ) d Z(z, z )|ρ 1/ρ , (4)

where P(r(x), s(z)) denotes the set of all couplings with r(x) as x-marginal and s(z) as z-marginal. The metrics d X and d Z are the metrics in the spaces X and Z, respectively.

3.2 APPLICATION TO REPRESENTATION LEARNING: GROMOV-WASSERSTEIN AUTOENCODER

In this work, we propose a novel GWAE modeling methodology based on the GW metric for distance structure modeling in the variational autoencoding formulation. The objectives of generative models typically aim for distribution matching in the data space, e.g., the likelihood (Kingma & Welling, 2014) and the Jensen-Shannon divergence (Goodfellow et al., 2014). The GWAE objective differs from these approaches and aims to directly match the latent and data distributions based on their distance structure.

3.2.1 MODEL SETTINGS: VARIATIONAL AUTOENCODING Given an N-sized set of data points {xi}N i=1 supported on a data space X, representation learning aims to build a latent space Z and obtain mappings between both the spaces. For numerical computation, we postulate that the spaces X and Z respectively have tractable metrics d X and d Z such as the Euclidean distance (see Appendix B.1 for details), and let M, L N\{0}, X RM, and Z RL. We mention the bottleneck case M L similarly to the existing representation learning methods (Kingma & Welling, 2014; Higgins et al., 2017a; Kim & Mnih, 2018) because the data space X is typically an L-dimensional manifold (Carlsson et al., 2008; Bengio et al., 2013).

We construct a model with a trainable latent prior πθ(z) to approach the data distribution pdata(x) in terms of distance structure. Following the standard VAE (Kingma & Welling, 2014), we consider a generative model pθ(x, z) with parameters θ and an inference model qϕ(x, z) with parameters ϕ. The generation process consists of the prior πθ(z) and a decoder pθ(x|z) parameterized with neural networks. Since the inverted generation process pθ(z|x) = πθ(z)pθ(x|z)/pθ(x) is intractable in this scheme, an encoder qϕ(z|x) pθ(z|x) is instead established using neural networks for parameterization. Thus, the generative pθ(x, z) and inference qϕ(x, z) models are defined as

pθ(x, z) = πθ(z)pθ(x|z), qϕ(x, z) = pdata(x)qϕ(z|x). (5)

The empirical ˆpdata(x) = 1/N PN i=1 δ(x xi) is used for the estimation of pdata(x). A Dirac decoder and a diagonal Gaussian encoder are used to alleviate deviations from the data manifold as in Tolstikhin et al. (2018) (see Appendix B.1 for these details and formulations).

Published as a conference paper at ICLR 2023

3.2.2 OPTIMAL TRANSPORT OBJECTIVE Here, we focus on the latent space Z to transfer the underlying data structure to the latent space. This highlights the main difference between the GWAE and the existing generative approaches. The training objective of GWAE is the GW metric between the metric measure spaces (X, d X , pdata(x)) and (Z, d Z, πθ(z)) as

minimize θ GWρ(pdata(x), πθ(z))ρ, (6)

where ρ 1 is a constant, and we adopt ρ = 1 to alleviate the effect of outlier samples distant from the isometry for training stability. Computing the exact GW value is difficult owing to the high dimensionality of both x and z. Hence, we estimate and minimize the GW metric using the variational autoencoding scheme, which captures the latent factors of complex data in a stable manner. We recast the GW objective into a main GW estimator LGW with three regularizations: a reconstruction loss LW , a joint dual loss LD, and an entropy regularization RH.

Estimated GW metric LGW . We use the generative model pθ(x, z) as the coupling of Eq. (6) similarly to the WAE (Tolstikhin et al., 2018) methodology. The main loss LGW estimates the GW metric as:

minimize θ LGW := Epθ(x,z)Epθ(x ,z ) |d X (x, x ) Cd Z(z, z )|ρ , (7)

subject to pdata(x) = pθ(x), (8)

where C is a trainable scale constant to cancel out the scale degree of freedom, and pθ(x) denotes the marginal pθ(x) = R

Z pθ(x, z)dz.

WAE-based X-marginal condition LW . To obtain a numerical solution with stable training, Tolstikhin et al. (2018) relax the X-matching condition of Eq. (8) into ξ-Wasserstein minimization (ξ 1) using the variational autoencoding coupling. The WAE methodology (Tolstikhin et al., 2018) uses the inference model qϕ(x, z) to formulate the ξ-Wasserstein minimization as the reconstruction loss LW with a Z-matching condition as:

minimize θ,ϕ LW := Eqϕ(x,z)Epθ(x |z) [d X (x, x )] , (9)

subject to qϕ(z) = πθ(z). (10)

where d X is a distance function based on the Lξ metric. We adopt the settings ξ = 2 to retain the conventional Gaussian reconstruction loss.

Merged sufficient condition LD. We merge the marginal coupling conditions of Eq. (8) and Eq. (10) into the joint X Z-matching sufficient condition pθ(x, z) = qϕ(x, z) to attain bidirectional inferences while preserving the stability of autoencoding. Since such joint distribution matching can also be relaxed into the minimization of W1(qϕ(x, z), pθ(x, z)), this condition is satisfied by minimizing the Kantorovich-Rubinstein duality introduced by Arjovsky et al. (2017) as in Eq. (3). Practically, a 1-Lipschitz neural network (critic) fψ estimates the supremum of Eq. (3), and the main model minimizes this estimated supremum as:

minimize θ,ϕ maximize ψ LD := Eqϕ(x,z) [fψ(x, z)] Epθ(x,z) [fψ(x, z)] , (11)

where ψ is the critic parameters. To satisfy the 1-Lipschitz constraint, the critic fψ is implemented with techniques such as spectral normalization (Miyato et al., 2018) and gradient penalty (Gulrajani et al., 2017) (see Appendix B.3 for the details of the gradient penalty loss).

Entropy regularization RH. We further introduce the entropy regularization RH using the inference entropy to avoid degenerate solutions in which the encoder qϕ(z|x) becomes Dirac and deterministic for all data points. In such degenerate solutions, the latent representation simply becomes a look-up table because such a point-to-point encoder maps the set of data points into a set of latent code points with measure zero (Hoffman et al., 2017; Dai et al., 2018), causing overfitting into the empirical data distribution. An effective way to avoid it is a regularization with the inference entropy Hq of the latent variables z conditioned on data x as

RH := Hq(z|x) = Eqϕ(x,z) [ log qϕ(z|x)] . (12)

Since the conditioned entropy Hq(z|x) diverges to negative infinity in the degenerate solutions, the regularization term RH facilitates the probabilistic learning of GWAE models.

Published as a conference paper at ICLR 2023

Stochastic Training with Single Estimated Objective. Applying the Lagrange multiplier method to the aforementioned constraints, we recast the GW metric of Eq. (6) into a single objective L with multipliers λW , λD, and λH as

minimize θ,ϕ maximize ψ L := LGW + λW LW + λDLD λHRH. (13)

One efficient solution to optimize this objective is using the mini-batch gradient descent in alternative steps (Goodfellow et al., 2014; Arjovsky et al., 2017), which we can conduct in automatic differentiation packages, such as Py Torch (Paszke et al., 2019). One step of mini-batch descent is the minimization of the total objective L in Eq. (13), and the other step is the maximization of the critic objective LD in Eq. (11). By alternatively repeating these steps, the critic estimates the Wasserstein metric using the expected potential difference LD (Arjovsky et al., 2017). Although the objective in Eq. (13) involves three auxiliary regularizations including an adversarial term, the GWAE model can be efficiently optimized because the adversarial mechanism and the variational autoencoding scheme share the goal of distribution matching pθ(x, z) qϕ(x, z) (see Appendix C.5 for more details).

3.2.3 PRIOR BY SAMPLING GWAE models apply to the cases in which the prior πθ(z) takes the form of an implicit distribution with a sampler. An implicit distribution πθ(z) provides its sampler z πθ(z) while a closed-form expression of the probability density function is not available. The adversarial algorithm of GWAE handles such cases and enables a wide class of priors to provide meta-prior-based inductive biases for unsupervised representation learning, e.g., for disentanglement (Locatello et al., 2019b; 2020). Note that the GW objective in Eq. (6) becomes a constant function in non-trainable prior cases.

Neural Prior (NP). A straightforward way to build a differentiable sampler of a trainable prior is using a neural network to convert noises. The prior of the latent variables z is defined via sampling using a neural network gθ : RL RL with parameters θ (see Appendix B.2 for its formulation). Notably, the neural network gθ need not be invertible unlike Normalizing Flow (Rezende & Mohamed, 2015) since the prior is defined as an implicit distribution not requiring a push-forward measure.

Factorized Neural Prior (FNP). For disentanglement, we can constitute a factorized prior using an element-wise independent neural network gθ = { g(i) θ }L i=1 (see Appendix B.2 for its formulation). Such factorized priors can be easily implemented utilizing the 1-dimensional grouped convolution (Krizhevsky et al., 2012).

Gaussian Mixture Prior (GMP). For clustering structure, we construct a class of Gaussian mixture priors. Given that the prior contains K components, the k-th component is parameterized using the weights wk, means mk RL, and square-root covariances Mk RL L as

k=1 wk N(z|mk, Mk MT k), (14)

where the weights {wk}K k=1 are normalized as PK k=1 wk = 1. To sample from a prior of this class, one randomly chooses a component k from the K-way categorical distribution with probabilities (w1, w2, . . . , wk) and draws a sample z as follows:

z = mk + Mkϵ, ϵ N(0, IL), (15)

where 0 and In denote the zero vector and the n-sized identity matrix, respectively. In this class of priors, the set of trainable parameters consists of {(wk, mk, Mk)}K k=1. Note that this parameterization can be easily implemented in differentiable programming frameworks because Mk MT k is positive semidefinite for any Mk RL L.

4 EXPERIMENTS

We investigated the wide capability of the GWAE models for learning representations based on meta-priors.1 We evaluated GWAEs in two principal meta-priors: disentanglement and clustering.

1In the tables of the quantitative evaluations, and indicate scores in which higher and lower values are better, respectively.

Published as a conference paper at ICLR 2023

0 20 40 60 80 100 Number of Epochs Trained

Gromov-Wasserstein Value

Estimated GW Empirical GW

(a) The estimation of the GW metric in each epoch.

20 40 60 80 100 z

(b) The isometry in GWAE.

Figure 1: The estimation and minimization of the GW metric. This trial of training is conduced in GWAE (NP, λD=1, λW =1, λH=1) using the MNIST (Le Cun et al., 1998) dataset. (a) The curves show the GW values estimated by the loss term LGW (solid, blue) and the empirical GW computed by the POT package (Flamary et al., 2021) (dashed, orange). The values are computed using the validation set. (b) The axes x = d X (x, x ) (vertical) and z = d Z(z, z ) (horizontal) respectively denote the difference in the data and latent spaces between generated samples (x, z), (x , z ) pθ(x, z). The histogram contains 10,000 generated sample pairs.

To validate the effectiveness of GWAE on different tasks for each meta-prior, we conducted each experiment in corresponding experimental settings. We further studied their autoencoding and generation for the inspection of general capability.

4.1 EXPERIMENTAL SETTINGS

We compared the GWAE models with existing representation learning methods (see Appendix A for the details of the compared methods). For the experimental results in this section, we used four visual datasets: Celeb A (Liu et al., 2015), MNIST (Le Cun et al., 1998), 3D Shapes (Burgess & Kim, 2018), and Omniglot (Lake et al., 2015) (see Appendix C.1 for dataset details). For quantitative evaluations, we selected hyperparameters from λW [100, 101], λD [100, 101], and λH [10 4, 100] using their performance on the validation set. For fair comparisons, we trained the networks with a consistent architecture from scratch in all the methods (see Appendix C.2 for architecture details).

4.2 GROMOV-WASSERSTEIN ESTIMATION AND MINIMIZATION

We validated the estimation and minimization of the GW metric in Fig. 1. First, to validate the estimation of the GW metric, we compared the GW metric estimated in GWAE and the empirical GW value computed in the conventional method in Fig. 1a. Against the GWAE models estimating the GW metric as in Eq. (7), the empirical GW values are computed by the standard OT framework POT (Flamary et al., 2021). Although the estimated LGW is slightly higher than the empirical values, the curves behave in a very similar manner during the entire training process. This result supports that the GWAE model successfully estimated the GW values and yielded their gradients to proceed with the distribution matching between the data and latent spaces. Second, to validate the minimization of the GW metric, we show the histogram of the differences of generated samples in the data and latent space in Fig. 1b. The isometry of generated samples is attained if the generative coupling pθ(x, z) attains the infimum in Eq. (4). This histogram result shows that the generative model pθ(x, z) acquired nearly-isometric latent embedding, and suggests that the GW metric was successfully minimized although the objective of Eq. (13) contains three regularization loss terms (refer to Appendix C.8 for ablation studies, and Appendix C.4 for comparisons). These two experimental results support that the GWAE models successfully estimated and optimized the GW objective.

4.3 LEARNING REPRESENTATIONS BASED ON META-PRIORS

Disentanglement. We investigated the disentanglement of representations obtained using GWAE models and compared them with conventional VAE-based disentanglement methods. Since the element-wise independence in the latent space is postulated as a meta-prior for disentangled representation learning, we used the FNP class for the prior πθ(z). Considering practical applications with

Published as a conference paper at ICLR 2023

0.0 0.0 1.0 2.0 -1.0 -2.0

0.0 1.0 2.0 -1.0 -2.0

(a) VAE (Kingma & Welling, 2014).

0.0 0.0 0.5 1.0 1.5 -0.5 -1.0

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

(b) Factor VAE (Kim & Mnih, 2018) (γ=10).

2 0 1 0 0 1 0 2 0

20 10 0 -10 -20 -30

20 10 0 -10 -20

(c) GWAE (FNP, λW =1, λD=10, λH=0.3).

Figure 2: Comparison of the learned latent spaces in 3D Shapes (Burgess & Kim, 2018) and L = 16. The vertical and horizontal axes in the scatter plots respectively represent two of the 16 (= L) latent variables with the highest and the second-highest informativeness (Do & Tran, 2020) w.r.t. the object hue factor. Note that a single factor value varies along only one axis in a disentangled representation.

Table 1: Quantitative comparison of disentanglement. The reported scores were calculated in 3D Shapes (Burgess & Kim, 2018), and the latent size L = 16. Since the latent size L is larger than the number of the ground truth factors, the hyperparameter tuning was based on the validation set DCI-C (Eastwood & Williams, 2018) values. To deal with the probabilistic scores (Zaidi et al., 2021), we reported the ranges for five measurements. The details of the scores are provided in Appendix C.3.

Model DCI-C DCI-D DCI-I

VAE (Kingma & Welling, 2014) 0.7734 0.0004 0.6831 0.0002 0.9914 0.0003 β-VAE (Higgins et al., 2017a) 0.8245 0.0002 0.7328 0.0002 0.9796 0.0002 WAE (Tolstikhin et al., 2018) 0.8288 0.0004 0.7544 0.0004 0.9959 0.0001 β-TCVAE (Chen et al., 2018) 0.8347 0.0003 0.7085 0.0002 0.9880 0.0002 Factor VAE (Kim & Mnih, 2018) 0.7963 0.0004 0.7390 0.0004 0.9961 0.0002 DIP-VAE-I (Kumar et al., 2018) 0.8609 0.0003 0.6984 0.0003 0.9961 0.0001 DIP-VAE-II (Kumar et al., 2018) 0.8236 0.0001 0.7498 0.0003 0.9957 0.0002

GWAE (FNP) 0.9080 0.0002 0.7024 0.0002 0.9966 0.0002 * The ranges are denoted by (mean) (standard error of the mean).

unknown ground-truth factor, we set relatively large latent size L to avoid the shortage of dimensionality. The qualitative and quantitative results are shown in Fig. 2 and Table 1, respectively. These results support the ability to learn a disentangled representation in complex data. The scatter plots in Fig. 2 suggest that the GWAE model successfully extracted one underlying factor of variation (object hue) precisely along one axis, whereas the standard VAE (Kingma & Welling, 2014) formed several clusters for each value, and Factor VAE (Kim & Mnih, 2018) obtained the factor in quadrants.

Clustering Structure. We empirically evaluated the capabilities of capturing clusters using MNIST (Le Cun et al., 1998). We compared the GWAE model using GMP with other VAE-based methods considering the out-of-distribution (Oo D) detection performance in Fig. 3. We used MNIST images as in-distribution (ID) samples for training and Omniglot (Lake et al., 2015) images as unseen Oo D samples. Quantitative results show that the GWAE model successfully extracted the clustering structure, empirically implying the applicability of multimodal priors.

4.4 AUTOENCODING MODEL

We additionally studied the autoencoding and generation performance of GWAE models in Table 2 (see Appendix C.7 for qualitative evaluations). Although the distribution matching pθ(x) pdata(x) is a collateral condition of Eq. (7), quantitative results show that the GWAE model also favorably compares with existing autoencoding models in terms of generative capacity. This result suggests the substantial capture of the underlying low-dimensional distribution in GWAE models, which can lead to the applications to other types of meta-priors.

Published as a conference paper at ICLR 2023

True positive rate

False positive rate

0.4 0.6 0.8 1.0

(a) VAE (Kingma & Welling, 2014), AUC=0.9957.

True positive rate

False positive rate

0.4 0.6 0.8 1.0

(b) DAGMM (Zong et al., 2018), AUC=0.9654.

True positive rate

False positive rate

0.4 0.6 0.8 1.0

(c) GWAE (GMP), AUC=1.0000.

Figure 3: The ROC curves of the Oo D detection in MNIST (Le Cun et al., 1998) against Omniglot (Lake et al., 2015). We trained these models using MNIST as ID samples and used Omniglot as Oo D samples. We upsampled Omniglot to 10,000 samples for data balancing. For the anomaly detection using the latent codes z, we applied the negative log-likelihood energy log π(z) for VAE and DAGMM, and used the estimated Kantorovich potential Epθ( x|z) [fψ( x, z)] for GWAE (see Appendix C.10 for more latent space details).

Table 2: Quantitative comparisons of generation and reconstruction. The FID scores (Heusel et al., 2017) evaluate a random sample set from the generative model pθ(x) (without using dataset images) against the entire test set, and both consist of an equal number of 19,962 samples. The PSNR scores measure the reconstruction qϕ(z)pθ(x|z) using test images (see Appendix C.3 for details). All reported values were computed in Celeb A (Liu et al., 2015) with a latent size of L = 64. For all the methods, we applied early stopping (patience=10) and hyperparameter tuning using the validation set. The bold and underlined values respectively denote the best and the second-best performance in each score.

Model FID PSNR [d B]

Baseline VAE (Kingma & Welling, 2014) 130.9 19.96 β-VAE (Higgins et al., 2017a) 92.6 22.71 GECO (Rezende & Viola, 2018a) 162.1 21.19 KL re-weighting

σ-VAE (Rybkin et al., 2021) 53.13 20.03

Hierarchical factors Ladder VAE (Sønderby et al., 2016) 255.6 12.35 VLadder AE (Zhao et al., 2017) 147.1 19.76 WAE (Tolstikhin et al., 2018) 55 22.70 WVI (Ambrogioni et al., 2018) 295.0 14.45 SWAE (Kolouri et al., 2019) 102.2 21.85 OT-based models

RAE (Xu et al., 2020) 52.20 21.34

Trainable priors Vamp Prior (Tomczak & Welling, 2018) 243.8 16.23 2-Stage VAE (Dai & Wipf, 2019) 34 16.15 VAE-GAN (Larsen et al., 2016) 111.8 19.51 AVB (Mescheder et al., 2017) 93.0 22.60 IVI-based models

ALI (Dumoulin et al., 2017) 171.8 12.26

Ours GWAE (NP) 45.3 22.82 * The values are cited from the original papers annotated after the model names.

5 CONCLUSION

In this work, we have introduced a novel representation learning method that performs the distance distribution matching between the given unlabeled data and the latent space. Our GWAE model family transfers distance structure from the data space into the latent space in the OT viewpoint, replacing the ELBO objective of variational inference with the GW metric. The GW objective provides a direct measure between the latent and data distribution. Qualitative and quantitative evaluations empirically show the performance of GWAE models in terms of representation learning. In future work, further applications also remain open to various types of meta-priors, such as spherical representations and non-Euclidean embedding spaces.

Published as a conference paper at ICLR 2023

REPRODUCIBILITY STATEMENT

We describe the implementation details in Section 4, Appendix B, and Appendix C. The dataset details are provided in Appendix C.1. To ensure reproducibility, our code is available online at https://github.com/ganmodokix/gwae and is provided as the supplementary material.

ACKNOWLEDGMENTS

This work was partly supported by AMED Grant Number JP21zf0127004 and JSPS KAKENHI Grant Number JP21H03456.

Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis & Machine Intelligence, 40 (12):2897 2905, 2018. doi: 10.1109/TPAMI.2017.2784440. 3, 16

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 19, 2018. URL https://openreview.net/forum?id=Hyx Qz Bceg. 16

Luca Ambrogioni, Umut Güçlü, Ya gmur Güçlütürk, Max Hinne, Marcel A. J. van Gerven, and Eric Maris. Wasserstein variational inference. In Proceedings of Neural Information Processing Systems (NIPS), pp. 2473 2482, 2018. URL https://papers.nips.cc/paper/2018/ hash/2c89109d42178de8a367c0228f169bf8-Abstract.html. 9

Jyoti Aneja, Alex Schwing, Jan Kautz, and Arash Vahdat. A contrastive learning approach for training variational autoencoder priors. In Proceedings of Neural Information Processing Systems (Neur IPS), pp. 480 493, 2021. URL https://proceedings.neurips.cc/paper/ 2021/hash/0496604c1d80f66fbeb963c12e570a26-Abstract.html. 2

Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 17, 2017. URL https://openreview.net/forum?id=Hk4_qw5xe. 20

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 214 223, 2017. URL https://proceedings.mlr.press/v70/arjovsky17a.html. 2, 3, 5, 6, 20

Yuki M. Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=Hyx-jy BFPr. 1

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013. doi: 10.1109/TPAMI.2013.50. 1, 4

Leo Breiman. Random forests. Machine Learning, 45(1):5 32, 2001. doi: 10.1023/A:1010933404324.

Chris Burgess and Hyunjik Kim. 3D Shapes Dataset. https://github.com/deepmind/ 3d-shapes/, 2018. Accessed May 13, 2022. 2, 7, 8, 22

Gunnar Carlsson, Tigran Ishkhanov, Vin de Silva, and Afra Zomorodian. On the local behavior of spaces of natural images. International Journal of Computer Vision (IJCV), 76(1):1 12, 2008. doi: 10.1007/s11263-007-0056-x. 1, 4

Ricky T. Q. Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Proceedings of Neural Information Processing Systems (NIPS), pp. 2610 2620, 2018. URL https://proceedings.neurips.cc/paper/ 2018/hash/1ee3dfcd8a0645a25a35977997223d22-Abstract.html. 1, 3, 8, 21

Published as a conference paper at ICLR 2023

Casey Chu, Kentaro Minami, and Kenji Fukumizu. Smoothness and stability in GANs. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 15, 2020. URL https://openreview.net/forum?id=HJe Oek HKwr. 21

Elliot Creager, David Madras, Joern-Henrik Jacobsen, Marissa Weis, Kevin Swersky, Toniann Pitassi, and Richard Zemel. Flexibly fair representation learning by disentanglement. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1436 1445, 2019. URL https://proceedings.mlr.press/v97/creager19a.html. 1

Bin Dai and David Wipf. Diagnosing and enhancing vae models. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 12, 2019. URL https://openreview. net/forum?id=B1e0X3C9t Q. 9, 18, 27, 28, 35

Bin Dai, Yu Wang, John Aston, Gang Hua, and David Wipf. Connections with robust PCA and the role of emergent sparsity in variational autoencoder models. Journal of Machine Learning Research, 19(41):1 42, 2018. URL http://jmlr.org/papers/v19/17-704.html. 5

Bin Dai, Ziyu Wang, and David Wipf. The usual suspects? Reassessing blame for VAE posterior collapse. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2313 2322, 2020. URL https://proceedings.mlr.press/v119/dai20c.html. 1

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848. 26

Nicki Skafte Detlefsen and Søren Hauberg. Explicit disentanglement of appearance and perspective in generative models. In Proceedings of Neural Information Processing Systems (Neur IPS), pp. 1018 1028, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 3493894fa4ea036cfc6433c3e2ee63b0-Abstract.html. 3

Zheng Ding, Yifan Xu, Weijian Xu, Gaurav Parmar, Yang Yang, Max Welling, and Zhuowen Tu. Guided variational autoencoder for disentanglement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7920 7929, 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/ html/Ding_Guided_Variational_Autoencoder_for_Disentanglement_ Learning_CVPR_2020_paper.html. 1, 3

Kien Do and Truyen Tran. Theory and evaluation metrics for learning disentangled representations. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 30, 2020. URL https://openreview.net/forum?id=HJg K0h4Ywr. 8

Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 18, 2017. URL https://openreview.net/forum?id=BJt NZAFgg. 20

Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 18, 2017. URL https://openreview. net/forum?id=B1El R4cgg. 9, 20, 28, 35

Cian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of disentangled representations. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 15, 2018. URL https://openreview.net/forum?id=By-7dz-AZ. 8, 24, 26

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. Journal of Machine Learning Research, 22(78): 1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html. 7, 37

Published as a conference paper at ICLR 2023

Benoit Gaujac, Ilya Feige, and David Barber. Learning disentangled representations with the wasserstein autoencoder. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Part III, pp. 69 84, 2021. doi: 10.1007/978-3-030-86523-8_5. 2, 3, 19

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of Neural Information Processing Systems (NIPS), pp. 2672 2680, 2014. URL https://papers.nips.cc/ paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html. 2, 4, 6, 19

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In Proceedings of Neural Information Processing Systems (NIPS), pp. 5769 5779, 2017. URL https://papers.nips.cc/paper/2017/hash/ 892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html. 5, 21

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv: 1606.08415, 2016. 23,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of Neural Information Processing Systems (NIPS), pp. 6629 6640, 2017. URL https://papers.nips.cc/paper/2017/hash/ 8a1d694707eb0fefe65871369074926d-Abstract.html. 9, 24, 26, 29

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 22, 2017a. URL https://openreview.net/forum?id= Sy2fz U9gl. 1, 3, 4, 8, 9, 16, 21, 28, 32, 33, 34

Irina Higgins, Arka Pal, Andrei A. Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. DARLA: Improving zero-shot transfer in reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1480 1490, 2017b. URL http://proceedings.mlr.press/v70/ higgins17a.html. 1

Matt Hoffman, Carlos Riquelme, and Matthew Johnson. The beta VAE s implicit prior. In Proceedings of Neural Information Processing Systems (NIPS) Workshop on Bayesian Deep Learning, pp. 1 5, 2017. URL https://research.google/pubs/pub47350/. 1, 5

Xianxu Hou, Ke Sun, Linlin Shen, and Guoping Qiu. Improving variational autoencoder with deep feature consistent and generative adversarial training. Neurocomputing, 341:183 194, 2019. doi: 10.1016/j.neucom.2019.03.013. 3

Wei-Ning Hsu, Yu Zhang, and James Glass. Learning latent representations for speech generation and transformation. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1273 1277, 2017. doi: 10.21437/Interspeech.2017-349. 1

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Toward controlled generation of text. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1587 1596, 2017. URL https://proceedings.mlr.press/v70/hu17e.html. 1

Ferenc Huszár. Variational inference using implicit distributions. ar Xiv: 1702.08235, 2017. 2, 3, 19

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2649 2658, 2018. URL http://proceedings. mlr.press/v80/kim18b.html. 1, 3, 4, 8, 17, 21, 27

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 13, 2015. URL https://openreview.net/forum?id=8gm Wwj Fy Lj. 23

Published as a conference paper at ICLR 2023

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 13, 2014. URL https: //openreview.net/forum?id=33X9fd2-9Fy Zd. 1, 2, 3, 4, 8, 9, 16, 20, 21, 27, 29, 30,

Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced wasserstein auto-encoders. In Proceedings of International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=H1xa Jn05FQ. 9

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009. Master s thesis, Technical Report, University of Toronto. 22, 28, 31, 34

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Neural Information Processing Systems (NIPS), pp. 1097 1105, 2012. URL https://papers.nips.cc/paper/2012/hash/ c399862d3b9d6b76c8436e924a68c45b-Abstract.html. 6

Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 16, 2018. URL https://openreview.net/ forum?id=H1k G7GZAW. 3, 8

Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. doi: 10.1126/ science.aab3050. 7, 8, 9, 22

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1558 1566, 2016. URL http://proceedings. mlr.press/v48/larsen16.html. 3, 9, 20, 21, 28, 35

Yann Le Cun, Léeon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. doi: 10.1109/5.726791. 2, 7, 8, 9, 22, 27, 28, 30, 33, 37

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3730 3738, 2015. doi: 10.1109/ICCV.2015.425. 2, 7, 9, 22, 28, 29, 31, 32, 35, 36

Francesco Locatello, Gabriele Abbati, Thomas Rainforth, Stefan Bauer, Bernhard Schölkopf, and Olivier Bachem. On the fairness of disentangled representations. In Proceedings of Neural Information Processing Systems (Neur IPS), pp. 14584 14597, 2019a. URL https://proceedings.neurips.cc/paper/2019/hash/ 1b486d7a5189ebe8d8c46afc64b0d1b4-Abstract.html. 1

Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the International Conference on Machine Learning (ICML), pp. 4114 4124, 2019b. URL http://proceedings.mlr.press/v97/locatello19a.html. 6

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. A sober look at the unsupervised learning of disentangled representations and their evaluation. Journal of Machine Learning Research, 21(209):1 62, 2020. URL http: //jmlr.org/papers/v21/19-976.html. 6

Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the Workshop on Deep Learning for Audio, Speech, and Language Processing, ICML (WDLASL), pp. 3 9, 2013. URL http://robotics.stanford. edu/~amaas/papers/relu_hybrid_icml2013_final.pdf. 25

Alireza Makhzani. Implicit autoencoders. ar Xiv: 1805.09804, 2018. 3

Published as a conference paper at ICLR 2023

Facundo Mémoli. Gromov-wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(1):417 487, 2011. doi: 10.1007/s10208-011-9093-5. 2, 4

Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2391 2400, 2017. URL http://proceedings. mlr.press/v70/mescheder17a.html. 9, 20, 32, 33, 34

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), pp. 1 26, 2018. URL https://openreview.net/forum?id= B1QRgzi T-. 5, 21, 25

Khai Nguyen, Son Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Improving relational regularized autoencoders with spherical sliced fused gromov wasserstein. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 11, 2021. URL https://openreview. net/forum?id=Di QD7FWL233. 2

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Proceedings of Neural Information Processing Systems (Neur IPS), pp. 8024 8035, 2019. URL https://papers.nips.cc/paper/2019/hash/ bdbca288fee7f92f2bfa9f7012727740-Abstract.html. 6, 22

Danilo J. Rezende and Fabio Viola. Generalized elbo with constrained optimization, geco. In Proceedings of Neural Information Processing Systems (NIPS) Workshop on Bayesian Deep Learning, pp. 1 11, 2018a. URL http://bayesiandeeplearning.org/2018/papers/33.pdf. 2,

Danilo J. Rezende and Fabio Viola. Taming vaes. ar Xiv: 1810.00597, 2018b. 2, 29

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1530 1538, 2015. URL http://proceedings.mlr.press/v37/rezende15.html. 6

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1278 1286, 2014. URL https://proceedings.mlr. press/v32/rezende14.html. 1

Oleh Rybkin, Kostas Daniilidis, and Sergey Levine. Simple and effective vae training with calibrated decoders. In Proceedings of the International Conference on Machine Learning (ICML), pp. 9179 9189, 2021. URL http://proceedings.mlr.press/v139/rybkin21a.html. 9

Thibault Sejourne, Francois-Xavier Vialard, and Gabriel Peyré. The unbalanced gromov wasserstein distance: Conic formulation and relaxation. In Proceedings of Neural Information Processing Systems (Neur IPS), volume 34, pp. 8766 8779, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 4990974d150d0de5e6e15a1454fe6b0f-Abstract.html. 4

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Proceedings of Neural Information Processing Systems (NIPS), pp. 3745 3753, 2016. URL https://papers.nips.cc/paper/2016/hash/ 6ae07dcb33ec3b7c814df797cbda0f87-Abstract.html. 1, 3, 9, 18

Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. In Proceedings of International Conference on Learning Representations (ICLR), pp. 1 17, 2017. URL https://openreview.net/forum?id= S1RP6GLle. 2, 3

Published as a conference paper at ICLR 2023

Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient flows on the space of metric measure spaces. ar Xiv: 1208.0434, 2012. 4

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012. doi: 10.1017/CBO9781139035613. 2, 3, 17, 19, 20

Valentin Thomas, Emmanuel Bengio, William Fedus, Jules Pondard, Philippe Beaudoin, Hugo Larochelle, Joelle Pineau, Doina Precup, and Yoshua Bengio. Disentangling the independently controllable factors of variation by interacting with the world. In Proceedings of Neural Information Processing Systems (NIPS) Workshop, Learning Disentangled Representations: from Perception to Control, pp. 1 9, 2017. URL https://acsweb.ucsd.edu/~wfedus/pdf/ICF_NIPS_ 2017_workshop.pdf. 1

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368 377, 1999. URL https://www.cs.huji.ac.il/labs/learning/Papers/ allerton.pdf. 16

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 16, 2018. URL https://openreview.net/forum?id=Hk L7n1-0b. 2, 3, 4, 5, 8, 9, 18, 19, 27, 29

Jakub Tomczak and Max Welling. Vae with a vampprior. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1214 1223, 2018. URL https://proceedings.mlr.press/v84/tomczak18a.html. 9, 18

Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based representation learning. In Proceedings of Neural Information Processing Systems (NIPS) Workshop on Bayesian Deep Learning, pp. 1 25, 2018. URL https://www.mins.ee.ethz.ch/pubs/ p/autoenc2018. 1, 3, 28

Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In Proceedings of Neural Information Processing Systems (Neur IPS), volume 33, pp. 19667 19679, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ e3b21256183cf7c2c7a66be163579d37-Paper.pdf. 1

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86):2579 2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html. 30, 38

Cédric Villiani. Optimal Transport: Old and New. Springer Berlin, 2009. doi: 10.1007/ 978-3-540-71050-9. 3

Hongteng Xu, Dixin Luo, Ricardo Henao, Svati Shah, and Lawrence Carin. Learning autoencoders with relational regularization. In Proceedings of the International Conference on Machine Learning (ICML), pp. 10576 10586, 2020. URL https://proceedings.mlr.press/v119/ xu20e.html. 2, 9, 19

Julian Zaidi, Jonathan Boilard, Ghyslain Gagnon, and Marc-André Carbonneau. Measuring disentanglement: A review of metrics. ar Xiv: 2012.09276, 2021. 8, 26

Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, and Yann Le Cun. Adversarially regularized autoencoders. In Proceedings of the International Conference on Machine Learning (ICML), pp. 5902 5911, 2018. URL https://proceedings.mlr.press/v80/zhao18b.html. 1

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from generative models. In Proceedings of the International Conference on Machine Learning (ICML), pp. 4091 4099, 2017. URL https://proceedings.mlr.press/v70/zhao17c.html. 1, 3, 9, 18

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Info VAE: Balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5885 5892, 2019. doi: 10.1609/aaai.v33i01.33015885. 2, 3, 17, 19

Published as a conference paper at ICLR 2023

Bo Zong, Qi Song, Martin Renquang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 19, 2018. URL https://openreview.net/forum?id=BJJLHbb0-. 1, 3, 9, 30, 38

A DETAILS OF RELATED WORK

For self-containment, we describe VAE-based representation learning methods. As with Section 3, x and z denote data and latent variables, respectively, and the data x are M-dimensional and the latent variables z are L-dimensional. Unless otherwise noted, each VAE-based model consists of a generative model pθ(x, z) with parameters θ, an inference model qϕ(x, z) with parameters ϕ, and a pre-defined (non-trainable) prior π(z) as in the standard VAE model architecture.

A.1 VAE-BASED MODELS WITH ELBO EXTENSION

Utilizing the latent variables of VAE-based models is a prominent approach to representation learning. Several models with extended ELBO-based objectives aim to overcome the shortcomings of the original VAE model, such as posterior collapse. VAE-based models are mainly grounded on the ELBO objective, where we denote the ELBO for the data point x as

ELBO(x; θ, ϕ) = Eqϕ(z|x) [log pθ(x|z)] DKL(qϕ(z|x) π(z)) , (16)

which is mentioned as the expected objective of the original VAE (Kingma & Welling, 2014) in Eq. (1).

A.1.1 β-VAE

β-VAE (Higgins et al., 2017a) is a VAE-based model for learning disentangled representations by re-weighting the KL term of the ELBO. Given a KKT multiplier β > 0, the β-VAE objective is expressed as

maximize θ,ϕ Epdata(x) Eqϕ(z|x) [pθ(x|z)] βDKL(qϕ(z|x) π(z)) . (17)

The KKT multiplier β works as the weight of the regularization to impose a factorized prior (e.g., the standard Gaussian N(0, IL)) on the latent variables. This re-weighting induces the capability of disentanglement in the case of β > 1; however, a large value of β causes posterior collapse, in which the latent variables forget the information of the input data.

From the Information Bottleneck (IB) (Tishby et al., 1999) point of view, the β-VAE objective is re-interpreted as the following optimization problem (Alemi et al., 2018; Achille & Soatto, 2018):

maximize θ,ϕ Iϕ(z; y) (18)

subject to Iϕ(z; x) Ic, (19)

where Ic is a bottleneck capacity, y is a task to be estimated, and Iϕ( ; ) denotes the mutual information on the inference model. Introducing the Lagrange multiplier β, the IB problem is given as

maximize θ,ϕ Iϕ(z; y) βIϕ(z; x). (20)

Alemi et al. (2018) have given the lower bound of this IB objective as

Iϕ(z; y) βIϕ(z; x) Epdata(y)qϕ(z|y) [log pθ(y|z)] H(y) | {z } The lower bound of Iϕ(z; y)

β Epdata(x) [DKL(qϕ(z|x) π(z))] | {z } The upper bound of Iϕ(z; x)

where the task entropy H(y) is independent of the parameters θ and ϕ. The autoencoding task y = x gives the objective equivalent to that of the original VAE. This IB-based formulation of the β-VAE objective implies that the larger value of the multiplier β guides the training process to minimize the mutual information Iϕ(z; x) to make the encoder forget the input data, i.e., to cause posterior collapse.

Published as a conference paper at ICLR 2023

A.1.2 FACTORVAE

Factor VAE (Kim & Mnih, 2018) is a state-of-the-art disentanglement method that minimizes the Total Correlation (TC) of the aggregated posterior qϕ(z) = Epdata(x)[qϕ(z|x)] in addition to the original ELBO objective. The TC is expressed as the KL divergence between a distribution and its factorized counterpart. In the Factor VAE case, the TC of the aggregated posterior is the KL divergence from the factorized aggregated posterior qϕ(z) = QL i=1 qϕ(zi) to the aggregated posterior qϕ(z). The training objective of Factor VAE is the weighted sum of the ELBO and the TC term as

maximize θ,ϕ ELBO(x; θ, ϕ) γTC(qϕ(z)), (22)

where TC(z) denotes the TC of the latent variables z defined as

TC(z) = DKL(qϕ(z) qϕ(z)) (23)

log Disc(z) 1 Disc(z)

In Eq. (24), Disc(z) denotes a discriminator to estimate the TC term by density ratio estimation (Sugiyama et al., 2012) as

Disc(z) = arg max f:Z [0,1] Eqϕ(z) [log f(z)] + E qϕ(z) [log(1 f(z))] . (25)

Practically, the discriminator is estimated using SGD in parallel using samples from qϕ(z) by permuting the latent codes along the batch dimension independently in each latent variable.

A.1.3 INFOVAE

Info VAE (Zhao et al., 2019) is an extension of VAE to prevent posterior collapse by the retention of data information in the latent variables. The Info VAE objective is the sum of the ELBO and the inference model mutual information Iϕ in Eq. (19). To this end, the following maximization problem is solved via SGD:

maximize θ,ϕ Epdata(x) [ELBO(x; θ, ϕ)] + Iϕ(x; z) (26)

= Epdata(x)Eqϕ(z|x) [pθ(x|z)] DKL(qϕ(z) π(z)) (27)

The main difference between the VAE and Info VAE objectives is using the regularization term DKL(qϕ(z) π(z)) instead of the original VAE regularization DKL(qϕ(z|x) π(z)). The original KL term becomes zero if all the data points are encoded into the standard Gaussian N(0, IL) to cause posterior collapse. The Info VAE KL term DKL(qϕ(z) π(z)) alleviates this problem by adopting the aggregated posterior qϕ(z) for optimization instead of the encoder qϕ(z|x). The authors of Info VAE (Zhao et al., 2019) further provide the model family in which the KL term is replaced with other divergences. They introduce an alternative divergence D(qϕ(z), π(z)) and its weight λ to conduct representation learning by the following training objective:

maximize θ,ϕ Epdata(x) [ELBO(x; θ, ϕ)] + Iϕ(x; z) (28)

= Epdata(x)Eqϕ(z|x) [pθ(x|z)] λD(qϕ(z), π(z)). (29)

In the original Info VAE paper (Zhao et al., 2019), the authors reported that the Maximum-Mean Discrepancy (MMD) is the best choice for the divergence D. The MMD divergence MMD(qϕ(z), π(z)) is defined as

MMD(qϕ(z), π(z)) = Eqϕ(z)Eqϕ(z ) [k(z, z )] + Eπ(z)Eπ(z ) [k(z, z )]

2Eqϕ(z)Eπ(z ) [k(z, z )] , (30)

where k( , ) is any universal kernel, such as the radial basis function kernel

k(z, z ) = exp( z z 2 2/σ2) (31)

for a constant σ > 0.

Published as a conference paper at ICLR 2023

A.2 VAE-BASED METHODS BASED ON HIERARCHICAL FACTORS

Several VAE-based methods postulate the existence of hierarchical factors as its meta-prior to learn representations with the abstractness of different levels (Sønderby et al., 2016; Zhao et al., 2017). These methods involve the change in their network architecture to utilize the feature hierarchy often captured in the hidden layers of deep neural networks.

A.2.1 LADDER VARIATIONAL AUTOENCODER (LADDERVAE)

Ladder Variational Autoencoder (Ladder VAE) (Sønderby et al., 2016) introduces hierarchical latent variables to the VAE model. Whereas the objective is still the ELBO, the Ladder VAE model structure has hierarchical latent variables. The generative process is modeled as the Markov chain of several latent variable groups, and the inference model consists of deterministic feature encoders and the decoders shared with generative models. In the original paper (Sønderby et al., 2016), the authors claim that the Ladder VAE models provide tighter log-likelihood lower bounds than the standard VAE.

A.2.2 VARIATIONAL LADDER AUTOENCODER (VLADDERAE)

Variational Ladder Autoencoder (VLadder AE) (Zhao et al., 2017) is a VAE-based model for hierarchical factors. Instead of the hierarchical models based on Markov chains, the VLadder AE models introduce the hierarchical structure in the network architecture parameterizing the generative and the inference model. Since it constrains feature hierarchy by the process of feature extraction, VLadder AE also performs disentanglement, e.g., the latent variables from different hidden convolutional layers capture textural or global features of visual data.

A.3 VAE-BASED METHODS INVOLVING PRIOR LEARNING

The standard VAE model has a pre-defined prior, which may cause the discrepancy between the underlying data structure and the postulated prior (Dai & Wipf, 2019). Several methods overcome this problem by involving the prior itself in the training process.

A.3.1 VAMPPRIOR

Vamp Prior (Tomczak & Welling, 2018) is a type of prior consisting of the mixture of the encoder distributions from several pseudo-input. The pseudo-inputs are introduced as trainable parameters, which are input into the encoder to build a mixture prior. Thus, the VAE models with Vamp Priors have trainable priors while retaining the main training procedure using the reparameterization trick to apply SGD.

A.3.2 2-STAGE VAE

2-Stage VAE (Dai & Wipf, 2019) is a generative model with two probabilistic autoencoders. The process of 2-Stage VAE consists of two steps: (i) training a standard VAE using the given dataset as the input, and (ii) training another VAE using the latent variables of the previous VAE as the input. The 2-Stage VAE model attempts to overcome the discrepancy between the pre-defined prior and the learned latent representation by introducing the second VAE in stage (ii), which yields the prior training using the VAE in stage (i).

A.4 WASSERSTEIN AUTOENCODER (WAE)

WAE (Tolstikhin et al., 2018) is a family of generative models whose autoencoder tries to estimate and minimize the primal form of the Wasserstein metric between the generative model pθ(x) and the data distribution pdata(x) using SGD with the following objective:

minimize θ,ϕ Epdata(x)Eqϕ(z|x)Epθ(x |z) [d(x, x )] + λD(qϕ(z), π(z)), (32)

where λ is a Lagrange multiplier, the generative model is defined as a latent variable model pθ(x, z) = π(z)pθ(x|z) postulating the prior of the latent variables π(z), and a conditional distribution qϕ(z|x) is a probabilistic encoder to optimize instead of all couplings supported on X X. The WAE

Published as a conference paper at ICLR 2023

objective is indeed equivalent to that of Info VAE (Zhao et al., 2019) in Eq. (27), which provides the OT-based perspective on VAE-based models. Following the Info VAE (Zhao et al., 2019), we adopt the MMD for the divergence D, which is denoted by WAE-MMD in the original WAE paper (Tolstikhin et al., 2018). Although the WAE-based approaches rewrite VAE-based objectives with the Wasserstein metric, these metrics are between x-marginal distributions and do not directly include the latent space Z. To learn representations z, the Wasserstein-based objective is further modified (Gaujac et al., 2021).

A.5 RELATIONAL REGULARIZED AUTOENCODER (RAE)

Relational Regularized Autoencoder (RAE) (Xu et al., 2020) is a variational autoencoding generative model with a regularization loss based on the fused Gromov-Wasserstein (FGW) metric. RAE introduces the FGW metric between the aggregated posterior and the latent prior as the regularization divergence to fortify the WAE constraint πθ(z) = qϕ(z) introduced by Tolstikhin et al. (2018) for generative modeling. The FGW regularization is introduced with a weight hyperparameter β [0, 1] and given as

minimize θ,ϕ Epdata(x)Eqϕ(z|x)Epθ(x |z) [d(x, x )] + λDF GW (qϕ(z), πθ(z); β), (33)

where DF GW denotes the FGW metric being the upper bound of the weighted sum of the Wasserstein and Gromov-Wasserstein metrics. The FGW metric DF GW is given as

DF GW (qϕ(z), πθ(z); β)

= inf γ P(qϕ(z),πθ(z))

(1 β)Eγ(z,z )[d Z(z, z )] + βEγ(z1,z 1)γ(z2,z 2)[|d Z(z1, z2) d Z(z 1, z 2)|]

(1 β) inf γ P(qϕ(z),πθ(z)) Eγ(z,z )[d Z(z, z )] | {z } Wasserstein term for direct comparison

+ β inf γ P(qϕ(z),πθ(z)) Eγ(z1,z 1)γ(z2,z 2)[|d Z(z1, z2) d Z(z 1, z 2)|2] | {z } Gromov-Wasserstein term for relational comparison

where P(qϕ(z), πθ(z)) is a set of all couplings whose marginals are qϕ(z), πθ(z). The discrepancy between the prior πθ(z) and the aggregated posterior qϕ(z) causes the degradation of generative performance since the processes of decoding pdata(x)qϕ(z|x)pθ(x|z) and generation πθ(z)pθ(x|z) are modeled in different regions of the latent space. This formulation enables learning a prior distribution qϕ(z) with flexibly assuming the structures of data, where the prior πθ(z) is modeled as a Gaussian mixture model the original settings by Xu et al. (2020). They aim at matching the distributions on the latent space Z, which can have an identical dimensionality but may differ in terms of distance structure.

A.6 IVI METHODS

Beyond the analytically tractable distributions, implicit distributions are applied to variational inference. An implicit distribution only requires its sampling method, which extends the variety of modeling and applications in variational inference and VAE-based models.

A.6.1 DENSITY RATIO ESTIMATION BY ADVERSARIAL DISCRIMINATORS

The density ratio estimation technique (Sugiyama et al., 2012) is essential to the mechanism of GANs (Goodfellow et al., 2014) and IVI methods (Huszár, 2017), which is conducted via an optimal discriminator f between distributions r(x) and s(x) as

DKL(r(x) s(x)) = Er(x)

log f (x) 1 f (x)

= Er(x) [log f (x) log(1 f (x))] , (36)

where f (x) = arg max f:X (0,1) Er(x) [log f(x)] + Es(x) [log(1 f(x))] . (37)

Published as a conference paper at ICLR 2023

The discriminator is estimated via maximizing Eq. (37) with a neural network f f . The training of discriminators often suffers from instability and mode collapse owing to its alternative parameter updates based on Eq. (36) and Eq. (37) (Arjovsky & Bottou, 2017; Arjovsky et al., 2017). One approach to tackle this problem is imposing the Lipschitz continuity on the discriminator based on the Kantorovich-Rubinstein duality (Arjovsky et al., 2017).

A.6.2 ADVERSARIAL VARIATIONAL BAYES (AVB)

Adversarial Variational Bayes (AVB) (Mescheder et al., 2017) is an ELBO optimization method using the adversarial training process instead of the analytical KL term. Let us recall that the KL term in Eq. (1) is defined by the expected density ratio as

DKL(qϕ(z|x) π(z)) = Eqϕ(z|x)

Adopting the density ratio trick (Sugiyama et al., 2012), the analytical KL term can be replaced with the optimal discriminator, which takes a data point x and its encoder sample z qϕ(z|x) to output the density ratio qϕ(z|x)/π(z). It enables implicit distributions in the prior while retaining the ELBO objective of variational inference.

A.6.3 ADVERSARIALLY LEARNED INFERENCE (ALI) / BIDIRECTIONAL GENERATIVE ADVERSARIAL NETWORKS (BIGAN)

Adversarially Learned Inference (ALI) (Dumoulin et al., 2017) / Bidirectional Generative Adversarial Networks (Bi GAN) (Donahue et al., 2017) are models introducing the distribution matching of the generative model and the inference model as implicit distributions. These models have been proposed in different papers (Dumoulin et al., 2017; Donahue et al., 2017); however, they share an equivalent methodology. One can draw samples from the generative model π(z)pθ(x|z) by decoding prior samples and also from the inference model pdata(x)qϕ(z|x) by encoding data points. Here the ALI/Bi GAN models introduce a discriminator to estimate the Jensen-Shannon divergence between the generative model pθ(x, z) and the inference model qϕ(x, z). The model matching between the encoder and the decoder also learns latent representations by the bidirectional mappings.

A.6.4 VAE-GAN

VAE-GAN (Larsen et al., 2016) is a hybrid model based on VAE and GANs. The VAE-GAN models introduce a discriminator for the generative modeling w.r.t. the data x and utilize the hidden layers of the discriminator to model the decoder likelihood pθ(x|z) along the manifolds supporting the data. It provides the outstanding performance of data generation to the VAE framework by measuring the similarity of data utilizing the GANs-like network architecture.

B DETAILS OF PROPOSED METHOD

B.1 MODELING DETAILS

The decoder pθ(x|z) is modeled with a neural network Dθ : Z RM and its parameters θ as

pθ(x|z) = δ(x Dθ(z)). (39)

Following the standard VAE settings (Kingma & Welling, 2014), the encoder qϕ(z|x) is defined as a diagonal Gaussian parameterized by neural networks µϕ : Z RM and σ2 ϕ : Z RM + with parameters ϕ as

qϕ(z|x) = N(z|µϕ(x), diag(σ2 ϕ(x))). (40)

For the distance functions d X and d Z in Eq. (7) and Eq. (9), we used the L2 distance defined as

d X (x, x ) = 1

2 x x , (41)

d Z(z, z ) = 1

2 z z . (42)

Published as a conference paper at ICLR 2023

As another choice, we also utilized the adversarially learned metric (Larsen et al., 2016) in Eq. (9). In the adversarially learned metric, the distance is measured in the feature space formed by the hidden outputs of the critic fψ. Let hψ(x) denote the critic hidden outputs in which the critic takes x as its input. We can then define a distance d based on the adversarially learned metric as

d (x, x ) =

d X (x, x )2 + 1

2 hψ(x) hψ(x ) 2 2. (43)

Since the critic network fψ(x, z) has the Y-shaped architecture (see Appendix C.2) and concatenates x-based features and z-based features in one of the hidden layers to take a pair (x, z) as the inputs, we use the x-side branch as hψ(x).

B.2 PRIOR DETAILS

Neural Prior (NP). Formally, the NP πθ(z) with a neural network gθ is defined as:

πθ(z) = Z π(ϵ) det gθ(ϵ)

where π(ϵ) = N(ϵ|0, IL). (45)

We can implement this class of prior with sampling noises ϵ as z = gθ(ϵ), avoiding the calculation of the integral.

Factorized Neural Prior. For disentanglement in the variational autoencoding settings, element-wise independence is often imposed on latent variables z. Following the standard VAE settings (Kingma & Welling, 2014), we postulate Z = RL, where the latent variables z Z are expressed as an L-dimensional vector z = [z1, z2, . . . , z L]T. As with the NP, the FNP class of prior is defined as

i=1 π(i) θ (zi), (46)

where π(i) θ (zi) = Z π(ϵ(i))

g(i) θ (ϵ(i))

dϵ(i), (i = 1, 2, . . . , L) (47)

π(ϵ(i)) = N(ϵ(i)|0, 1). (i = 1, 2, . . . , L) (48)

This prior can be implemented with N disjoint neural networks, or 1-dimensional grouped convolutions. The difference between the NP and the FNP is element-wise independence, in which the prior πθ(z) is factorized into distributions for each latent variable. Factorized priors enable disentanglement by obtaining a representation comprising independent factors of variation (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018).

B.3 GRADIENT PENALTY

In the case of gradient penalty (Gulrajani et al., 2017), the maximization in Eq. (11) is further modified as

maximize ψ LD + λGP Eqϕ(x,z)Epθ(x ,z )Eϵ U(0,1) h ( x, z)fψ( x, z) 2 1 2i , (49)

where λGP > 0 is a constant, and x = ϵx+(1 ϵ)x and z = ϵz+(1 ϵ)z are interpolated samples by the random uniform noise ϵ. We adopt λGP = 10 in all the experiments reported in this paper. Introducing the gradient penalty together with other techniques such as spectral normalization (Miyato et al., 2018) is effective and essential for adversarial learning in general (Chu et al., 2020; Miyato et al., 2018).

C EXPERIMENTAL DETAILS

For the reported experimental results, we used a single GPU of NVIDIA Ge Force RTX 2080 Ti, and a single run of the entire GWAE training process until convergence takes about eight hours.

Published as a conference paper at ICLR 2023

C.1 DATASET DETAILS

For the reported experiments in Section 4, we used the following datasets:

MNIST (Le Cun et al., 1998). The MNIST dataset contains 70,000 handwritten digit images of 10 classes, comprising 60,000 training images and 10,000 test images. We used the original test set and randomly split the original training set into 54,000 training images and 6,000 validation images. We used the class information as its approximate factors of variation in the form of 10-dimensional dummy variables. This dataset is available online2 in its original format or via the torchvision package3 in the Py Torch (Paszke et al., 2019) tensor format. The MNIST dataset is licensed under the terms of the Creative Commons Attribution-Share Alike 3.0 license4.

Celeb A (Liu et al., 2015). The Celeb A dataset contains 202,599 aligned face images with 40 binary attributes. We cropped 144 144 pixels in the center of the 178 218-sized aligned images in the original dataset to omit excessive backgrounds. We used the train/validation/test partitions that the original authors provided. We used the binary attributes as its approximate factors of variation in the form of 40-dimensional vectors. As in the website of this dataset5, the Celeb A dataset is available for non-commercial research purposes only.

3D Shapes (Burgess & Kim, 2018). The 3D Shapes dataset contains 480,000 synthetic images with six ground truth factors of variation. The images in this dataset contain a single-colored 3D object, a single-colored wall of a rectangular room, a single-colored floor. These images are procedurally generated from the independent factors of variation, floor colour, wall colour, object colour, scale, shape, and orientation (Burgess & Kim, 2018). We randomly split the entire dataset into 384,000/48,000/48,000 images for the train/validation/test set, respectively. Since the factor shape is a categorical variable in four classes, we converted it into four dummy variables to obtain quantitative factors of variation in the form of 9dimensional vectors. The repository of this dataset6 is licensed under Apache License 2.07.

Omniglot (Lake et al., 2015). The Omniglot dataset contains 1,623 images of hand-written characters from 50 different alphabets written by 20 different people. The images are 105 105sized, binary-valued. We used this dataset as Oo D samples over MNIST in the evaluations on the Oo D detection utilizing cluster structure. The repository of this dataset8 is licensed under the MIT License9.

CIFAR-10 (Krizhevsky & Hinton, 2009). The CIFAR10 dataset contains 60,000 images with 10 classes, comprising 50,000 training images and 10,000 test images. The images are 32x32 color images in 10 natural image classes, such as airplane and cat. This dataset is provided online10 without any specific license.

In all the datasets above, we used all the images as the raster (bitmap) representation and resized them to 64 64 pixels with three channels, where each image is a 3 64 64-sized tensor value and M = 12, 288. For gray-scale (one-channeled) images such as in MNIST, we repeated these images along the channel dimension three times to uniform these sizes to 3 64 64 elements.

C.2 ARCHITECTURE DETAILS

The architecture of neural networks in GWAE and the compared methods are built with convolutions and deconvolution (transposed convolution) in the same settings as shown in Tables 3 and 4. In all the experiments on GWAE, we applied the gradient penalty and the spectral normalization in

2http://yann.lecun.com/exdb/mnist/ 3https://github.com/pytorch/vision 4https://creativecommons.org/licenses/by-sa/3.0/ 5https://mmlab.ie.cuhk.edu.hk/projects/Celeb A.html 6https://github.com/deepmind/3d-shapes 7http://www.apache.org/licenses/ 8https://github.com/brendenlake/omniglot 9https://opensource.org/licenses/MIT 10https://www.cs.toronto.edu/~kriz/cifar.html

Published as a conference paper at ICLR 2023

Table 3: Model architecture for the encoders in the GWAE models and the compared models. For the 64 64 RGB images used in the experiments, the input size is set to (Channels, Height, Width) = (3, 64, 64). FC and Conv denote fully-connected (linear) layers and convolutional layers, respectively.

Layer Input Shape Output Shape Options

Inverse Sigmoid σ 1(x) = log x 1 x Conv 3 64 64 32 32 32 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) Conv 32 32 32 64 16 16 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) Conv 64 16 16 128 8 8 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) Conv 128 8 8 256 4 4 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) FC 256 4 4 256 bias=True Si LU activation (Hendrycks & Gimpel, 2016) FC 256 L for µ, L for σ2 bias=True

the critic networks to impose the 1-Lipschitz continuity on the critic fψ, as shown in Table 7. In the neural samplers of GWAE models, we used the fully-connected architecture in Table 5 for NP and the grouped-convolutional architecture in Table 6 for FNP. We used fully connected layers for unconstrained priors in NP, and 1-dimensional grouped convolution layers (converting sequences with length 1 and L channels) for factorized priors in FNP. For the optimizers of GWAE, we used RMSProp11 with a learning rate of 10 4 for the main autoencoder network and used RMSProp with a learning rate of 5 10 5 for the critic network. For all the compared methods except for GWAE, we used the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 10 4. In the experiments, we used an equal batch size of 64 for all evaluated models. The batch size is relatively small, since the computational cost of GWAE for each batch is quadratic to the batch size B and the GW estimation runs in time O(NB) for each epoch using N/B batches.

In the case that a batch normalization layer is introduced in the encoder outputs qϕ(zi|x) = N( µ(x), diag( σ2(x))), the mean and variance are computed w.r.t. the aggregated posterior qϕ(z) rather than the element-wise sample mean and variance of L-dimensional output values. The normalized parameters ( µ(x), σ(x)) against the original parameters (µ(x), σ(x)) are given as

µ(x) = µ(x) Eqϕ(z)[z] q

Vqϕ(z)[z] , (50)

σ2(x) = σ2(x) Vqϕ(z)[z], (51)

where the division is element-wise conducted, and V denotes the variance. The mean Eqϕ(z)[z] and variance Vqϕ(z)[z] are approximated using unbiased estimators consisting of mini-batch samples. Given a mini-batch index set B {1, 2, . . . , N}, the unbiased estimations are expressed using the law of total variance as

Eqϕ(z)[z] 1 #B

i B µ(xi) =: ˆµ, (52)

Vqϕ(z)[z] 1 #B

i B σ2(xi) + 1 #B 1

i B (µ(x) ˆµ)2. (53)

11https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6. pdf

Published as a conference paper at ICLR 2023

Table 4: Model architecture for the decoders in the GWAE models and the compared models. The image shape is set to the same as Table 3. FC and De Conv denote fully-connected layers and deconvolutional layers, respectively.

Layer Input Shape Output Shape Options

FC L 256 bias=True Si LU activation (Hendrycks & Gimpel, 2016) FC 256 256 4 4 bias=True Si LU activation (Hendrycks & Gimpel, 2016) De Conv 256 4 4 128 8 8 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) De Conv 128 8 8 64 16 16 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) De Conv 64 16 16 32 32 32 kernel size=4, stride=2, padding=1 Si LU activation (Hendrycks & Gimpel, 2016) De Conv 32 32 32 3 64 64 kernel size=4, stride=2, padding=1 Sigmoid σ(x) = 1 1+e x

Table 5: Model architecture for the samplers in the GWAE models with NP. FC denotes a fullyconnected layer.

Layer Input Shape Output Shape Options

FC L 256 bias=True Si LU activation (Hendrycks & Gimpel, 2016) FC 256 256 bias=True Si LU activation (Hendrycks & Gimpel, 2016) FC 256 256 bias=True Si LU activation (Hendrycks & Gimpel, 2016) FC 256 L bias=True Batch Normalization with affine=False

Table 6: Model architecture for the samplers in the GWAE models with FNP. Group Conv denotes 1-dimensional grouped convolutional layers.

Layer Input Shape Output Shape Options

Group Conv L 256 bias=True, groups=L Si LU activation (Hendrycks & Gimpel, 2016) Group Conv 256 256 bias=True, groups=L Si LU activation (Hendrycks & Gimpel, 2016) Group Conv 256 256 bias=True, groups=L Si LU activation (Hendrycks & Gimpel, 2016) Group Conv 256 L bias=True, groups=L Batch Normalization with affine=False

C.3 QUANTITATIVE EVALUATION DETAILS

For quantitative evaluations, we used the DCI scores (Eastwood & Williams, 2018) for disentanglement, the FID score (Heusel et al., 2017) for image generation, and the PSNR score for image reconstruction.

C.3.1 DCI SCORES

The DCI scores (Eastwood & Williams, 2018) measure a representation in terms of disentangled representation learning. In the DCI scores, disentanglement is measured from three aspects: (i) each representation variable represents a single factor of variation, (ii) each factor of variation is expressed

Published as a conference paper at ICLR 2023

Table 7: Model architecture for the critics in the GWAE models. We concatenated the outputs of the x-side and z-side branches and multiplied the concatenated outputs by 0.5 to input into the stem network for the sake of the gradient norm, resulting in a Y-shaped network. We applied spectral normalization (Miyato et al., 2018) to all the layers in the critic networks and used the Leaky Re LU (Maas et al., 2013) activation for the critic to retain the 1-Lipschitz continuity. FC and Conv denote fully-connected layers and convolutional layers, respectively.

Layer Input Shape Output Shape Options

x-side branch Conv 3 64 64 8 32 32 kernel size=4, stride=2, padding=1 Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 Conv 8 32 32 16 16 16 kernel size=4, stride=2, padding=1 Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 Conv 16 16 16 32 8 8 kernel size=4, stride=2, padding=1 Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 Conv 32 8 8 64 4 4 kernel size=4, stride=2, padding=1 Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 Conv 64 4 4 128 2 2 kernel size=4, stride=2, padding=1 Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 Conv 128 2 2 256 1 1 kernel size=4, stride=2, padding=1 Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 FC 256 64 bias=True

z-side branch FC L 256 bias=True Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 FC 256 256 bias=True Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 FC 256 64 bias=True Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2

Stem network z-side branch FC 64+64 256 bias=True Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 FC 256 256 bias=True Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2 FC 256 1 bias=True Leaky Re LU activation (Maas et al., 2013) with negative slope 0.2

by a single representation variable, and (iii) a representation is informative w.r.t. the original data. The correspondence of variables and factors is computed via estimating the ground truth factors from the representation using random forest (Breiman, 2001). DCI Disentanglement (DCI-D) measures (i) the factor singleness for each variable. DCI Completeness (DCI-C) measures (ii) the variable singleness for each factor. DCI Informativeness (DCI-I) measures (iii) whether the representation is informative for estimating the ground truth factors. These metrics are computed via the variable importances (e.g., the Gini impurity (Breiman, 2001)) of the random forest (Breiman, 2001), in which the random forest regressor estimates the ground truth factors using the representation variables. Using L-dimensional representation variables z, V -dimensional factors y and their importance Rij of the i-th variable zi

Published as a conference paper at ICLR 2023

for the k-th factor yk, the DCI-D and DCI-C scores for each variable and each factor are defined as

DCI-Di = 1 +

k=1 pik log V pik, (i = 1, 2, . . . , L) (54)

where pik = Rik

j=1 Rij , (55)

DCI-Ck = 1 +

i=1 qik log V qik, (k = 1, 2, . . . , V ) (56)

where qik = Rik

j=1 Rjk . (57)

The DCI-D score for the entire variable set is given by the weighted sum PL i=1 ρi DCI-Di, where the weight ρi is weighted importance ρi = (PV k=1 Rik)/(PL i=1 PV k=1 Rik). The DCI-C score for the entire factor set is given by the average score 1/V PV k=1 DCI-Ck. The DCI-D and DCI-C metrics take values within the range [0, 1], where higher values indicate better performance. For DCI-I, we used the normalized definition by Zaidi et al. (2021) because the normalized DCI-I values are within the range [0, 1] and the higher values mean better informativeness, while DCI-I score DCI-DOriginal is the estimation mean square error in the original definition. The DCI-I definition that we used is expressed as DCI-I = 1 6 DCI-IOriginal. (58) Following the original paper (Eastwood & Williams, 2018), we set the number of random trees to 10 and decided the tree depth with cross-validation.

C.3.2 FRÉCHET INCEPTION DISTANCE (FID)

Fréchet Inception Distance (FID) (Heusel et al., 2017) is a score for evaluating the quality of the generated images by generative models. The FID score is defined as the squared 2-Wasserstein metric between the features of the real images with mean (µr, Σr) and that of the generated images with mean (µg, Σg). Assuming that the features are normally distributed in the feature space, the FID score is expressed as FID = W 2 2 (N(µr, Σr), N(µg, Σg)) (59)

= µr µg 2 2 + tr(Σr + Σg 2(ΣrΣg) 1 2 ). (60) Since the Wasserstein metric measures the discrepancy between distributions, lower values indicate better generation performance in the FID score. Following the original FID paper (Heusel et al., 2017), we used the features obtained from the final pooling layer outputs of the Inception-v3 pre-trained in the Image Net dataset (Deng et al., 2009).

C.3.3 PEAK SIGNAL-TO-NOISE RATIO (PSNR)

For measuring the image reconstruction, we used the Peak Signal-to-Noise Ratio (PSNR) value. The PSNR value is defined as PSNR = 20 log10(MAX) 10 log10(MSE), (61) where MAX denotes the maximum value of the pixel values, and MSE indicates the mean square error (MSE). In all the experiments conducted in Section 4, the value of MAX is set to MAX = 1 because the images input as a dataset data are scaled within the range [0, 1].

C.4 ISOMETRY COMPARISON

Regarding the evaluations in Section 4.2, we further conducted comparisons on isometry in Fig. 4. The results show that the GWAE models provide more isometric autoencoders compared with other VAE-based representation learning methods. The existing VAE-based methods did not yield as far as GWAE, which supports that the GW metric works as a different objective class from the ELBO. This implies that the GW metric loss substantially affects the training procedure of learning representations.

Published as a conference paper at ICLR 2023

20 40 60 80 100 z

(a) VAE (Kingma & Welling, 2014).

20 40 60 80 z

(b) Factor VAE (Kim & Mnih, 2018).

20 40 60 80 z

(c) 2-Stage VAE (Dai & Wipf, 2019).

75 100 125 150 175 200 225 z

(d) WAE (Tolstikhin et al., 2018).

20 40 60 80 100 z

Figure 4: Histograms of the differences in MNIST (Le Cun et al., 1998). Each histogram consists of 10,000 samples of ( x, z), where x (vertical) and z (horizontal) respectively denote the differences x = d X (x, x ) and z = d Z(z, z ) of two generative samples (x, z), (x , z ) pθ(x, z). In all reported results including Factor VAE (Kim & Mnih, 2018) (γ=3), WAE (Tolstikhin et al., 2018), and GWAE (NP, λD=1, λW =1, λH=1), the latent dimension L was set to L = 16, and their priors were set to the standard Gaussian.

Published as a conference paper at ICLR 2023

Table 8: The effect of prior family selection in the GWAE model. The same settings in Table 1 are applied to all the reported models.

Model DCI-C DCI-D DCI-I

GWAE (NP) 0.3966 0.3113 0.9403 GWAE (FNP) 0.9080 0.7024 0.9966 GWAE (GMP) 0.4247 0.4373 0.9655

C.5 TRAINING PROCESS

We present the training process of GWAE in Fig. 5. Although the objective seems complex for its composition of four different losses, the training process successfully converged and the values of the terms LGW , LW , and LD jointly descended in the most part of training. Although the term RH increased, its values did not diverge to prevent the degenerate solutions. These results imply that the three different losses LGW , LW , and LD did not conflict during the training process even for the complicated data, balancing these three terms against RH as in the trade-off of the reconstruction against the regularization in β-VAE (Higgins et al., 2017a; Tschannen et al., 2018).

C.6 PRIOR FAMILY SELECTION

We show the effect of prior family selection regarding a meta-prior, disentanglement, in Table 8. While GWAE models with the NP and GMP retain the informativeness of the FNP, the other two priors than FNP did not comparably disentangle the latent factors. Although the NP covers a more general family of prior, these results suggest that choosing a prior family suitable to the postulated meta-prior greatly facilitates learning representations.

C.7 QUALITATIVE EVALUATIONS OF GENERATION AND RECONSTRUCTION

We show the reconstructed images by GWAE and state-of-the-art variational autoencoding methods in Fig. 6. The shown images are the first ten samples of the test split in the Celeb A (Liu et al., 2015) dataset under the latent size L = 64. Compared with the other methods, the reconstruction of the GWAE model tends to retain edges (see the bottom rows of Fig. 6), while VAE-based models generate smooth, blurry images due to the noise injected in the latent space to perform probabilistic modeling and manifold learning. We also show the reconstruction results of MNIST (Le Cun et al., 1998) in Fig. 7 and CIFAR-10 (Krizhevsky & Hinton, 2009) in Fig. 8. These results support that the GWAE models consistently perform autoencoding also in a more simple dataset (MNIST). In a more complex dataset (CIFAR-10), the GWAE model attained the best evaluation in generation albeit its reconstruction, suggesting that the GWAE model successfully captured the abstract structure of data rather than reconstructed the given images. This difference highlights the difference in their objectives, i.e., the GW objective aims at distribution matching in the latent space, while the β-VAE (Higgins et al., 2017a) objective with β < 1 puts weight on reconstruction.

We further study the generated images by GWAE and state-of-the-art VAE-based generative models in Fig. 9. These qualitative results show that the GWAE generation successfully obtains a diverse set of images compared with those of state-of-the-art autoencoding generative models. The ALI model (Dumoulin et al., 2017) (Fig. 9 (a)) also generates various images by the distribution matching of bidirectional models, but the generated images have wavy contours, failing at composing images with a consistent appearance owing to the lack of an autoencoding process. Although the VAEGAN model (Larsen et al., 2016) (Fig. 9 (b)) adequately yields organized images with smooth textures, the azimuth of these images is less diverse, i.e., the great majority of the images are facing forward or looking slightly sideways. The images generated by 2-Stage VAE (Dai & Wipf, 2019) (Fig. 9 (c)) have diverse azimuth, color, and background; however, these images tend to incline toward the majority attributes, e.g., not wearing eyeglasses or sunglasses. The GWAE model (Fig. 9 (d)) successfully generates facial images with various skin colors, diversified backgrounds, and assorted facial expressions (e.g., wearing a mustache). These results imply that the GWAE models also function as generative models while it has been built as a representation learning method owing to the

Published as a conference paper at ICLR 2023

Table 9: The ablation study on generation and reconstruction in Celeb A (Liu et al., 2015). The same settings as Table 2 are applied in these experiments.

Model FID (Heusel et al., 2017) PSNR [d B]

GWAE (NP) 45.3 22.82 GWAE (NP) w/o LW 233.7 9.80 GWAE (NP) w/o LD 403.8 18.63 GWAE (NP) w/ MMD LD 158.4 22.61 GWAE (NP) w/ Z-only critic 102.4 22.41 GWAE (NP) w/o RH 179.6 21.57 GWAE (NP, ρ = ξ) 123.5 16.03

collateral condition pθ(x, z) qϕ(x, z) in Eq. (11) and the generative modeling pθ(x) pdata(x) as its necessary condition.

C.8 ABLATION STUDY

We conducted the ablation study of the losses and regularizations introduced in Eq. (13). Table 9 shows the results of the ablation study of the three sub-constraints LW , LD, and RH. The ablations yielded the performance degradation of GWAE, especially in LW . These results suggest the necessity of each regularization term and reveal their roles in representation learning.

Ablation of LW . The ablation of the term LW brought low-quality reconstruction, which suggests that LW works as the autoencoding constraint as can be seen from taking the reconstruction loss in LW . It also reduced generation capability as well as reconstruction, suggesting that the generative modeling via autoencoding is inherited from the variational autoencoding architecture of VAEs (Kingma & Welling, 2014).

Ablation of LD. Without the term LD, the GWAE models suffer from the lack of distribution matching in data generation, while it successfully conducted data reconstruction. These phenomena could be caused by the discrepancy between the encoded latent distribution qϕ(z) and the prior πθ(z). Similar results are also obtained in the ablation of the merged sufficient condition (see Eq. (11)) for the regularization LD, where LD is defined as the MMD loss between the prior πθ(z) and the encoded latent qϕ(z), as in the WAE-MMD model (Tolstikhin et al., 2018). This choice of LD on the low-dimensional space Z appears to be a replacement for the Kantorovich potential adversarially learned in the high-dimensional joint space X Z; however, lacking the merged sufficient condition seems to have caused the crucial reduction of generation performance as in the gross ablation of LD. These results imply that the term LD with adversarial learning regularizes the generative model pθ(x, z) to match the inference qϕ(x, z).

Ablation of RH. Removing RH slightly increased the reconstruction error but deteriorated the generation quality. To confirm this behavior, we also show the samples generated by the GWAE model without the regularization RH in Fig. 10 and its reconstruction in Fig. 11. These qualitative results that the decoder without RH successfully reconstructs the images from the inference qϕ(z) but generates corrupted images from the prior πθ(z). It suggests the hole problem (Rezende & Viola, 2018b) in the degenerate solution, where each data point is mapped at a single latent point to cover the zero-measure area of the latent space and the latent space is almost everywhere not covered by the inference qϕ(z). Thus, the entropy regularization RH seems to have worked for retaining the probabilistic mappings in the encoder qϕ(z|x) to avoid this phenomenon.

In addition, for ablating ρ = 1, we also experimented with the ρ = ξ settings that appear to be intuitively natural although causing an unstable training process due to the outlier samples in LGW . The GWAE model with ρ = ξ suffered from performance degradation both in the generation and reconstruction, suggesting that our settings ρ = 1 ξ affect the learning process of the entire model.

C.9 THE META-PRIOR EFFECT ON GW MINIMIZATION AND ESTIMATION

For a further inspection of Section 4.2, we also studied the GW minimization and estimation using FNP in Fig. 12. Compared with the NP case in Fig. 1, GWAE with FNP presents less stable and

Published as a conference paper at ICLR 2023

more biased estimation and minimization. The learning curve of LGW in Fig. 12a is largely biased in the first 40 epochs and then seems to be converged at approximately 3.2, a higher value than that of Fig. 1a (lower than 2). The isometry histogram also suggests the degradation of GW minimization in FNP. In Fig. 12b, more samples fell in off-diagonal areas, showing that the isometry is less tight than that of Fig. 1b. These results are presumably due to the mismatch of disentanglement meta-prior in MNIST (Le Cun et al., 1998) because one of the major generative factors of MNIST images is the kind of digits, a categorical variable typically learned as one-hot variables in contrast to the factorization imposed by FNP.

C.10 PRIORS IN CLUSTERING STRUCTURE

For more detailed investigation of the capture of clustering structure studied in Fig. 3, we further study the latent spaces of VAE (Kingma & Welling, 2014), DAGMM (Zong et al., 2018), and GWAE with GMP. The t-SNE visualization (van der Maaten & Hinton, 2008) of the latent spaces are shown in Fig. 13, which suggests that the GWAE model with GMP clearly captured the clustering structure in its latent space. The prior of VAE (Kingma & Welling, 2014) is defined as the standard Gaussian N(0, IL) which does not consist of multiple clusters. The learned prior of DAGMM contains multiple clusters; however, adjacent clusters were overlapping to some extent. From the learned prior in GWAE, we can observe clear clusters densely concentrating themselves and separating each other. These results support the quantitative Oo D results in Fig. 3, in which the GWAE model outperforms the other two models with and without explicit clustering modeling, respectively.

Published as a conference paper at ICLR 2023

0 20 40 60 80 100

7 Training Validation

(a) LGW , Celeb A.

0 20 40 60 80 100

Training Validation

(b) LW , Celeb A.

0 20 40 60 80 100

Training Validation

(c) LD, Celeb A.

0 20 40 60 80 100

Training Validation

(d) RH, Celeb A.

0 20 40 60 80 100

8 Training Validation

(e) LGW , CIFAR-10.

0 20 40 60 80 100

900 Training Validation

(f) LW , CIFAR-10.

0 20 40 60 80 100

Training Validation

(g) LD, CIFAR-10.

0 20 40 60 80 100

90 Training Validation

(h) RH, CIFAR-10.

Figure 5: The training process of a GWAE model. The model is trained using NP and λH = λW = λH = 1. The plot (a) (d) are training curves during one trial of training using Celeb A (Liu et al., 2015), and (e) (h) are using CIFAR-10 (Krizhevsky & Hinton, 2009). In each plot, the horizontal axis represents the number of epochs elapsed, and the vertical axis expresses the loss value. The blue and orange curves represent the training and validation losses, respectively.

Published as a conference paper at ICLR 2023

(b) β-VAE (Higgins et al., 2017a) (β=0.1).

(c) AVB (Mescheder et al., 2017) (β=1).

(d) WAE (λ=100).

(e) GWAE (NP, λD=1, λW =10, λH=0.0001).

Figure 6: Reconstructed images in Celeb A (Liu et al., 2015). The images denote original data samples (top rows), reconstructed images (middle rows), and zoomed reconstructions (bottom rows). Each column corresponds to one data instance in the test set.

Published as a conference paper at ICLR 2023

(a) VAE. FID: 16.8, PSNR: 23.66 d B.

(b) β-VAE (Higgins et al., 2017a) (β=0.1). FID: 15.5, PSNR: 25.45 d B.

(c) AVB (Mescheder et al., 2017) (β=1). FID: 39.2, PSNR: 24.31 d B.

(d) WAE (λ=100). FID: 16.9, PSNR: 25.28 d B.

(e) GWAE (NP, λD=1, λW =10, λH=0.0001). FID: 14.4, PSNR: 26.11 d B.

Figure 7: Reconstructed images in MNIST (Le Cun et al., 1998). The images denote original data samples (top rows), reconstructed images (bottom rows). Each column corresponds to one data instance in the test set.

Published as a conference paper at ICLR 2023

(a) VAE. FID: 111.3, PSNR: 19.84 d B.

(b) β-VAE (Higgins et al., 2017a) (β=0.1). FID: 84.5, PSNR: 22.48 d B.

(c) AVB (Mescheder et al., 2017) (β=1). FID: 109.9, PSNR: 21.14 d B.

(d) WAE (λ=100). FID: 87.3, PSNR: 22.45 d B.

(e) GWAE (NP, λD=1, λW =10, λH=0.0001). FID: 59.9, PSNR: 17.64 d B.

Figure 8: Reconstructed images in CIFAR-10 (Krizhevsky & Hinton, 2009). The images denote original data samples (top rows), reconstructed images (bottom rows). Each column corresponds to one data instance in the test set.

Published as a conference paper at ICLR 2023

(a) ALI (Dumoulin et al., 2017).

(b) VAE-GAN (Larsen et al., 2016) (γ=1).

(c) 2-Stage VAE (Dai & Wipf, 2019).

(d) GWAE (NP, λD=1, λW =10, λH=0.0001).

Figure 9: Generated images in Celeb A (Liu et al., 2015). We show 100 images sampled from the generative model pθ(x) without conducting cherry-picking.

Published as a conference paper at ICLR 2023

Figure 10: Generated images in Celeb A (Liu et al., 2015) using the GWAE model without the regularization term RH.

Figure 11: Reconstructed images in Celeb A (Liu et al., 2015) using the GWAE model without the regularization term RH. Each column corresponds to one test data instance. The rows denote original (top) and reconstructed (bottom) images.

Published as a conference paper at ICLR 2023

0 20 40 60 80 100 Number of Epochs Trained

Gromov-Wasserstein Value

Estimated GW Empirical GW

(a) The estimation of the GW metric using FNP.

20 30 40 50 60 70 80 90 z

(b) The isometry in GWAE with FNP.

Figure 12: The estimation and minimization of the GW metric. This trial of training is conduced in GWAE (FNP, λD=1, λW =1, λH=1) using the MNIST (Le Cun et al., 1998) dataset, which is the same settings as Fig. 1 except for FNP. (a) The curves show the GW values estimated by the loss term LGW (solid, blue) and the empirical GW computed by the POT package (Flamary et al., 2021) (dashed, orange). The values are computed using the validation set. (b) The axes x = d X (x, x ) (vertical) and z = d Z(z, z ) (horizontal) respectively denote the difference in the data and latent spaces between generated samples (x, z), (x , z ) pθ(x, z). The histogram contains 10,000 generated sample pairs.

Published as a conference paper at ICLR 2023

20 0 20 t-SNE Dimension 1

t-SNE Dimension 2

40 30 20 10 0 10 20 30 40 t-SNE Dimension 1

t-SNE Dimension 2

(a) VAE (Kingma & Welling, 2014).

20 0 20 t-SNE Dimension 1

t-SNE Dimension 2

40 30 20 10 0 10 20 30 40 t-SNE Dimension 1

t-SNE Dimension 2

(b) DAGMM (Zong et al., 2018).

40 20 0 20 40 t-SNE Dimension 1

t-SNE Dimension 2

60 40 20 0 20 40 60 t-SNE Dimension 1

t-SNE Dimension 2

(c) GWAE (GMP).

Figure 13: The t-SNE visualizations (van der Maaten & Hinton, 2008) for latent space samples z πθ(z) for the Oo D detection in Fig. 3. The left plot presents the sampled points of the t-SNE embeddings, and the right one presents the kernel density estimation (KDE) of these embeddings. The sample size is equally 1,024 in each reported model.