# learning_autoencoders_with_relational_regularization__d0812dc3.pdf Learning Autoencoders with Relational Regularization Hongteng Xu * 1 2 Dixin Luo * 2 Ricardo Henao 2 Svati Shah 2 Lawrence Carin 2 A new algorithmic framework is proposed for learning autoencoders of data distributions. We minimize the discrepancy between the model and target distributions, with a relational regularization on the learnable latent prior. This regularization penalizes the fused Gromov-Wasserstein (FGW) distance between the latent prior and its corresponding posterior, allowing one to flexibly learn a structured prior distribution associated with the generative model. Moreover, it helps co-training of multiple autoencoders even if they have heterogeneous architectures and incomparable latent spaces. We implement the framework with two scalable algorithms, making it applicable for both probabilistic and deterministic autoencoders. Our relational regularized autoencoder (RAE) outperforms existing methods, e.g., the variational autoencoder, Wasserstein autoencoder, and their variants, on generating images. Additionally, our relational co-training strategy for autoencoders achieves encouraging results in both synthesis and realworld multi-view learning tasks. The code is at https://github.com/Hongteng Xu/ Relational-Auto Encoders. 1. Introduction Autoencoders have been used widely in many challenging machine learning tasks for generative modeling, e.g., image (Kingma & Welling, 2013; Tolstikhin et al., 2018) and sentence (Bowman et al., 2016; Wang et al., 2019) generation. Typically, an autoencoder assumes that the data in the sample space X may be mapped to a low-dimensional manifold, which can be represented in a latent space Z. The autoencoder fits the unknown data distribution px via *Equal contribution 1Infinia ML Inc., Durham, NC, USA 2Duke University, Durham, NC, USA. Correspondence to: Hongteng Xu , Dixin Luo . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). a latent-variable model denoted p G, specified by a prior distribution pz on latent code z Z and a generative model G : Z 7 X mapping the latent code to the data x X. Learning seeks to minimize the discrepancy between px and p G. According to the choice of the discrepancy, we can derive different autoencoders. For example, the variational autoencoder (Kingma & Welling, 2013) applies the KL-divergence as the discrepancy and learns a probabilistic autoencoder via maximizing the evidence lower bound (ELBO). The Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) minimizes a relaxed form of the Wasserstein distance between px and p G, and learns a deterministic autoencoder. In general, the objective function approximating the discrepancy consists of a reconstruction loss of observed data and a regularizer penalizing the difference between the prior distribution pz and the posterior derived by encoded data, i.e., qz|x. Although existing autoencoders have achieved success in many generative tasks, they often suffer from the following two problems. Regularizer misspecification Typical autoencoders, like the VAE and WAE, fix the pz as a normal distribution, which often leads to the problem of over-regularization. Moreover, applying such an unstructured prior increases the difficulties in conditional generation tasks. To avoid oversimplified priors, the Gaussian mixture VAE (GMVAE) (Dilokthanakul et al., 2016) and the VAE with Vamp Prior (Tomczak & Welling, 2018) characterize their priors as learnable mixture models. However, without side information (Wang et al., 2019), jointly learning the autoencoder and the prior suffers from a high risk of under-regularization, which is sensitive to the setting of hyperparameters (e.g., the number of mixture components and the initialization of the prior). Co-training of heterogeneous autoencoders Solving a single task often relies on the data in different domains (i.e., multi-view data). For example, predicting the mortality of a patient may require both her clinical record and genetic information. In such a situation, we may need to learn multiple autoencoders to extract latent variables as features from different views. Traditional multi-view learning strategies either assume that the co-trained autoencoders share the same latent distributions (Wang et al., 2015; Ye et al., 2016), or assume that there exists an explicit transform between different latent spaces (Wang et al., 2016). These assumptions are questionable in practice, as the correspond- Learning Autoencoders with Relational Regularization ing autoencoders can have heterogeneous architectures and incomparable latent spaces. How to co-train such heterogeneous autoencoders is still an open problem. To overcome the aforementioned problems, we propose a new Relational regularized Auto Encoder (RAE). As illustrated in Figure 1(a), we formulate the prior pz as a Gaussian mixture model. Differing from existing methods, however, we leverage the Gromov-Wasserstein (GW) distance (M emoli, 2011) to regularize the structural difference between the prior and the posterior in a relational manner, i.e., comparing the distance between samples from the prior with samples from the posterior, and restricting their difference. Considering this relational regularizer allows us to implement the discrepancy between pz and qz|x as the fused Gromov-Wasserstein (FGW) distance (Vayer et al., 2018a). Besides imposing structural constraints on the prior distribution within a single autoencoder, for multiple autoencoders with different latent spaces (e.g., the 2D and 3D latent spaces shown in Figure 1(b)) we can train them jointly by applying the relational regularizer to their posterior distributions. The proposed relational regularizer is applicable for both probabilistic and deterministic autoencoders, corresponding to approximating the FGW distance as hierarchical FGW and sliced FGW, respectively. We demonstrate the rationality of these two approximations and analyze their computational complexity. Experimental results show that i) learning RAEs helps achieve structured prior distributions and also suppresses the under-regularization problem, outperforming related approaches in image-generation tasks; and ii) the proposed relational co-training strategy is beneficial for learning heterogeneous autoencoders, which has potential for multi-view learning tasks. 2. Relational Regularized Autoencoders 2.1. Learning mixture models as structured prior Following prior work with autoencoders (Tolstikhin et al., 2018; Kolouri et al., 2018), we fit the model distribution p G by minimizing its Wasserstein distance to the data distribution px, i.e., min Dw(px, p G). According to Theorem 1 in (Tolstikhin et al., 2018), we can relax the Wasserstein distance and formulate the learning problem as follows: min G,Q Epx Eqz|x;Q[d(x, G(z))] | {z } reconstruction loss + γD(Epx[qz|x;Q], pz) | {z } distance(posterior, prior) where G : Z 7 X is the target generative model (decoder); qz|x;Q is the posterior of z given x, parameterized by an encoder Q : X 7 Z; d represents the distance between samples; and D is an arbitrary discrepancy between distributions. Accordingly, qz;Q = Epx[qz|x;Q] is the marginal distribution derived from the posterior. Parameter γ achieves (a) Proposed RAE (b) Relational Co-training Figure 1. (a) Learning a single autoencoder with relational regularization. (b) Relational co-training of the autoencoders with incomparable latent spaces. a trade-off between reconstruction loss and the regularizer. Instead of fixing pz as a normal distribution, we seek to learn a structured prior associated with the autoencoder: min G,Q,pz P Epx Eqz|x;Q[d(x, G(z))] + γD(qz;Q, pz). (2) where P is the set of valid prior distributions, which is often assumed as a set of (Gaussian) mixture models (Dilokthanakul et al., 2016; Tomczak & Welling, 2018). Learning the structured prior allows one to explore the clustering structure of the data and achieve conditional generation (i.e., sampling latent variables from a single component of the prior and generating samples accordingly). 2.2. Relational regularization via Gromov-Wasserstein Jointly learning the prior and the autoencoder may lead to under-regularization in the training phase it is easy to fit pz to qz;Q without harm to the reconstruction loss. Solving this problem requires introduction of structural constraints when comparing these two distributions, motivating a relational regularized autoencoder (RAE). In particular, besides commonly-used regularizers like the KL divergence (Dilokthanakul et al., 2016) and the Wasserstein distance (Titouan et al., 2019), which achieve direct comparisons of the distributions, we consider a relational regularizer based on the Gromov-Wasserstein (GW) distance (M emoli, 2011) in our learning problem: min G,Q,pz P Epx Eqz|x;Q[d(x, G(z))] + γ((1 β)D(qz;Q, pz) | {z } direct comparison + βDgw(qz;Q, pz) | {z } relational comparison where β [0, 1] controls the trade-off between the two regularizers, and Dgw is the GW distance defined as follows. Definition 2.1. Let (X, dx, px) and (Y, dy, py) be two metric measure spaces, where (X, dx) is a compact metric Learning Autoencoders with Relational Regularization space and px is a probability measure on X (with (Y, dy, py) defined in the same way). The Gromov-Wasserstein distance Dgw(px, py) is defined as infπ Π(px,py) X Y X Y rx,y,x ,y dπ(x, y)dπ(x , y ) = infπ Π(px,py) E(x,y,x ,y ) π π[rx,y,x ,y ] where rx,y,x ,y = |dx(x, x ) dy(y, y )|2, and Π(px, py) is the set of all probability measures on X Y with px and py as marginals. The rx,y,x ,y defines a relational loss, comparing the difference between the pairs of samples from the two distributions. Accordingly, the GW distance corresponds to the minimum expectation of the relational loss. The optimal joint distribution π corresponding to the GW distance is called the optimal transport between the two distributions. The Dgw(qz;Q, pz) in (3) penalizes the structural difference between the two distributions, mutually enhancing the clustering structure of the prior and that of the posterior. We prefer using the GW distance to implement the relational regularizer, because of the ease by which it may be combined with existing regularizers, allowing design of scalable learning algorithms. In particular, when the direct regularizer is the Wasserstein distance (Titouan et al., 2019), i.e., D = Dw, we can combine it with the Dgw and derive a new regularizer as follows: (1 β)Dw(px, py) + βDgw(px, py) =(1 β) infπ Π(px,py) X Y cx,ydπ(x, y)+ β inf π Π(px,py) X Y X Y rx,y,x ,y dπ(x, y)dπ(x , y ) inf π Π(px,py) (1 β) X Y cx,ydπ(x, y) | {z } Wasserstein term β X Y X Y rx,y,x ,y dπ(x, y)dπ(x , y ) | {z } Gromov-Wasserstein term =Dfgw(px, py; β), where c : X Y 7 R is a direct loss function between the two spaces. The new regularizer enforces a shared optimal transport for the Wasserstein and Gromov-Wasserstein terms, corresponding to the fused Gromov-Wasserstein (FGW) distance (Vayer et al., 2018a) between the distributions. The rationality of this combination has two perspectives. First, the optimal transport indicates the correspondence between two spaces (M emoli, 2011; Xu et al., 2019b). In the following section, we show that this optimal transport maps encoded data to the clusters defined by the prior. Enforcing shared optimal transport helps ensure the consistency of the clustering structure. Additionally, as shown in (4), Dfgw(px, py; β) (1 β)Dw(px, py) + βDgw(px, py). When replacing the regularizers in (3) with the FGW regularizer, we minimize an upper bound of the original objective function, useful from the viewpoint of optimization. Therefore, we learn an autoencoder with relational regularization by solving the following optimization problem: min G,Q,pz P Epx Eqz|x;Q[d(x, G(z))] + γDfgw(qz;Q, pz; β), (5) where the prior pz is parameterized as a Gaussian mixture model (GMM) with K components {N(µk, Σk)}K k=1. We set the probability of each component as 1 K . The autoencoder can be either probabilistic or deterministic, leading to different learning algorithms. 3. Learning algorithms 3.1. Probabilistic autoencoder with hierarchical FGW When the autoencoder is probabilistic, for each sample x, the encoder Q outputs the mean and the logarithmic variance of the posterior qz|x;Q. Accordingly, the marginal distribution qz;Q becomes a GMM as well, with number of components equal to the batch size, and the regularizer corresponds to the FGW distance between two GMMs. Inspired by the hierarchical Wasserstein distance (Chen et al., 2018; Yurochkin et al., 2019; Lee et al., 2019), we leverage the structure of the GMMs and propose a hierarchical FGW distance to replace the original regularizer. In particular, given two GMMs, we define the hierarchical FGW distance between them as follows. Definition 3.1 (Hierarchical FGW). Let p = PK k=1 akpk and q = PN n=1 bnqn be two GMMs. {pk}K k=1 and {qn}N n=1 are M-dimensional Gaussian distributions. a = [ak] K 1, b = [bn] N 1 are the distribution of the Gaussian components. For β [0, 1], the hierarchical fused Gromov-Wasserstein distance between these two GMMs is Dhfgw(p, q; β) = min T =[tkn] Π(a,b)(1 β) X k,n Dw(pk, qn)tkn+ k,k ,n,n |Dw(pk, pk ) Dw(qn, qn )|2tkntk n =Dfgw(a, b; β). As shown in (6), the hierarchical FGW corresponds to an FGW distance between the distributions of the Gaussian components, whose ground distance is the Wasserstein distance between the Gassuain components. Figures 2(a) and 2(b) further illustrate the difference between the FGW and our hierarchical FGW. For the two GMMs, instead of computing the optimal transport between them in the sample space, the hierarchical FGW builds optimal transport between their Gaussian components. Additionally, we have Proposition 3.2. Dhfgw(p, q; β) = Dfgw(p, q) when Σ = 0 for all the Gaussian components. Replacing the FGW with the hierarchical FGW, we convert an optimization problem of a continuous distribution (the π Learning Autoencoders with Relational Regularization (b) Hierarchical FGW (c) Sliced FGW Figure 2. Illustrations of FGW, hierarchical FGW, and sliced FGW. The black solid arrow represents the distance between samples while the black dotted arrow represents the relational loss between sample pairs. In (a, c), the red and blue arrows represent the Euclidean distance between samples. In (b), the red and blue arrows represent the Wasserstein distance between Gaussian components. in (4)) to a much simpler optimization problem of a discrete distribution (the T in (6)). Rewriting (6) in matrix form, we compute the hierarchical FGW distance via solving the following non-convex optimization problem: Dhfgw(p, q; β) = min T Π(a,b) D 2βDp TD q , T , (7) where , indicates the inner product between matrices, Π(a, b) = {T 0|T1N = a, T 1K = b}, and 1N is an N-dimensional all-one vector. The optimal transport matrix T is a joint distribution of the Gaussian components in the two GMMs, where Dp = [Dw(pk, pk )] RK K and Dq = [Dw(qn, qn )] RN N, whose elements are the Wasserstein distances between Gaussian components, and D = (1 β)Dpq + β K (Dp Dp) + β N (Dq Dq) (8) with Dpq = [Dw(pk, qn)] RK N and represents the Hadamard product. The Wasserstein distance between Gaussian distributions has a closed form: Definition 3.3. Let p = N(up, Σp) and q = N(uq, Σq) be two N-dimensional Gaussian distributions, where u and Σ represent the mean and the covariance matrix, respectively. The Wasserstein distance Dw(p, q) is up uq 2 2 + trace(Σp + Σq 2(Σ 1 2p ) 1 2 ). (9) When the covariance matrices are diagonal, i.e., Σ = diag(σ2), where σ = [σn] RN is the standard deviation, (9) can be rewritten as Dw(p, q) = up uq 2 2 + σp σq 2 2. (10) We solve (7) via the proximal gradient method in (Xu et al., 2019b), with further details in the Supplementary Material. The hierarchical FGW is a good substitute for the original FGW, imposing structural constraints while being more efficient computationally. Plugging the hierarchical FGW and its computation into (5), we apply Algorithm 1 to learn the proposed RAE. Note that taking advantage of the Envelope Algorithm 1 Learning RAE with hierarchical FGW 1: Input Samples in X 2: Output The autoencoder {G, Q} and the prior with K Gaussian components pz = 1 K P k N(z; µk, diag(σk)). 3: for each epoch 4: for each batch of samples {xn}N n=1 X 5: µn, log(σ2 n) = Q(xn) for n = 1, ..., N. 6: Reparameterize zn = µn + ϵσn, where ϵ N(0, I). 7: qz;Q = 1 N P n N(z; µn, diag(σn)). 8: Calculate Dp, Dq via (10), calculate D via (8) 9: Obtain optimal transport T via solving (7). 10: Lreconstruction = P n d(xn, G(zn)). 11: Dhfgw(qz;Q, pz; β) = D 2βDp T D q , T . 12: Update G, Q, pz = Adam(Lreconstruction + γDhfgw). Theorem (Afriat, 1971), we treat the optimal transport matrix as constant when applying backpropagation, reducing computational complexity significantly. The optimal transport matrix maps the components in the qz;Q to those in the pz. Because the components in the qz;Q correspond to samples and the components in the pz correspond to clusters, this matrix indicates the clustering structure of the samples. 3.2. Deterministic autoencoder with sliced FGW When the autoencoder is deterministic, its encoder outputs the latent codes corresponding to observed samples. These latent codes can be viewed as the samples of qz;Q. For the prior pz, we can also generate samples with the help of the reparameterization trick. In such a situation, we estimate the FGW distance in (5) based on the samples of the two distributions. For arbitrary two metric measure spaces (X, dx, px) and (Y, dy, py), the empirical FGW between their samples {xi}N i=1 and {yj}N j=1 is b Dfgw(px, py; β) = min T Π( 1 N 1N)(1 β) XN i,j=1 d(xi, yj)tij+ i,i ,j,j |d(xi, xi ) d(yj, yj )|2tijti j . We can rewrite this empirical FGW in matrix form as (7), and solve it by the proximal gradient method discussed above. When the samples are in 1D space and the metric is the Euclidean distance, however, according to the sliced GW distance in (Titouan et al., 2019) and the sliced Wasserstein distance in (Kolouri et al., 2016), the optimal transport matrix corresponds to a permutation matrix and the b Dfgw(px, py; β) can be rewritten as: b Dfgw(px, py; β) = min σ PN 1 β i=1(xi yσ(i))2+ i,j=1((xi xj)2 (yσ(i) yσ(j))2)2, where PN is the set of all permutations of {1, ..., N}. Without loss of generality, we assume the 1D samples are sorted, Learning Autoencoders with Relational Regularization i.e., x1 ... x N and y1 ... y N, and demonstrate that the solution of (12) is characterized by the following theorem. Theorem 3.4. For x, y {x = [xi], y = [yj] RN RN|x1 ... x N, y1 ... y N}, we denote their zero-mean translations as x and y , respectively. The solution of (12) satisfies: 1) When (P i x iy i + 1 β i x iy n+1 i + 1 β 8β )2, the solution is the identity permutation σ(i) = i. 2) Otherwise, the solution is the anti-identity permutation σ(i) = n + 1 i. The proof of Theorem 3.4 is provided in the Supplementary Material. Consequently, for the samples in 1D space, we can calculate the empirical FGW distance via permuting the samples. To leverage this property for high-dimensional samples, we propose the following sliced FGW distance: Definition 3.5 (Sliced FGW). Let SM 1 = {θ RM| θ 2 = 1} be the M-dimensional hypersphere and u SM 1 the uniform measure on SM 1. For each θ, we denote the projection on θ as Rθ, where Rθ(x) = x, θ . For (X, dx, px) and (Y, dy, py), we define their sliced fused Gromov-Wasserstein distance as Dsfgw(px, py; β) = Eθ u SM 1[Dfgw(Rθ#px, Rθ#py; β])], where Rθ#p represents the distribution after the projection, and Dfgw(Rθ#px, Rθ#py) is the FGW distance between (Rθ(X), d Rθ(x), Rθ#px) and (Rθ(Y), d Rθ(y), Rθ#py). According to this definition, the sliced FGW projects the original metric measure spaces into 1D spaces, and calculates the FGW distance between these spaces. The sliced FGW corresponds to the expectation of the FGW distances under different projections. We can approximate the sliced FGW distance based on the samples of the distributions as well. In particular, given {xi}N i=1 from X, {yi}N i=1 from Y, and L projections {Rθl}L l=1, the empirical sliced FGW is b Dsfgw(px, py; β) l=1 b Dfgw(Rθl#px, Rθl#py; β) l=1 min σ PN 1 β i=1(xi,θl yσ(i),θl)2+ i,j=1((xi,θl xj,θl)2 (yσ(i),θl yσ(j),θl)2)2, where xi,θl = Rθl(xi) represents the projected sample. Figure 2(c) further illustrates the principle of the sliced FGW distance. Replacing the empirical FGW with the empirical sliced FGW, we learn the relational regularized autoencoder via Algorithm 2. 3.3. Comparisons on computational complexity Compared with calculating empirical FGW distance directly, our hierarchical FGW and sliced FGW have much lower Algorithm 2 Learning RAE with sliced FGW 1: Input Samples in X 2: Output The autoencoder {G, Q} and the prior with K Gaussian components pz = 1 K P k N(z; µk, diag(σk)). 3: for each epoch 4: for each batch of samples {xn}N n=1 X 5: for n = 1, ..., N 6: Samples of qz;Q: zn = Q(xn). 7: Samples of pz: k Categorical(K), z n = µk + ϵσk. 8: for l = 1, ..., L 9: Create a random projection θl SM 1 10: zn,θl = Rθlzn, z n,θl = Rθlz n for n = 1, ..., N. 11: Sort {zn,θl}N n=1 and {z n,θl}N n=1, respectively. 12: Calculate b Dfgw(Rθl#qz;Q, Rθl#pz; β]) based on sorted samples and Theorem 3.4. 13: Lreconstruction = P n d(xn, G(zn)). 14: Calculate b Dsfgw(qz;Q, pz; β) via (13). 15: Update G, Q, pz = Adam(Lreconstruction + γ b Dsfgw). computational complexity. Following notation in the previous two subsections, we denote the batch size as N, the number of Gaussian components in the prior as K, and the dimension of the latent code as M. If we apply the proximal gradient method in (Xu et al., 2019b) to calculate the empirical FGW directly, the computational complexity is O(JN 3), where J is the number of Sinkhorn iterations used in the algorithm. For our hierarchical FGW, we apply the proximal gradient method to a problem with a much smaller size (i.e., solving (7)) because of K N in general. Accordingly, the computational complexity becomes O(JN 2K). For our sliced FGW, we apply L random projections to project the latent codes to 1D spaces, whose complexity is O(LMN). For each pair of projected samples, we sort them with O(N log N) operations and compute (12) with O(N 2) operations. Overall, the computational complexity of our sliced FGW is O(LN(M + log N + N)). Because J L in general, the computational complexity of the sliced FGW is comparable to that of the hierarchical FGW. 4. Relational Co-Training of Autoencoders Besides learning a single autoencoder, we can apply our relational regularization to learn multiple autoencoders. As shown in Figure 1(b), when learning two autoencoders we can penalize the GW distance between their posterior distributions, and accordingly the learning problem becomes: min {Gs,Qs}2 s=1 Epxs Eqzs|xs;Qs [d(xs, Gs(zs))]+ γ(1 τ)D(qzs;Qs, pzs) + 2γτDgw(qz1;Q1, qz2;Q2). The regularizer D quantifies the discrepancy between the marginalized posterior and the prior, the prior distributions can be predefined or learnable parameters, τ [0, 1] achieves a trade-off between D and the relational regularizer Learning Autoencoders with Relational Regularization Dgw, and γ controls the overall significance of these two kinds of regularizers. When learning probabilistic autoencoders, we set D to the hierarchical Wasserstein distance between GMMs (Chen et al., 2018) and approximate the relational regularizer by a hierarchical GW distance, equivalent to the hierarchical FGW with β = 1. When learning deterministic autoencoders, we set D to the sliced Wasserstein distance used in (Kolouri et al., 2018) and approximate the relational regularizer via the sliced GW (Titouan et al., 2019) (the sliced FGW with β = 1). The main advantage of the proposed relational regularization is that it is applicable for co-training heterogeneous autoencoders. As shown in (14), the data used to train the autoencoders can come from different domains and with different data distributions. To fully capture the information in each domain, sometimes the autoencoders have heterogeneous architectures, and the corresponding latent codes are in incomparable spaces, e.g., with different dimensions. Taking the GW distance as the relational regularizer, we impose a constraint on the posterior distributions defined in different latent spaces, encouraging structural similarity between them. This regularizer helps avoid over-regularization because it does not enforce a shared latent distribution across different domains. Moreover, the proposed regularizer is imposed on the posterior distributions. In other words, it does not require samples from different domains to be paired. According to the analysis above, our relational co-training strategy has potential for multi-view learning, especially in the scenario with unpaired samples. In particular, given the data in different domains, we first learn their latent codes via solving (14). Concatenating the latent codes in different domains, we can use the concatenation of the latent codes as the features for downstream learning tasks. 5. Related Work Gromov-Wasserstein distance The GW distance has been used as a metric for shape registration (M emoli, 2009; 2011), vocabulary set alignment (Alvarez-Melis & Jaakkola, 2018), and graph matching (Chowdhury & M emoli, 2018; Vayer et al., 2018b; Xu et al., 2019b). The work in (Peyr e et al., 2016) proposes an entropy-regularized GW distance and calculates it based on Sinkhorn iterations (Cuturi, 2013). Following this direction, the work in (Xu et al., 2019b) replaces the entropy regularizer with a Bregman proximal term. The work in (Xu, 2019) proposes an ADMM-based method to calculate the GW distance. To further reduce the computational complexity, the recursive GW distance (Xu et al., 2019a) and the sliced GW distance (Titouan et al., 2019) have been proposed. For generative models, the work in (Bunne et al., 2019) leverages the GW distance to learn coupled adversarial generative networks. However, none of the existing autoencoders consider using the GW distance Table 1. Comparisons for different autoencoders Method Q : X 7 Z pz D(qz;Q, pz) VAE Probabilistic N(z; 0, I) KL WAE Deterministic N(z; 0, I) MMD SWAE Deterministic N(z; 0, I) Dw GMVAE Probabilistic 1 K P k N(z; uk, Σk) KL Vamp Prior Probabilistic 1 K P k N(z; Q(xk)) KL Our RAE Probabilistic 1 K P k N(z; uk, Σk) Dhfgw Deterministic b Dsfgw as their regularizer. Autoencoders The principle of the autoencoder is to minimize the discrepancy between the data and model distributions. The common choices of the discrepancy include the KL divergence (Kingma & Welling, 2013; Dilokthanakul et al., 2016; Tomczak & Welling, 2018; Takahashi et al., 2019) and the Wasserstein distance (Tolstikhin et al., 2018; Kolouri et al., 2018), which lead to different learning algorithms. Our relational regularized autoencoder can be viewed as a new member of the Wasserstein autoencoder family. Compared with the MMD and the GAN loss used in WAE (Tolstikhin et al., 2018), and the sliced Wasserstein distance used in (Kolouri et al., 2018), our FGW-based regularizer imposes relational constraints and allows the learning of an autoencoder with structured prior distribution. Co-training methods For data in different domains, a commonly-used co-training strategy maps them to a shared latent space, and encourages similarity between their latent codes. This co-training strategy suppresses the risk of overfitting for each model and enhances their generalization power, which achieves encouraging performance in multi-view learning (Kumar & Daum e, 2011; Chen & Denoyer, 2017; Sindhwani et al., 2005). However, this strategy assumes that the latent codes yield the same distribution, which may lead to over regularization. Additionally, it often requires well-aligned data, i.e., the samples in different domains are paired. Our relational co-training strategy provides a potential solution to relax these restrictions for practical applications. 6. Experiments 6.1. Image generation We test our relational regularized autoencoder (RAE) for image-generation tasks and compare it with the following alternatives: the variational autoencoder (VAE) (Kingma & Welling, 2013), the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018), the sliced Wasserstein autoencoder (SWAE) (Kolouri et al., 2018), the Gaussian mixture VAE (GMVAE) (Dilokthanakul et al., 2016), and the Vamp Prior (Tomczak & Welling, 2018). Table 1 lists the main differences between our RAE and these baselines. Learning Autoencoders with Relational Regularization Table 2. Comparisons on learning image generator Encoder Method MNIST Celeb A Q : X 7 Z Rec. loss FID Rec. loss FID Probabilistic VAE 16.60 156.11 96.36 59.99 GMVAE 16.76 60.88 108.13 353.17 Vamp Prior 22.41 127.81 RAE 14.14 41.99 63.21 52.20 Deterministic WAE 9.97 52.78 63.83 52.07 SWAE 11.10 35.63 87.02 88.91 RAE 10.37 49.39 64.49 51.45 Table 3. Runtime per epoch (second) when training various models Dataset WAE SWAE P-RAE D-RAE MNIST 25.6 24.7 26.9 24.8 Celeb A 602.2 553.4 618.5 569.7 We test the methods on the MNIST (Le Cun et al., 1998) and Celeb A datasets (Liu et al., 2015). For fairness, all the autoencoders have the same DCGAN-style architecture (Radford et al., 2015) and are learned with the same hyperparameters: the learning rate is 0.001; the optimizer is Adam (Kingma & Ba, 2014) with β1 = 0.5 and β2 = 0.999; the number of epochs is 50; the batch size is 100; the weight of regularizer γ is 1; the dimension of latent code is 8 for MNIST and 64 for Celeb A. For the autoencoders with structured priors, we set the number of the Gaussian components to be 10 and initialize their prior distributions at random. For the proposed RAE, the hyperparameter β is set to be 0.1, which empirically makes the Wasserstein term and the GW term in our FGW distance have the same magnitude. The probabilistic RAE calculates the hierarchical FGW based on the proximal gradient method with 20 iterations, and the deterministic RAE calculates the sliced FGW with 50 random projections. All the autoencoders use Euclidean distance as the distance between samples, thus the reconstruction loss is the mean-square-error (MSE). We implement all the autoencoders with Py Torch and train them on a single NVIDIA GTX 1080 Ti GPU. More implementation details, e.g., the architecture of the autoencoders, are provided in Supplementary Material. For each dataset, we compare the proposed RAE with the baselines on i) the reconstruction loss on testing samples; ii) the Fr echet Inception Distance (FID) between 10,000 testing samples and 10,000 randomly generated samples. We list the performance of various autoencoders in Table 2. Among probabilistic autoencoders, our RAE consistently achieves the best performance on both testing reconstruction loss and FID score. When learning deterministic autoencoders, our RAE is at least comparable to the considered alternatives on these measurements. Figure 3 compares the autoencoders on their convergence of the reconstruction loss. The convergence of our RAE is almost the same as that of state-of-the-art methods, which further verifies its feasibility. For the autoencoders learning GMMs as their priors, we further make comparisons for them in conditional generation tasks, i.e., generating samples conditioned on specific Gaussian components. Figures 4 and 5 visualize the generation results for various methods. For the MNIST dataset, the GMVAE, our probabilistic RAE, and deterministic RAE achieve desired generation results. The images conditioned on different Gaussian components correspond to different digits/writing styles. The Vamp Prior, however, suffers from a problem of severe mode collapse. The images conditioned on different Gaussian components are similar to each other and with limited modes most of them are 0 , 2 , 3 , and 8 . As shown in Table 1, for each Gaussian component of the prior, the Vamp Prior parameterizes it by passing a landmark xk through the encoder. Because the landmarks are in the sample space, this implicit model requires more parameters, making it sensitive to initialization with a high risk of overfitting. Figures 3(a) and 3(b) verify our claim: the testing loss of the Vamp Prior is unstable and does not converge well during training. For the Celeb A dataset, the GMVAE fails to learn a GMM-based prior. As shown in Figure 5(a), the GMVAE trains a single Gaussian distribution, while ignoring the remaining components. As a result, only one Gaussian component can generate face images. Our probabilistic and deterministic RAE, by contrast, learn their GMM-based prior successfully. In particular, all the components of our probabilistic RAE can generate face images, but the components are indistinguishable. Our deterministic RAE achieves the best performance in this conditional generation task different components can generate semantically-meaningful images with interpretable diversity. For each component, we add some tags to highlight semantic meaning. The visual comparisons for various autoencoders on their reconstructed and generated samples are shown in Supplementary Material. 6.2. Multi-view learning via co-training autoencoders We test our relational co-training strategy on four multiview learning datasets (Li et al., 2015):1 Caltech101-7 is a subset of the Caltech-101 dataset (Fei-Fei et al., 2004) with 1,474 images in 7 classes. Each image is represented by 48-dimensional Gabor features and 40-dimensional Wavelet moments. Caltech101-20 is a subset of the Caltech-101 with 2,386 images in 20 classes. The features are the same with the Caltech101-7. Handwritten is a dataset with 2,000 images corresponding to 10 digits. Each image has 240-dimensional pixel-based features and 76-dimensional Fourier coefficients. Cathgen is a real-world dataset of 8,000 patients. For each patient, we seek to leverage 44dimensional clinical features and 67-dimensional genetic 1https://github.com/yeqinglee/mvdata Learning Autoencoders with Relational Regularization (b) Enlarged (a) (c) Celeb A Figure 3. Comparisons for various methods on their convergence. (a) Vamp Prior (c) Probabilistic RAE (d) Deterministic RAE Figure 4. Comparisons on conditional digit generation. features to predict the happening of myocardial infarction, i.e., a binary classification task. For each dataset, we use 80% of the data for training, 10% for validation, and the remaining 10% for testing. We test various multi-view learning methods. For each method, we first learn two autoencoders for the data in different views in an unsupervised way, and then concatenate the latent codes of the autoencoders as the features and train a classifier based on softmax regression. When learning autoencoders, our relational co-training method solves (14) with γ = 1 and τ = 0.5. The influence of τ on the learning results is shown in Supplementary Material. For simplification, we set the prior distributions as normal distributions in (14). The autoencoders are probabilistic, whose encoders and decoders are MLPs. Each autoencoder has 20-dimensional latent codes, and more implementation details are provided in Supplementary Material. We set D as the hierarchical Wasserstein distance and the relational regularizer as the hierarchical GW distance. In addition to the proposed method, denoted as AEs+GW, we consider the following baselines: i) learning two variational autoencoders independently (Independent AEs); ii) learning two variational autoencoders jointly with a least-square co-regularization (Sindhwani et al., 2005) (AEs+Co Reg); iii) learning latent representations via canonical correlation analysis (CCA) (V ıa et al., 2007); iv) learning two autoencoders jointly with a CCA-based regualrization (AEs+CCA) (Wang et al., 2015); v) learning two autoencoders by replacing the Dgw in (14) with a Wasserstein regularizer (AEs+W). The AE+Co Reg penalizes the Euclidean distance between the latent codes from different views, which needs paired samples. The remaining methods penalize the discrepancy between the distributions of the latent codes, which are applicable for unpaired samples. The classification accuracy in Table 4 Learning Autoencoders with Relational Regularization (b) Probabilistic RAE (c) Deterministic RAE Figure 5. Comparisons on conditional face generation. Table 4. Comparisons on classification accuracy (%) Method Data type Caltech101-7 Caltech101-20 Handwritten Cathgen Independent AEs Unpaired 56.92 1.67 33.07 2.09 52.09 5.82 64.36 1.93 AEs+Co Reg (Sindhwani et al., 2005) Paired 76.58 1.38 60.25 1.66 56.20 5.25 66.79 1.30 CCA (V ıa et al., 2007) Paired 78.33 1.88 52.27 2.32 66.28 5.02 65.28 2.17 AEs+CCA (Wang et al., 2015) Paired 80.24 1.22 62.37 1.35 69.72 4.64 66.89 1.57 AEs+W Unpaired 83.07 1.69 69.58 2.03 71.21 5.55 66.06 1.68 AEs+GW (Ours) Unpaired 84.29 1.74 69.39 2.01 72.36 3.82 66.99 1.77 demonstrates effectiveness of our relational co-training strategy, as the proposed method outperforms the baselines consistently across different datasets. 7. Conclusions A new framework has been proposed for learning autoencoders with relational regularization. Leveraging the GW distance, this framework allows the learning of structured prior distributions associated with the autoencoders and prevents the model from under-regularization. Besides learning a single autoencoder, the proposed relational regularizer is beneficial for co-training heterogeneous autoencoders. In the future, we plan to make this relational regularizer applicable for co-training more than two autoencoders and further reduce its computational complexity. Acknowledgements This research was supported in part by DARPA, DOE, NIH, ONR and NSF. We thank Dr. Hongyuan Zha for helpful discussions. Afriat, S. Theory of maxima and the method of lagrange. SIAM Journal on Applied Mathematics, 20(3):343 357, 1971. Alvarez-Melis, D. and Jaakkola, T. Gromov-wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1881 1890, 2018. Bowman, S., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10 21, 2016. Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S. Learning generative models across incomparable spaces. In International Conference on Machine Learning, pp. 851 861, 2019. Chen, M. and Denoyer, L. Multi-view generative adversarial networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 175 188. Springer, 2017. Chen, Y., Georgiou, T. T., and Tannenbaum, A. Optimal transport for gaussian mixture models. IEEE Access, 7: 6269 6278, 2018. Learning Autoencoders with Relational Regularization Chowdhury, S. and M emoli, F. The Gromov-Wasserstein distance between networks and stable network invariants. ar Xiv preprint ar Xiv:1808.04337, 2018. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292 2300, 2013. Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with gaussian mixture variational autoencoders. ar Xiv preprint ar Xiv:1611.02648, 2016. Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop, pp. 178 178. IEEE, 2004. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kolouri, S., Zou, Y., and Rohde, G. K. Sliced Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258 5267, 2016. Kolouri, S., Pope, P. E., Martin, C. E., and Rohde, G. K. Sliced-wasserstein autoencoder: an embarrassingly simple generative model. ar Xiv preprint ar Xiv:1804.01947, 2018. Kumar, A. and Daum e, H. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 393 400, 2011. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lee, J., Dabagia, M., Dyer, E., and Rozell, C. Hierarchical optimal transport for multimodal distribution alignment. In Advances in Neural Information Processing Systems, pp. 13453 13463, 2019. Li, Y., Nie, F., Huang, H., and Huang, J. Large-scale multiview spectral clustering via bipartite graph. In AAAI, 2015. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In ICCV, 2015. M emoli, F. Spectral Gromov-Wasserstein distances for shape matching. In ICCV Workshops, pp. 256 263, 2009. M emoli, F. Gromov wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417 487, 2011. Peyr e, G., Cuturi, M., and Solomon, J. Gromov-wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pp. 2664 2672, 2016. Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. Sindhwani, V., Niyogi, P., and Belkin, M. A coregularization approach to semi-supervised learning with multiple views. In Proceedings of ICML workshop on learning with multiple views, volume 2005, pp. 74 79, 2005. Takahashi, H., Iwata, T., Yamanaka, Y., Yamada, M., and Yagi, S. Variational autoencoder with implicit optimal priors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5066 5073, 2019. Titouan, V., Flamary, R., Courty, N., Tavenard, R., and Chapel, L. Sliced gromov-wasserstein. In Advances in Neural Information Processing Systems, pp. 14726 14736, 2019. Tolstikhin, I., Bousquet, O., Gelly, S., and Sch olkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations (ICLR 2018). Open Review. net, 2018. Tomczak, J. and Welling, M. Vae with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214 1223, 2018. Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Fused Gromov-Wasserstein distance for structured objects: theoretical foundations and mathematical properties. ar Xiv preprint ar Xiv:1811.02834, 2018a. Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Optimal transport for structured data. ar Xiv preprint ar Xiv:1805.09114, 2018b. V ıa, J., Santamar ıa, I., and P erez, J. A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks, 20(1):139 152, 2007. Wang, S., Ding, Z., and Fu, Y. Coupled marginalized autoencoders for cross-domain multi-view learning. In IJCAI, pp. 2125 2131, 2016. Wang, W., Arora, R., Livescu, K., and Bilmes, J. On deep multi-view representation learning. In International Conference on Machine Learning, pp. 1083 1092, 2015. Learning Autoencoders with Relational Regularization Wang, W., Gan, Z., Xu, H., Zhang, R., Wang, G., Shen, D., Chen, C., and Carin, L. Topic-guided variational auto-encoder for text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 166 177, 2019. Xu, H. Gromov-wasserstein factorization models for graph clustering. ar Xiv preprint ar Xiv:1911.08530, 2019. Xu, H., Luo, D., and Carin, L. Scalable gromov-wasserstein learning for graph partitioning and matching. ar Xiv preprint ar Xiv:1905.07645, 2019a. Xu, H., Luo, D., Zha, H., and Duke, L. C. Gromovwasserstein learning for graph matching and node embedding. In International Conference on Machine Learning, pp. 6932 6941, 2019b. Ye, T., Wang, T., Mc Guinness, K., Guo, Y., and Gurrin, C. Learning multiple views with orthogonal denoising autoencoders. In International Conference on Multimedia Modeling, pp. 313 324. Springer, 2016. Yurochkin, M., Claici, S., Chien, E., Mirzazadeh, F., and Solomon, J. Hierarchical optimal transport for document representation. ar Xiv preprint ar Xiv:1906.10827, 2019.