# generalized_multimodal_elbo__2d364cc7.pdf

Published as a conference paper at ICLR 2021

GENERALIZED MULTIMODAL ELBO

Thomas M. Sutter Imant Daunhawer Julia E. Vogt

Department of Computer Science ETH Zurich 8092 Zurich, Switzerland {thomas.sutter,imant.daunhawer,julia.vogt}@inf.ethz.ch

Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulﬁll all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their beneﬁts without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in selfsupervised, generative learning tasks.

1 INTRODUCTION

The availability of multiple data types provides a rich source of information and holds promise for learning representations that generalize well across multiple modalities (Baltruˇsaitis et al., 2018). Multimodal data naturally grants additional self-supervision in the form of shared information connecting the different data types. Further, the understanding of different modalities and the interplay between data types are non-trivial research questions and long-standing goals in machine learning research. While fully-supervised approaches have been applied successfully (Karpathy & Fei-Fei, 2015; Tsai et al., 2019; Pham et al., 2019; Schoenauer-Sebag et al., 2019), the labeling of multiple data types remains time consuming and expensive. Therefore, it requires models that efﬁciently learn from multiple data types in a self-supervised fashion.

Self-supervised, generative models are suitable for learning the joint distribution of multiple data types without supervision. We focus on VAEs (Kingma & Welling, 2014; Rezende et al., 2014) which are able to jointly infer representations and generate new observations. Despite their success on unimodal datasets, there are additional challenges associated with multimodal data (Suzuki et al., 2016; Vedantam et al., 2018). In particular, multimodal generative models need to represent both modality-speciﬁc and shared factors and generate semantically coherent samples across modalities. Semantically coherent samples are connected by the information which is shared between data types (Shi et al., 2019). These requirements are not inherent to the objective the evidence lower bound (ELBO) of unimodal VAEs. Hence, adaptions to the original formulation are required to cater to and beneﬁt from multiple data types. Furthermore, to handle missing modalities, there is a scalability issue in terms of the number of modalities: naively, it requires 2M different encoders to handle all combinations for M data types. Thus, we restrict our search for an improved multimodal ELBO to the class of scalable multimodal VAEs.

Among the class of scalable multimodal VAEs, there are two dominant strains of models, based on either the multimodal variational autoencoder (MVAE, Wu & Goodman, 2018) or the Mixture-of Experts multimodal variational autoencoder (MMVAE, Shi et al., 2019). However, we show that these approaches differ merely in their choice of joint posterior approximation functions. We draw a theoretical connection between these models, showing that they can be subsumed under the class

Equal contribution.

Published as a conference paper at ICLR 2021

of abstract mean functions for modeling the joint posterior. This insight has practical implications, because the choice of mean function directly inﬂuences the properties of a model (Nielsen, 2019). The MVAE uses a geometric mean, which enables learning a sharp posterior, resulting in a good approximation of the joint distribution. On the other hand, the MMVAE applies an arithmetic mean which allows better learning of the unimodal and pairwise conditional distributions. We generalize these approaches and introduce the Mixture-of-Products-of-Experts-VAE that combines the beneﬁts of both methods without considerable trade-offs.

In summary, we derive a generalized multimodal ELBO formulation that connects and generalizes two previous approaches. The proposed method, termed Mo Po E-VAE, models the joint posterior approximation as a Mixture-of-Products-of-Experts, which encompasses the MVAE (Product-of Experts) and MMVAE (Mixture-of-Experts) as special cases (Section 3). In contrast to previous models, the proposed model approximates the joint posterior for all subsets of modalities, an advantage that we validate empirically in Section 4, where our model achieves state-of-the-art results.

2 RELATED WORK

This work extends and generalizes existing work in self-supervised multimodal generative models that are scalable in the number of modalities. Scalable in the sense that a single model approximates the joint distribution over all modalities (including all marginal and conditional distributions) instead of requiring individual models for every subset of modalities (e.g., Huang et al., 2018; Tian & Engel, 2019; Hsu & Glass, 2018). The latter approach requires a prohibitive number of models, exponential in number of modalities.

Multimodal VAEs Among multimodal generative models, multimodal VAEs (Suzuki et al., 2016; Vedantam et al., 2018; Kurle et al., 2019; Tsai et al., 2019; Wu & Goodman, 2018; Shi et al., 2019; 2020; Sutter et al., 2020) have recently been the dominant approach. Multimodal VAEs are not only suitable to learn a joint distribution over multiple modalities, but also enable joint inference given a subset of modalities. However, to approximate the joint posterior for all subsets of modalities efﬁciently, it is required to introduce additional assumptions on the form of the joint posterior. To overcome the issue of scalability, previous work relies on either the product (Kurle et al., 2019; Wu & Goodman, 2018) or the mixture (Shi et al., 2019; 2020) of unimodal posteriors. While both approaches have their merits, there are also disadvantages associated with them. We unite these approaches in a generalized formulation a mixture of products joint posterior that encapsulates both approaches and combines their beneﬁts without signiﬁcant trade-offs.

Multimodal posteriors The MVAE (Wu & Goodman, 2018) assumes that the joint posterior is a product of unimodal posteriors a Product-of-Experts (Po E, Hinton, 2002). The Po E has the beneﬁt of aggregating information across any subset of unimodal posteriors and therefore provides an efﬁcient way of dealing with missing modalities for speciﬁc types of unimodal posteriors (e.g., Gaussians). However, to handle missing modalities the MVAE relies on an additional sub-sampling of unimodal log-likelihoods, which no longer guarantees a valid lower bound on the joint log-likelihood (Wu & Goodman, 2019). Previous work provides empirical results that exhibit the shortcomings of the MVAE, attributing them to a precision miscalibration of experts (Shi et al., 2019) or to the averaging over inseparable individual beliefs (Kurle et al., 2019). Our results suggest that the Po E works well in practice, if it is also applied on all subsets of modalities, which naturally leads to the proposed Mixture-of-Products-of-Experts (Mo Po E) generalization, which yields a valid lower bound on the joint log-likelihood.

On the other hand, the MMVAE (Shi et al., 2019) assumes that the joint posterior is a mixture of unimodal posteriors a Mixture-of-Experts (Mo E). The MMVAE is suitable for the approximation of unimodal posteriors and for translation between pairs of modalities, however, it cannot take advantage of multiple modalities being present, because it only takes the unimodal posteriors into account during training. In contrast, the proposed Mo Po E-VAE computes the joint posterior for all subsets of modalities and therefore enables efﬁcient many-to-many translations. Extensions of the MVAE and MMVAE (Kurle et al., 2019; Daunhawer et al., 2020; Shi et al., 2020; Sutter et al., 2020) have introduced additional loss terms, however, these are also applicable to and can be added on top of the proposed model.

Published as a conference paper at ICLR 2021

Table 1: Properties of previous scalable multimodal VAEs and our proposed model. Note that to deal with missing modalities, the MVAE requires sub-sampling of unimodal ELBOs, which yields an invalid bound on the joint log-likelihood (Wu & Goodman, 2019).

Model Posterior form Aggregate modalities

Multi-modal posterior

Missing modalities

MVAE Po E ( ) MMVAE Mo E Mo Po E-VAE (ours) Mo Po E

Table 1 summarizes the properties of previous multimodal VAEs and highlights the beneﬁts of the proposed model: the ability to aggregate multiple modalities, to learn a multi-modal posterior (in the statistical sense), and to efﬁciently handle missing modalities at test time.

3.1 PRELIMINARIES

We consider a dataset {X(i)}N i=1 of N i.i.d. samples, each of which is a set of M modalities X(i) = {x(i) j }M j=1. We assume that the data is generated by some random process involving a joint hidden random variable z such that inter-modality dependencies are unknown. The marginal log-likelihood can be decomposed into a sum over marginal log-likelihoods of individual sets log pθ({X(i)}N i=1) = PN i=1 log pθ(X(i)), which can be written as:

log pθ(X(i)) = DKL(qφ(z|X(i))||pθ(z|X(i))) + L(θ, φ; X(i)), (1)

with L(θ, φ; X(i)) := Eqφ(z|X(i))[log pθ(X(i)|z)] DKL(qφ(z|X(i))||pθ(z)). (2)

L(θ, φ; X(i)) is called evidence lower bound (ELBO) on the marginal log-likelihood of the i-th set. It forms a tractable objective to approximate the joint data distribution log pθ(X(i)). qφ(z|X(i)) is the posterior approximation distribution with learnable parameters φ. From the non-negativity of the KL divergence, it follows that log pθ(X(i)) L(θ, φ; X(i)). If the posterior approximation qφ(z|X(i)) is identical to the true posterior distribution pθ(z|X(i)), the bound holds with equality. Hence, maximizing the ELBO in Equation (2) minimizes the otherwise intractable KL-divergence between approximate and true posterior distribution:

arg min φ DKL(qφ(z|X(i))||pθ(z|X(i))) . (3)

Adaptations to the ELBO formulation in Equation (2) include an additional hyperparameter β which weights the KL-divergence relative to the log-likelihood (Higgins et al., 2017). To improve readability, we will omit the superscript (i) in the remaining part of this work.

3.2 APPROXIMATING pθ(z|X) IN CASE OF MISSING DATA TYPES

For a dataset of M modalities, there are 2M different subsets contained in the powerset P(X). If, for a particular observation, we only have access to a subset of data types Xk P(X), the approximation of pθ(Xk) would result in a different ELBO formulation L(θ, φk; Xk) where the true posterior pθ(z|Xk) of subset Xk is approximated. Instead, we are interested in the true posterior pθ(z|X) of all data types X, even when only a subset Xk, i.e. qφk(z|Xk), is available. The desired ELBO for the available subset Xk is given by

Lk(θ, φk; X) := E qφk (z|Xk)[log(pθ(X|z)] DKL( qφk(z|Xk)||pθ(z)) . (4)

The subtle but important difference between Lk(θ, φk; X) and L(θ, φk; Xk) is that the former still yields a valid lower bound on pθ(X), whereas the latter forms a lower bound on log pθ(Xk), which is no longer a valid bound on the desired log pθ(X).

Published as a conference paper at ICLR 2021

Different from previous work, we argue for an optimization of the powerset P(X), i.e., the joint optimization of all ELBOs Lk(θ, φk; X) deﬁned by the posterior subset approximation qφk(z|Xk). Since maximizing the ELBO in Equation (2) is equivalent to minimizing the KL-divergence in Equation (3), the joint optimization of the powerset P(X) is equal to the minimization of the following convex combination of KL-divergences of the power set P(X).1

Xk P(X) DKL( qφ(z|Xk)||pθ(z|X)) (5)

Hence, we propose to optimize Equation (4) for all subsets Xk.

Lemma 1. The sum of KL-divergences in Equation (5) describes the joint probability log pθ(X) as follows:

log pθ(X) = 1 2M X

Xk P(X) DKL ( qφ(z|Xk)||pθ(z|X)) + 1

Xk P(X) E qφ(z|Xk)

log pθ(X|z)pθ(z)

Following Lemma 1 (see Appendix A.1 for the proof) and the non-negativity of the KL-divergence, we see that the convex combination of expectations over the powerset P(X) is an ELBO on the joint probability log pθ(X). Since this would require 2M different inference networks in a naive implementation, we use a more efﬁcient approach utilizing abstract mean functions.

3.3 SCALABLE INFERENCE USING ABSTRACT MEAN FUNCTIONS

To create a model that is scalable in the number of modalities a model that breaks the need for 2M different networks previous works deﬁne the joint posterior approximation qφ(z|X) as a mean function of the unimodal variational posteriors. The Po E and Mo E can be subsumed under the concept of abstract means (Nielsen, 2019). Abstract means unify multiple mean functions Mf for a given function f (Niculescu & Persson, 2005):

Mf(p) = f 1 1 P

where P is the number of elements and the function f needs to be injective in order for f 1 to exist. f(p) = ap + b results in the arithmetic mean, f(p) = log p in the geometric mean.

The choice of mean function directly inﬂuences the properties of the learned model as we will recapitulate with regard to multimodal VAEs in the following. The MVAE (Wu & Goodman, 2018) employs the Po E, which is a geometric mean of unimodal posteriors. Aggregation through the Po E results in a sharp posterior approximation (Hinton, 2002), but struggles in optimizing the individual experts as mentioned by the authors (Wu & Goodman, 2018, p. 3). In contrast, the MMVAE (Shi et al., 2019) uses the Mo E, which is an arithmetic mean of unimodal posteriors. As such, the MMVAE optimizes individual experts well, but is not able to learn a distribution that is sharper than any of its experts. Thus, the choice of mean function directly inﬂuences the properties of the resulting model. The Mo E is optimizing for conditional distributions based on the unimodal posterior approximations, while the Po E is optimizing for the approximation of the joint probability distribution.

For scalable, abstract-mean based models, the set of parameters φk for the posterior approximation of a subset qφ(z|Xk) is determined by the unimodal posterior approximations qφj(z|xj) as φk = {φj j {1, . . . , M} : xj Xk}.

3.4 GENERALIZED MULTIMODAL ELBO

In the following, we ﬁrst introduce the new ELBO LMo Po E(θ, φ; X) and then prove that its objective minimizes the convex combination of KL-divergences in Equation (5).

1We omit the subscript k for the parameterization of the posterior approximations when it is clear from context that only Xk is available, and write φ instead.

Published as a conference paper at ICLR 2021

Deﬁnition 1.

1. Let the posterior approximation of subset Xk be

qφ(z|Xk) = Po E({qφj(z|xj) xj Xk}) Y

xj Xk qφj(z|xj) .

2. Let the joint posterior be qφ(z|X) = 1 2M P

Xk P(X) qφ(z|Xk) .

The objective LMo Po E(θ, φ; X) for learning a joint distribution of multiple data types X is deﬁned as

LMo Po E(θ, φ; X) := Eqφ(z|X)[log(pθ(X|z)] DKL 1

Xk P(X) qφ(z|Xk)||pθ(z) . (6)

From Deﬁnition 1, Lemma 2 directly follows. Lemma 2. LMo Po E(θ, φ; X) is a multimodal ELBO, that is log pθ(X) LMo Po E(θ, φ; X).

Since qφ(z|X) is deﬁned as a mixture distribution (i.e., a probability distribution), it directly follows that LMo Po E(θ, φ; X) is a valid ELBO on log pθ(X), because a variational distribution can be chosen arbitrarily as long as it is a valid probability distribution. For a proof of Lemma 2, see Appendix A.2.

Lemma 3. Maximizing LMo Po E(θ, φ; X) minimizes the convex combination of KL-divergences of the powerset P(X) given in Equation (5).

For a proof of Lemma 3, see Appendix A.3. Deﬁnition 1 does not put any restrictions on the choice of posterior approximations qφ(z|Xk). As we are interested in scalable, multimodal models, we focus on methods which apply to this restriction and choose the Po E for the posterior approximations of the subsets Xk P(X). Other, non-scalable posterior fusion methods are possible using this framework.

3.5 THE GENERAL FRAMEWORK

Deﬁnition 1 can be interpreted as a hierarchical distribution: ﬁrst the unimodal posterior approximations of a subset qφj(z|xj) xj Xk are combined using a Po E, second the subset approximations qφ(z|Xk) Xk P(X) are combined using a Mo E. This allows us to combine the strengths of both Mo E as well as Po E while circumventing their weaknesses (see Section 2). For Gaussian posterior approximations, as is common in VAEs, the Po E can be calculated in closed form, which makes it a computationally efﬁcient solution.

In the following, we derive the objectives optimized by the MVAE and MMVAE as special cases of LMo Po E(θ, φ; X). The MVAE only takes into account the full subset, i.e., the Po E of all data types. Trivially, this is a Mo E with only a single component:

LPo E(θ, φ; X) = Eqφ(z|X)[log(pθ(X|z)] DKL(qφ(z|X)||pθ(z)) (7)

with qφ(z|X)

j=1 qφj(z|xj) = Po E({qφj(z|xj)}M j=1) =

k=1 Po E({qφj(z|xj)}M j=1) (8)

This is equivalent to the Mo Po E-VAE of a single subset XK, which is the full set X.

As the Po E of a single expert is just the expert itself, the MMVAE model (Shi et al., 2019) is the special case of LMo Po E(θ, φ; X) which takes only into account the M unimodal subsets:

LMo E(θ, φ; X) = Eqφ(z|X)[log(pθ(X|z)] DKL

j=1 qφj (z|xj) ||pθ (z)

with qφ(z|X) = 1

j=1 qφj(z|xj) = 1

j=1 Po E(qφj(z|xj)) (10)

Published as a conference paper at ICLR 2021

LMo E(θ, φ; X) is equivalent to a Mo Po E-VAE of the M unimodal posterior approximations qφj(z|xj) for j = 1, . . . , M.

Therefore, the proposed Mo Po E-VAE is a generalized formulation of the MVAE and MMVAE, which accounts for all subsets of modalities. The identiﬁed special cases offer a new perspective on the strengths and weaknesses of prior work: previous models focus on a speciﬁc subset of posteriors, which might lead to a decreased performance on the remaining subsets.

In particular, the MVAE should perform best when all modalities are present, whereas the MMVAE should be most suitable when only a single modality is observed. We validate this observation empirically in Section 4.

4 EXPERIMENTS & RESULTS

We evaluate the proposed method on three different datasets and compare it to state-of-the-art methods. We introduce a new dataset called Poly MNIST with 5 simpliﬁed modalities. Additionally, we evaluate all models on the trimodal matching digits dataset MNIST-SVHN-Text and the challenging bimodal Celeba dataset with images and text. The latter two were introduced in Sutter et al. (2020).

We evaluate the models according to three different metrics. We assess the quality of the learned latent representation using a linear classiﬁer. The coherence of generated samples is evaluated using pre-trained classiﬁers. The approximation of the joint data distribution is measured using test set log-likelihoods.

The datasets and the evaluation of experiments are described in detail in Appendix B.

4.1 MNIST-SVHN-TEXT

Figure 1: Joint Coherence vs. Log Likelihoods for MNIST-SVHN-Text.

Based on the MNIST-SVHN dataset (Shi et al., 2019), this trimodal dataset with an additional text modality forces a model to adapt to multiple data types. It involves data types of various difﬁculties. Whereas MNIST (Le Cun & Cortes, 2010) and text are clean modalities, SVHN (Netzer et al., 2011) is comprised of noisy images.

Tables 2 and 3 show the superior performance of the proposed method compared to state-of-the-art methods regarding the ability to learn meaningful latent representations and generate coherent samples. MVAE reaches superior performance for the generation of the SVHN modality, while Mo Po E-VAE overall achieves best coherence results. Table 4 shows the results for the test log-likelihoods.

The proposed Mo Po E-VAE is the only method that is able to reach state-of-the-art coherence, latent classiﬁcation accuracies, as well as test log-likelihoods for all combination of inputs. This can be seen in Figure 1, illustrating the trade-off between test log-likelihoods and joint coherence for every model. Every point encodes the joint coherence and joint log-likelihood for a different βvalue.2 The goal is to have high coherence and log-likelihoods (i.e., the top right corner). Note that lower beta values typically correspond to models with a higher log-likelihood but lower coherence. Overall, the Mo Po E-VAE achieves a superior trade-off compared to the baselines. As expected by our theoretical analysis (Section 3.5), the MVAE achieves good joint log-likelihoods, whereas MMVAE reaches high joint coherence.

2The β-hyperparameter controls the weight of the KL-divergence in Equation (6). We evaluate the models using β {0.5, 1.0, 2.5, 5.0, 10.0, 20.0}.

Published as a conference paper at ICLR 2021

Table 2: Linear classiﬁcation accuracy of latent representations for MNIST-SVHN-Text. We evaluate all subsets of modalities Xk where the abbreviations of subsets are as follows: M: MNIST; S: SVHN; T: Text; M,S: MNIST and SVHN; M,T: MNIST and Text; S,T: SVHN and Text; M,S,T: all. We report the means and standard deviations over 5 runs.

MODEL M S T M,S M,T S,T M,S,T

MVAE 0.90 0.01 0.44 0.01 0.85 0.10 0.89 0.01 0.97 0.02 0.81 0.09 0.96 0.02 MMVAE 0.95 0.01 0.79 0.05 0.99 0.01 0.87 0.03 0.93 0.03 0.84 0.04 0.86 0.03 MOPOE 0.95 0.01 0.80 0.03 0.99 0.01 0.97 0.01 0.98 0.01 0.99 0.01 0.98 0.01

Table 3: Generation coherence for MNIST-SVHN-Text. For conditional generation, the letter above the horizontal line indicates the modality which is generated based on the subsets Xk below. We report the mean values over 5 runs. Standard deviations are included in Appendix C.3.

MODEL JOINT S T S,T M T M,T M S M,S

MVAE 0.12 0.24 0.20 0.32 0.43 0.30 0.75 0.28 0.17 0.29 MMVAE 0.28 0.75 0.99 0.87 0.31 0.30 0.30 0.96 0.76 0.84 MOPOE 0.31 0.74 0.99 0.94 0.36 0.34 0.37 0.96 0.76 0.93

4.2 POLYMNIST

Figure 2: Ten samples from the Poly MNIST dataset. Each column depicts one tuple that consists of ﬁve different modalities .

The Poly MNIST dataset consists of sets of MNIST digits where each set {xj}M j=1 consists of 5 images with the same digit label but different backgrounds and different styles of hand writing. An example of one such tuple is shown in Figure 2. Thus, each modality represents a shufﬂed set of MNIST digits overlayed on top of (random crops from) 5 different background images, which are modality-speciﬁc. In total there are 60, 000 tuples of training examples and 10, 000 tuples of test examples and we make sure that no two MNIST digits were used in both the training and test set.3

The Poly MNIST dataset allows to investigate how well different methods perform given more than two modalities. Since individual images can be difﬁcult to classify correctly (even for a human observer) one would expect multimodal models to aggregate information across multiple modalities. Further, this dataset facilitates the comparison of different models, because it removes the need for modality-speciﬁc architectures and hyperparameters. As such, for a fair comparison, we use the same architectures and hyperparameter values across all methods. We expect to see that both the MMVAE and our proposed method are able to aggregate the redundant digit information across different modalities, whereas the MVAE should not be able to beneﬁt from an increasing number of modalities, because it does not aggregate unimodal posteriors. Further, we hypothesize that the MVAE will achieve the best generative performance when all modalities are present, but that it will struggle with an increasing number of missing modalities. The proposed Mo Po E-VAE should perform well given any subset of modalities.

Poly MNIST results Figure 3 compares the results across different methods. The performance in terms of three different metrics is shown as a function of the number of input modalities; for instance, the log-likelihood of all generated modalities given one input modality (averaged over all possible single input modalities). As expected, both the MVAE and Mo Po E-VAE beneﬁt from more input modalities, whereas the performance of the MVAE stays ﬂat across all metrics. In the limit of all 5 input modalities, the log-likelihood of Mo Po E-VAE is on par with MVAE, but the proposed method is clearly superior in terms of both latent classiﬁcation as well as conditional coherence

3Details on how the dataset was generated are included in Appendix D.

Published as a conference paper at ICLR 2021

Table 4: Test set log-likelihoods on MNIST-SVHN-Text. We report the test set log-likelihoods of the joint generative model conditioned on the variational posterior of subsets of modalities qφ(z|Xk). (x M: MNIST; x S: SVHN; x T : Text; X = (x M, x S, x T )).

MODEL X X|x M X|x S X|x T X|x M, x S X|x M, x T X|x S, x T

MVAE -1790 3.3 -2090 3.8 -1895 0.2 -2133 6.9 -1825 2.6 -2050 2.6 -1855 0.3 MMVAE -1941 5.7 -1987 1.5 -1857 12 -2018 1.6 -1912 7.3 -2002 1.2 -1925 7.7 MOPOE -1819 5.7 -1991 2.9 -1858 6.2 -2024 2.6 -1822 5.0 -1987 3.1 -1850 5.8

Figure 3: Performance on Poly MNIST as a function of the number of input modalities, averaged over all subsets of the respective size. Performance is measured in terms of three different metrics (larger is better) and markers denote the means (error bands denote standard deviations) over ﬁve runs. Left: Linear classiﬁcation accuracy of digits given the latent representation computed from the respective subset. Center: Coherence of conditionally generated samples (excluding the input modality). Right: Log-likelihood of all generated modalities. Not shown: The joint coherence is 3.6 ( 1.5), 20.0 ( 1.9), and 12.1 ( 1.6) percent for MVAE, MMVAE, and Mo Po E respectively.

across any subset of modalities. Analogously, in the limit of a single input modality, Mo Po E-VAE matches the performance of MMVAE. Only in terms of the joint coherence (see ﬁgure legend) the MMVAE performs better, suggesting that a more ﬂexible prior might be needed for the Mo Po EVAE. Therefore, the Poly MNIST experiment illustrates that the proposed method does not only theoretically encompass the other two methods, but that it is superior for most subsets of modalities and even matches the performance in special cases that favor previous methods.

4.3 BIMODAL CELEBA

In this dataset, the images displaying faces (Liu et al., 2015) are equipped with additional text describing the faces using the labeled attributes. Any negatively labeled attribute is completely missing in the string which makes the text modality more challenging. Compared to previous experiments, we additionally use modality-speciﬁc latent spaces, which were found to improve the generative quality of a model (Hsu & Glass, 2018; Sutter et al., 2020; Daunhawer et al., 2020).4 Figure 4 displays qualitative results of images which are generated given text. Table 5 shows the classiﬁcation results for the coherence of generated samples as well as the classiﬁcation of latent representations. We see that the proposed model is able to match the baselines on this challenging dataset, which favors the baselines, because it consists of two modalities. Figure 4 shows that attributes like gender or smiling are learned well, as they manifest in generated samples and can be identiﬁed from the latent representation. Subtle and rare attributes are more difﬁcult to generate consistently; evaluations speciﬁc to the different labels are provided in Appendix E.3.

5 CONCLUSION

In this work, we propose a new multimodal ELBO formulation. Our contribution is threefold: First, the proposed Mo Po E-VAE generalizes prior works (MVAE, MMVAE) and combines their beneﬁts. Second, we analyze the strengths and weaknesses of previous works and relate them directly to their

4For more details, see Appendix E.

Published as a conference paper at ICLR 2021

Table 5: Classiﬁcation and coherence results on the bimodal Celeb A experiment. For latent representations and conditionally generated samples, we report the mean average precision over all attributes (I: Image; T: Text; Joint: I and T).

LATENT REPRESENTATION GENERATION

MODEL I T JOINT I T T I

MVAE 0.30 0.31 0.32 0.26 0.33 MMVAE 0.35 0.38 0.35 0.14 0.41 MOPOE 0.40 0.39 0.39 0.15 0.43

Figure 4: Qualitative results for bimodal Celeb A. The images are conditionally generated by Mo Po E-VAE using the text on top of each column.

objective and choice of posterior approximation function. Finally, in extensive experiments we empirically show the advantages compared to state-of-the-art models and even match their performance on tasks that favor previous work. In future work, we would like to evaluate previous extensions to multimodal VAEs. Addtionally, we will explore different types and combinations of abstract mean functions and investigate their effects on the model and its performance as well as their theoretical properties (e.g., tightness) compared to existing methods.

ACKNOWLEDGMENTS

We would like to thank Riˇcards Marcinkeviˇcs for helpful discussions and proposing the name Poly MNIST . ID is supported by the SNSF grant #200021 188466.

Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (2):423 443, 2018.

Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Yuri Burda, Roger B Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1509. 00519.

Thomas M Cover and Joy A Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954.

Published as a conference paper at ICLR 2021

Imant Daunhawer, Thomas M Sutter, Ricards Marcinkevics, and Julia E Vogt. Self-supervised Disentanglement of Modality-speciﬁc and Shared Factors Improves Multimodal Generative Models. In German Conference on Pattern Recognition. Springer, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR, 2(5):6, 2017.

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002.

Wei-Ning Hsu and James Glass. Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data. 2018. URL http://arxiv.org/abs/1805.11264.

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-toimage translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172 189, 2018.

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128 3137, 2015.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.

Richard Kurle, Stephan G unnemann, and Patrick van der Smagt. Multi-Source Neural Variational Inference. In The Thirty-Third Conference on Artiﬁcial Intelligence, AAAI 2019, pp. 4114 4121. {AAAI} Press, 2019. doi: 10.1609/aaai.v33i01.33014114. URL https://doi.org/10. 1609/aaai.v33i01.33014114.

Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann. lecun.com/exdb/mnist/.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In The IEEE International Conference on Computer Vision (ICCV), 2015.

Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807 814, 2010.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

C Niculescu and L Persson. Convex Functions and Their Applications: A Contemporary Approach. 2005.

Frank Nielsen. On the Jensen-Shannon symmetrization of distances relying on abstract means. Entropy, 2019. ISSN 10994300. doi: 10.3390/e21050485.

Fabian Pedregosa, Ga el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. Scikitlearn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825 2830, 2011.

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnab as P oczos. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 6892 6899, 2019.

Published as a conference paper at ICLR 2021

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1278 1286, 2014. URL https://arxiv.org/abs/1401.4082.

Alice Schoenauer-Sebag, Louise Heinrich, Marc Schoenauer, Michele Sebag, Lani F Wu, and Steve J Altschuler. Multi-domain adversarial learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 New Orleans, LA, USA, May 6-9, 2019, 2019. URL https://arxiv.org/abs/1903.09239.

Yuge Shi, N Siddharth, Brooks Paige, and Philip Torr. Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models. In Advances in Neural Information Processing Systems, pp. 15692 15703, 2019.

Yuge Shi, Brooks Paige, Philip H S Torr, and N Siddharth. Relating by Contrasting: A Dataefﬁcient Framework for Multimodal Generative Models. 2020. URL https://arxiv.org/ abs/2007.01179.

Thomas M Sutter, Imant Daunhawer, and Julia E Vogt. Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence. 2020. URL https://arxiv.org/abs/2006.08242.

Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint Multimodal Learning with Deep Generative Models. pp. 1 12, 2016. URL http://arxiv.org/abs/1611.01891.

Yingtao Tian and Jesse Engel. Latent translation: Crossing modalities by bridging generative models. 2019. URL https://arxiv.org/abs/1902.08261.

Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. Learning Factorized Multimodal Representations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id=rygqqs A9KX.

Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative Models of Visually Grounded Imagination. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https://openreview.net/forum?id=Hk Csm6l Rb.

Mike Wu and Noah Goodman. Multimodal Generative Models for Scalable Weakly-Supervised Learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montreal, Canada, pp. 5580 5590, 2 2018. URL http://arxiv.org/abs/1802.05335.

Mike Wu and Noah Goodman. Multimodal Generative Models for Compositional Representation Learning, 2019.

Published as a conference paper at ICLR 2021

A.1 PROOF OF LEMMA 1 P

Xk P(X) DKL( qφ(z|Xk)||pθ(z|X)), i.e., the sum of KL-divergences in Equation (5) can be used for describing the joint probability log pθ(X).

Proof. We show that the convex combination of KL-divergences can be directly related to the joint probability log pθ(X):

Xk P(X) DKL( qφ(z|Xk)||pθ(z|X)) = X

Xk P(X) E qφ(z|Xk)

log qφ(z|Xk)

log qφ(z|Xk)

+ log pθ(X) (11)

which can be reformulated as an expression of the joint probability log pθ(X):

log pθ(X) = 1

Xk P(X) DKL ( qφ(z|Xk)||pθ(z|X))

| {z } Equation (5)

Xk P(X) E qφ(z|Xk)

log pθ(X|z)pθ(z)

From the non-negativity of the KL-divergence, we derive the lower bound to the joint probability log p(X):

log pθ(X) 1 2M X

Xk P(X) E qφ(z|Xk)

log pθ(X|z)pθ(z)

A.2 PROOF OF LEMMA 2

LMo Po E(θ, φ; X) is a multimodal ELBO, that is log pθ(X) LMo Po E(θ, φ; X).

Proof. The sums using index k also sum over all 2M subsets in the power set P(X). We use k only for better readability.

log p(X) = DKL(qφ(z|X)||p(z|X)) + Eqφ(z|X)[log p(z, X)

qφ(z|X)] (13)

k q(z|Xk)||p(z|X)

log p(z, X) 1 2M P

log p(z, X) 1 2M P

= E qφ(z|X)[log(pθ(X|z)] DKL( 1

Xk P(X) qφk(z|Xk)||pθ(z)) (16)

= L(θ, φ; X) (17)

A.3 PROOF OF LEMMA 3

Maximizing LMo Po E(θ, φ; X) minimizes the convex combination of KL-divergences of the powerset P(X) given in Equation (5).

Published as a conference paper at ICLR 2021

Proof. Lemma 1 shows that the sum of KL-divergences in Equation (5) is able to describe the joint probability and can be used to form a valid ELBO. Equation (12) is the convex combination of ELBOs given a subset s posterior approximation qφ(z|Xk). Utilizing Jensen s inequality, it follows:

Xk P(X) E qφ(z|Xk)

log pθ(X|z)pθ(z)

log pθ(X|z)pθ(z) 1 2M P

where the sums on the right hand also iterate over all Xk P(X). From Equation (18), we see that the proposed LMo Po E(θ, φ; X) is not only a valid lower bound to the joint log-probability log pθ(X), but also a tighter one than the convex combination of ELBOs:

log pθ(X) E 1 2M P

log pθ(X|z)pθ(z) 1 2M P

= LMo Po E(θ, φ; X) (20)

Xk P(X) E qφ(z|Xk)

log pθ(X|z)pθ(z)

As Equation (19) can be directly derived from Equation (5), maximizing the proposed objective results in minimizing the convex combination of KL-divergences.

In the following, we derive the inequality in Equation (18) in more detail:

Xk P(X) E qφ(z|Xk)

log pθ(X|z)pθ(z)

Xk P(X) E qφ(z|Xk) [log pθ(X|z)pθ(z) log qφ(z|Xk)] (23)

Xk P(X) E qφ(z|Xk) [log pθ(X|z)pθ(z)]

| {z } =E 1 2M P k qφ(z|Xk)[log pθ(X|z)pθ(z)]

Xk P(X) E qφ(z|Xk) [log qφ(z|Xk)]

E 1 2M P k qφ(z|Xk)[log( 1 2M P

k qφ(z|Xk))]

k qφ(z|Xk) [log pθ(X|z)pθ(z)] E 1 2M P

log (pθ(X|z)pθ(z)) log

log pθ(X|z)pθ(z) 1 2M P

In the minuend of Equation (24), the ordering of expectation and sum can be exchanged due to the linearity of the expectation. In the subtrahend of Equation (24), the sum of expectation of the posterior approximations of subsets can be reformulated into the expectation of a mixture distribution using Jensen s inequality. Due to the convexity of the function f(t) = t log t (Cover & Thomas, 2006, p.29), the expectation of a mixture distribution is a lower bound to the sum over the expectation of posterior approximations as the mixture distribution can be seen as a convex combination of posterior approximations of subset of modalities qφ(z|Xk). Hence, the inequality from Equation (24) to Equation (25) follows as we decrease the subtrahend in Equation (25).

Published as a conference paper at ICLR 2021

B EVALUATION OF EXPERIMENTS

For the experiments, we evaluate all models regarding three different metrics: the classiﬁcation accuracy (or average precision for Celeb A) on the latent representation, the coherence of generated samples and the test set log-likelihoods.

The latent representations are evaluated using a logistic regression classiﬁer from scikit-learn (Pedregosa et al., 2011). The classiﬁer is trained using 500 samples from the training set which are encoded using the trained models. The evaluation is done on the full test set and the reported numbers are the average performances over all batches in the test set.

The generation coherence is evaluated using the same networks as the unimodal encoders which were trained beforehand. For every data type, we train a neural network classiﬁer in a supervised way. The architecture of the classiﬁer is identical to the encoder except from the last layer. For joint coherence, all generated samples are evaluated by the classiﬁer and if all modalities are classiﬁed as having the same label, they are considered coherent. The coherence accuracy is the ratio of coherent samples divided by the number of generated samples. For conditional generation, the conditionally generated samples have to be coherent to the input samples.

The test set log-likelihoods are evaluated using 15 importance samples for all models and the reported numbers are the averages over all test set batches.

If not stated differently, the reported numbers in section 4 are the mean and standard deviations of 5 runs with different random seeds. All models evaluated use the same architectures and numbers of parameters. The likelihoods of the different modalities are weighted to each other according to the size of the modality for all experiments. The most dominant modality is set to 1.0. The remaining ones are scaled up by the ratio of their data dimensions. For example in the MNIST-SVHN-Text experiment, SVHN is set to 1.0 and MNIST to 3.92 which is the ratio of their data dimensions.

For all unimodal posterior approximations, we assume Gaussian distributions N(z; µ, σ2In) where n is the number of latent space dimensions. In all experiments, the mixture components are equally weighted with 1 #components.

B.1 COMPARISON TO PREVIOUS WORKS

Shi et al. (2019) in the end use a different ELBO objective including importance samples LIWAE (Burda et al., 2016). We compare all models without the use of importance samples as these could be easily introduced to all objectives and are not directly related to the focus of this work which is choice of joint posterior approximation.

Sutter et al. (2020) utlize the Jensen-Shannon divergence as a regularizer instead of the KLdivergence. This results in the use of a dynamic prior and shows promising results. Besides the dynamic prior, they model the joint posterior approximation as well using a Mo E. Again, we do not include models utilizing a dynamic prior as this could be introduced to all formulations and is not the focus of this work.

B.1.1 EQUIVALENCE TO ELBO FORMULATION IN SHI ET AL. (2019)

For clarity, we show here the equivalence of the formulation in Equation (10) to the formulation in (Shi et al., 2019, p.5).

Published as a conference paper at ICLR 2021

LMo E(θ, φ; X) = Eqφ(z|X)[log(pθ(X|z)] DKL

j=1 qφj (z|xj) ||pθ (z)

= Eqφ(z|X)[log(pθ(X|z)] E 1

M PM j=1 qφj (z|xj)

1 M PM j=1 qφj (z|xj)

= Eqφ(z|X)[log(pθ(X|z)] Eqφ(z|X)

log qφ(z|X)

= Eqφ(z|X)[log pθ(X|z)] + Eqφ(z|X)

log pθ(X|z)pθ (z)

j=1 Eqφj (z|xj)

log pθ(X|z)pθ (z)

where Equation (32) and Equation (33) are equivalent to the ﬁrst equation on page 5 in Shi et al. The different formulation on the second line of the ﬁrst equation on page 5 is coming from their use of importance samples.

Published as a conference paper at ICLR 2021

Table 6: Generation Coherence for MNIST-SVHN-Text. For every subtable, the modality above the wide horizontal line is generated based on the subsets below the same line except for joint coherence. The abbreviations of the different modalities are as follows: M:MNIST; S: SVHN; T: Text. Combinations thereof separated by commas result in the subsets consisting of the modalities. We report the mean value and standard deviation of 5 runs.

MODEL S T S,T

MVAE 0.24 0.01 0.20 0.05 0.32 0.03 MMVAE 0.75 0.06 0.99 0.01 0.87 0.03 MOPOE 0.74 0.04 0.99 0.01 0.94 0.01

MODEL M T M,T

MVAE 0.43 0.02 0.30 0.08 0.75 0.04 MMVAE 0.31 0.03 0.30 0.04 0.30 0.03 MOPOE 0.36 0.07 0.34 0.06 0.37 0.06

MODEL M S M,S

MVAE 0.28 0.06 0.17 0.02 0.29 0.06 MMVAE 0.96 0.01 0.76 0.04 0.84 0.02 MOPOE 0.96 0.01 0.76 0.03 0.93 0.01

MODEL JOINT COHERENCE

MVAE 0.12 0.02 MMVAE 0.28 0.01 MOPOE 0.31 0.03

C MNIST-SVHN-TEXT

C.1 DATASET

The dataset MNIST-SVHN-Text was introduced and described by Sutter et al. (2020). Equal to Shi et al. (2019) in their bimodal experiment, we create 20 triples per set resulting in a many-to-many mapping.

C.2 EXPERIMENTAL SETUP

The latent space dimension is set to 20 for all modalities, models and runs. The results in tables 2 to 4 are generated with β = 5.0. We train all models for 150 epochs. We use the same architectures as in Sutter et al. (2020). For MNIST encoder and decoder, we use fully-connected layers, for SVHN and text encoders and decoder feed-forward convolutional layers. For all layers, we use Re LU-activation functions (Nair & Hinton, 2010). The detailed architectures can also be looked up in the released code. We use an Adam optimizer (Kingma & Ba, 2014) with an initial learning rate 0.001.

C.3 ADDITIONAL RESULTS

In table 6, we show the coherence results including the standard deviation of the 5 runs which were removed from the main part due to space restrictions.

Additionally, we perform the analysis of coherence in relation to log-likelihood for conditional generation as well, similar to the example using random generation in section 4.1. The combination of coherence and log-likelihoods shows the ability of a model to learn the data distribution as well as the generation of coherent samples. Every point refers to a different β value. We evaluated the models for β = [0.5, 1.0, 2.5, 5.0, 10.0, 20.0]. The points in the ﬁgures are the mean values of 5 different runs with the lines being the standard deviations in both directions, coherence and log-likelihoods.

Figure 7 displays a qualitative comparison between the three models using 100 randomly generated samples. The generated samples correspond to the numbers in section 4.1. MVAE is able to best approximate the joint distribution in terms of sample quality for the price of a limited coherence,

Published as a conference paper at ICLR 2021

Figure 5: Coherence and Log-Likelihoods for MNIST-SVHN-Text. The three ﬁgures show the evaluation for the conditional generation of a single modality given the other two in relation to the joint log-likelihood given these two modalities, e.g. in the ﬁrst row we generate SVHN samples conditioned on MNIST and Text. The points in the ﬁgures are the mean values of 5 different runs with the lines being the standard deviations in bopth directions, coherence and log-likelihoods.

Figure 6: Coherence and Log-Likelihoods for MNIST-SVHN-Text. The three rows of ﬁgures show the evaluation for the conditional generation of two modalities given the remaining one in relation to the joint log-likelihood given this single modality, e.g. in the ﬁrst row we generate SVHN and Text samples conditioned on MNIST. The points in the ﬁgures are the mean values of 5 different runs with the lines being the standard deviations in both directions, coherence and log-likelihoods.

while MMVAE shows higher coherence but limited sample quality. Mo Po E approximate MVAE s sample quality with a start-of-the-art coherence.

Published as a conference paper at ICLR 2021

(a) MVAE: MNIST

(b) MVAE: SVHN

(c) MVAE: Text

(d) MMVAE: MNIST

(e) MMVAE: SVHN

(f) MMVAE: Text

(g) Mo Po E: MNIST

(h) Mo Po E: SVHN

(i) Mo Po E: Text

Figure 7: Qualitative comparison of randomly generate MNIST-SVHN-Text samples.

In addition to the theoretical proof of Lemma 3 that Deﬁnition 1 minimizes the convex combination of ELBOs, we compare the performance of a model trained using the objective in Equation (5) to the proposed method Mo Po E-VAE. It can be seen that Mo Po E-VAE achieves competitive results to the model which is optimizing Equation (5). This shows empirically that the proposed method is indeed minimizing the convex combination of ELBOs in Equation (5). Equation (5) is extensively minimizing the ELBO of every possible subset. Hence, Equation (5) is computationally much more expensive to optimize.

Table 7: Comparison of objectives: Equation (5) and Deﬁnition 1. We report the test set loglikelihoods of the joint generative model conditioned on the variational posterior of subsets of modalities qφ(z|Xk). (x M: MNIST; x S: SVHN; x T : Text; X = (x M, x S, x T )). For both objectives we use β = 2.5

MODEL X X|x M X|x S X|x T X|x M, x S X|x M, x T X|x S, x T

EQ. (5) -1810 -1993 -1831 -2039 -1811 -2000 -1839 MOPOE -1815 12.4 -1990 4.4 -1858 13.2 -2024 1.2 -1819 13.4 -1986 2.5 -1848 11.5

Published as a conference paper at ICLR 2021

D POLYMNIST

D.1 DATASET

For the creation of the Poly MNIST dataset, we fuse each MNIST image with a random crop of size 28x28 from the background image of the respective modality. In particular, we binarize the MNIST image and invert the colors of the random crop at those locations where the binarized MNIST digit is visible. We use the following background images:

1. John Burkardt. Licensed under GNU LGPL. https://people.sc.fsu.edu/ jburkardt/data/jpg/fractal_tree.jpg [Online; retrieved 27.09.2020]

2. Edvard Munch. The Scream. Public domain. https://upload.wikimedia.org/ wikipedia/commons/f/f4/The_Scream.jpg [Online; retrieved 27.09.2020]

3. The Waterloo Image Repository. Lena. Copyright belongs to the author. http://links.uwaterloo.ca/Repository/TIF/lena3.tif [Online; retrieved 27.09.2020] 4. John Burkardt. Licensed under GNU LGPL. https://people.sc.fsu.edu/ jburkardt/data/jpg/star_field.jpg [Online; retrieved 27.09.2020]

5. John Burkardt. Licensed under GNU LGPL. https://people.sc.fsu.edu/ jburkardt/data/jpg/shingles.jpg [Online; retrieved 27.09.2020]

D.2 EXPERIMENTAL SETUP

The latent space dimension is set to 512 for all modalities, models and runs. All results in are based on β = 2.5, which was found to be a reasonable setting for all models. We use the same architectures for all methods and train all models for 300 epochs. We use an Adam optimizer (Kingma & Ba, 2014) with an initial learning rate 0.001. The architecture is based on straightforward convolutional neural networks (without bells and whistles); for details, we refer to the released code.

D.3 QUALITATIVE RESULTS

In Figures 8 to 11, we show qualitative results comparing the different methods.

(a) Mo Po E

Figure 8: Reconstructions across all modalities for all models. In every pair of rows, we show one row of test images followed by one row of respective reconstructions.

Published as a conference paper at ICLR 2021

(a) Mo Po E

Figure 9: Ten unconditionally generated images from the respective ﬁve modalities for each model. Column-wise, we use the same latent codes, sampled from the prior. Note that, row-wise, the digits should not be ordered.

(a) Mo Po E

Figure 10: Conditionally generated images of the ﬁrst modality given the respective test example from the second modality shown in the ﬁrst row. Column-wise, we take different samples from the approximate posterior, which should result in stylistic variations for generated outputs, but which should ideally not change the digit labels.

(a) Mo Po E

Figure 11: Conditionally generated images of the ﬁrst modality given the four test examples from the remaining modalities shown in the ﬁrst four rows. Column-wise, we take different samples from the approximate posterior, which should result in stylistic variations for generated outputs, but which should ideally not change the digit labels. Compared to the results from Figure 10, the Mo Po E-VAE generates more coherent samples when conditioned on four instead of one input modality.

Published as a conference paper at ICLR 2021

E BIMODAL CELEBA

E.1 DATASET

The bimodal version of Celeb A was introduced by Sutter et al. (2020). The text modality consists of strings which concatenate the attributes which are present in a face. If an attribute is not present, it is not present in the string which makes it a more difﬁcult modality. Example strings can be seen in the top of Figure 4.

E.1.1 MODALITY-SPECIFIC LATENT SPACES

Modality-speciﬁc spaces empirically have empirically shown to be useful (Bouchacourt et al., 2018; Hsu & Glass, 2018; Daunhawer et al., 2020; Sutter et al., 2020) especially for the generative quality of samples. As Celeb A is a visually challenging dataset, we adopt this idea and the ELBO formulation changes accordingly. For details, we refer the reader to the beforehand mentioned papers. The latent space is divided into a shared space qφc(c|X) and modality-speciﬁc spaces qφsj (sj|xj) for every modality xj. This allows every modality to encode information which is speciﬁc to this modality in a separate latent space.

e L(θ, φ; X) =

j=1 Eqφc(c|X)[Eqφsj (sj|xj)[log pθ(xj|sj, c)]] (34)

j=1 DKL(qφsj (sj|xj)||pθ(sj)) DKL( 1

Xk X qφc(c|Xk)||pθ(c))

where qφc(c|X) = 1 2M P

Xk X qφc(c|Xk) models the shared information and qφsj (sj|xj) the modality-speciﬁc information for every modality.

All posterior approximations, shared and modality-speciﬁc, are again assumed to be Gaussian distributed, see Appendix B.

E.2 EXPERIMENTAL SETUP

The latent spaces are set to 32 dimensions for the shared space as well as the modality-speciﬁc spaces, resulting in 64 dimensions per modality in total. We set β = 2.5 for all runs and models. All models are trained for 200 epochs. Again we use the same architectures as in Sutter et al. (2020): the encoders and decoders of both image and text use residual blocks (He et al., 2016). We use an Adam optimizer (Kingma & Ba, 2014) with an initial learning rate 0.0005. The architectures can also be looked up in the released code. The classiﬁcation of samples and representations are evaluated using average precision due to the imbalanced nature of the distribution of labels.

E.3 ADDITIONAL RESULTS

We show the attribute-speciﬁc evaluations in ﬁgs. 12 and 13 where the representations and generated samples are evaluated speciﬁc to individual attributes. The evaluations are performed for all subsets of modalities. We see the differences in averagea precision between attributes in the coherence of samples as well as the latent representations. The correlation between learned representation and coherence of samples gives further evidence on the importance of a good representation also for the multimodal setting and its task of conditional generation.

Figure 14 displays qualitative results of randomly generated samples. We can see the high quality samples the proposed model is able to generate which cover a wide variety of attributes. In the images, minor artefacts can be seen. This suggests that there is still room for improvement doing a more rigorous hyper-parameter search.

Published as a conference paper at ICLR 2021

(c) Img and Text

Figure 12: Coherence of generated bimodal Celeb A samples. For every subplot, image and text are generated conditionally by the the modality or subset of modalities in the caption. We see that different attributes are not learned equally well.

Published as a conference paper at ICLR 2021

Figure 13: Learned Latent Representations for the bimodal Celeb A dataset.

(a) Mo Po E: Img

(b) Mo Po E: Text

Figure 14: Qualitative Results of randomly generated Celeb A samples.