# multisource_neural_variational_inference__9f53a499.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Multi-Source Neural Variational Inference Richard Kurle Department of Informatics Technical University of Munich, Data:Lab, Volkswagen Group 80805 Munich, Germany richard.kurle@tum.de Stephan G unnemann Department of Informatics Technical University of Munich guennemann@in.tum.de Patrick van der Smagt Data:Lab, Volkswagen Group 80805 Munich, Germany Learning from multiple sources of information is an important problem in machine-learning research. The key challenges are learning representations and formulating inference methods that take into account the complementarity and redundancy of various information sources. In this paper we formulate a variational autoencoder based multi-source learning framework in which each encoder is conditioned on a different information source. This allows us to relate the sources via the shared latent variables by computing divergence measures between individual source s posterior approximations. We explore a variety of options to learn these encoders and to integrate the beliefs they compute into a consistent posterior approximation. We visualise learned beliefs on a toy dataset and evaluate our methods for learning shared representations and structured output prediction, showing trade-offs of learning separate encoders for each information source. Furthermore, we demonstrate how conflict detection and redundancy can increase robustness of inference in a multi-source setting. 1 Introduction An essential feature of most living organisms is the ability to process, relate, and integrate information coming from a vast number of sensors and eventually from memories and predictions (Stein and Meredith 1993). While integrating information from complementary sources enables a coherent and unified description of the environment, redundant sources are beneficial for reducing uncertainty and ambiguity. Furthermore, when sources provide conflicting information, it can be inferred that some sources must be unreliable. Replicating this feature is an important goal of multimodal machine learning (Baltruˇsaitis, Ahuja, and Morency 2017). Learning joint representations of multiple modalities has been attempted using various methods, including neural networks (Ngiam et al. 2011), probabilistic graphical models (Srivastava and Salakhutdinov 2014), and canonical correlation analysis (Andrew et al. 2013). These methods focus on learning joint representations and multimodal sensor fusion. However, it is challenging to relate information extracted from different modalities. In this work, we aim at learning probabilistic representations that can be related to each other by statistical divergence measures as well as translated from Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. one modality to another. We make no assumptions about the nature of the data (i.e. multimodal or multi-view) and therefore adopt a more general problem formulation, namely learning from multiple information sources. Probabilistic graphical models are a common choice to address the difficulties of learning from multiple sources by modelling relationships between information sources i.e., observed random variables via unobserved, random variables. Inferring the hidden variables is usually only tractable for simple linear models. For nonlinear models, one has to resort to approximate Bayesian methods. The variational autoencoder (VAE) (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) is one such method, combining neural networks and variational inference for latent-variable models (LVM). We build on the VAE framework, jointly learning the generative and inference models from multiple information sources. In contrast to the VAE, we encapsulate individual inference models into separate modules . As a result, we obtain multiple posterior approximations, each informed by a different source. These posteriors represent the belief over the same latent variables of the LVM, conditioned on the available information in the respective source. Modelling beliefs individually but coupled by the generative model enables computing meaningful quantities such as measures of surprise, redundancy, or conflict between beliefs. Exploiting these measures can in turn increase the robustness of the inference models. Furthermore, we explore different methods to integrate arbitrary subsets of these beliefs, to approximate the posterior for the respective subset of observations. We essentially modularise neural variational inference in the sense that information sources and their associated encoders can be flexibly interchanged and combined after training. 2 Background Neural variational inference Consider a dataset X = {x(n)}N n=1 of N i.i.d. samples of some random variable x and the following generative model: pθ(x(n)) = Z pθ(x(n) | z(n)) p(z(n)) dz(n), where θ are the parameters of a neural network, defining the conditional distribution between latent and observable random variables z and x respectively. The variational autoen- coder (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014) is an approximate inference method that enables learning the parameters of this model by optimising an evidence lower bound (ELBO) to the log marginal likelihood. A second neural network with parameters φ defines the parameters of an approximation qφ(z | x) of the posterior distribution. Since the computational cost of inference for each data point is shared by using a recognition model, some authors refer to this form of inference as amortised or neural variational inference (Gershman and Goodman 2014; Mnih and Gregor 2014). The importance weighted autoencoder (Burda, Grosse, and Salakhutdinov 2015) (IWAE) generalises the VAE by using a multi-sample importance weighting estimate of the log-likelihood. The IWAE ELBO is given as: ln pθ(x(n)) Ez(n) 1:K qφ(z(n)|x(n)) k=1 w(n) k i , where K is the number of importance samples, and w(n) k are the importance weights: w(n) k = pθ(x(n) | z(n) k ) p(z(n) k ) qφ(z(n) k | x(n)) . Besides achieving a tighter lower bound, the IWAE was motivated by noticing that a multi-sample estimate does not require all samples from the variational distribution to have a high posterior probability. This enables the training of a generative model using samples from a variational distribution with higher uncertainty. Importantly, this distribution need not be the posterior of all observations in the generative model. It can be a good enough proposal distribution, i.e. the belief from a partially-informed source. 3 Multi-source neural variational inference We are interested in datasets consisting of tuples {x(n) = (x(n) 1 , . . . , x(n) M )}N n=1, we use m {1, . . . , M} to denote the index of the source. Each observation x(n) m RDm may be embedded in a different space but is assumed to be generated from the same latent state z(n). Therefore, each x(n) m corresponds to a different, potentially limited source of information about the underlying state z(n). From now on we will refer to xm in the generative model as observations and the same xm in the inference model as information sources. We model each observation xm in the generative model with a distinct set of parameters θm, although some parameters could be shared. The likelihood function is given as: pθ(x(n) | z(n)) = m=1 pθm x(n) m | z(n) . For inference, the VAE conditions on all observable data x(n). However, one can condition (amortize) the approximate posterior distribution on any set of information sources. In this paper we limit ourselves to x(n) S , S {1, . . . , M}. An approximate posterior distribution qφS(z(n) | x(n) S ) may then be interpreted as the belief of the respective information sources about the latent variables, underlying the generative process. In contrast to the VAE, we want to calculate the beliefs from different information sources individually, compare them, and eventually integrate them. In the following, we address each of these desiderata. 3.1 Learning individual beliefs In order to learn individual inference models as in Fig. 1a, we propose an average of M ELBOs, one for each information source and its respective inference model. The resulting objective is an ELBO to the log marginal likelihood itself and referred to as L(ind): m=1 πm Ez(n) 1:K qφm z(n)|x(n) m h ln 1 k=1 w(n) m,k i , w(n) m,k = pθ x(n) | z(n) k p z(n) k qφm z(n) k | x(n) m . The indices n, m and k refer to the data sample, information source, and importance sample index. The factors πm are the weights of the ELBOs, satisfying 0 πm 1 and PM m=1 πm = 1. Although the πm could be inferred, we set πm = 1/M, m. This ensures that all parameters φm are optimised individually to their best possible extent instead of down-weighting less informative sources. Since we are dealing with partially-informed encoders qφm(z(n) | x(n) m ) instead of qφ(z(n) | x(n)), the beliefs can be more uncertain than the posterior of all observations x. This in turn degrades the generative model, as it requires samples from the posterior distribution. We found that the generative model becomes biased towards generating averaged samples rather than samples from a diverse, multimodal distribution. This issue arises in VAE-based objectives, irrespective of the complexity of the variational family, because each Monte Carlo sample of latent variables must predict all observations. To account for this, we propose to use importance sampling estimates of the log-likelihood (see Sec. 2). The importance weighting and sampling-importance-resampling can be seen as feedback from the observations, allowing to approximate the true posterior even with poorly informed beliefs. 3.2 Comparing beliefs Encapsulating individual inferences has an appealing advantage compared to an uninterpretable, deterministic combination within a neural network: Having obtained multiple beliefs w.r.t. the same latent variables, each informed by a distinct source, we can calculate meaningful quantities to relate the sources. Examples are measures of redundancy, surprise, or conflict. Here we focus on the latter. Detecting conflict between beliefs is crucial to avoid false inferences and thus increase robustness of the model. Conflicting beliefs may stem from conflicting data or from unreliable (inference) models. The former is a form of data anomaly, e.g. due to a failing sensor. An unreliable model x1 x2 x M . .. φ1 φ2 φM . .. (a) Individual inferences λ1 λ2 λM . . . x1 x2 x M . . . φ1 φ2 φM . . . (b) Mixture of experts inference. λ1 λ2 λM .. . x1 x2 x M .. . φ1 φ2 φM .. . (c) Product of experts inference Figure 1: Graphical models of inference models. White circles denote hidden random variables, grey-shaded circles observed random variables, diamonds deterministic variables. N is the number of i.i.d. samples in the dataset. To better distinguish the mixture or product of expert models from an IWAE with hard-wired integration in a neural-network layer, we explicitly draw the deterministic variables λ1, . . . , λM, denoting the parameters of the variational distributions. on the other hand may result from model misspecification or optimisation problems, i.e. due to the approximation or amortisation gap, respectively (Cremer, Li, and Duvenaud 2018). Distinguishing between the two causes of conflict is challenging however and requires evaluating the observed data under the likelihood functions. Previous work has used the ratio of two KL divergences as a criterion to detect a conflict between a subjective prior and the data (Bousquet 2008). The nominator is the KL between the posterior and the subjective prior, and denominator is the KL between posterior and a non-informative reference prior. The two KL divergences measure the information gain of the posterior induced by the evidence w.r.t. the subjective prior and the non-informative prior, respectively. The decision criterion for conflict is a ratio greater than 1. We propose a similar ratio, replacing the subjective prior with qφm and taking the prior as reference: c(m || m ) = DKL qφm (z | xm ) || qφm(z | xm) DKL qφm (z | xm ) || p(z) . (2) This measure has the property that it yields high values if the belief of source m is significantly more certain than that of m . This is desirable for sources with redundant information. For complementary information sources other conflict measures, e.g. the measure defined in (Dahl, G asemyr, and Navig ), may be more appropriate. 3.3 Integrating beliefs So far, we have shown how to learn separate beliefs from different sources and how to relate them. However, we have not readily integrated the information from these sources. This can be seen by noticing that the gap between L(ind) and the log marginal likelihood is significantly larger compared to an IWAE with an unflexible, hard-wired combination (see supplementary material of our accompanying technical report (Kurle, G unnemann, and Smagt 2018)). Here we propose two methods to integrate the beliefs qφm(z | xm) to an integrated belief qφ(z | x). Disjunctive integration Mixture of Experts One approach to combine individual beliefs is by treating them as alternatives, which is justified if some (but not all) sources or their respective models are unreliable or in conflict (Khaleghi et al. 2013). We propose a mixture of experts (Mo E) distribution, where each component is the belief, informed by a different source. The corresponding graphical model for inference is shown in Fig. 1b. As in Sec. 3.1, the variational parameters are each predicted from one source individually without communication between them. The difference is that each qφm(z | xm) is considered as a mixture component, such that the whole mixture distribution approximates the true posterior. Instead of learning individual beliefs qφm(z | xm) by optimising L(ind) and integrating them subsequently into a combined qφ(z | x), we can design an objective function for learning the Mo E posterior directly. We refer to the corresponding ELBO as L(Mo E). It differs from L(ind) only by the denominator of the importance weights, using the mixture distribution with component weights πm: w(n) m,k = pθ x(n) | z(n) k p z(n) k PM m =1 πm qφm z(n) k | x(n) m , Conjunctive integration Product of Experts Another option for combining beliefs are conjunctive methods, treating each belief as a constraint. These are applicable in the case of equally reliable and independent evidences (Khaleghi et al. 2013). This can be seen by inspecting the mathematical form of the posterior distribution of all observations. Applying Bayes rule twice reveals that the true posterior of a graphical model with conditionally independent observations can be decomposed as a product of experts (Hinton 2002) (Po E): p(z | x) = QM m =1 p(xm ) We propose to approximate Eq. (3) by replacing the true posteriors of single observations p(z | xm) by the variational distributions qφm(z | xm), obtaining the inference model shown in Fig. 1c. In order to make the Po E distribution computable, we further assume that the variational distributions and the prior are conjugate distributions in the exponential family. Probability distributions in the exponential family have the well-known property that their product is also in the exponential family. Hence, we can calculate the normalisation constant in Eq. (3) from the natural parameters. In this work, we focus on the popular case of normal distributions. For the derivation of the natural parameters and normalisation constant, we refer to the supplementary material of our technical report (Kurle, G unnemann, and Smagt 2018). Analogous to Sec. 3.3, we can design an objective to learn the Po E distribution directly, rather than integrating individual beliefs. We refer to the corresponding ELBO as L(Po E): L(Po E) =: Ez(n) 1:K qφ z(n)|x(n) h ln 1 k=1 w(n) k i , (4) where w(n) k are the standard importance weights as in the IWAE and where qφ(z(n) | x(n)) is the Po E inference distribution. However, the natural parameters of the individual normal distributions are not uniquely identifiable by the natural parameters of the integrated normal distribution. Thus, optimising L(Po E) leads to inseparable individual beliefs. To account for this, we propose a hybrid between individual and integrated inference distribution: L(hybrid) = λ1L(ind) + λ2L(Po E), (5) where we choose λ1 = λ2 = 1 2 in practice for simplicity. In Sec. 5 we evaluate the proposed integration methods both as learning objectives, and for integrating the beliefs obtained by optimising L(ind) or L(hybrid). Note again however, that L(Po E) or L(hybrid) assume conditionally independent observations and equally reliable sources. In contrast, L(ind) makes no assumptions about the structure of the generative model. This allows for any choice of appropriate integration method after learning. 4 Related Work Canonical correlation analysis (CCA) (Hotelling 1936) is an early attempt to examine the relationship between two sets of variables. CCA and nonlinear variants (Shon et al. 2005; Andrew et al. 2013; Feng, Li, and Wang 2015) propose projections of pairs of features such that the transformed representations are maximally correlated. CCA variants have been widely used for learning from multiple information sources (Hardoon, Szedmak, and Shawe-taylor 2004; Rasiwasia et al. 2010). These methods have in common with ours, that they learn a common representational space for multimodal data. Furthermore, a connection between linear CCA and probabilistic graphical models has been shown (Bach and Jordan 2005). Dempster-Shafer theory (Dempster 1967; Shafer 1976) is a widely used framework for integration of uncertain information. Similar to our Po E integration method, Dempster s rule of combination takes the pointwise product of belief functions and normalises subsequently. Due to apparently counterintuitive results obtained when dealing with conflicting information (Zadeh 1986), the research community proposed various measures to detect conflicting belief functions and proposed alternative integration methods. These include disjunctive integration methods (Jiang et al. 2016; Denœux 2008; Deng 2015; Murphy 2000), similar to our Mo E integration method. A closely related line of research is that of multimodal autoencoders (Ngiam et al. 2011) and multimodal Deep Boltzmann machines (DBM) (Srivastava and Salakhutdinov 2014). Multimodal autoencoders use a shared representation for input and reconstructions of different modalities. Since multimodal autoencoders learn only deterministic functions, the interpretability of the representations is limited. Multimodal DBMs on the other hand learn multimodal generative models with a joint representation between the modalities. However, DBMs have only been shown to work on binary latent variables and are notoriously hard to train. More recently, variational autoencoders were applied to multimodal learning (Suzuki, Nakayama, and Matsuo 2016). Their objective function maximises the ELBO using an encoder with hard-wired sources and additional KL divergence loss terms to train individual encoders. The difference to our methods is that we maximise an ELBO for which we require only M individual encoders. We may then integrate the beliefs of arbitrary subsets of information sources after training. In contrast, the method in (Suzuki, Nakayama, and Matsuo 2016) would require a separate encoder for each possible combination of sources. Similarly, (Vedantam et al. 2017) first trains a generative model with multiple observations, using a fully-informed encoder. In a second training stage, they freeze the generative model parameters and proceed by optimising the parameters of inference models which are informed by a single source. Since the topology of the latent space is fixed in the second stage, finding good weights for the inferenc models may be complicated. Concurrently to this work, (Wu and Goodman 2018) proposed a method for weakly-supervised learning from multimodal data, which is very similar to our hybrid method discussed in Sec. 3.3. Their method is based on the VAE, whereas we find it crucial to optimise the importancesampling based ELBO to prevent the generative models from generating averaged conditional samples (see Sec. 3.1). 5 Experiments We visualise learned beliefs on a 2D toy problem, evaluate our methods for structured prediction and demonstrate how our framework can increase robustness of inference. Model and algorithm hyperparameters are summarised in the supplementary material of our technical report (Kurle, G unnemann, and Smagt 2018). 5.1 Learning beliefs from complementary information sources We begin our experiments with a toy dataset with complementary sources. As a generative process, we consider a mixture of bi-variate normal distributions with 8 mixture components. The means of each mixture component are located on the unit circle with equidistant angles, and the standard deviations are 0.1. To simulate complementary sources, we allow each source to perceive only one dimension of the data. As with all our experiments, we as- sume a zero-centred normal prior with unit variance and z R2. We optimise L(ind) with two inference models qφ1(z | x1), qφ2(z | x2), and two separate likelihood functions pθ1(x1 | z), pθ2(x2 | z). Fig. 2a (right) shows the beliefs of both information sources for 8 test data points. These test points are the means of the 8 mixture components of the observable data, rotated by 2 . The small rotation is only for visualisation purposes, since each source is allowed to perceive only one axis and would therefore produce indistinguishable beliefs for data points with identical values on the perceived axis. We visualise the two beliefs corresponding to the same data point with identical colours. The height and width of the ellipses correspond to the standard deviations of the beliefs. Fig. 2a (left) shows random samples in the observation space, generated from 10 random latent samples z qφm(z | xm) for each belief. The generated samples are colour-coded in correspondence to the figure on the right. The 8 circles in the background visualise the true data distribution with 1 and 2 standard deviations. The two types of markers distinguish the information sources x1 and x2 used for inference. As can be seen, the beliefs reflect the ambiguity as a result of perceiving a single dimension xm. 1 Next we integrate the two beliefs using Eq. (3). The resulting integrated belief and generated data from random latent samples of the belief are shown in Figs. 2b (right) and 2b (left) respectively. We can see that the integration resolves the ambiguity. In the supplementary material of our accompanying technical report (Kurle, G unnemann, and Smagt 2018), we plot samples from the individual and integrated beliefs, before and after a sampling importance resampling procedure. 5.2 Learning and inference of shared representations for structured prediction Models trained with L(ind) or L(hybrid) can be used to predict structured data of any modality, conditioned on any available information source. Equivalently, we may impute missing data if modelled explicitly as an information source: p(xm | xm ) = Ez qφm z|xm h pθm(xm | z) i . (6) MNIST variants We created 3 variants of MNIST (Lecun et al. 1998), where we simulate multiple information sources as follows: MNIST-TB: x1 perceives the top half and x2 perceives the bottom half of the image. MNIST-QU: 4 information sources that each perceive quarters of the image. MNIST-NO: 4 information sources with independent bitflip noise with p = 0.05. We use these 4 sources to amortise inference. In the generative model, we use the standard, noise-free digits as observable variables. 1The true posterior (of a single source) has two modes for most data points. The uni-modal (Gaussian) proposal distribution learns to cover both modes. (a) Individual beliefs and their predictions. Left: 8 coloured circles are centred at the 8 test inputs from a mixture of Gaussians toy dataset. The radii indicate 1 and 2 standard deviations of the normal distributions. The two types of markers represent generated data from random samples of one of the information sources (data axis 0 or 1). Right: Corresponding individual beliefs. Ellipses show 1 standard deviation of the individual approximate posterior distributions. (b) Integrated belief and its predictions. Figure 2: Approximate posterior distributions and samples from the predicted likelihood function with and without integration of beliefs First, we assess how well individual beliefs can be integrated after learning, and whether beliefs can be used individually when learning them as integrated inference distributions. On all MNIST variants, we train 5 different models by optimising the objectives L(ind), L(Mo E), L(Po E), and L(hybrid) with K = 16, as well as L(hybrid) with K = 1. All other hyperparameters are identical. We then evaluate each model under the 3 objectives L(ind), L(Mo E) and L(Po E). For comparison, we also train a standard IWAE with hardwired sources on MNIST and on MNIST-NO with a single noisy source. The ELBOs on the test set are estimated using K = 16 importance samples. The obtained estimates are summarised in Tab. 1. The results confirm that learning the Po E inference model directly leads to inseparable individual beliefs. As expected, learning individual inference models and integrating them subsequently as a Po E comes with a tradeoff for L(Po E), which is mostly due to the low entropy of the integrated distribution. On the other hand, optimising the model with L(hybrid) achieves good results for both individual and integrated beliefs. On MNIST-NO, we can get an improvement of 2.74 nats by integrating the beliefs of redundant sources, compared to the standard IWAE with a single source. Next, we evaluate our method for conditional (structured) Table 1: Negative evidence lower bounds on variants of randomly binarised MNIST. Lower is better. L(ind) L(Mo E) L(Po E) L(hybrid) L(hybrid) (K=1) IWAE L(ind) 102.20 102.40 265.59 104.03 108.97 - L(Mo E) 101.51 101.82 264.48 103.37 108.30 - L(Po E) 94.38 94.39 87.59 90.07 90.81 88.79 MNIST-QU L(ind) L(Mo E) L(Po E) L(hybrid) L(hybrid) (K=1) IWAE L(ind) 120.46 120.37 447.67 129.63 140.61 - L(Mo E) 119.10 119.98 446.02 128.16 139.19 - L(Po E) 108.07 107.85 87.67 89.20 90.17 88.79 MNIST-NO L(ind) L(Mo E) L(Po E) L(hybrid) L(hybrid) (K=1) IWAE L(ind) 94.81 94.86 101.20 96.27 95.31 - L(Mo E) 93.98 94.03 100.36 95.58 94.55 - L(Po E) 94.52 94.65 92.27 92.21 94.49 94.95 prediction using Eq. (6). Fig. 3a shows the means of the likelihood functions, with latent variables drawn from individual and integrated beliefs. To demonstrate conditional image generation from labels, we add a third encoder that perceives class labels. Fig. 3b shows the means of the likelihood functions, inferred from labels. We also compare our method to the missing data imputation procedure described in (Rezende, Mohamed, and Wierstra 2014) for MNIST-TB und MNIST-QU. We run the Markov chain for all samples in the test set for 150 steps each and calculate the log likelihood of the imputed data at every step. The results averaged over the dataset are compared to our multimodal data generation method in Fig. 4. For large portions of missing data as in MNIST-TB, the Markov chain often fails to converge to the marginal distribution. But even for MNIST-QU with only a quarter of the image missing, our method outperforms the Markov chain procedure by a large margin. Please consult the supplementary material for a visualisation of the stepwise generations during the inference procedure. Caltech-UCSD Birds 200 Caltech-UCSD Birds 200 (Welinder et al. 2010) is a dataset with 6033 images of birds with 128 128 resolutions, split into 3000 train and 3033 test images. As a second source, we use segmentation masks provided by (Yang, Safar, and Yang 2014). On this dataset we assess whether learning with multiple modalities can be advantageous in scenarios where we are interested only in one particular modality. Therefore, we evaluate the ELBO for a single source and a single target observation, i.e. encoding images and decoding segmentation masks. We compare models that learned with multiple modalities using L(ind) and L(hybrid) with models that learnt from a single modality. Additionally, we evaluate the segmentation accuracy using Eq. (6). The accuracy is estimated with 100 samples, drawn from the belief informed by image data. The results are summarised in Tab. 2. We distinguish between objectives that involve both modalities in the generative model and objectives where we learn only the generative model for the modality (a) Row 1: Original images. Row 2 4: Belief informed by top half of the image. Row 5 7: Informed by bottom half. Row 8 10: Integrated belief. (b) Predictions from 10 random samples of the latent variables, inferred from one-hot class labels. Figure 3: Predicted images, where latent variables are inferred from the variational distributions of different sources. Sources with partial information generate diverse samples, the integration resolves ambiguities. E.g. in Fig. 3a, the lower half of digit 3 randomly generates digits 5 and 3 and the upper half generates digits 3 and 9. In contrast, the integration resolves ambiguities. (a) MNIST-TB, where bottom half is missing. (b) MNIST-QU, where bottom right quarter is missing. Figure 4: Missing data imputation with Monte Carlo procedure described in (Rezende, Mohamed, and Wierstra 2014) and our method. For the Markov chain procedure, the initial missing data is drawn randomly from Ber (0.5) and imputed from the previous random generation in subsequent steps. MSNVI was trained with L(ind). For MNIST-QU, we used the Po E belief of the three observed quarters. The plots show the log-likelihood at every step of the Markov chain, marginalised over the dataset. Higher is better. Table 2: Negative ELBOs and segmentation accuracy on Caltech-UCSD Birds 200. The IWAE was trained with a single source and target observation. Models trained with L(ind) and L(hybrid) use all sources and targets, and L(ind)* and L(hybrid)* use all sources for inference, but learn the generative model of a single modality. L(ind) L(ind)* L(hybrid) L(hybrid)* IWAE img-to-seg 5326 3264 5924 3337 3228 img-to-img -26179 -26663 -29285 -29668 -30415 accuracy 0.808 0.870 0.810 0.872 0.855 Figure 5: Predictions (xand y-coordinates) of the pendulum position (figures 1, 2, 3, 5, 6) and conflict measure (figure 4). For the predictions, latent variables are inferred from images of 3 sensors with different views (top row) as well as their integrated beliefs (bottom mid and right). The figures show predictions (of the static model) for different angles of the pendulum, performing 3 rotations. After 2 rotations, failure of sensor 0 is simulated by outputting noise only. Lines show the mean and shaded areas show 1 and 2 standard deviations, estimated using 500 random samples of latent variables. Bottom left: The conflict measure of Eq. (2) for different angles of the pendulum. of interest (segmentation), denoted with an asterisk. Models that have to learn the generative models for images and segmentations show worse ELBOs and accuracy, when evaluated on one modality. In contrast, the accuracy is slightly increased when we learn the generative model of segmentations only, but use both sources for inference. We also refer the reader to the supplementary material of our technical report (Kurle, G unnemann, and Smagt 2018), where we visualise conditionally generated images, showing that learning with the importance sampling estimate of the ELBO is crucial to generate diverse samples from partially informed sources. 5.3 Robustness via conflict detection and redundancy In this experiment we demonstrate how a shared latent representation can increase robustness, by exploiting sensor redundancy and the ability to detect conflicting data. We created a synthetic dataset of perspective images of a pendulum with different views of the same scene. The pendulum rotates along the z-axis and is centred at the origin. We simulate three cameras with 32 32-pixel resolution as information sources for inference and apply independent noise with std 0.1 to all sources. Each sensor is directed towards the origin (centre of rotation) from different view-points: Sensor 0 is aligned with the z-axis, and sensor 1 and 2 are rotated by 45 deg along the xand y-axis, respectively. The distance of all sensors to the origin is twice the radius of the pendulum rotation. For the generative model we use the xand y-coordinate of the pendulum rather than reconstructing the images. The model was trained with L(ind). In Fig. 5, we plot the mean and standard deviation of predicted xand y-coordinates, where latent variables are inferred from a single source as well as from the Po E posteriors of different subsets. As expected, integrating the beliefs from redundant sensors reduces the predictive uncertainty. Additionally, we visualise the three images used as information sources above these plots. Next, we simulate an anomaly in the form of a defect sensor 0, outputting random noise after 2 rotations of the pendulum. This has a detrimental effect on the integrated beliefs, where sensor 0 is part of the integration. We also plot the conflict measure of Eq. (2). As can be seen, the conflict measures for sensor 0 increases significantly when sensor 0 fails. In this case, one should integrate only the two remaining sensors with low conflict conjunctively. 6 Summary and future research directions We extended neural variational inference to scenarios where multiple information sources are available. We proposed an objective function to learn individual inference models jointly with a shared generative model. We defined an exemplar measure (of conflict) to compare the beliefs from distinct inference models and their respective information sources. Furthermore, we proposed a disjunctive and a conjunctive integration method to combine arbitrary subsets of beliefs. We compared the proposed objective functions experimentally, highlighting the advantages and drawbacks of each. Naive integration as a Po E (L(Po E)) leads to inseparable individual beliefs, while optimising the sources only individually (L(ind)) worsens the integration of the sources. On the other hand, a hybrid of the two objectives (L(hybrid)) achieves a good trade-off between both desiderata. Moreover, we showed how our method can be applied to structured output prediction and the benefits of exploiting the comparability of beliefs to increase robustness. This work offers several future research directions. As an initial step, we considered only static data and a simple latent variable model. However, we have made no assumptions about the type of information source. Interesting research directions are extensions to sequence models, hierarchical models and different forms of information sources such as external memory. Another important research direction is the combination of disjunctive and conjunctive integration methods, taking into account the conflict between sources. Acknowledgements We would like to thank Botond Cseke for valuable suggestions and discussions. Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, III 1247 III 1255. JMLR.org. Bach, F., and Jordan, M. 2005. A probabilistic interpretation of canonical correlation analysis. Baltruˇsaitis, T.; Ahuja, C.; and Morency, L.-P. 2017. Multimodal machine learning: A survey and taxonomy. ar Xiv preprint ar Xiv:1705.09406. Bousquet, N. 2008. Diagnostics of prior-data agreement in applied Bayesian analysis. Journal of Applied Statistics 35(9):1011 1029. Burda, Y.; Grosse, R. B.; and Salakhutdinov, R. 2015. Importance weighted autoencoders. Co RR abs/1509.00519. Cremer, C.; Li, X.; and Duvenaud, D. K. 2018. Inference suboptimality in variational autoencoders. Co RR abs/1801.03558. Dahl, F. A.; G asemyr, J.; and Navig, B. A robust conflict measure of inconsistencies in Bayesian hierarchical models. Scandinavian Journal of Statistics 34(4):816 828. Dempster, A. P. 1967. Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Statist. 38(2):325 339. Deng, Y. 2015. Generalized evidence theory. Applied Intelligence 43(3):530 543. Denœux, T. 2008. Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence. Artificial Intelligence 172(2):234 264. Feng, F.; Li, R.; and Wang, X. 2015. Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing 154:50 60. Gershman, S., and Goodman, N. D. 2014. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Meeting of the Cognitive Science Society, Cog Sci 2014, Quebec City, Canada, July 23-26, 2014. Hardoon, D. R.; Szedmak, S. R.; and Shawe-taylor, J. R. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16(12):2639 2664. Hinton, G. E. 2002. Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8):1771 1800. Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28(3/4):321 377. Jiang, W.; Xie, C.; Zhuang, M.; Shou, Y.; and Tang, Y. 2016. Sensor data fusion with z-numbers and its application in fault diagnosis. Sensors 16(9). Khaleghi, B.; Khamis, A.; Karray, F.; and Razavi, S. 2013. Multisensor data fusion: A review of the state-of-the-art. 14. Kingma, D. P., and Welling, M. 2013. Auto-encoding variational Bayes. Co RR abs/1312.6114. Kurle, R.; G unnemann, S.; and Smagt, P. v. d. 2018. Multi-Source Neural Variational Inference. Ar Xiv e-prints abs/1811.04451. Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 1791 1799. Murphy, C. K. 2000. Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1 9. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011. Multimodal deep learning. In Getoor, L., and Scheffer, T., eds., ICML, 689 696. Omnipress. Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G. R.; Levy, R.; and Vasconcelos, N. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, MM 10, 251 260. New York, NY, USA: ACM. Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31th International Conference on Machine Learning (ICML), 1278 1286. Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton: Princeton University Press. Shon, A. P.; Grochow, K.; Hertzmann, A.; and Rao, R. P. N. 2005. Learning shared latent structure for image synthesis and robotic imitation. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS 05, 1233 1240. Cambridge, MA, USA: MIT Press. Srivastava, N., and Salakhutdinov, R. 2014. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research 15:2949 2980. Stein, B. E., and Meredith, M. A. 1993. The merging of the senses. Cambridge, MA, US: The MIT Press. Suzuki, M.; Nakayama, K.; and Matsuo, Y. 2016. Joint multimodal learning with deep generative models. Vedantam, R.; Fischer, I.; Huang, J.; and Murphy, K. 2017. Generative models of visually grounded imagination. Co RR abs/1705.10762. Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology. Wu, M., and Goodman, N. 2018. Multimodal generative models for scalable weakly-supervised learning. Co RR abs/1802.05335. Yang, J.; Safar, S.; and Yang, M.-H. 2014. Max-margin boltzmann machines for object segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition 320 327. Zadeh, L. A. 1986. A simple view of the dempster-shafer theory of evidence and its implication for the rule of combination. AI Mag. 7(2):85 90.