# spread_divergence__44522e19.pdf Spread Divergence Mingtian Zhang 1 Peter Hayes 1 Tom Bird 1 Raza Habib 1 David Barber 1 For distributions P and Q with different supports or undefined densities, the divergence D(P||Q) may not exist. We define a Spread Divergence D(P||Q) on modified P and Q and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks). 1. Introduction For distributions P and Q, a divergence D(P||Q) is a measure of their difference (Dragomir, 2005) provided it satisfies the properties1 D(P||Q) 0 and D(P||Q) = 0 P = Q. (1) For absolutely continuous distributions2 P, Q and probability density functions p(x), q(x), the f-divergence is Df(P||Q) Df(p||q) = Eq(x) where f(x) is a convex function with f(1) = 0. When p and q have the same support3, Df(p||q) = 0 p = q a.e. P = Q, (3) 1Department of Computer Science, University College London, UK. Correspondence to: Mingtian Zhang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 1Two distributions being equal P = Q can be interpreted to mean that the cdfs (cumulative distribution functions) of the two distributions match. 2A distribution is absolutely continuous if its cdf is absolutely continuous. In this case, it has a density function. See also Tao (2011, p. 172). 3We use a.e. to represent almost everywhere , see Durrett (2019, p. 18) for a definition. see (Csiszár, 1967; Ali & Silvey, 1966). A well-known f-divergence is the Kullback-Leibler (KL) divergence KL(p||q) = Ep(x) The KL divergence plays a key role in fitting a model q(x) to data. For training data x1, . . . , x N generated iid from a distribution p(x), a sample based approximation of eq(4) is KL(p||q) = 1 n=1 log q(xn) + const. (5) where L(θ) PN n=1 log q(xn) is the data log likelihood and θ are the parameters of the model q(x). Hence, fitting a model q to data using maximum likelihood can be viewed as minimizing the KL divergence between the model and the data generating process. Using an f-divergence Df(P||Q) for model training therefore requires (i) P and Q to have valid probability densities p, q and (ii) p and q to have common support. However, these requirements are not satisfied in some important machine learning applications, in particular in implicit generative models, as we discuss below. 1.1. Implicit Generative Models Implicit generative models are of considerable recent interest in machine learning. These take the form of a latent variable model q(x) = R q(x|z)q(z)dz, but with a deterministic output distribution q(x|z) = δ(x gθ(z)). We can define the generalised density4, for the model Q as q(x) = Z δ (x gθ(z)) q(z)dz, (6) where z represents the value of latent variable Z, x the value of observation variable X and δ(x) is the Dirac delta function. In the common setting where the latent dimension is lower than the observation dimension dim(Z) < dim(X), 4We use generalised density to include implicit generative models. This includes also the special case p(x) = δ(x µ), where p(x) represents a (Dirac) delta distribution with support {µ}. In general we cannot define an f-divergence based on generalised densities, but can do so if the distribution is absolutely continuous - see section(2.1.2) for details. Spread Divergence Figure 1. The figure shows an implicit generative model with data dim(X) = 3 and latent dim(Z) = 2. The model can only generate data (dots) on a lower dimensional manifold in X space. The likelihood of generating a data point off the manifold is zero meaning that the likelihood will in general be a non-continuous function of the parameters defining the manifold. the model q(x) only has limited support, see figure(1). Strictly speaking, this does not define a density over the real space Rdim(X) since X is not absolutely continuous. Nevertheless, this does define a distribution Q. To generate a sample from Q, one first samples z from q(z) and then passes this through the generator function gθ(z). In this case, gradient based maximum likelihood learning is problematic since L(θ) is typically not a continuous function of θ5. Furthermore, the Expectation Maximisation algorithm (Dempster et al., 1977) is not available for models of the form in eq(6) since EM assumes that log q(x|z) is well defined, which is not the case when q(x|z) = δ (x gθ(z)) (Bermond & Cardoso, 1999). Equally, in eq(4) the ratio p(x)/q(x) may represent a division by zero; the KL divergence between the model and the data generating process is thus ill-defined. It is natural to consider transforming two distributions P and Q to have common support by using a mixture model P = αP + (1 α)N, Q = αQ + (1 α)N, where N is an absolutely continuous noise distribution with full support. However, as we explain in appendix(E) this approach is not useful in the context of training implicit generative models since Q does not have a density which can be numerically evaluated. 1.1.1. MODEL NOISE IS NOT ENOUGH A common approach to enable maximum likelihood to be used to train implicit generative models is to simply add noise to the model so that it has full support (and a valid density), see for example (Wu et al., 2016). However, this approach does not guarantee a consistent estimator. To see this, consider the simple implicit generative model Q with 5For a point xn that is not on the model manifold, q(xn) = 0. As we adjust θ such that xn becomes on the manifold, q(xn) will typically increase to a finite non-zero value, meaning that the (log) likelihood is discontinuous in θ. generalised density q(x) = Z δ (x zθq) N (z 0, 1) dz, (7) where the latent Z is univariate and dim(X) > 1. Here the vector θq defines a one-dimensional line in the X-space. For D-dimensional X, adding independent Gaussian noise with mean zero and isotropic covariance σ2ID to X results in the noised distribution with density q(x) = N (x 0D, Σ) , Σ θqθT q + σ2ID. (8) For observed training data x1, . . . , x N the log likelihood under this model is 1 N L(θq) = 1 n=1 x T nΣ 1xn log det Σ+const. (9) We assume that the training data xn is iid sampled from the distribution P with generalised density p(x) = Z δ (x zθp) N (z 0, 1) dz. (10) Hence P and Q are from the same parametric distribution but with different parameters. By the law of large numbers, in the large N limit, the log likelihood eq(9) tends to θT pΣ 1θp log det Σ + const. (11) which has an optimum when6 (see appendix(A)) θ2p θp. (12) Thus adding noise to the model Q and training using maximum likelihood does not form a consistent estimator; it has an optimum at θq = θp, resulting in an incorrect estimate of the data generating process. In appendix(A) we explain why annealing the noise σ2 towards zero during numerical optimisation will not directly heal this problem. Another well-known failure case of trying to learn an implicit generative model by adding noise only to the model is deterministic ICA, which we discuss at length in section(4.1). For this reason alternative (non-likelihood, non-KL) approaches to measure the difference between distributions are commonly used in training implicit generative models (see for example Mohamed & Lakshminarayanan (2016)), such as Maximum Mean Discrepancy (Gretton et al., 2012) and Wasserstein distance (Arjovsky et al., 2017; Peyré et al., 2019). In the next section we introduce the Spread Divergence which defines a valid divergence even when the 6θ2 p is shorthand for the squared length θT pθp = ||θp||2 2. Spread Divergence supports of the distributions do not match or the distributions do not have a valid density. As we will demonstrate, the Spread Divergence allows one to use maximum likelihood based approaches to train implicit generative models, whilst resulting in a consistent estimator. 2. Spread Divergence For distributions Q and P with generalised densities q(x) and p(x) we first need to define q(y) and p(y) that (i) are valid probability density functions and (ii) have the same support. In contrast to simply noising Q we define noisy densities for both distributions p(y) = Z p(y|x)p(x)dx, q(y) = Z p(y|x)q(x)dx (13) The noise p(y|x) must spread P and Q such that p(y) and q(y) satisfy the above two reauirements. The choice of p(y|x) must also ensure that D( p|| q) = 0 P = Q. If we can define the noise appropriately, this would allow us to define the Spread Divergence D(P||Q) D( p|| q) , (14) which satisfies the divergence requirement D(P||Q) 0 and D(P||Q) = 0 P = Q. In the following we discuss appropriate choices for the noise distribution p(y|x). We focus on stationary spread noise p(y|x) = k(y x) since this is simple to implement by adding independent noise to a variable. Non-stationary spread distributions can be easily constructed using a mixture of stationary noise distributions, or through Mercer kernels these are left for future study. The case of discrete X is discussed in appendix(B). 2.1. Stationary Spread Divergence For a random variable X we define a new stationary spread random variable Y by adding to X random noise K. In order to ensure that Y has a valid probability density function and the required support, we assume that K is absolutely continuous with density k(x)>0, x R. We then define a random variable Y = X + K. In the context of eq(13) this corresponds to using noise of the form p(y|x) = k(y x). 2.1.1. ABSOLUTELY CONTINUOUS DISTRIBUTIONS For two absolutely continuous distributions P and Q with densities p(x) and q(x). We define p and q as a convolution p(y) = Z k(y x)p(x)dx = (k p) (y), (15) q(y) = Z k(y x)q(x)dx = (k q) (y). (16) Since k>0, p and q have common support R. Since all densities p(x) are absolutely integrable, the Fourier transforms F {p} and F {q} exist. Assuming F {k} also exists, we can then use the convolution theorem to write F { p} = F {k} F {p} , F { q} = F {k} F {q} . (17) Let7 F{k} = 0. Then F{k}F{p} = F{k}F{q} F{p} = F{q}. (18) Using this we can show that the stationary Spread Divergence is valid. A sketch of the proof is as follows: D(p||q) = 0 D( p|| q) = 0 (19) p = q a.e. (20) F {k} F {p} = F {k} F {q} (21) F {p} = F {q} (22) p = q a.e. P = Q, (23) where we used the invertibility of the Fourier transform. Hence we can define a valid Spread Divergence provided (i) k(x) is a positive probability density function with support R and (ii) F{k} = 0. As an example consider Gaussian additive spread noise k(x) = N x 0, σ2 . This has Fourier transform F {k} (ω) = e σ2ω2 2 > 0. (24) Similarly, for Laplace noise k(x) = 1 2be 1 F {k} (ω) = b 2 + ω2 > 0. (25) In both cases k>0 and F {k} >0. Additive Gaussian and Laplace noise can therefore be used to define a valid Spread Divergence. That is, if the divergence between the spreaded distributions is zero, then the original distributions are equal. 2.1.2. GENERAL STATIONARY CASE The previous additive noise setting assumed that X is absolutely continuous. In appendix(C) we show how to extend this to all distributions, including implicit generative models. We show that adding absolutely continuous noise K to X defines an absolutely continuous random variable X + K, even if X is itself not absolutely continuous (which is the case for implicit generative models). Applying the same additive noise process to both P and Q then results in absolutely continuous distributions that have densities p and q and common support. We further show that, provided the characteristic function of the additive random noise variable 7In fact the more general proof in appendix(C) shows that only the weaker condition that F{k} = 0 on a countable set is needed. Spread Divergence (a) Two delta distributions. (b) Spreaded delta distributions. (c) Spread KL divergence Figure 2. (a) Delta distributions p(x) = δ(x µp), q(x) = δ(x µq) where µp = 0, µq = 1. (b) Spreaded distributions p(y) = R p(y|x)p(x)dx, q(y) = R p(y|x)q(x)dx, where p(y|x) = N y x, σ2 = 0.5 . (c) The spread KL divergence as a function of µq. K is non-zero8, the noise K can be used to define a valid Spread Divergence between any two distributions with the property that Df(P||Q) = 0 P = Q. This non-zero requirement for the characteristic function is analogous to the characteristic condition on kernels such that the Maximum Mean Discrepancy MMD(P, Q) = 0 P = Q, see (Sriperumbudur et al., 2011; 2012; Gretton et al., 2012). As an illustration, consider the extreme case of two delta distributions P and Q with generalised densities p(x) = δ(x µp), q(x) = δ(x µq). (26) In this case KL(p||q) is not well defined. For stationary Gaussian noise p(y|x) = N y x, σ2 , the spreaded distributions are: p(y) = Z δ(x µp)N y x, σ2 dx = N y µp, σ2 , q(y) = Z δ(x µq)N y x, σ2 dx = N y µq, σ2 . For noise variance σ2 = 0.5 this gives: KL( p|| q) = ||µp µq||2 2. (27) Hence KL( p|| q) = 0 P = Q. It is also worth noting that the spread KL divergence in this case is equal to the squared 2-Wasserstein distance (Peyré et al., 2019; Gelbrich, 1990). Treating µq as a variable, figure(2) illustrates the spread KL divergence converging to 0 as µq tends to µp = 0. This treatment of generalised densities allows us to define a divergence for implicit generative models and, by extension, an associated training algorithm, as we describe below. 2.2. Spread Maximum Likelihood Estimation In section(1.1.1) we noted that in the context of fitting an implicit generative model Q to training data, simply adding noise to the model distribution Q and using maximum likelihood does not result in a consistent estimator. In the Spread 8See appendix(C) for a weaker condition. Divergence case, for data x1, . . . , x N we define the empirical (generalised) density as n=1 δ (x xn) . (28) For spread noise p(y|x) we then spread both the model Q and empirical density using q(y) = Z p(y|x)q(x)dx, p(y) = Z p(y|x)p(x)dx. (29) We then define the spread log likelihood using L Z p(y) log q(y)dy. (30) We consider that the data x1, . . . , x N is generated from the same underlying parametric implicit distribution as the model Q m(x; θq) = Z δ(x gθq(z))q(z)dz, (31) but with a different parameters θp. Then as N (using the law of large numbers) the spread log likelihood becomes lim N L = Z m(y; θp) log m(y; θq)dy, (32) m(y; θ) = Z p(y|x)m(x; θ)dx. (33) Up to an additive constant this is KL( m(y; θp)|| m(y; θq)). The spread log likelihood eq(32) therefore has a maximum when the spread KL divergence has a minimum. This occurs when the two spread distributions match m(y; θp) = m(y; θq). Furthermore, by the property of the Spread Divergence, this means that the spread log likelihood has a maximum when the distributions match m(y; θp) = m(y; θq), Spread Divergence which occurs when θq = θp. Hence, the spread log likelihood defines a consistent estimator. In practice we typically cannot carry out the integral in eq(30) exactly, and resort to a sample approximation, sampling L noisy versions yn,l, l = 1, . . . , L of each datapoint xn to give l=1 log q(yn,l). (34) The Maximum Likelihood Estimator (MLE) is a cherished approach due to its consistency (convergence to the true parameters in the large data limit) and asymptotic efficiency (achieves the Cramér-Rao lower bound on the variance of any unbiased estimator) - see Casella & Berger (2002) for an introduction. An interesting question for future study is whether these properties also carry over to the spread MLE. In appendix(F), we demonstrate that spread MLE (for a certain family of spread noise) has weaker sufficient conditions than MLE for both consistency and asymptotic efficiency. Furthermore, a sufficient condition for the existence of the MLE is that the likelihood function is continuous over a compact parameter space Θ. We provide an example in appendix(F.1) where the maximum likelihood estimator may not exist (since it violates the compactness requirement), but the spread maximum likelihood estimator still exists. 3. Maximising Discriminatory Power Intuitively, spreading out distributions makes them more similar. More formally, from the data processing inequality (see appendix(D)), using spread noise will always decrease the f-divergence Df( p|| q) Df(p||q) (when Df(p||q) is well defined). If we use spread MLE to train a model, too much noise may make the spreaded empirical and spreaded model distributions so similar that it becomes difficult to numerically distinguish them. It is useful therefore to learn spread noise pψ(y|x) (parameterised by ψ) to maximally discriminate between the distributions maxψ D( p|| q). In general we need to constrain the spread noise to ensure that the divergence remains bounded. The learned noise will discourage overlap between the two spreaded distributions. We discuss below two complementary approaches to adjust pψ(y|x). The first adjusts the covariance for Gaussian p(y|x) and the second uses a mean transformation. In principle, both approaches can be combined and easily generalised to other noise distributions, such as the Laplace distribution. In section(4.2), we empirically investigate the benefit of these approaches when scaling the application of Spread Divergence to complex high dimensional problems. 3.1. Learning the Gaussian Noise Covariance In learning Gaussian stationary spread noise p(y|x) = N (y x, Σ), the number of parameters in the covariance Figure 3. Left: The lower dotted line denotes Gaussian distributed data p(x) with support only along the linear subspace defined by the origin a and direction A. The upper dotted line denotes Gaussian distributed data q(x) with support different from p(x). Optimally, to maximise the Spread Divergence between the two distributions, for fixed noise entropy, we should add noise that preferentially spreads out along the directions defined by p and q, as denoted by the ellipses. matrix Σ scales quadratically with the data dimension D. We therefore define Σ = σ2I + UU T where σ2 > 0 is fixed (to ensure bounded Spread Divergence) and U is a learnable D R matrix with R D. As a simple example that can be computed exactly, we consider a implicit generative models with generalised densities p and q that generate data in separated linear subspaces, p(x) = Z δ (x a Az) p(z)dz (35) q(x) = Z δ (x b Bz) p(z)dz, (36) with p(z) = N (z 0, IZ). The spreaded densities are then p(y) = N y a, AAT + Σ , q(y) = N y b, BBT + Σ . As Σ tends to zero, KL( p|| q) tends to infinity. We therefore constrain Σ = σ2I + uu T, where σ2 is fixed and u Tu = 1. Also, for calculational simplicity, we assume A = B. The Spread Divergence KL( p|| q) is then maximised for the noise direction u pointing orthogonal to the vector AAT + σ2I 1 (b a). The noise thus concentrates along directions defined by p and q, see figure(3). 3.2. Learning a Mean Transformation Consider spread noise p(y|x) = k(y f(x)) for injective9 f and stationary k. Then, we define p(y) = Z k(y f(x))px(x)dx. (37) Using a change of variables, p(y) = Z k(y z)pz(z)dz, (38) pz(z) = px(f 1(z))/J x = f 1(z) , (39) 9Since the co-domain of f is determined by its range, injective indicates invertible in this case. Spread Divergence where J is the absolute Jacobian of f. This is a valid Spread Divergence since eq(38) is in the form of standard stationary spread noise, but on a transformed variable z. Each injective f gives a different noise p(y|x) and hence we can search for the best noise implicitly by learning f. 4. Applications As an application of the spread MLE, we use it to train implicit generative models with generalised density pθ(x) = Z δ (x gθ(z)) p(z)dz, (40) where θ are the parameters of the encoder g. We show that, despite the likelihood gradient not being available, we can nevertheless successfully train such models using slightly modified versions of standard likelihood based training approaches, such as variational algorithms (Barber, 2012). In particular we discuss learning a low dimensional linear ICA model and a high dimensional deep generative model. 4.1. Implicit Linear Models: Deterministic ICA ICA (Independent Components Analysis) corresponds to the model p(x, z) = p(x|z) Q i p(zi), where the independent components zi follow a non-Gaussian distribution. For Gaussian noise ICA an observation x is assumed to be generated by the process p(x|z) = Q j N xj gj(z), γ2 where gi(z) mixes the independent latent process z. In linear ICA, gj(z) = a T jz where aj is the jth column on the mixing matrix A. For zero observation noise γ2 = 0 this corresponds to a linear implicit generative model. For training data x1, . . . , x N a standard maximum likelihood approach to learning A is to maximise Eˆp(x) [log p(x)], (41) where p(x) is the marginal of the joint p(x, z) and we define the empirical distribution n=1 δ (x xn) . (42) Since this is a latent variable model, it is natural to apply the EM algorithm (Dempster et al., 1977) to learn A. However, for zero, or even small observation noise γ2, it is well known that EM is ineffective (Bermond & Cardoso, 1999; Winther & Petersen, 2007). To see this, consider dim(X) = dim(Z) (where dim(X) and dim(Z) are the dimension of the data and latents respectively) and invertible A. At iteration k the EM algorithm has an estimate Ak of the mixing matrix. It is straightforward to show that the M-step updates Ak to Ak+1 = E xz T E zz T 1, (43) where, for zero observation noise (γ = 0), n xn A 1 k x T n = ˆSA T k , (44) E zz T = A 1 k ˆSA T k , (45) and ˆS 1 N P n xnx T n is the moment matrix of the data. Thus, Ak+1 = ˆSA T k A 1 k ˆSA T k 1 = Ak and the algorithm freezes . Similarly, for low noise γ 1, progress critically slows down. Thus trying to train a deterministic ICA model by simply adding noise to the model will not help directly (see also section(1.1.1)). To deal with small noise and the limiting case of a deterministic model (γ = 0), we consider Gaussian spread noise p(y|x) = N y x, σ2IX to give p(y, z) = Z p(y|x)p(x, z)dx (46) j N yj gj(z), γ2 + σ2 IX Y i p(zi). (47) Using spread noise, the empirical distribution is replaced by the spreaded empirical distribution n N y xn, σ2IX . (48) We then learn the model parameters by maximising the spread log likelihood (see section(2.2)) Eˆp(y) [log p(y)]. (49) Since this is of the form of a latent variable model, we can use an EM algorithm to maximise eq(49). The M-step to maximise the spread log likelihood has the same form as eq(43) but with modified statistics Z p(y, z|n)yz Tdzdy, (50) Z p(y, z|n)zz Tdzdy. (51) p(y, z|n) N y xn, σ2 p(z|y) (52) p(z|y) = 1 Zq(y)N (z µ(y), Σ) Y i p(zi), (53) Zq(y) = Z N (z µ(y), Σ) Y i p(zi)dz, (54) Here Zq(y) is a normaliser and Σ = (γ2 +σ2) ATA 1 , µ(y) = ATA 1 Ay. (55) Spread Divergence (a) Error versus observation noise γ. (b) Error versus number of training points. Figure 4. Relative error |Aest ij Atrue ij |/|Atrue ij | versus observation noise (a) and number of training points (b). (a) For dim(X) = 20 dimensional observations and dim(Z)=10 dimensional latent variables, we generate N=20000 data points from the model x = Az + γN(0, IX), for independent zero mean unit variance Laplace z. The elements of A are uniform random 1. We use Sy=1, Sz=1000 samples and 2000 EM iterations to estimate A. The error is averaged over all i, j and 10 experiments. We also plot standard errors around the mean relative error. In blue we show the error in learning A using the standard EM algorithm. As γ 0, the error blows up as the EM algorithm freezes . In orange we plot the error for spread noise EM; no slowing down occurs as the observation noise γ decreases. In (b), apart from small N, the spread EM algorithm error is lower than for the standard EM algorithm. Here dim(Z)=5, dim(X)=10, Sy=1, Sz=1000, γ = 0.2, with 500 EM updates used. Results are averaged over 50 runs of randomly drawn A. Since the posterior p(z|y) peaks around N (z µ(y), Σ), we rewrite eq(50) as Z N y xn, σ2IX N (z µ(y), Σ) i p(zi) Zq(y) yz Tdzdy (56) and similarly for E zz T . Writing the expectations with respect to N (z µ(y), Σ) allows for a simple but effective importance sampling approximation focused on regions of high probability. We implement this update by drawing Sy samples from N y xn, σ2IX and, for each y sample, we draw Sz samples from N (z µ(y), Σ). This scheme has the advantage over more standard variational approaches, see for example (Winther & Petersen, 2007), in that we obtain a consistent estimator of the M-step update for A. We show results for a toy experiment in figure(4), learning the mixing matrix A in a deterministic non-square setting. Note that standard algorithms such as Fast ICA (Hyvärinen, 1999) fail in this case. The spread noise is set to σ = max(0.001, 2.5 sqrt(mean(AAT))). This modified EM algorithm thus learns a good approximation of A, with no critical slowing down. 4.2. Non-linear Implicit Models: δ-VAE A deep implicit generative model has generalised density pθ(x)= Z δ (x gθ(z)) p(z)dz, (57) where gθ is a deep neural network, see for example (Goodfellow et al., 2014). As discussed, for dim(Z) < dim(X) we cannot use standard maximum likelihood approaches to train this model. To address this, we consider using the spread MLE approach section(2.2). For training data x1, . . . , x N the empirical distribution is n=1 δ (x xn) . (58) For Gaussian spread noise p(y|x)=N y x, σ2IX , the spreaded empirical distribution is n=1 N y xn, σ2IX , (59) and the spreaded model is pθ(y) = Z N y gθ(z), σ2IX p(z)dz (60) = Z pθ(y|z)p(z)dz. (61) We then maximise the spread log likelihood Z p(y) log pθ(y)dy. (62) Typically, the integral over y is intractable, in which case we resort to a sampling estimation s=1 log pθ(yn s ), (63) Spread Divergence (a) Fixed Laplace Spread Noise (b) Fixed Gaussian Spread Noise (c) Learned Gaussian Spread Noise Figure 5. Samples from a deep implicit generative model trained using δ-VAE with (a) Laplace spread noise with fixed covariance, (b) Gaussian spread noise with fixed covariance and (c) Gaussian spread noise with learned covariance. We first train with one epoch a standard VAE as initialization to all models and keep the latent code z N (z 0, IZ) fixed when sampling from these models thereafter, so we can more easily compare the sample quality. See also figure(8) for further samples. where yn s is a sample from p(yn|xn) = N yn xn, σ2IX . For non-linear g, the distribution pθ(y) is usually intractable and we therefore use the variational lower bound log pθ(y) Z qφ(z|y)( log qφ(z|y) + log (pθ(y|z)p(z)))dz. (64) This approach is a straightforward extension of the standard variational autoencoder (VAE) method (Kingma & Welling, 2013) and in appendix(G) we derive a lower variance objective and detail the learning procedure. We dub this model and associated Spread Divergence training the δ-VAE . For learning the spread noise we use the approach outlined in section(3). To learn the mean transformation function, we used an invertible residual network (Behrmann et al., 2018) fψ : RD RD where fψ = (f 1 ψ . . . f T ψ ) denotes a Res Net with blocks f t ψ = I( ) + gψt( ). Then fψ is invertible if the Lipschitz-constants Lip(gψt)<1, for all t {1, . . . , T}. Note that when using the Spread Divergence for training we only need samples from p(y) which can be obtained from eq(37) by first sampling x from px(x) and then y from p(y|x) = k(y f(x)); this does not require computing the Jacobian or inverse fψ 1. MNIST Experiment: We trained a δ-VAE on MNIST (Le Cun et al., 2010) with (i) fixed Laplace spread noise, as in eq(25), (ii) fixed Gaussian spread noise, as in eq(24) and (iii) Gaussian noise with learned covariance, as in section(3.1), with rank R = 20; gθ( ) is a neural network that contains 3 feed forward layers. See appendix(H) for further details. Figures 5(a,b,c) show samples from pθ(x) for these models. MNIST is a relatively easy problem in the sense that it is hard to distinguish between the quality of the fixed and (a) δ Fixed spread noise (b) δ Learned spread noise Figure 6. Samples from a deep implicit generative model trained using δ-VAE with (a) fixed and (b) learned spread with the mean transformation method. See also figure(9) for more samples. We use a similar sampling strategy as in the MNIST experiment to facilitate sample comparison between the different models see appendix(I). learned noise samples. We speculate that Laplace noise appears to improve image sharpness since this noise focuses attention on discriminating between points close to the data manifold (since the Laplace distribution is leptokurtic and has a higher probability of generating points close to the data manifold than the Gaussian distribution). Celeb A Experiment: We trained a δ-VAE on the Celeb A dataset (Liu et al., 2015) with (i) fixed and (ii) learned spread using the mean transformation method as discussed in section(3.2). We compare to results from a standard VAE with fixed Gaussian noise p(x|z) = N (x gθ(z), 0.5IX) (Tolstikhin et al., 2017), where gθ( ) is a neural network contains 4 convolution layers. For (i) the fixed Spread Divergence uses Gaussian noise N (y x, 0.25IX). For (ii) we use Gaussian noise with a learned injective function in the form of a Res Net: fψ( ) = I( ) + gψ( ) - see appendix(I) for details. Figure 6 shows samples from our δ-VAE for (i) and (ii) (with gθ(z) initialised to the fixed-noise setting). It is notable how the sharpness of the image samples substantially increases when learning the spread noise. Table 1 shows Frechet Inception Distance (FID) (Heusel et al., 2017) score comparisons between different baseline algorithms for implicit generative model training10. The δ-VAE significantly improves on the standard VAE result. Learning the mean transformation improves on the fixed-noise δ-VAE. Indeed the mean transformation δ-VAE results are comparable to popular GAN and WAE models (Gulrajani et al., 2017; Berthelot et al., 2017; Arjovsky et al., 2017; Kodali et al., 2017; Mao et al., 2017; Fedus et al., 2017; Tolstikhin et al., 2017). Whilst the δ-VAE results are not state-of-the-art, we believe 10FID scores were computed using github.com/ bioinf-jku/TTUR based on 10000 samples. Spread Divergence Encoder-Decoder Models FID GAN Models FID VAE 63.0 WGAN GP 30.0 δ-VAE with fixed spread 52.7 BEGAN 38.9 δ-VAE with learned spread 46.5 WGAN 41.3 DRAGAN 42.3 WAE-MMD 55.0 LSGAN 53.9 WAE-GAN 42.0 NS GAN 55.0 MM GAN 65.6 Table 1. Celeb A FID Scores. The δ-VAE results are the average over 5 independent measurements. The scores of the GAN models are based on a large-scale hyperparameter search and take the best FID obtained (Lucic et al., 2018). The results of the VAE model and both WAE-based models are from (Tolstikhin et al., 2017). it is the first time that implicit models have been trained using a principled maximum likelihood based approach. By increasing the complexity of the generative model gθ and injective function fψ, or using better choices of noise, the results may become more competitive with state-of-the-art GAN models11. 5. Related Work Instance noise: The instance noise trick to stabilize GAN training (Roth et al., 2017; Sønderby et al., 2016) is a special case of Spread Divergence using fixed Gaussian noise. Whilst other similar tricks, e.g. (Furmston & Barber, 2009), have been proposed previously, we believe it is important to state the more general utility of the spread noise approach. δ-VAE versus WAE: The Wasserstein Auto-Encoder (Tolstikhin et al., 2017) is another implicit generative model that uses an encoder-decoder architecture. The major difference to our work is that the δ-VAE is based on the KL divergence, which corresponds to MLE, whereas the WAE uses the Wasserstein distance. δ-VAE versus denoising VAE: The denoising VAE (Im et al., 2017) uses a VAE with noise added to the data only. In contrast, for our δ-VAE, spread MLE adds noise to both the data and the model. Therefore, the denoising VAE cannot recover the true data distribution, whereas in principle the δ-VAE with spread MLE can. MMD GAN with kernel learning: Learning a kernel to increase discrimination is also used in MMD GAN (Li et al., 2017). Similar to ours, the kernel in MMD GAN is constructed by k = k fψ, where k is a fixed kernel and fψ is a neural network. To ensure the MMD distance Mk fψ(p, q) = 0 p = q, this requires fψ to be injective (Gretton et al., 2012). However, in this framework, fψ(x) 11We also tried learning the image generator using Laplace spread noise. However, the colour of the sampled images becomes overly intense and we leave it to future work to address this. usually maps x to a lower dimensional space. This is crucial for MMD because the amount of data required to produce a reliable estimator grows with the data dimension (Ramdas et al., 2015) and the computational cost of MMD scales quadratically with the amount of data. Whilst using a lowerdimensional mapping makes MMD more practical it also makes it difficult to construct an injective function f. For this reason, heuristics such as the auto-encoder regularizer (Li et al., 2017) are considered. In contrast, for the δ-VAE with spread MLE, the cost of estimating the divergence is linear in the number of data points. Therefore, there is no need for fψ to be a lower-dimensional mapping; guaranteeing that fψ is injective is straightforward for the δ-VAE. Flow-based generative models: Invertible flow-based functions (Rezende & Mohamed, 2015) have been used to boost the representation power of generative models. Our use of injective functions is quite distinct from the use of flow-based functions to boost generative model capacity. In our case, the injective function f does not change the model - it only changes the divergence. For this reason, the Spread Divergence doesn t require the log determinant of the Jacobian, which is required in (Rezende & Mohamed, 2015; Behrmann et al., 2018), meaning that more general invertible functions can be used to boost the discriminatory power of a Spread Divergence. We described how to define a divergence even when two distributions don t have valid probability density functions or have the same support. We showed that defining divergences this way enables us to train implicit generative models using standard likelihood based approaches. In principle, we can learn the true data generating implicit generative model by using any valid Spread Divergence. In practice, however, the quality of the learned model can depend strongly on the choice of spread noise. We therefore investigated learning spread noise to maximally discriminate between two distributions. We found that the resulting training approach is numerically stable and that it can significantly improve the quality of the learned model. There are several directions for further investigation: (i) the broader family of spread noise and their properties, including statistical efficiency; (ii) optimal noise selection for different tasks; (iii) the connections between Spread Divergence and other distance (or divergence) measure, e.g. MMD, Wasserstein Distance. ACKNOWLEDGMENTS We would like to thank a reviewer for improving our consideration of non-absolutely continuous distributions and Li Zhu for useful discussions. Spread Divergence Ali, S. and Silvey, S. A General Class of Coefficients of Divergence of one Distribution from Another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131 142, 1966. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. ar Xiv:1701.07875, 2017. Barber, D. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012. Behrmann, J., Duvenaud, D., and Jacobsen, J. Invertible Residual Networks. ar Xiv:1811.00995, 2018. Bermond, O. and Cardoso, J. Approximate Likelihood for Noisy Mixtures. In Proc. ICA 99, pp. 325 330, 1999. Berthelot, D., Schumm, T., and Metz, L. Began: Boundary Equilibrium Generative Adversarial Networks. ar Xiv:1703.10717, 2017. Casella, G. and Berger, R. Statistical Inference, volume 2. Duxbury Pacific Grove, CA, 2002. Cramér, H. Mathematical Methods of Statistics, volume 9. Princeton University Press, 1999. Csiszár, I. Information-type Measures of Difference of Probability Distributions and Indirect Observation. studia scientiarum Mathematicarum Hungarica, 2:229 318, 1967. Csiszár, I. A Class of Measures of Informativity of Observation Channels. Periodica Mathematica Hungarica, 2 (1-4):191 213, 1972. Dempster, A., Laird, N., and Rubin, D. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977. Dragomir, S. Some General Divergence Measures for Probability Distributions. Acta Mathematica Hungarica, 109(4):331 345, Nov 2005. ISSN 1588-2632. doi: 10.1007/s10474-005-0251-6. Durrett, R. Probability: Theory and Examples, volume 49. Cambridge university press, 2019. Fedus, W., Rosca, M., Lakshminarayanan, B., Dai, A., Mohamed, S., and Goodfellow, I. Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence at Every Step. ar Xiv:1710.08446, 2017. Ferguson, T. An Inconsistent Maximum Likelihood Estimate. Journal of the American Statistical Association, 77 (380):831 834, 1982. Furmston, T. and Barber, D. Solving Deterministic Policy (PO)MPDs using Expectation-Maximisation and Antifreeze. In First international workshop on learning and data mining for robotics (LEMIR), pp. 56 70, 2009. In conjunction with ECML/PKDD-2009. Gelbrich, M. On a Formula for the L2 Wasserstein Metric Between Measures on Euclidean and Hilbert Spaces. Mathematische Nachrichten, 147(1):185 203, 1990. Gerchinovitz, S., Ménard, P., and Stoltz, G. Fano s Inequality for Random Variables. ar Xiv, 2018. doi: ar Xiv:1702.05985v2. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017. Hyvärinen, A. Fast and Robust Fixed-point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks, 10(3):626 634, May 1999. ISSN 1045-9227. doi: 10.1109/72.761722. Im, D., Ahn, S., Memisevic, R., and Bengio, Y. Denoising Criterion for Variational Auto-Encoding Framework. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ar Xiv:1502.03167, 2015. Kallenberg, O. Foundations of Modern Probability. Springer Science & Business Media, 2006. Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980, 2014. Kingma, D. and Welling, M. Auto-Encoding Variational Bayes. ar Xiv:1312.6114 [stat.ML], 2013. Kodali, N., Abernethy, J., Hays, J., and Kira, Z. On Convergence and Stability of GANs. ar Xiv:1705.07215, 2017. Spread Divergence Le Cun, Y., Cortes, C., and Burges, C. MNIST Handwritten Digit Database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010. Lehmann, E. Elements of Large-sample Theory. Springer Science & Business Media, 2004. Li, C., Chang, W., Cheng, Y., Yang, Y., and Póczos, B. MMD GAN: Towards Deeper Understanding of Moment Matching Network. In Advances in Neural Information Processing Systems, pp. 2203 2213, 2017. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015. Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are GANs Created Equal? A Large-Scale Study. In Advances in Neural Information Processing Systems, pp. 700 709, 2018. Mao, X., Li, Q., Xie, H., Lau, R., Wang, Z., and S., P. Least Squares Generative Adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794 2802, 2017. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. ar Xiv:1802.05957, 2018. Mohamed, S. and Lakshminarayanan, B. Learning in Implicit Generative Models. ar Xiv, 2016. doi: ar Xiv: 1610.03483. Peyré, G., Cuturi, M., et al. Computational Optimal Transport. Foundations and Trends R in Machine Learning, 11 (5-6):355 607, 2019. Ramdas, A., Reddi, S., Póczos, B., Singh, A., and Wasserman, L. On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. Rezende, J. and Mohamed, S. Variational Inference with Normalizing Flows. ar Xiv:1505.05770, 2015. Roth, K., Lucchi, A., Nowozin, S., and Hofmann, T. Stabilizing Training of Generative Adversarial Networks through Regularization. pp. 2018 2028, 2017. Sønderby, C., Caballero, J., Theis, L., Shi, W., and Huszár, F. Amortised Map Inference for Image Super-Resolution. ar Xiv:1610.04490, 2016. Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. Universality, Characteristic Kernels and RKHS Embedding of Measures. J. Mach. Learn. Res., 12:2389 2410, July 2011. ISSN 1532-4435. Sriperumbudur, B., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. On the Empirical Estimation of Integral Probability Metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. Tao, T. An Introduction to Measure Theory. 2011. Tolstikhin, I., Bousquet, O., Gelly, S., and Schölkopf, B. Wasserstein Auto-Encoders. ar Xiv:1711.01558, 2017. Wald, A. Note on the Consistency of the Maximum Likelihood Estimate. The Annals of Mathematical Statistics, 20(4):595 601, 1949. Winther, O. and Petersen, K. Bayesian Independent Component Analysis: Variational Methods and Non-negative Decompositions. Digital Signal Processing, 17(5):858 872, 2007. ISSN 1051-2004. Special Issue on Bayesian Source Separation. Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the Quantitative Analysis of Decoder-based Generative Models. ar Xiv:1611.04273, 2016. Zhang, M., Bird, T., Habib, R., Xu, T., and Barber, D. Variational f-Divergence Minimization. ar Xiv:1907.11891, 2018.