# cramerwold_autoencoder__d1a9d828.pdf

Journal of Machine Learning Research 21 (2020) 1-28 Submitted 7/19; Revised 4/20; Published 8/20

Cramer-Wold Auto-Encoder

Szymon Knop szymonknop@gmail.com Przemysław Spurek przemyslaw.spurek@uj.edu.pl Jacek Tabor jacek.tabor@uj.edu.pl Igor Podolak igor.podolak@uj.edu.pl Marcin Mazur marcin.mazur@uj.edu.pl Faculty of Mathematics and Computer Science Jagiellonian University, Kraków, Poland

Stanisław Jastrzębski staszek.jastrzebski@gmail.com Center of Data Science / Department of Radiology New York University, New York, United States

Editor: John Cunningham

The computation of the distance to the true distribution is a key component of most state-ofthe-art generative models. Inspired by prior works on the Sliced-Wasserstein Auto-Encoders (SWAE) and the Wasserstein Auto-Encoders with MMD-based penalty (WAE-MMD), we propose a new generative model a Cramer-Wold Auto-Encoder (CWAE). A fundamental component of CWAE is the characteristic kernel, the construction of which is one of the goals of this paper, from here on referred to as the Cramer-Wold kernel. Its main distinguishing feature is that it has a closed-form of the kernel product of radial Gaussians. Consequently, CWAE model has a closed-form for the distance between the posterior and the normal prior, which simpliﬁes the optimization procedure by removing the need to sample in order to compute the loss function. At the same time, CWAE performance often improves upon WAE-MMD and SWAE on standard benchmarks. Keywords: Auto-Encoder, Generative model, Wasserstein Auto-Encoder, Cramer-Wold Theorem, Deep neural network

1. Introduction

One of the crucial aspects in the construction of generative models is devising eﬃcient methods for computing and minimizing the distance to the true data distribution. In Variational Auto-Encoder (VAE), the distance to the true distribution is measured using KL divergence under the latent variable model and minimized using variational inference. An improvement was brought by the introduction of the Wasserstein metric (Tolstikhin et al., 2017) in the construction of WAE-GAN and WAE-MMD models, which relaxed the need for variational methods. WAE-GAN requires a separate optimization problem to be solved to approximate the used divergence measure, while in WAE-MMD the discriminator has the closed-form obtained from a characteristic kernel1. Most recently Kolouri et al. (2018) introduced the Sliced-Wasserstein Auto-Encoder (SWAE), which simpliﬁes distance computation even further. The main innovation of SWAE

1. Kernel is characteristic if it is injective on distributions, see e.g. Muandet et al. (2017).

2020 Szymon Knop Przemysław Spurek Jacek Tabor Igor Podolak Marcin Mazur Stanisław Jastrzębski.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-560.html.

Knop Spurek Tabor Podolak Mazur Jastrzębski

Figure 1: Cramer-Wold distance between two sets is obtained as the mean squared L2 distance of their smoothed projections on all one-dimensional lines. Figure shows an exemplary (one of many) projection.

was the introduction of the Sliced-Wasserstein distance, a fast to estimate metric for comparing two distributions, based on the mean Wasserstein distance of one-dimensional projections. However, even in SWAE there is no closed-form analytic formula that would enable computing the distance of the sample from the standard normal distribution. Consequently, in SWAE two types of sampling are needed: (i) sampling from the prior distribution and (ii) sampling from one-dimensional projections. The main contribution of this paper is the introduction of the Cramer-Wold distance between distributions, which is based on MMD distance and a new Cramer-Wold kernel. Cramer-Wold kernel is characteristic, i.e. the embedding is injective, and admits a closed-form in a certain case (see Eq. (12)). Thanks to the closed-form formula, it can be eﬃciently computed. We use it to construct the Cramer-World Auto-Encoder (CWAE) model, in which the cost function has a closed analytic formula. We demonstrate on standard benchmarks that CWAE is faster to optimise, more stable (no sampling is needed during the learning process) and retains, or even improves, performance compared to both WAE-MMD and SWAE. The Cramer-Wold kernel can be used as a measure between a sample and a mixture of radial Gaussian distributions. Śmieja et al. (2019) present a semi-supervised generative model Se GMA, which is able to learn a joint probability distribution of data and their classes. It is implemented in a typical auto-encoder framework but uses a mixture of Gaussians as a target distribution in the latent space. In such a situation, the classical Wasserstein kernel is diﬃcult to use since it requires sampling from both (target and real) distributions. Se GMA works eﬃciently due to the use of Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components, and thus, eliminates the need for sampling. This paper is arranged as follows. In sections 3 and 4 we introduce and theoretically analyze the Cramer-Wold distance, with the formal deﬁnition of a Cramer-Wold kernel in

Cramer-Wold Auto-Encoder

Section 5. Readers interested mainly in the construction of CWAE may proceed directly to Section 6. Section 7 contains experiments. Finally, we conclude in Section 9.

2. Motivation and related work

One of the ways to look at modern generative models (see, e.g. Tolstikhin et al. (2017)) is to note that each of them tends to minimise a certain divergence measure between the true, but unknown, data distribution PX and the model distribution PD that is deﬁned as a possibly random transportation via the given map D of a ﬁxed distribution PZ, acting on the latent space Z, into X. The most well known are the Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences, which refer to the Variational Auto-Encoder VAE (Kingma et al., 2014) and the Generative Adversarial Network GAN (Goodfellow et al., 2014) models, respectively (although in GAN a saddle-point objective occurs and hence adversarial training is required). However, these measures are often hard to use in a learning process due to some computational problems, including complexity, vanishing gradient, etc. In recent years new approaches, involving optimal transport (OT) setting (Villani, 2008), appeared in generative modeling. They were based on the use of the Wasserstein or, generally, optimal transport, distance as a measure of divergence between distributions. Beside the classical Wasserstein GAN (Arjovsky et al., 2017) model, we can mention here the Wasserstein Auto-Encoder WAE (Tolstikhin et al., 2017) as well as the Sliced-Wasserstein Auto-Encoder SWAE (Kolouri et al., 2018) as models, which were the inspiration and reference points for our work. In the following two paragraphs we brieﬂy recall the main concepts behind these ideas.

Wasserstein Auto-Encoder (WAE) Tolstikhin et al. (2017) introduce an auto-encoder based generative model with deterministic decoder D and a, possibly random, encoder E, which is based on minimizing the Wasserstein distance d W (PX, PD) between the data and the model distributions. Recall that d W (µ, ν) is given by the following formula

d2 W (µ, ν) = inf γ Γ(µ,ν)

Z x1 x2 2 2 dγ(x1, x2),

where Γ(µ, ν) is the set of joint probability measures having µ and ν as marginals. By Theorem 1 of Tolstikhin et al. (2017) this leads to the WAE objective function expressed by a sum consisting of two terms: (i) an expected cost of a diﬀerence between the data distribution PX and another distribution on the data space that is obtained by a self-transportation of PX via, appropriately understood, superposition of E and D, and (ii) a tuned divergence d Z between a prior distribution PZ and another distribution on Z that is obtained by a transportation of PX into Z via E. In consequence, assuming a deterministic encoder, the authors introduce two generative models, depending on a speciﬁc divergence measure used: WAE-GAN, involving the JS-divergence as d Z (learned by the adversarial training), and WAE-MMD, using as d Z the maximum mean discrepancy MMDk with a suitably established characteristic kernel function k.

Sliced-Wasserstein Auto-Encoder (SWAE) Another contribution that involves optimal transport setting in generative modeling is the work of Kolouri et al. (2018). It diﬀers from WAE in the choice of the divergence measure d Z. Based on a slicing method and

Knop Spurek Tabor Podolak Mazur Jastrzębski

the fact that the Wasserstein distance between one-dimensional distributions can be easily expressed as

d2 W (µ, ν) = Z 1

0 (P 1 µ (t) P 1 ν (t))2 dt,

where P 1 µ and P 1 ν denote the quantile functions of µ and ν, respectively, the authors use the mean value of d2 W as d Z, taken over all one-dimensional projections of appropriate distributions on the latent space Z (see the next section for more details). This idea directly motivated our Cramer-Wold distance.

3. Cramer-Wold distance

Motivated by the prevalent use of normal distribution as the prior in modern generative models, we investigate whether it is possible to simplify and speed up the optimization of such models. As the ﬁrst step towards this, we introduce Cramer-Wold distance, which has a simple analytical formula for computing the normality of high-dimensional samples. On a high level, our proposed approach uses the traditional L2 distance of kernel-based density estimation, computed across multiple single-dimensional projections of the true data and the output distribution of the model. We base our construction on the following two popular tricks of the trade: sliced based decomposition and smoothing distributions.

Sliced-based decomposition of a distribution Following Kolouri et al. (2018); Deshpande et al. (2018), the initial concept is to leverage the Cramer Wold Theorem (Cramér and Wold, 1936) and the Radon Transform (Deans, 1983) to reduce computing distance between two distributions to one-dimensional calculations. For v in the unit sphere SD RD, the projection of a set X RD onto the space spanned by v is given by v T X, whereas the projection of N(m, αI) is N(v T m, α). The Cramer-Wold theorem states that two multivariate distributions can be uniquely identiﬁed by all their one-dimensional projections. Hence, to obtain the key component of SWAE model, i.e. the sliced-Wasserstein distance between two samples X, Y RD, we compute the mean Wasserstein distance between all one-dimensional projections2:

d2 sw(X, Y ) = Z

SD d2 W (v T X, v T Y ) dσD(v), (1)

where SD denotes the unit sphere in RD and σD is the normalised surface measure on SD. This approach is eﬀective since the one-dimensional Wasserstein distance between samples has the closed-form; and therefore, to estimate Eq. (1), one has to sample only from all one-dimensional projections.

Smoothing distributions Using the slice-based decomposition requires deﬁning a distance measure between two sets of samples in a one-dimensional space. To this end, we use an approach commonly applied in statistics to compare samples or distributions, which is to ﬁrst smoothen (sample) distribution with a Gaussian kernel. For a sample R = (ri)i=1..n R

2. Observe that in space H with the scalar product , , each one-dimensional projection is given by a scalar product x x, v , for some v H. Consequently, this projection is proportional to x x, v v , which is a 1D-projection with respect to the element from the unit sphere.

Cramer-Wold Auto-Encoder

by its smoothing with Gaussian kernel N(0, γ) we understand

i N(ri, γ),

where by N(m, S) we denote the one-dimensional normal density with mean m and variance S. This produces a distribution with regular density and is commonly used in kernel density estimation. If R comes from the normal distribution with standard deviation close to one, the asymptotically optimal choice of γ = (4/3n)2/5 is given by the Silverman s rule of thumb (see Silverman (1986)). Theoretically, one can choose an arbitrary ﬁxed γ. However, we use an approach similar to the Bowman-Foster normality test (Bowman and Foster, 1993)3. For a continuous density f, its smoothing smγ(f) is given by the convolution with N(0, γ), and in the special case of Gaussians we have smγ(N(m, S)) = N(m, S + γ).

Cramer-Wold distance We are now ready to introduce the Cramer-Wold distance. For the convenience of the reader, we formulate the distance between samples ﬁrst, and then between sample and a distribution. For a formal deﬁnition of distance between distribution see paragraph Generalised Cramer-Wold kernel in Section 6. In a nutshell, we propose to compute the squared distance between two samples by considering the mean squared L2 distance between their smoothed projections over all one-dimensional subspaces. By the squared L2 distance between functions f, g : R R we refer to f g 2 2 = R |f(x) g(x)|2dx. A key feature of this distance is that it permits a closed-form in the case of the normal distribution. The following algorithm fully deﬁnes the Cramer-Wold distance between two samples X = (xi)i=1..n, Y = (yj)j=1..k RD (for illustration of Steps 1 and 2 see Figure 1):

1. given v in the unit sphere S(0, 1) RD, consider all the projections v T X = (v T xi)i=1..n and v T Y = (v T yj)j=1..k ,

2. compute the squared L2 distance of the densities smγ(v T X) and smγ(v T X):

smγ(v T X) smγ(v T Y ) 2 2 ,

3. to obtain squared Cramer-Wold distance average, integrate the above formula over all possible v SD.

The key theoretical outcome of this paper is that the computation of the Cramer-Wold distance can be simpliﬁed to a closed-form solution. Consequently, to compute the distance of two samples there is no need to ﬁnd the optimal transport like in WAE, nor is it necessary to sample over the projections as in SWAE.

Theorem 1 Let X = (xi)i=1..n, Y = (yj)j=1..n RD be given4. We formally deﬁne the squared Cramer-Wold distance with formula

d2 cw(X, Y ) := Z

SD smγ(v T X) smγ(v T Y ) 2 2 dσD(v). (2)

3. The choice of the optimal value of γ parameter is still a challenging problem. In our paper we use Silverman s rule of thumb since it works very well in practical applications. There are other equivalent rules although other rules also possible. 4. For clarity of presentation we provide here the formula for the case of samples of equal size.

Knop Spurek Tabor Podolak Mazur Jastrzębski

d2 cw(X, Y ) = 1 2n2 πγ

ii φD xi xi 2

jj φD yj yj 2

ij φD xi yj 2

where φD(s) = 1F1(1

2 ; s) and 1F1 is the Kummer s conﬂuent hypergeometric function (see, e.g., Barnard et al. (1998)). Moreover, φD(s) has the following asymptotic formula valid for D 20: φD(s) (1 + 4s 2D 3) 1/2. (4)

To prove Theorem 1 we need the following crucial technical proposition.

Proposition 2 Let z RD and γ > 0 be given. Then Z

N(v T z, γ)(0) dσD(v) = 1 2πγ φD

Proof By applying an orthonormal change of coordinates, without loss of generality, we may assume that z = (z1, 0, . . . , 0), and then v T z = z1v1 for v = (v1, . . . , v D). Consequently, we get R

SD N(v T z, γ)(0) dσD(v) = R

SD N(z1v1, γ)(0) dσD(v).

Making use of the formula for slice integration of functions on spheres (Axler et al., 1992, Corollary A.6) we get:

SD f dσD =VD 1

1 (1 x2)(D 3)/2 R

1 x2 ζ) dσD 1(ζ) dx,

where VK denotes the surface volume of sphere SK RK. Applying the above equality to function f(v1, . . . , v D) = N(z1v1, γ)(0) and s = z2 1/(2γ) = z 2/(2γ), we consequently get that the LHS of (5) simpliﬁes to

VD 1 2πγ R 1 1(1 x2)(D 3)/2 exp( sx2) dx,

which completes the proof since VK = 2 π K

R 1 1 exp( sx2)(1 x2)(D 3)/2 dx = π Γ( D 1

Proof [Proof of Theorem 1] Directly from the deﬁnition of smoothing we obtain that

d2 cw(X, Y ) = R

n P i N(v T xi, γ) 1

n P j N(v T yj, γ) 2 2 dσD(v). (6)

Applying now the one-dimensional formula for the L2 scalar product of two Gaussians:

N(r1, γ1), N(r2, γ2) 2 = N(r1 r2, γ1 + γ2)(0)

Cramer-Wold Auto-Encoder

and the equality f g 2 2 = f, f 2 + g, g 2 2 f, g 2 (where f, g 2 = R f(x)g(x)dx), we simplify the squared L2 norm in the integral of RHS of (6) to

i N(v T xi, γ) 1

i N(v T yj, γ) 2 2 = 1 n2 P

i N(v T xi, γ), P

i N(v T xi, γ) 2+

j N(v T yj, γ), P

j N(v T yj, γ) 2 2

i N(v T xi, γ), P

j N(v T yj, γ) 2 =

ii N(v T (xi xi ), 2γ)(0) + 1

jj N(v T (yj yj ), 2γ)(0) 2

ij N(v T (xi yj), 2γ)(0).

Applying proposition 2 directly, we obtain formula (3). Proof of the formula for the asymptotics of function φD is provided in the next section.

Therefore, to estimate the distance of a given sample X to some prior distribution f, one can follow the common approach and take the distance between X and a sample from f. As the main theoretical result of the paper we consider the following theorem, which states that in the case of standard Gaussian multivariate prior, we can completely reduce the need for sampling (we omit the proof since it is similar to that of Theorem 1).

Theorem 3 Let X = (xi)i=1..n RD be a given sample. We formally deﬁne

d2 cw(X, N(0, I)) := Z

SD smγ(v T X) smγ(N(0, 1)) 2 2dσD(v).

d2 cw(X, N(0, I)) = 1 2n2 π 1 γ P i,j φD( xi xj 2

4γ ) + n2 1+γ 2n q

P i φD( xi 2

See Figure 2 for a comparison between Cramer-Wold, Wasserstein MMD, and the Sliced Wasserstein distances with diﬀerent data dimensions and sample sizes. In the experiment, we use two samples from Gaussian distribution N([0, . . . , 0]T , I) and N([α, 0, . . . , 0]T , I), where we change the parameter α in range [0, 6]. Note that the Cramer-Wold distance is the lowest one irrespective of data dimension and sample size, and does not change much.

4. Computation of φD

As it was shown in the previous section, the key element of the Cramer-Wold distance is the function φD(s) = 1F1(1

2 ; s) for s 0.

Consequently, in this section we focus on the derivation of its basic properties. We provide its approximate asymptotic formula valid for dimensions D 20, and then consider the special case of D = 2 (see Figure 3), where we provide the exact formula. To do so, let us ﬁrst recall (see Abramowitz and Stegun (1964, Chapter 13)) that the Kummer s conﬂuent hypergeometric function 1F1 (denoted also by M) has the following integral representation

1F1(a, b, z) = Γ(b) Γ(a)Γ(b a)

0 ezuua 1(1 u)b a 1 du,

Knop Spurek Tabor Podolak Mazur Jastrzębski

Sample size

Dimension of data

Figure 2: Comparison between Cramer-Wold, Wasserstein MMD, and Sliced-Wasserstein distances for diﬀerent dimensions (from top to bottom for 10, 50, 100, 200) and sample sizes (columns from left to right for 100, 200, 500, 1000). In the experiment, we use two samples from Gaussians N([0, . . . , 0]T , I) and N([α, 0, . . . , 0]T , I), where the parameter α of the mean shift is changed in range [0, 6].

valid for a, b > 0 such that b > a. Since we consider that latent is at least of dimension D 2, it follows that

φD(s) = Γ(D

0 e suu 1/2(1 u)D/2 3/2 du.

By making a substitution u = x2, du = 2xdx, we consequently get

φD(s) = 2Γ(D/2) Γ(1/2)Γ(D/2 1/2)

0 e sx2(1 x2)(D 3)/2 dx

= Γ(D/2) Γ(1/2)Γ(D/2 1/2)

1 e sx2(1 x2)(D 3)/2 dx. (7)

Proposition 4 For large5 D we have

φD(s) (1 + 4s 2D 3) 1/2 for all s 0. (8)

Proof We have to estimate asymptotics of (7), i.e.

φD(s) = Γ(D

R 1 1 e sx2(1 x2)(D 3)/2 dx.

Since for large D and all x [ 1, 1] we have

(1 x2)(D 3)/2e sx2 (1 x2)(D 3)/2 (1 x2)s = (1 x2)s+(D 3)/2,

5. In practice we can take D 20.

Cramer-Wold Auto-Encoder

Figure 3: Comparison of φD value (red line) with the approximation given by (8) (green line) in the case of dimensions D = 2, 5, 20. Observe that for D = 20, the functions practically coincide.

2 ) π R 1 1(1 x2)s+(D 3)/2 dx = Γ(D

2 ) π π Γ(s+ D

To simplify the above, we apply the formula (1) from Tricomi and Erdélyi (1951):

Γ(z + α) Γ(z + β) = zα β(1 + (α β)(α + β 1)

2z + O(|z| 2)),

with α, β ﬁxed so that α + β = 1 (so only the error term of order O(|z| 2) remains), and get the following

Summarizing,

4)1/2 = (1 + 4s 2D 3) 1/2.

The above formula is valid for dimensions higher than 20. For lower dimensions we recommend using iterative direct formulas for φD function, which can be obtained using erf and modiﬁed Bessel functions of the ﬁrst kind I0 and I1. To provide an example we consider here a special case of D = 2 since it is used in the paper for illustrative reasons in the latent for the MNIST data-set. As we have the equality (Gradshteyn and Ryzhik, 2015, (8.406.3) and (9.215.3))

φ2(s) = 1F1(1

2, 1, s) = e s

to practically implement φ2 we apply the approximation of I0 from Abramowitz and Stegun (1964, p. 378) given in the following remark.

Knop Spurek Tabor Podolak Mazur Jastrzębski

Remark 5 Let s 0 be arbitrary and let t = s/7.5. Then

2 (1+3.51562t2+3.08994t4+1.20675t6 + 0.26597t8+0.03608t10+0.00458t12) for s [0, 7.5], q

2 s (0.398942+0.013286t 1+0.002253t 2 0.001576t 3+0.00916t 4 0.020577t 5

+0.026355t 6 0.016476t 7+0.003924t 8) for s 7.5.

5. Cramer-Wold kernel

In this section we ﬁrst formally deﬁne the Cramer-Wold metric for arbitrary measures, and then show that it is given by a characteristic kernel which has a closed-form for spherical Gaussians. For more information on kernels, and kernel embedding of distributions, we refer the reader to Muandet et al. (2017). Let us ﬁrst introduce the general deﬁnition of the Cramer-Wold cw-metric. To do so we generalise the notion of smoothing for arbitrary measures µ with formula

smγ(µ) = µ N(0, γI),

where denotes the convolution operator for two measures, and we identify the normal density N(0, γI) with the measure it introduces. It is well known that the resulting measure has the density given by

x Z N(x, γI)(y)dµ(y).

Clearly smγ(N(0, αI)) = N(0, (α + γ)I)).

Moreover, by applying the characteristic function one obtains that if the smoothing of two measures coincide, then the measures coincide too

smγ(µ) = smγ(µ) = µ = ν. (10)

We also need to deﬁne the transport of the density by the projection x v T x, where v is chosen from the unit sphere SD. The deﬁnition is formulated so that if X is a random vector with density f, then fv is the density of the random vector Xv := v T X. Then

y:y rv v f(z)d D 1(z),

where d D 1 denotes the (D 1) dimensional Lebesgue measure. In general, if µ is a measure on RD, then µv is the measure deﬁned on R by the formula

µv(A) = µ({x : v T x A}).

Since, if a random vector X has a density N(a, γI), and then the random variable Xv has the density N(v T a, α), we may directly conclude that

N(a, γI)v = N(v T a, γ).

Cramer-Wold Auto-Encoder

Test interpolation Test reconstruction Random sample

Figure 4: Results of VAE, WAE-MMD, SWAE, and CWAE models trained on CELEB A data-set using the WAE architecture from Tolstikhin et al. (2017). In test reconstructions odd rows correspond to the real test points.

It is also worth noting that due to the fact that the projection of a Gaussian is a Gaussian, the smoothing and projection operators commute, i.e.

smγ(µv) = (smγµ)v.

Knop Spurek Tabor Podolak Mazur Jastrzębski

Given ﬁxed γ > 0, the two above notions allow us to formally deﬁne the cw-distance of two measures µ and ν with formula

d2 cw(µ, ν) = Z

SD smγ(µv) smγ(νv) 2 L2dσD(v). (11)

Observe that this implies that cw-distance is given by the kernel function

k(µ, ν) = Z

SD smγ(µv), smγ(νv) L2dσD(v).

Let us now prove that the function dcw, deﬁned by equation (11) is a metric, i.e. the kernel is characteristic).

Theorem 6 Function dcw is a metric.

Proof Since dcw comes from a scalar product, we only need to show that if the distance of two measures is zero, the measures coincide. So let µ, ν be given measures such that dcw(µ, ν) = 0. This implies that

smγ(µv) = smγ(νv).

By (10) this implies that µv = νv. Since this holds for all v SD, by the Cramer-Wold Theorem we obtain that µ = ν.

We can summarize the above by saying that the Cramer-Wold kernel is a characteristic kernel which, by the deﬁnition and (5), has a closed-form of a scalar product of two radial Gaussians given by

N(x, αI), N(y, βI) cw = 1

2π(α+β+2γ)φD x y 2

2(α+β+2γ) . (12)

Remark 7 Observe, that except for the Gaussian kernel, it is the only kernel which has the closed form for the spherical Gaussians. It is important since the RBF (Gaussian) kernels cannot be successfully applied in auto-encoder based generative models (we discuss it in the next section,). The reason is that the derivative of Gaussian vanishes quickly with distance; and therefore, it leads to diﬃculties in training as shown in (Tolstikhin et al., 2017, Section 4, WAE-GAN and WAE-MMD speciﬁcs).

6. Cramer-Wold Auto-Encoder (CWAE)

In this section we construct an auto-encoder based on the Cramer-Wold distance. We start by introducing notation.

Auto-encoder. Let X = (xi)i=1..n RN be a given data-set. The basic aim of AE is to transport the data to a typically, but not necessarily, lower dimensional latent space Z = RD

while minimizing the reconstruction error. Hence, we search for an encoder E : Rn Z and a decoder D : Z Rn functions that minimise the mean squared error MSE(X; E, D) on X and its reconstructions D(Exi).

Cramer-Wold Auto-Encoder

Auto-encoder based generative model. CWAE, similarly to WAE, is a an auto-encoder model with modiﬁed cost function which forces the model to be generative, i.e. ensures that the data transported to the latent space comes from the prior distribution (typically Gaussian). This statement is formalized by the following important remark, see also Tolstikhin et al. (2017).

Figure 5: Comparison between CWAE and WAE-MMD with CW kernel on Fashion-MNIST data-set.

Remark 8 Let X be an N-dimensional random vector, from which our data-set was drawn, and let Y be a random vector with distribution f on latent Z. Suppose that we have constructed functions E : RN Z and D : Z RN (representing the encoder and decoder pair) such that6

1. D(Ex) = x for x image(X),

2. random vector EX has the distribution f.

Then by the point 1 we obtain that D(EX) = X, therefore

DY has the same distribution as D(EX) = X.

This means that to produce samples from X we can instead produce samples from Y and map them by the decoder D.

Since an estimator of the image of the random vector X is given by its sample X, we conclude that a generative model is correct if it has a small reconstruction error and resembles the prior distribution in the latent. Thus, to construct a generative auto-encoder model (with Gaussian prior), we add to its cost function a measure of the distance of a given sample from a normal distribution.

6. We recall that for function (or in particular random vector) X : Ω RD, by image(X) we denote the set consisting of all possible values X can attain, i.e. {X(ω) : ω Ω}.

Knop Spurek Tabor Podolak Mazur Jastrzębski

CWAE cost function. Once the crucial ingredient of CWAE is ready, we can describe its cost function. To ensure that the data transported to latent Z are distributed according to standard normal density distribution, we can add the Cramer-Wold distance d2 cw(X, N(0, I)) from standard multivariate normal density to the cost function:

cost(X; E, D) = MSE(X; E, D) + λd2 cw(EX, N(0, I))

Figure 6: Synthetic data in the latent and the distance from prior cost: the CWAE model on the left, WAE-MMD on the right. On the horizontal axis there is the share of z N(0, 1) in uniform data. The blue curves represent a standard model (without logarithm), while the orange denote the one with a logarithm used.

Since the use of special functions involved in the formula for Cramer-Wold distance might be cumbersome, in all experiments (except the illustrative 2D case) we apply the asymptotic form (8) of function φD:

2 πd2 cw(X) 1 n2 P ij(γn + xi xj 2

2D 3 ) 1/2 + (1 + γn) 1/2 2

n P i(γn + 1

where γn = ( 4

3n)2/5 is chosen using Silverman s rule of thumb (Silverman, 1986). Such a solution can be understood as a use of WAE-MMD with a Cramer-Wold kernel. In CWAE model, we use a logarithm function:

cost(X; E, D) = MSE(X; E, D) + λ log d2 cw(EX, N(0, I)).

CWAE cost diﬀers from the WAE model cost, by the utilisation of a logarithm function. We observed that using a logarithm to scale the second term increased training speed, as shown in Figure 5. During the ﬁrst few starting iterations it is typical for the errors variation to be high. In case of CWAE, the Dcw cost is around 10 times larger than the dk cost of WAE. The logarithm can tone it substantially, increasing the stability of learning, which is not needed in WAE. The network ﬁnds a smoother way to increase the normality of the latent space, thus speeding up training process. At the same time it is probable that at the beginning of training, the distribution of example projections in the latent space is more uniform. Then, with training progression it tends to become closer to a normal distribution (assuming a normal prior). A synthetic data experiment showing this phenomenon is given in Figure 6. The logarithmic cost drops-oﬀ much quicker pulling the model towards quicker minimization.

Cramer-Wold Auto-Encoder

FID Score comparison

model WAE SWAE CWAE

Figure 7: Comparison between WAE, SWAE and CWAE with respect to batch size. We repeated the experiment ﬁve times, conﬁdence intervals represent the standard deviation.

On the other hand, a modiﬁcation of WAE-MMD with cost + d2 k( , ) (see Eq. (13)) to + log d2 k( , ) (in Eq. (14)) results in a steeper and more irregular descent. The WAEMMD cost is closer to zero, and may sometimes be even negative as noted in Tolstikhin (2018, . . . penalty used in WAE-MMD is not precisely the population MMD, but a sample based Ustatistic. . . if the population MMD is zero, it necessarily needs to take negative values from time to time. ). Therefore the log version is not suitable for the WAE-MMD version, which coincides with experiments. The use of Cramer-Wold distance and a logarithm in cost function allows us to construct more stable models. More precisely, the cost function is less sensitive to the changes of training parameters like batch size and learning rate, see Figure 7. As a consequence, in practical applications the CWAE model is easier to train.

Comparison with WAE and SWAE models. Finally, let us brieﬂy recapitulate the diﬀerences between the introduced CWAE model, WAE variants (Tolstikhin et al., 2017) and SWAE (Kolouri et al., 2018). Firstly, from the theoretical point of view both SWAE and CWAE models use similar distances dsw and dcw, obtained by integration over all 1-dimensional projections (compare Eqs (1) and (2)). On the other hand, SWAE incorporates Wasserstein distances under the integral, while in CWAE, under the integral, the L2 distances between regularizations are used. Additionally, the integral in the dsw formula is estimated with a ﬁnite sum, while for dcw we obtain analytically quite accurate approximate formula. From a computational point of view, it is important that in contrast to WAE-MMD and SWAE, the CWAE model does not require sampling from normal distribution (as it is the case in WAE-MMD) or over slices (as in SWAE) to evaluate its cost function. In this sense, CWAE uses a closed formula cost function. In contrast to WAE-GAN, our objective does not require a separately trained neural network to approximate the optimal transport function, thus avoiding pitfalls of adversarial training.

Comparison with WAE-MMD models. We now compare the proposed CWAE model to WAE-MMD. In particular, we show that CWAE can be seen as a combination of the sliced-approach with the MMD-based models. The WAE-MMD model uses approximations, while CWAE uses a closed-form, which has an impact on training. It results in a more leveled drop of distance weight, with even negative values in case of a WAE-MMD estimator, see Figure 6.

Knop Spurek Tabor Podolak Mazur Jastrzębski

Since both WAE and CWAE use kernels to discriminate between sample and normal density distribution to compare the models we ﬁrst describe the WAE model. WAE cost function for a given characteristic kernel k and sample X = (xi)i=1..n RD (in the Ddimensional latent) is given by

WAE cost = MSE + λ d2 k(X, Y ), (13)

where Y = (yi)i=1..n is a sample from the standard normal density N(0, I), and d2 k(X, Y ) denotes the kernel-based distance between the probability distributions representing X and Y , that is 1

n P i δxi and 1

n P i δyi, where δz denotes the atom Dirac measure at z RD. The inverse multi-quadratic kernel IMQ k is chosen as default

k(x, y) = C C + x y 2 2 ,

where in experiments in Tolstikhin et al. (2017) a value C = 2Dσ2 was used, while σ is a hyper-parameter denoting the standard deviation of the normal density distribution. Therefore, the model has hyper-parameters λ and σ, which were chosen to be λ = 10, σ2 = 1 in MNIST, λ = 100, σ2 = 2 in CELEB A. On the other hand, the CWAE cost function for a sample X = (xi)i=1..n RD (in the D-dimensional latent) is given by

CWAE cost = MSE + λ log d2 cw(X, N(0, I)), (14)

where the distance between the sample and the standard normal distribution is taken with respect to the Cramer-Wold kernel with a regularizing hyper-parameter γ, given by the Silverman s rule of thumb (the motivation for such a choice of hyper-parameters is explained in Section 3).

We stress the following important diﬀerences

Due to the properties of the Cramer-Wold kernel, we are able to substitute the sample estimation of d2 k(X, N(0, I)) given in WAE-MMD by d2 cw(X, Y ) by its exact formula.

CWAE, as compared to WAE, is less sensitive to the choice of parameters:

1. The choice of regularization hyper-parameter is given by the Silverman s rule of thumb and depends on the sample size (contrary to WAE-MMD, where the hyper-parameters are chosen by hand, and in general do not depend on the sample size).

2. In our preliminary experiments, we have observed that frequently (like in the case of log-likelihood) taking the logarithm of the non-negative factors of the cost function, which we aim to minimise to zero, improves the learning. Motivated by the above and the CWAE cost function analysis, the CWAE cost uses logarithm of the Cramer-Wold distance to balance the MSE and the divergence terms. It turned out that in most cases it is enough to set in Eq. (14) the parameter λ = 1. Furthermore, we show (see Figure 7) that CWAE is less sensitive in respect to batch size. For every batch size and model we performed a grid search over

Cramer-Wold Auto-Encoder

λ {1, 10, 100} and learning rate values in {0.01, 0.001, 0.0001}. For every model, we selected a conﬁguration with the lowest FID score and repeated the experiment ﬁve times. As we can see, CWAE seems to be insensitive to this parameter.

Summarizing, the CWAE model, contrary to WAE-MMD, is less sensitive to the choice of parameters. Moreover, since we do not have the noise in the learning process given by the random choice of the sample Y from N(0, I), the learning should be more stable. As a consequence, CWAE generally learns faster than WAE-MMD, and has smaller standard deviation of the cost-function during the learning process. Detailed results of the experiments for CELEB A data-set are presented in Figure 8. Moreover, for better comparison, we veriﬁed how the learning process looks like in the case of original WAE-MMD architecture form (Tolstikhin et al., 2017), see Figure 8.

Generalised Cramer-Wold kernel. In this paragraph, we show that asymptotically, with respect to dimension D, Cauchy kernel used in WAE-MMD can in fact be seen as the sliced kernel. We use two-dimensional subspaces as slices. To do so we need the probability measure on d-dimensional linear subspaces of RD, see Mattila (1999). One can do it either directly with the deﬁnition of a Grassmanian, or describe it with the orthonormal basis for integration over orthonormal matrices (Aubert and Lam, 2003; Braun, 2006). Now we deﬁne the d-dimensional sliced Cramer-Wold kernel by the formula

kd(µ, ν) = Z

G(d,D) smγ(µv), smγ(νv) L2dγd,D(v),

where γd,D denotes the respective Radon probability measure on G(d, D). Equivalently, we can integrate over orthonormal sequences in RD of length d:

Od(RD) = {(v1, . . . , vd) (RD)d : vi = 1, vi vj}.

The normalised, invariant with respect to orthonormal transformations, measure on Od we denote with θd. Observe that for d = 1 we obtain normalised integration over the sphere. Then we obtain that kd can be equivalently deﬁned as

kd(µ, ν) = Z

Od smγ(µv), smγ(νv) L2dθd(v).

Let us ﬁrst observe that for Gaussian densities the formula for kd can be slightly simpliﬁed

kd(N(x, αI), N(y, βI))= R

Od N(v T (x y), (α + β + 2γ)Id)(0)dθd(v)

i=1 N(v T i (x y), α + β + 2γ)(0)dθd(v).

Now if we deﬁne Φd D(s, h) = Z

Od N(v T se1, h Id)(0)dθd(v),

where e1 RD is the ﬁrst unit base vector, we obtain that the kernel-product reduces to computation of the scalar function ΦD

kd(N(x, αI), N(y, βI)) = Φd D( x y , α + β + 2γ).

Knop Spurek Tabor Podolak Mazur Jastrzębski

The crucial observation needed to proceed further is that the measure space (Od(RD), θd) can be approximated by (RD, N(0, I/D))d. This follows from the fact, that if v1, . . . , vd are drawn from the density N(0, I/D), then for suﬃciently large D we have vi 1 and vi, vj 0 for i = j.

Theorem 9 We have

Φd D(s, h) (2π) d/2 (h + s2/D) d/2.

Proof By the observation stated before the theorem, we have

Φd D(s, h) = R

i=1 N(v T i se1, h)(0)dθd(v) R

i=1 N(v T i se1, h)(0)N(0, I

D)(vi)dv1 . . . dvd

RD N(v T i se1, h)(0)N(0, I

It thus suﬃces to compute each component of the above formula. To do so, we denote by Nk the k-dimensional normal density, and get

RD N1(s v, e1 , h)(0) ND (0, 1

= R N1(0, h)(st) ND(0, 1

D I)(te1) ND 1(0, 1

RD 1 ND 1(0, 1

DI)(w)dw dt

= R N1(0, h)(st) ND(0, 1

D I)(te1) ND 1(0, 1

h s2+h D = 1

which yields the assertion of the theorem.

As a direct consequence, we obtain the following asymptotic formula (with the dimension D large) of the generalised Cramer-Wold kernel of two spherical Gaussians:

kd(N(x,αI), N(y, βI)) (2π) d/2 (α + β + 2γ + x y 2/D) d/2.

Observe, that for d = 2 we obtain the standard inverse multiquadratic kernel.

7. Experiments

In this section we empirically validate the proposed CWAE7 model on standard benchmarks for generative models CELEB A, CIFAR-10, MNIST, and Fashion MNIST. We compare the proposed CWAE model with WAE-MMD (Tolstikhin et al., 2017) and SWAE (Kolouri et al., 2018). As we shall see, our results match, or even exceed, those of WAE-MMD and SWAE, while using a closed-form cost function (see previous sections for a more detailed discussion). The rest of this section is structured as follows. In Subsection 7.2 we report the

7. The code is available at https://github.com/gmum/cwae.

Cramer-Wold Auto-Encoder

results of the standard qualitative tests, as well as visual investigations of the latent space. In Subsection 7.3 we will turn our attention to quantitative tests using Fréchet Inception Distance and other metrics (Heusel et al., 2017). In Subsection 8 we provide a proof of concept for an application of the Cramer-Wold distance in the framework introduced by Deshpande et al. (2018).

7.1. Experimentation setup

In the experiment we use two basic architecture types. Experiments on MNIST and Fashion MNIST use a feed-forward network for both encoder and decoder, and an 8 neuron latent layer, all using Re LU activations. For CIFAR-10, and CELEB A data-sets we use convolutiondeconvolution architectures. Please refer to Section 7.5 for full details.

Table 1: Comparison of diﬀerent architectures on the MNIST, Fashion-MNIST, CIFAR-10 and CELEB A data-sets. All models outputs except AE are similarly close to the normal distribution. CWAE achieves the best value of FID score (lower is better). All hyper-parameters were found using a grid search (see section 7.5).

Data-set Method Learning λ Skewness Kurtosis Rec. FID rate (normalised) error score

MNIST AE 0.001 - 1197.24 878.07 11.19 52.74 VAE 0.001 - 0.43 0.77 18.79 40.47 SWAE 0.001 1.0 6.01 10.72 10.99 29.76 WAE-MMD 0.0005 1.0 11.70 8.34 11.14 27.65 CWAE 0.001 1.0 12.21 35.88 11.25 23.63

FASHION AE 0.001 - 140.21 85.58 9.87 81.98 MNIST VAE 0.001 - 0.20 4.86 15.41 64.98 SWAE 0.001 100.0 1.15 18.14 10.56 54.48 WAE-MMD 0.001 100.0 2.82 4.33 10.01 58.79 CWAE 0.001 10.0 5.11 65.96 10.36 49.95

CIFAR10 AE 0.001 - 2.5 105 1.7 104 24.67 269.09 VAE 0.001 - 35.81 3.67 63.77 172.39 SWAE 0.001 1.0 517.32 121.17 25.42 141.91 WAE-MMD 0.001 1.0 1105.73 2097.14 25.04 129.37 CWAE 0.001 1.0 176.60 1796.66 25.93 120.02

CELEB A AE 0.001 - 4.6 109 2.6 108 86.41 353.50 VAE 0.001 - 43.72 171.66 110.87 60.85 SWAE 0.0001 100.0 141.17 222.02 85.97 53.85 WAE-MMD 0.0005 100.0 162.67 604.09 86.38 51.51 CWAE 0.0005 5.0 130.08 542.42 86.89 49.69

Knop Spurek Tabor Podolak Mazur Jastrzębski

SWAE WAE-MMD CWAE

CW distance

WAE distance

SW distance

10 20 30 40 50 Epochs

Figure 8: CELEB A trained CWAE, WAE, and SWAE models with FID score, kurtosis and skewness, as well as CW-, WAE-, and SWAE-distances on the original WAE-MMD architecture from Tolstikhin et al. (2017). All values are the averages from 5 models trained for each architecture. Conﬁdence intervals represent the standard deviation. Optimum kurtosis is marked with a dashed line.

Cramer-Wold Auto-Encoder

7.2. Qualitative tests

The quality of a generative model is typically evaluated by examining generated samples or by interpolating between samples in the latent space. We present such a comparison between CWAE with WAE-MMD in Figure 4. We follow the same procedure as in Tolstikhin et al. (2017). In particular, we use the same base neural architecture for both CWAE and WAE-MMD. For each model we consider (i) interpolation between two random examples from the test set (leftmost in Figure 4), (ii) reconstruction of a random example from the test set (middle column in Figure 4), and ﬁnally a sample reconstructed from a random point sampled from the prior distribution (right column in Figure 4). The experiment shows that there are no perceptual diﬀerences between CWAE and WAE-MMD generative distribution. In the next experiment we qualitatively assess the normality of the latent space. This will allow us to ensure that CWAE does not compromise on the normality of its latent distribution, which is a part of the cost function for all the models except AE. We compare CWAE8 with VAE, WAE and SWAE on the MNIST data using 2 dimensional latent space and a two-dimensional Gaussian prior distribution. Results are reported in Figure 9. As is readily visible, the latent distribution of CWAE is as close, or perhaps even closer, to the normal distribution than that of the other models. To summarize, both in terms of perceptual quality and satisfying normality objective, CWAE matches WAE-MMD. The next section will provide more quantitative studies.

Figure 9: The latent distribution of CWAE is close to the normal distribution. Each subﬁgure presents points sampled from two-dimensional latent spaces, VAE, WAE, SWAE, and CWAE (left to right). All trained on the MNIST data-set.

7.3. Quantitative tests

In order to quantitatively compare CWAE with other models, in the ﬁrst experiment we follow the experimental setting and use the same architecture as in Tolstikhin et al. (2017). In particular, we employ the Fréchet Inception Distance (FID) (Heusel et al., 2017). In agreement with the qualitative studies, we observe FID of CWAE to be similar or slightly better than WAE-MMD. We highlight that CWAE on CELEB A achieves 49.69 FID

8. Since (4) is valid for dimensions D 20, to implement CWAE in 2-dimensional latent space we apply equality 1F1(1/2, 1, s) = e s

2 jointly with the approximate formula (Abramowitz and Stegun, 1964, p. 378) for the Bessel function of the ﬁrst kind I0.

Knop Spurek Tabor Podolak Mazur Jastrzębski

Table 2: Comparison between classical cost function of WAE, SWAE and a version with a logarithm (method names with a -LOG suﬃx).

Data-set Method Learning λ Skewness Kurtosis Rec. FID rate (normalised) error score

MNIST SWAE 0.001 1.0 6.01 10.72 10.99 29.76 SWAE-LOG 0.0005 1.0 2.36 12.20 11.42 24.89 WAE 0.0005 1.0 11.70 8.34 11.14 27.65 WAE-LOG 0.001 1.0 18.22 61.04 13.17 36.08 CWAE 0.001 1.0 12.21 35.88 11.25 23.63

FASHION SWAE 0.001 100.0 1.15 18.14 10.56 54.48 MNIST SWAE-LOG 0.001 10.00 11.01 0.88 14.11 55.17 WAE 0.001 100.0 2.82 4.33 10.01 58.79 WAE-LOG 0.005 1.0 53.37 66.01 16.14 99.51 CWAE 0.001 10.0 5.11 65.96 10.36 49.95

CIFAR10 SWAE 0.001 1.0 517.32 121.17 25.42 141.91 SWAE-LOG 0.0005 1.0 157.14 234.52 26.25 119.89 WAE 0.001 1.0 1105.73 2097.14 25.04 129.37 WAE-LOG 0.001 1.0 1.1 108 4.9 105 28.25 136.25 CWAE 0.001 1.0 176.60 1796.66 25.93 120.02

CELEB A SWAE 0.0001 100.0 141.17 222.02 85.97 53.85 SWAE-LOG 0.0005 10.0 132.54 465.39 85.82 53.46 WAE 0.0005 100.0 162.67 604.09 86.38 51.51 WAE-LOG 0.0001 1.0 514.43 2154.39 82.53 58.10 CWAE 0.0005 5.0 130.08 542.42 86.89 49.69

score compared to 51.51 and 53.85 achieved by WAE-MMD and SWAE, respectively, see Figure 8 and Table 1. Next, motivated by Remark 8, we propose a novel method for quantitative assessment of the models based on their comparison to standard normal distribution in the latent. To achieve this we decided to use one of the most popular statistical normality tests, i.e. Mardia tests (Henze, 2002). Mardia s normality tests are based on verifying whether the skewness b1,D( ) and kurtosis b2,D( ) of a sample X = (xi)i=1..n RD

b1,D(X) = 1 n2 X

j,k (x T j xk)3, and b2,D(X) = 1

are close to that of standard normal density. The expected Mardia s skewness and kurtosis for the standard multivariate normal distribution are 0 and D(D +2), respectively. To enable easier comparison in experiments we consider also the value of the normalised Mardia s kurtosis given by b2,D(X) D(D + 2), which equals zero for the standard normal density. Results are presented in Figure 8 and Table 1. In Figure 8 we report for CELEB A data-set the value of FID score, Mardia s skewness and kurtosis during learning process of WAE, SWAE and CWAE (measured on the validation data-set).

Cramer-Wold Auto-Encoder

WAE, SWAE and CWAE models obtain the best reconstruction error, comparable to AE. VAE model exhibits a slightly worse reconstruction error, but values of kurtosis and skewness indicating their output are closer to normal distribution. As expected, the output of AE is far from normal distribution; its kurtosis and skewness grow during learning. This arguably less standard evaluation, which we hope will ﬁnd its way to being adapted by the community, serves as yet another evidence that CWAE has strong generative capabilities, which at least match the performance of WAE-MMD. Moreover, we observe that VAE model s output distribution is closest to the normal distribution, at the expense of the reconstruction error, which is reﬂected by some blurred reconstructions, typically associated with that approach. At the end of this subsection we compare our method with classical approaches WAEMMD and SWAE with modiﬁed cost function. More precisely, similarly to CWAE we use logarithm in cost function in WAE and SWAE, see Table 2. At it was mentioned, adding logarithm to WAE-MMD does not work, since penalty used in WAE-MMD is not precisely the population MMD, but a sample-based U-statistic. In consequence, cost function can be negative from time to time. Therefore, the log version is not suitable for the WAE-MMD version. On the other hand, logarithm improves learning process in case of CWAE and SWAE.

7.4. Comparison of the learning speed

We expect our closed-form formula to lead to a speedup in training time. Indeed, we found that for batch-sizes up to 1024 CWAE is faster (in terms of time needed per batch) than other models. More precisely, CWAE is approximately 2 faster up to 256 batch-size. Figure 10 gives a comparison of mean learning time for diﬀerent most frequently used batch-sizes. Time spent on processing a batch is smaller for CWAE for a practical range of batch-sizes [32, 512]. For batch-sizes larger than 1024, CWAE is slower due to its quadratic complexity with respect to the batch-size. However, we note that batch-sizes even larger than 512 are relatively rarely used in practice for training auto-encoders.

7.5. Architecture

For WAE, SWAE and CWAE models and each data-set we performed a grid search over parameter λ {1, 5, 10, 100} and learning rate values from {0.01, 0.005, 0.001, 0.0005, 0.0001}. For VAE and AE models, we examined only diﬀerent learning rates. All models were trained on a 128-element mini-batches. For every model, we report results for a conﬁguration that achieved the lowest value of FID Score. MNIST/Fashion-MNIST (28 28 images): an encoder-decoder feed-forward architecture:

encoder three feed-forward Re LU layers, 200 neurons each

latent 8-dimensional,

decoder three feed-forward Re LU layers, 200 neurons each.

CIFAR-10 data-set (32 images with three color layers): a convolution-deconvolution network

Knop Spurek Tabor Podolak Mazur Jastrzębski

32 64 128 256 512 1024 2048 Batch size

Average batch processing time

AE VAE WAE SWAE CWAE

Figure 10: Comparison of mean batch learning time (times are in log-scale) for diﬀerent algorithms in seconds. All for the same architecture introduced in Tolstikhin et al. (2017) and all requiring a similar number of epochs to train the full model. The times may diﬀer for computer architectures with more/less memory on a GPU card (minimal value in our experiments was about 6 10 3).

four convolution layers with 2 2 ﬁlters, the second one with 2 2 strides, other non-strided (3, 32, 32, and 32 channels) with Re LU activation,

128 Re LU neurons dense layer,

latent 64-dimensional,

two dense Re LU layers with 128 and 8192 neurons,

two transposed-convolution layers with 2 2 ﬁlters (32 and 32 channels) and Re LU activation,

a transposed convolution layer with 3 3 ﬁlter and 2 2 strides (32 channels) and Re LU activation,

a transposed convolution layer with 2 2 ﬁlter (3 channels) and sigmoid activation.

CELEB A (with images centered and cropped to 64 64 with 3 color layers): in order to have a direct comparison to WAE-MMD model on CELEB A, an identical architecture was used as that in Tolstikhin et al. (2017) utilised for the WAE-MMD model (WAE-GAN architecture is, naturally, diﬀerent):

four convolution layers with 5 5 ﬁlters, each layer was followed by a batch normalization (consecutively 128, 256, 512, and 1024 channels) and Re LU activation,

Cramer-Wold Auto-Encoder

latent 64-dimensional,

dense 1024 neuron layer,

three transposed-convolution layers with 5 5 ﬁlters, and each layer followed by a batch normalization with Re LU activation (consecutively 512, 256, and 128 channels),

transposed-convolution layer with 5 5 ﬁlter and 3 channels, clipped output value.

The last layer returns the reconstructed image. The results for all the above architectures are given in Table 1. All networks were trained with the Adam optimiser (Kingma and Ba, 2014). The hyper-parameters used were learning rate = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e 8. MNIST and CIFAR 10 models were trained for 500 epochs, CELEB A for 55. Similarly to Tolstikhin et al. (2017), models were trained using Adam for 55 epochs, with the same optimiser parameters.

8. Cramer-Wold Generator (CWG)

To overcome saddle-point problems Deshpande et al. (2018); Li et al. (2017) introduced an alternative deﬁnition that is based on a single objective, which is a measure between distributions. The model introduced is a generative Sliced Wasserstein Generator (SWG). It is not strictly a GAN model, since no adversarial training is used, hence a generator name is used. It uses the Sliced-Wasserstein distance, estimated with a ﬁnite sum as in Kolouri et al. (2018), to express dissimilarity between PX and PZ. We want to show here than an analogous model which is based on the Cramer-Wold rather than Wasserstein distance is also possible. Accordingly, we shall call it a Cramer-Wold Generator (CWG). The requirements of a generator lead to the following cost function9

CWG cost = d2 cw(X, D(Z)),

where X = (xi)i=1,...,n RN is data sample and D(Z) = (D(zi))i=1,...,n RN means a sample from N(0, I) mapped into the data space by the decoder. We implemented CWG model and trained it on MNIST and Fashion MNIST datasets by using the original SWG architecture from Deshpande et al. (2018). Further investigation into the CWG model is fully justiﬁed by the results of our qualitative and quantitative tests that compare CWG and SWG models (see Figures 11 and 12).

9. Conclusions

In this paper we present a new kernel based on Cramer-Wold distance which gives way to a new auto-encoder based generative model CWAE. It matches, and in some cases improves, results of WAE-MMD, while using a cost function given by a simple closed analytic formula.

9. It should be noticed that in this case to calculate d2 CW (by applying Formula (3)) we should use γ = ˆσ(4/3n)2/5, where ˆσ denotes the standard deviation from joined X and D(Z) samples. This follows from the Silverman s rule. In CWAE model it was reasonable to take ˆσ = 1 since we used d2 CW on the latent, where an encoded sample was trained to be standard Gaussian, but now this assumption cannot be maintained.

Knop Spurek Tabor Podolak Mazur Jastrzębski

0 1000 2000 3000 4000 5000 Epochs

FID Score comparison

model SWG (5000 projections) CWG

0 1000 2000 3000 4000 5000 Epochs

FID Score comparison

model SWG (5000 projections) CWG

Figure 11: Comparison between SWG and CWG models with FID score on MNIST (lefthand side image) and Fashion MNIST (right-hand side image) data-sets by using original SWG architecture from Deshpande et al. (2018).

CWG SWG CWG SWG

Fashion MNIST

Figure 12: Randomly sampled images from SWG and CWG models trained on MNIST and Fashion MNIST using original SWG architecture from Deshpande et al. (2018).

We hope this result will encourage future work in developing simpler to optimise analogs of strong neural models.

The use of the introduced Cramer-Wold kernel and metric between samples and distributions is crucial to the construction of CWAE. This metric can be eﬀectively computed for Gaussian mixtures. Use of Cramer-Wold kernel allows to construct methods less sensitive to changes in training parameters and faster minimizing a cost function. All this was shown in the CWAE generative auto-encoder model, which proved to be fast, more stable, and less sensitive to the changes of training parameters as compared to other methods, like WAE or SWAE. Finally, the proposed Cramer-Wold generator model shows that future work could explore use of this metric in other settings, particularly in adversarial models.

Cramer-Wold Auto-Encoder

10. Acknowledgements

The work of P. Spurek was supported by the National Centre of Science (Poland) Grant No. 2019/33/B/ST6/00894. The work of J. Tabor was supported by the National Centre of Science (Poland) Grant No. 2017/25/B/ST6/01271. I. Podolak carried out this work within the research project Bio-inspired artiﬁcial neural network (grant no. POIR.04.04.00-00-14DE/18-00) within the Team-Net program of the Foundation for Polish Science co-ﬁnanced by the European Union under the European Regional Development Fund.

M. Abramowitz and I.A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55 of National Bureau of Standards Applied Mathematics Series. U.S. Government Printing Oﬃce, Washington, D.C., 1964.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 214 223, International Convention Centre, Sydney, Australia, 2017. PMLR.

S Aubert and CS Lam. Invariant integration over the unitary group. J. of Mathematical Physics, 44(12):6112 6131, 2003.

S. Axler, P. Bourdon, and R. Wade. Harmonic function theory, volume 137 of Graduate Texts in Mathematics. Springer, New York, 1992.

R.W. Barnard, G. Dahlquist, K. Pearce, L. Reichel, and K.C. Richards. Gram polynomials and the kummer function. J. of Approximation Theory, 94(1):128 143, 1998.

A. W. Bowman and P. J. Foster. Adaptive smoothing and density-based tests of multivariate normality. J. of the American Statistical Association, 88(422):529 537, 1993.

D. Braun. Invariant integration over the orthogonal group. J. of Physics A: Mathematical and General, 39(47):14581, 2006.

H. Cramér and H. Wold. Some theorems on distribution functions. J. London Mathematical Society, 11(4):290 294, 1936.

S.R. Deans. The Radon transform and some of its applications. Wiley, New York, 1983.

I. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3483 3491, 2018.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014.

I.S. Gradshteyn and I.M. Ryzhik. Table of integrals, series, and products. Elsevier/Academic Press, Amsterdam, 2015.

Knop Spurek Tabor Podolak Mazur Jastrzębski

N. Henze. Invariant tests for multivariate normality: a critical review. Statistical Papers, 43 (4):467 506, 2002.

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a Nash equilibrium. ar Xiv:1706.08500, 2017.

D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014.

D.P. Kingma, S. Mohamed, D.J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581 3589, 2014.

S. Kolouri, Ch.E. Martin, and G.K. Rohde. Sliced-Wasserstein autoencoder: an embarrassingly simple generative model. ar Xiv:1804.01947, 2018.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203 2213, 2017.

P. Mattila. Geometry of sets and measures in Euclidean spaces: fractals and rectiﬁability, volume 44. Cambridge university press, 1999.

K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017.

B.W. Silverman. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1986.

M. Śmieja, M. Wołczyk, J. Tabor, and B. C. Geiger. Se GMA: Semi-Supervised Gaussian Mixture Auto-Encoder. ar Xiv preprint ar Xiv:1906.09333, 2019.

I. Tolstikhin. WAE. https://github.com/tolstikhin/wae, 2018.

I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. ar Xiv:1711.01558, 2017.

F. Tricomi and A. Erdélyi. The asymptotic expansion of a ratio of gamma functions. Paciﬁc J. of Mathematics, 1:133 142, 1951.

C. Villani. Optimal Transport: Old and New. Springer, Berlin, 2008.