# mixedcurvature_variational_autoencoders__f0e90ec4.pdf Published as a conference paper at ICLR 2020 MIXED-CURVATURE VARIATIONAL AUTOENCODERS Ondrej Skopek,1 Octavian-Eugen Ganea1,2 & Gary B ecigneul1,2 1 Department of Computer Science, ETH Z urich 2 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology oskopek@oskopek.com, oct@mit.edu, gary.becigneul@inf.ethz.ch Euclidean geometry has historically been the typical workhorse for machine learning applications due to its power and simplicity. However, it has recently been shown that geometric spaces with constant non-zero curvature improve representations and performance on a variety of data types and downstream tasks. Consequently, generative models like Variational Autoencoders (VAEs) have been successfully generalized to elliptical and hyperbolic latent spaces. While these approaches work well on data with particular kinds of biases e.g. tree-like data for a hyperbolic VAE, there exists no generic approach unifying and leveraging all three models. We develop a Mixed-curvature Variational Autoencoder, an efficient way to train a VAE whose latent space is a product of constant curvature Riemannian manifolds, where the per-component curvature is fixed or learnable. This generalizes the Euclidean VAE to curved latent spaces and recovers it when curvatures of all latent space components go to 0. 1 INTRODUCTION Generative models, a growing area of unsupervised learning, aim to model the data distribution p(x) over data points x from a space X, which is usually a high-dimensional Euclidean space Rn. This has desirable benefits like a naturally definable inner-product, vector addition, or a closedform distance function. Yet, many types of data have a strongly non-Euclidean latent structure (Bronstein et al., 2017), e.g. the set of human-interpretable images. They are usually thought to live on a natural image manifold (Zhu et al., 2016), a continuous lower-dimensional subset of the space in which they are represented. By moving along the manifold, one can continuously change the content and appearance of interpretable images. As noted in Nickel & Kiela (2017), changing the geometry of the underlying latent space enables better representations of specific data types compared to Euclidean spaces of any dimensions, e.g. tree structures and scale-free networks. Motivated by these observations, a range of recent methods learn representations in different spaces of constant curvatures: spherical or elliptical (Batmanghelich et al., 2016), hyperbolic (Nickel & Kiela, 2017; Sala et al., 2018; Tifrea et al., 2019) and even in products of these spaces (Gu et al., 2019; Bachmann et al., 2020). Using a combination of different constant curvature spaces, Gu et al. (2019) aim to match the underlying geometry of the data better. However, an open question remains: how to choose the dimensionality and curvatures of each of the partial spaces? A popular approach to generative modeling is the Variational Autoencoder (Kingma & Welling, 2014). VAEs provide a way to sidestep the intractability of marginalizing a joint probability model of the input and latent space p(x, z), while allowing for a prior p(z) on the latent space. Recently, variants of the VAE have been introduced for spherical (Davidson et al., 2018; Xu & Durrett, 2018) and hyperbolic (Mathieu et al., 2019; Nagano et al., 2019) latent spaces. Our approach, the Mixed-curvature Variational Autoencoder, is a generalization of VAEs to products of constant curvature spaces.1 It has the advantage of a better reduction in dimensionality, while maintaining efficient optimization. The resulting latent space is a non-constantly curved manifold that is more flexible than a single constant curvature manifold. 1Code is available on Git Hub at https://github.com/oskopek/mvae. Published as a conference paper at ICLR 2020 Our contributions are the following: (i) we develop a principled framework for manipulating representations and modeling probability distributions in products of constant curvature spaces that smoothly transitions across curvatures of different signs, (ii) we generalize Variational Autoencoders to learn latent representations on products of constant curvature spaces with generalized Gaussianlike priors, and (iii) empirically, our models outperform current benchmarks on a synthetic tree dataset (Mathieu et al., 2019) and on image reconstruction on the MNIST (Le Cun, 1998), Omniglot (Lake et al., 2015), and CIFAR (Krizhevsky, 2009) datasets for some latent space dimensions. 2 GEOMETRY AND PROBABILITY IN RIEMANNIAN MANIFOLDS To define constantly curved spaces, we first need to define the notion of sectional curvature K(τx) of two linearly independent vectors in the tangent space at a point x M spanning a two-dimensional plane τx (Berger, 2012). Since we deal with constant curvature spaces where all the sectional curvatures are equal, we denote a manifold s curvature as K. Instead of curvature K, we sometimes use the generalized notion of a radius: R = 1/ p There are three different types of manifolds M we can define with respect to the sign of the curvature: a positively curved space, a flat space, and a negatively curved space. Common realizations of those manifolds are the hypersphere SK, the Euclidean space E, and the hyperboloid HK: Sn K = {x Rn+1 : x, x 2 = 1/K}, for K > 0 En = Rn, for K = 0 Hn K = {x Rn+1 : x, x L = 1/K}, for K < 0 where , 2 is the standard Euclidean inner product, and , L is the Lorentz inner product, x, y L = x1y1 + i=2 xiyi x, y Rn+1. We will need to define the exponential map, logarithmic map, and parallel transport in all spaces we consider. The exponential map in Euclidean space is defined as expx(v) = x + v, for all x En and v Tx En. Its inverse, the logarithmic map is logx(y) = y x, for all x, y En. Parallel transport in Euclidean space is simply an identity PTx y(v) = v, for all x, y En and v Tx En. An overview of these operations in the hyperboloid Hn K and the hypersphere Sn K can be found in Table 1. For more details, refer to Petersen et al. (2006), Cannon et al. (1997), or Appendix A. 2.1 STEREOGRAPHICALLY PROJECTED SPACES The above spaces are enough to cover any possible value of the curvature, and they define all the necessary operations we will need to train VAEs in them. However, both the hypersphere and the hyperboloid have an unsuitable property, namely the non-convergence of the norm of points as the curvature goes to 0. Both spaces grow as K 0 and become locally flatter , but to do that, their points have to go away from the origin of the coordinate space 0 to be able to satisfy the manifold s definition. A good example of a point that diverges is the origin of the hyperboloid (or a pole of the hypersphere) µ0 = (1/ p |K|, 0, . . . , 0)T . In general, we can see that µ0 2 = 1/K K 0 . Additionally, the distance and the metric tensors of these spaces do not converge to their Euclidean variants as K 0, hence the spaces themselves do not converge to Rd. This makes both of these spaces unsuitable for trying to learn sign-agnostic curvatures. Luckily, there exist well-defined non-Euclidean spaces that inherit most properties from the hyperboloid and the hypersphere, yet do not have these properties namely, the Poincar e ball and the projected sphere, respectively. We obtain them by applying stereographic conformal projections, meaning that angles are preserved by this transformation. Since the distance function on the hyperboloid and hypersphere only depend on the radius and angles between points, they are isometric. We first need to define the projection function ρK. For a point (ξ; x T )T Rn+1 and curvature K R, where ξ R, x, y Rn ρK((ξ; x T )T ) = x |K|ξ , ρ 1 K (y) = 1 K y 2 2 1 + K y 2 2 ; 2y T 1 + K y 2 2 Published as a conference paper at ICLR 2020 The formulas correspond to the classical stereographic projections defined for these models (Lee, 1997). Note that both of these projections map the point µ0 = (1/ p |K|, 0, . . . , 0) in the original space to µ0 = 0 in the projected space, and back. Since the stereographic projection is conformal, the metric tensors of both spaces will be conformal. In this case, the metric tensors of both spaces are the same, except for the sign of K: g DK x = g PK x = (λK x )2g E, for all x in the respective manifold (Ganea et al., 2018a), and g E y = I for all y E. The conformal factor λK x is then defined as λK x = 2/(1 + K x 2 2). Among other things, this form of the metric tensor has the consequence that we unfortunately cannot define a single unified inner product in all tangent spaces at all points. The inner product at x M has the form of u, v x = (λK x )2 u, v 2 for all u, v Tx M. We can now define the two models corresponding to K > 0 and K < 0. The curvature of the projected manifold is the same as the original manifold. An n-dimensional projected hypersphere (K > 0) is defined as the set Dn K = ρK(Sn K \ { µ0}) = Rn, where µ0 = (1/ p |K|, 0, . . . , 0)T Sn K, along with the induced distance function. The n-dimensional Poincar e ball Pn K (also called the Poincar e disk when n = 2) for a given curvature K < 0 is defined as Pn K = ρK(Hn K) = x Rn : x, x 2 < 1 K , with the induced distance function. 2.2 GYROVECTOR SPACES An important analogy to vector spaces (vector addition and scalar multiplication) in non-Euclidean geometry is the notion of gyrovector spaces (Ungar, 2008). Both of the above spaces DK and PK (jointly denoted as MK) share the same structure, hence they also share the following definition of addition. The M obius addition K of x, y MK (for both signs of K) is defined as x K y = (1 2K x, y 2 K y 2 2)x + (1 + K x 2 2)y 1 2K x, y 2 + K2 x 2 2 y 2 2 . We can, therefore, define gyrospace distances for both of the above spaces, which are alternative curvature-aware distance functions d Dgyr(x, y) = 2 K x K y 2), d Pgyr(x, y) = 2 K x K y 2). These two distances are equivalent to their non-gyrospace variants d M(x, y) = d Mgyr(x, y), as is shown in Theorem A.4 and its hypersphere equivalent. Additionally, Theorem A.5 shows that d Mgyr(x, y) K 0 2 x y 2 , which means that the distance functions converge to the Euclidean distance function as K 0. We can notice that most statements and operations in constant curvature spaces have a dual statement or operation in the corresponding space with the opposite curvature sign. The notion of duality is one which comes up very often and in our case is based on Euler s formula eix = cos(x) + i sin(x) and the notion of principal square roots K. This provides a connection between trigonometric, hyperbolic trigonometric, and exponential functions. Thus, we can convert all the hyperbolic formulas above to their spherical equivalents and vice-versa. Since Ganea et al. (2018a) and Tifrea et al. (2019) used the same gyrovector spaces to define an exponential map, its inverse logarithmic map, and parallel transport in the Poincar e ball, we can reuse them for the projected hypersphere by applying the transformations above, as they share the same formalism. For parallel transport, we additionally need the notion of gyration (Ungar, 2008) gyr[x, y]v = K(x K y) K (x K (y K v)). Parallel transport in the both the projected hypersphere and the Poincar e ball then is PTK x y(v) = (λK x /λK y )gyr[y, x]v, for all x, y Mn K and v Tx Mn K. Using a curvature-aware definition scalar products ( x, y K = x, y 2 if K 0, x, y L if K < 0) and of trigonometric functions sin K = sin if K > 0 sinh if K < 0 cos K = cos if K > 0 cosh if K < 0 tan K = tan if K > 0 tanh if K < 0 we can summarize all the necessary operations in all manifolds compactly in Table 1 and Table 2. Published as a conference paper at ICLR 2020 Table 1: Summary of operations in SK and HK. Distance d(x, y) = 1 p |K| cos 1 K (|K| x, y K) Exponential map exp K x (v) = cos K p |K| v K x + sin K p |K| v K v p Logarithmic map log K x (y) = cos 1 K (K x, y K) sin K cos 1 K (K x, y K) (y K x, y K x) Parallel transport PTK x y(v) = v K y, v K 1 + K x, y K (x + y) Table 2: Summary of operations in projected spaces DK and PK. Distance d(x, y) = 1 p |K| cos 1 K 1 2K x y 2 2 (1 + K x 2 2)(1 + K y 2 2) Gyrospace distance dgyr(x, y) = 2 p |K| tan 1 K ( p |K| x K y 2) Exponential map exp K x (v) = x K |K|λK x v 2 Logarithmic map log K x (y) = 2 p |K|λK x tan 1 K p |K| x K y 2 x K y x K y 2 Parallel transport PTK x y(v) = λK x λK y gyr[y, x]v 2.3 PRODUCTS OF SPACES Previously, our space consisted of only one manifold of varying dimensionality and fixed curvature. Like Gu et al. (2019), we propose learning latent representations in products of constant curvature spaces, contrary to existing VAE approaches which are limited to a single Riemannian manifold. Our latent space M consists of several component spaces M = k i=1 Mni Ki, where ni is the dimensionality of the space, Ki is its curvature, and M {E, S, D, H, P} is the model choice. Even though all components have constant curvature, the resulting manifold M has non-constant curvature. Its distance function decomposes based on its definition d2 M(x, y) = Pk i=1 d2 M ni Ki x(i), y(i) , where x(i) represents a vector in Mni Ki, corresponding to the part of the latent space representation of x belonging to Mni Ki. All other operations we defined on our manifolds are element-wise. Therefore, we again decompose the representations into parts x(i), apply the operation on that part x(i) = f (ni) Ki (x(i)) and concatenate the resulting parts back x = Jk i=1 x(i). The signature of the product space, i.e. its parametrization, has several degrees of freedom per component: (i) the model M, (ii) the dimensionality ni, and (iii) the curvature Ki. We need to select all of the above for every component in our product space. To simplify, we use a shorthand notation for repeated components: (Mni Ki)j = j l=1 Mni Ki. In Euclidean spaces, the notation is redundant. For n1, . . . , nk Z, such that Pk i=1 ni = n Z, it holds that the Cartesian product of Euclidean spaces Eni is En = k i=1 Eni. However, the equality does not hold for the other considered manifolds. This is due to the additional constraints posed on the points in the definitions of individual models of curved spaces. Published as a conference paper at ICLR 2020 2.4 PROBABILITY DISTRIBUTIONS ON RIEMANNIAN MANIFOLDS To be able to train Variational Autoencoders, we need to chose a probability distribution p as a prior and a corresponding posterior distribution family q. Both of these distributions have to be differentiable with respect to their parametrization, they need to have a differentiable Kullback Leiber (KL) divergence, and be reparametrizable (Kingma & Welling, 2014). For distributions where the KL does not have a closed-form solution independent on z, or where this integral is too hard to compute, we can estimate it using Monte Carlo estimation DKL (q || p) 1 l=1 log q(z(l)) if L=1 = log q(z(1)) where z(l) q for all l = 1, . . . , L. The Euclidean VAE uses a natural choice for a prior on its latent representations the Gaussian distribution (Kingma & Welling, 2014). Apart from satisfying the requirements for a VAE prior and posterior distribution, the Gaussian distribution has additional properties, like being the maximum entropy distribution for a given variance (Jaynes, 1957). There exist several fundamentally different approaches to generalizing the Normal distribution to Riemannian manifolds. We discuss the following three generalizations based on the way they are constructed (Mathieu et al., 2019). Wrapping This approach leverages the fact that all manifolds define a tangent vector space at every point. We simply sample from a Gaussian distribution in the tangent space at µ0 with mean 0, and use parallel transport and the exponential map to map the sampled point onto the manifold. The PDF can be obtained using the multivariate chain rule if we can compute the determinant of the Jacobian of the parallel transport and the exponential map. This is very computationally effective at the expense of losing some theoretical properties. Restriction The Restricted Normal approach is conceptually antagonal instead of expanding a point to a dimensionally larger point, we restrict a point of the ambient space sampled from a Gaussian to the manifold. The consequence is that the distributions constructed this way are based on the flat Euclidean distance. An example of this is the von Mises-Fisher (v MF) distribution (Davidson et al., 2018). A downside of this approach is that v MF only has a single scalar covariance parameter κ, while other approaches can parametrize covariance in different dimensions separately. Maximizing entropy Assuming a known mean and covariance matrix, we want to maximize the entropy of the distribution (Pennec, 2006). This approach is usually called the Riemannian Normal distribution. Mathieu et al. (2019) derive it for the Poincar e ball, and Hauberg (2018) derive the Spherical Normal distribution on the hypersphere. Maximum entropy distributions resemble the Gaussian distribution s properties the closest, but it is usually very hard to sample from such distributions, compute their normalization constants, and even derive the specific form. Since the gains for VAE performance using this construction of Normal distributions over wrapping is only marginal, as reported by Mathieu et al. (2019), we have chosen to focus on Wrapped Normal distributions. To summarize, Wrapped Normal distributions are very computationally efficient to sample from and also efficient for computing the log probability of a sample, as detailed by Nagano et al. (2019). The Riemannian Normal distributions (based on geodesic distance in the manifold directly) could also be used, however they are more computationally expensive for sampling, because the only methods available are based on rejection sampling (Mathieu et al., 2019). 2.4.1 WRAPPED NORMAL DISTRIBUTION First of all, we need to define an origin point on the manifold, which we will denote as µ0 MK. What this point corresponds to is manifold-specific: in the hyperboloid and hypersphere it corresponds to the point µ0 = (1/ p |K|, 0, . . . , 0)T , and in the Poincar e ball, projected sphere, and Euclidean space it is simply µ0 = 0, the origin of the coordinate system. Sampling from the distribution WN(µ, Σ) has been described in detail in Nagano et al. (2019) and Mathieu et al. (2019), and we have extended all the necessary operations and procedures to arbitrary curvature K. Sampling then corresponds to: v N(µ0, Σ) Tµ0MK, u = PTK µ0 µ(v) TµMK, z = exp µK x (u) MK. Published as a conference paper at ICLR 2020 The log-probability of samples can be computed by the reverse procedure: u = log K µ (z) TµMK, v = PTK µ µ0(u) Tµ0MK, log WN(µ, Σ) = log N(v; µ0, Σ) log det f where f = exp K µ PTK µ0 µ . The distribution can be applied to all manifolds that we have introduced. The only differences are the specific forms of operations and the log-determinant in the PDF. The specific forms of the log-PDF for the four spaces H, S, D, and P are derived in Appendix B. All variants of this distribution are reparametrizable, differentiable, and the KL can be computed using Monte Carlo estimation. As a consequence of the distance function and operations convergence theorems for the Poincar e ball A.5 (analogously for the projected hypersphere), A.17, A.18, and A.20, we see that the Wrapped Normal distribution converges to the Gaussian distribution as K 0. 3 VARIATIONAL AUTOENCODERS IN PRODUCTS SPACES To be able to learn latent representations in Riemannian manifolds instead of in Rd as above, we only need to change the parametrization of the mean and covariance in the VAE forward pass, and the choice of prior and posterior distributions. The prior and posterior have to be chosen depending on the chosen manifold and are essentially treated as hyperparameters of our VAE. Since we have defined the Wrapped Normal family of distributions for all spaces, we can use WN(µ0, σ2I) as the posterior family, and WN(µ0, I) as the prior distribution. The forms of the distributions depend on the chosen space type. The mean is parametrized using the exponential map exp K µ0 as an activation function. Hence, all the model s parameters live in Euclidean space and can be optimized directly. In experiments, we sometimes use v MF(µ, κ) for the hypersphere Sn K (or a backprojected variant of v MF for Dn K) with the associated hyperspherical uniform distribution U(Sn K) as a prior (Davidson et al., 2018), or the Riemannian Normal distribution RN(µ, σ2) and the associated prior RNµ0, 1 for the Poincare ball Pn K (Mathieu et al., 2019). 3.1 LEARNING CURVATURE We have already seen approaches to learning VAEs in products of spaces of constant curvature. However, we can also change the curvature constant in each of the spaces during training. The individual spaces will still have constant curvature at each point, we just allow changing the constant in between training steps. To differentiate between these training procedures, we will call them fixed curvature and learnable curvature VAEs respectively. The motivation behind changing curvature of non-Euclidean constant curvature spaces might not be clear, since it is apparent from the definition of the distance function in the hypersphere and hyperboloid d(x, y) = R θx,y, that the distances between two points that stay at the same angle only get rescaled when changing the radius of the space. Same applies for the Poincar e ball and the projected spherical space. However, the decoder does not only depend on pairwise distances, but rather on the specific positions of points in the space. It can be conjectured that the KL term of the ELBO indeed is only rescaled when we change the curvature, however, the reconstruction process is influenced in non-trivial ways. Since that is hard to quantify and prove, we devise a series of practical experiments to show overall model performance is enhanced when learning curvature. Fixed curvature VAEs In fixed curvature VAEs, all component latent spaces have a fixed curvature that is selected a priori and fixed for the whole duration of the training procedure, as well as during evaluation. For Euclidean components it is 0, for positively or negatively curved spaces any positive or negative number can be chosen, respectively. For stability reasons, we select curvature values from the range [0.25, 1.0], which corresponds to radii in [1.0, 2.0]. The exact curvature value does not have a significant impact on performance when training a fixed curvature VAE, as motivated by the distance rescaling remark above. In the following, we refer to fixed curvature components with a constant subscript, e.g. Hn 1. Published as a conference paper at ICLR 2020 Learnable curvature VAEs In all our manifolds, we can differentiate the ELBO with respect to the curvature K. This enables us to treat K as a parameter of the model and learn it using gradientbased optimization, exactly like we learn the encoder/decoder maps in a VAE. Learning curvature directly is badly conditioned we are trying to learn one scalar parameter that influences the resulting decoder and hence the ELBO quite heavily. Empirically, we have found that Stochastic Gradient Descent works well to optimize the radius of a component. We constrain the radius to be strictly positive in all non-Euclidean spaces by applying a Re LU activation function before we use it in operations. Universal curvature VAEs However, we must still a priori select the partitioning of our latent space the number of components and for each of them select the dimension and at least the sign of the curvature of that component (signature estimation). The simplest approach would be to just try all possibilities and compare the results on a specific dataset. This procedure would most likely be optimal, but does not scale well. To eliminate this, we propose an approximate method we partition our space into 2-dimensional components (if the number of dimensions is odd, one component will have 3 dimensions). We initialize all of them as Euclidean components and train for half the number of maximal epochs we are allowed. Then, we split the components into 3 approximately equal-sized groups and make one group into hyperbolic components, one into spherical, and the last remains Euclidean. We do this by changing the curvature of a component by a very small ϵ. We then train just the encoder/decoder maps for a few epochs to stabilize the representations after changing the curvatures. Finally, we allow learning the curvatures of all non-Euclidean components and train for the rest of the allowed epochs. The method is not completely general, as it never uses components bigger than dimension 2, but the approximation has empirically performed satisfactorily. We also do not constrain the curvature of the components to a specific sign in the last stage of training. Therefore, components may change their type of space from a positively curved to a negatively curved one, or vice-versa. Because of the divergence of points as K 0 for the hyperboloid and hypersphere, the universal curvature VAE assumes the positively curved space is D and the negatively curved space is P. In all experiments, this universal approach is denoted as Un. 4 EXPERIMENTS For our experiments, we use four datasets: (i) Branching diffusion process (Mathieu et al., 2019, BDP) a synthetic tree-like dataset with injected noise, (ii) Dynamically-binarized MNIST digits (Le Cun, 1998) we binarize the images similarly to Burda et al. (2016); Salakhutdinov & Murray (2008): the training set is binarized dynamically (uniformly sampled threshold per-sample bin(x) {0, 1}D = x > U[0, 1], x x [0, 1]D), and the evaluation set is done with a fixed binarization (x > 0.5), (iii) Dynamically-binarized Omniglot characters (Lake et al., 2015) downsampled to 28 28 pixels, and (iv) CIFAR-10 (Krizhevsky, 2009). All models in all datasets are trained with early stopping on training ELBO with a lookahead of 50 epochs and a warmup of 100 epochs (Bowman et al., 2016). All BDP models are trained for a 1000 epochs, MNIST and Omniglot models are trained for 300 epochs, and CIFAR for 200 epochs. We compare models with a given latent space dimension using marginal log-likelihood with importance sampling (Burda et al., 2016) with 500 samples, except for CIFAR, which uses 50 due to memory constraints. In all tables, we denote it as LL. We run all experiments at least 3 times to get an estimate of variance when using different initial values. In all the BDP, MNIST, and Omniglot experiments below, we use a simple feed-forward encoder and decoder architecture consisting of a single dense layer with 400 neurons and element-wise Re LU activation. Since all the VAE parameters {θ, φ} live in Euclidean manifolds, we can use standard gradient-based optimization methods. Specifically, we use the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 10 3 and standard settings for β1 = 0.9, β2 = 0.999, and ϵ = 10 8. For the CIFAR encoder map, we use a simple convolutional neural networks with three convolutional layers with 64, 128, and 512 channels respectively. For the decoder map, we first use a dense layer Published as a conference paper at ICLR 2020 Table 3: Summary of results (mean and standard deviation), latent space dimension 6, spherical covariance, on the BDP (left) and MNIST (right) datasets. S6 1 55.81 0.35 D6 1 55.78 0.07 E6 56.28 0.56 (H2 1)3 56.08 0.52 H6 1 56.18 0.32 (P2 1)3 55.98 0.62 P6 1 56.74 0.55 (RN P2 1)3 54.99 0.12 (S2)3 56.05 0.21 (D2)3 56.06 0.36 (H2)3 55.80 0.32 (P2)3 56.29 0.05 (RN P2)3 56.25 0.56 D2 E2 P2 55.87 0.22 E2 H2 S2 55.92 0.42 E2 H2 (v MF S2) 55.82 0.43 E2 H2 1 (v MF S2 1) 55.77 0.51 (U2)3 55.56 0.15 U6 55.84 0.38 S6 1 96.71 0.17 v MF S6 1 97.03 0.14 D6 1 98.21 0.23 E6 97.16 0.15 H6 1 97.10 0.44 (P2 1)3 97.56 0.04 (RN P2 1)3 92.54 0.19 (S2)3 96.46 0.12 S6 96.72 0.15 v MF S6 96.72 0.18 D6 97.72 0.15 (H2)3 97.37 0.13 (RN P2)3 94.16 0.68 D2 E2 P2 97.48 0.18 D2 E2 (RN P2) 96.43 0.47 D2 1 E2 (RN P2 1) 96.18 0.21 E2 H2 S2 96.80 0.20 E2 H2 1 S2 1 96.76 0.09 E2 H2 (v MF S2) 96.56 0.27 (U2)3 97.12 0.04 U6 97.26 0.16 of dimension 2048, and then three consecutive transposed convolutional layers with 256, 64, and 3 channels. All layers are followed by a Re LU activation function, except for the last one. All convolutions have 4 4 kernels with stride 2, and padding of size 1. The first 10 epochs for all models are trained with a fixed curvature starting at 0 and increasing in absolute value each epoch. This corresponds to a burn-in period similarly to Nickel & Kiela (2017). For learnable curvature approaches we then use Stochastic Gradient Descent with learning rate 10 4 and let the optimizers adjust the value freely, for fixed curvature approaches it stays at the last burn-in value. All our models use the Wrapped Normal distribution, or equivalently Gaussian in Euclidean components, unless specified otherwise. All fixed curvature components are denoted with a M1 or M 1 subscript, learnable curvature components do not have a subscript. The observation model for the reconstruction loss term were Bernoulli distributions for MNIST and Omniglot, and standard Gaussian distributions for BDP and CIFAR. As baselines, we train VAEs with spaces that have a fixed constant curvature, i.e. assume a single Riemannian manifold (potentially a product of them) as their latent space. It is apparent that our models with a single component, like Sn 1 correspond to Davidson et al. (2018) and Xu & Durrett (2018), Hn 1 is equivalent to the Hyperbolic VAE of Nagano et al. (2019), Pn c corresponds to the Pc-VAE of Mathieu et al. (2019), and En is equivalent to the Euclidean VAE. In the following, we present a selection of all the obtained results. For more information see Appendix E. Bold numbers represent values that are particularly interesting. Since the Riemannian Normal and the von Mises Fischer distribution only have a spherical covariance matrix, i.e. a single scalar variance parameter per component, we evaluate all our approaches with a spherical covariance parametrization as well. Binary diffusion process For the BDP dataset and latent dimension 6 (Table 3), we observe that all VAEs that only use the von Mises-Fischer distribution perform worse than a Wrapped Normal. However, when a VMF spherical component was paired with other component types, it performed better than if a Wrapped Normal spherical component was used instead. Riemannian Normal VAEs did very well on their own the fixed Poincar e VAE (RN P2 1)3 obtains the best score. It did not fare as well when we tried to learn curvature with it, however. Published as a conference paper at ICLR 2020 Table 4: Summary of selected models (mean and standard deviation), latent space dimension 6 (left) and 12 (right), diagonal covariance, on the MNIST dataset. S6 1 96.51 0.09 D6 1 97.89 0.10 E6 96.88 0.16 H6 1 97.38 0.73 P6 1 97.33 0.15 S6 96.44 0.20 D6 97.53 0.22 (H2)3 96.86 0.31 H6 96.90 0.26 P6 97.26 0.16 D2 E2 P2 97.37 0.14 D2 1 E2 P2 1 97.29 0.16 E2 H2 S2 96.71 0.19 E2 H2 1 S2 1 96.66 0.27 (U2)3 97.06 0.13 U6 96.90 0.10 (S2 1)6 79.92 0.21 (D2 1)6 80.53 0.10 E12 79.51 0.09 (H2 1)6 80.54 0.23 H12 1 -79.37 0.14 (P2 1)6 80.39 0.07 S12 79.99 0.27 D12 80.37 0.16 H12 79.77 0.10 (P2)6 80.31 0.08 (D2 1)2 (E2)2 (P2 1)2 80.14 0.11 D4 1 E4 P4 1 80.14 0.20 (E2)2 (H2)2 (S2)2 79.59 0.25 E4 H4 S4 79.69 0.14 (U2)6 79.61 0.06 U12 80.01 0.30 An interesting observation is that all single-component VAEs M6 performed worse than product VAEs (M2)3 when curvature was learned, across all component types. Our universal curvature VAE (U2)3 managed to get better results than all other approaches except for the Riemannian Normal baseline, but it is within the margin of error of some other models. It also outperformed its singlecomponent variant U6. However, we did not find that it converged to specific curvature values, only that they were in the approximate range of ( 0.1, +0.1). Dynamically-binarized MNIST reconstruction On MNIST (Table 3) with spherical covariance, we noticed that VMF again under-performed Wrapped Normal, except when it was part of a product like E2 H2 (v MF S2). When paired with another Euclidean and a Riemannian Normal Poincar e disk component, it performed well, but that might be because the RN P 1 component achieved best results across the board on MNIST. It achieved good results even compared to diagonal covariance VAEs on 6-dimensional MNIST. Several approaches are better than the Euclidean baseline. That applies mainly to the above mentioned Riemannian Normal Poincar e ball components, but also S6 both with Wrapped Normal and VMF, as well as most product space VAEs with different curvatures (third section of the table). Our (U2)3 performed similarly to the Euclidean baseline VAE. With diagonal covariance parametrization (Table 4), we observe similar trends. With a latent dimension of 6, the Riemannian Normal Poincar e ball VAE is still the best performer. The Euclidean baseline VAE achieved better results than its spherical covariance counterpart. Overall, the best result is achieved by the single-component spherical model, with learnable curvature S6. Interestingly, all single-component VAEs performed better than their (M2)3 counterparts, except for the H6 hyperboloid, but only by a tiny margin. Products of different component types also achieve good results. Noteworthy is that their fixed curvature variants seem to perform marginally better than learnable curvature ones. Our universal VAEs perform at around the Euclidean baseline VAE performance. Interestingly, all of them end up with negative curvatures 0.3 < K < 0. Secondly, we run our models with a latent space dimension of 12. We immediately notice, that not many models can beat the Euclidean VAE baseline E12 consistently, but several are within the margin of error. Notably, the product VAEs of H, S, and E, fixed and learnable H12, and our universal VAE (U2)6. Interestingly, products of small components perform better when curvature is fixed, whereas single big component VAEs are better when curvature is learned. Dynamically-binarized Omniglot reconstruction For a latent space of dimension 6 (Table 5), the best of the baseline models is the Poincar e VAE of (Mathieu et al., 2019). Our models that come Published as a conference paper at ICLR 2020 Table 5: Summary of selected models (mean and standard deviation), latent space dimension 6, diagonal covariance, on the Omniglot (left) and CIFAR-10 dataset (right), respectively. S6 1 136.69 0.94 D6 1 137.42 1.20 E6 136.05 0.29 H6 1 137.09 0.06 P6 1 135.86 0.20 (S2)3 136.14 0.27 S6 136.20 0.44 (D2)3 136.13 0.17 D6 136.30 0.08 (H2)3 136.17 0.09 H6 136.24 0.32 (P2)3 136.09 0.07 P6 136.05 0.44 D2 E2 P2 135.89 0.40 E2 H2 S2 135.93 0.48 (U2)3 136.21 0.07 U6 136.04 0.17 E6 1896.19 2.54 H6 1 1888.23 2.12 P6 1 1893.27 0.61 D6 1893.85 0.36 S6 1889.76 1.62 P6 1891.40 2.14 D2 E2 P2 1899.90 4.60 E2 H2 S2 1895.46 0.92 (U2)3 1895.09 4.27 very close to the average estimated marginal log-likelihood, and are definitely within the margin of error, are mainly (S2)3, D2 E2 P2, and U6. However, with the variance of performance across different runs, we cannot draw a clear conclusion. In general, hyperbolic VAEs seem to be doing a bit better on this dataset than spherical VAEs, which is also confirmed by the fact that almost all universal curvature models finished with negative curvature components. CIFAR-10 reconstruction For a latent space of dimension 6, we can observe that almost all non Euclidean models perform better than the euclidean baseline E6. Especially well-performing is the fixed hyperboloid H6 1, and the learnable hypersphere S6. Curvatures for all learnable models on this dataset converge to values in the approximate range of ( 0.15, +0.15). Summary In conclusion, a very good model seems to be the Riemannian Normal Poincar e ball VAE RN Pn. However, it has practical limitations due to a rejection sampling algorithm and an unstable implementation. On the contrary, von Mises-Fischer spherical VAEs have almost consistently performed worse than their Wrapped Normal equivalents. Overall, Wrapped Normal VAEs in all constant curvature manifolds seem to perform well at modeling the latent space. A key takeaway is that our universal curvature models Un and (U2) n/2 seem to generally outperform their corresponding Euclidean VAE baselines in lower-dimensional latent spaces and, with minor losses, manage to keep most of the competitive performance as the dimensionality goes up, contrary to VAEs with other non-Euclidean components. 5 CONCLUSION By transforming the latent space and associated prior distributions onto Riemannian manifolds of constant curvature, it has previously been shown that we can learn representations on curved space. Generalizing on the above ideas, we have extended the theory of learning VAEs to products of constant curvature spaces. To do that, we have derived the necessary operations in several models of constant curvature spaces, extended existing probability distribution families to these manifolds, and generalized VAEs to latent spaces that are products of smaller component spaces, with learnable curvature. On various datasets, we show that our approach is competitive and additionally has the property that it generalizes the Euclidean variational autoencoder if the curvatures of all components go to 0, we recover the VAE of Kingma & Welling (2014). Published as a conference paper at ICLR 2020 ACKNOWLEDGMENTS We would like to thank Andreas Bloch for help in verifying some of the formulas for constant curvature spaces and for many insightful discussions; Prof. Thomas Hofmann and the Data Analytics Lab, the Leonhard cluster, and ETH Z urich for GPU access. Work was done while all authors were at ETH Z urich. Ondrej Skopek (oskopek@google.com) is now at Google. Octavian-Eugen Ganea (oct@mit.edu) is now at the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Gary B ecigneul (gary.becigneul@inf.ethz.ch) is funded by the Max Planck ETH Center for Learning Systems. Herv e Abdi and Lynne J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433 459, 2010. doi: 10.1002/wics.101. Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvature of deep generative models. In International Conference on Learning Representations, 2018. Gregor Bachmann, Gary B ecigneul, and Octavian-Eugen Ganea. Constant Curvature Graph Convolutional Networks, 2020. URL https://openreview.net/forum?id=BJg73x Htvr. Kayhan Batmanghelich, Ardavan Saeedi, Karthik Narasimhan, and Sam Gershman. Nonparametric Spherical Topic Modeling with Word Embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 537 542. Association for Computational Linguistics, 2016. doi: 10.18653/v1/P16-2087. Gary B ecigneul and Octavian-Eugen Ganea. Riemannian Adaptive Optimization Methods. In International Conference on Learning Representations, 2019. Marcel Berger. A panoramic view of Riemannian geometry. Springer Science & Business Media, 2012. Janos Bolyai. Appendix, Scientiam Spatii absolute veram exhibens: a veritate aut falsitate Axiomatis XI. Euclidei (a priori haud unquam decidenda) independentem; adjecta ad casum falsitatis, quadratura circuli geometrica. Auctore Johanne Bolyai de eadem, Geometrarum in Exercitu Caesareo Regio Austriaco Castrensium Capitaneo. Coll. Ref., 1832. Silvere Bonnabel. Stochastic Gradient Descent on Riemannian Manifolds. IEEE Transactions on Automatic Control, 58:2217 2229, 2013. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10 21. Association for Computational Linguistics, 2016. doi: 10.18653/v1/K16-1002. M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, July 2017. ISSN 1053-5888. doi: 10.1109/MSP.2017.2693418. Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, 2016. James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry. Flavors of geometry, 31:59 115, 1997. Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational Lossy Autoencoder. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, 2014. Published as a conference paper at ICLR 2020 Tim R. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M. Tomczak. Hyperspherical Variational Auto-Encoders. In UAI, 2018. Bhuwan Dhingra, Christopher J Shallue, Mohammad Norouzi, Andrew M Dai, and George E Dahl. Embedding Text in Hyperbolic Spaces. ar Xiv preprint ar Xiv:1806.04313, 2018. Robert L. Foote. A Unified Pythagorean Theorem in Euclidean, Spherical, and Hyperbolic Geometries. Mathematics Magazine, 90(1):59 69, 2017. ISSN 0025570X, 19300980. Octavian Ganea, Gary B ecigneul, and Thomas Hofmann. Hyperbolic Neural Networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 5345 5355. Curran Associates, Inc., 2018a. Octavian-Eugen Ganea, Gary B ecigneul, and Thomas Hofmann. Hyperbolic Entailment Cones for Learning Hierarchical Embeddings. ar Xiv preprint ar Xiv:1804.01882, 2018b. Mevlana C Gemici, Danilo Rezende, and Shakir Mohamed. Normalizing Flows on Riemannian Manifolds. ar Xiv preprint ar Xiv:1611.02304, 2016. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014. Albert Gu, Frederic Sala, Beliz Gunel, and Christopher R e. Learning Mixed-Curvature Representations in Product Spaces. In International Conference on Learning Representations, 2019. Søren Hauberg. Directional Statistics with the Spherical Normal Distribution. In 2018 21st International Conference on Information Fusion (FUSION), pp. 704 711. IEEE, 2018. Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. ar Xiv preprint ar Xiv:1804.00779, 2018. Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957. Yoon Kim, Kelly Zhang, Alexander M Rush, Yann Le Cun, et al. Adversarially regularized autoencoders. ar Xiv preprint ar Xiv:1706.04223, 2017. Diederik P. Kingma and Jimmy Lei Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR), 2015. Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, 2014. Max Kochurov, Sergey Kozlukov, Rasul Karimov, and Viktor Yanush. Geoopt: Adaptive Riemannian optimization in Py Torch. https://github.com/geoopt/geoopt, 2019. Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. ISSN 0036-8075. doi: 10.1126/science.aab3050. Johann H Lambert. Observations trigonom etriques. Mem. Acad. Sci. Berlin, 24:327 354, 1770. Marc T Law, Jake Snell, and Richard S Zemel. Lorentzian distance learning, 2019. Yann Le Cun. The MNIST database of handwritten digits, 1998. URL http://yann.lecun. com/exdb/mnist/. J.M. Lee. Riemannian Manifolds: An Introduction to Curvature. Graduate Texts in Mathematics. Springer New York, 1997. ISBN 9780387983226. Hongbo Li, David Hestenes, and Alyn Rockwood. A Universal Model for Conformal Geometries of Euclidean, Spherical and Double-Hyperbolic Spaces, pp. 77 104. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001. ISBN 978-3-662-04621-0. doi: 10.1007/978-3-662-04621-0 4. Published as a conference paper at ICLR 2020 Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. ar Xiv preprint ar Xiv:1811.12359, 2018. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee Whye Teh. Hierarchical Representations with Poincar e Variational Auto-Encoders. ar Xiv preprint ar Xiv:1901.06033, 2019. Lo ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, 2017. Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, and Masanori Koyama. A differentiable gaussian-like distribution on hyperbolic space for gradient-based learning. ar Xiv preprint ar Xiv:1902.02992, 2019. Maximilian Nickel and Douwe Kiela. Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry. In Proceedings of the 35th International Conference on International Conference on Machine Learning - Volume 50, ICML 18, 2018. Maximillian Nickel and Douwe Kiela. Poincar e Embeddings for Learning Hierarchical Representations. In Advances in Neural Information Processing Systems 30, pp. 6338 6347. Curran Associates, Inc., 2017. Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. Adversarially regularized graph autoencoder for graph embedding. ar Xiv preprint ar Xiv:1802.04407, 2018. Xavier Pennec. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements. Journal of Mathematical Imaging and Vision, 25(1):127, Jul 2006. ISSN 1573-7683. doi: 10.1007/s10851-006-6228-4. Peter Petersen, S Axler, and KA Ribet. Riemannian Geometry, volume 171. Springer, 2006. John Ratcliffe. Foundations of Hyperbolic Manifolds, volume 149. Springer Science & Business Media, 2006. Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. ar Xiv preprint ar Xiv:1505.05770, 2015. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ar Xiv preprint ar Xiv:1401.4082, 2014. Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4460 4469, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. Ruslan Salakhutdinov and Iain Murray. On the Quantitative Analysis of Deep Belief Networks. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pp. 872 879, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390266. Martin Simonovsky and Nikos Komodakis. Graph VAE: Towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412 422. Springer, 2018. John Parr Snyder. Map projections A working manual, volume 1395. US Government Printing Office, 1987. Alexandru Tifrea, Gary B ecigneul, and Octavian-Eugen Ganea. Poincar e Glove: Hyperbolic Word Embeddings. In International Conference on Learning Representations, 2019. Published as a conference paper at ICLR 2020 Abraham Albert Ungar. A Gyrovector Space Approach to Hyperbolic Geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008. Benjamin Wilson and Matthias Leimeister. Gradient descent in hyperbolic space. ar Xiv preprint ar Xiv:1805.08207, 2018. Richard C. Wilson and Edwin R. Hancock. Spherical embedding and classification. In Edwin R. Hancock, Richard C. Wilson, Terry Windeatt, Ilkay Ulusoy, and Francisco Escolano (eds.), Structural, Syntactic, and Statistical Pattern Recognition, pp. 589 599, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-14980-1. Jiacheng Xu and Greg Durrett. Spherical Latent Spaces for Stable Variational Autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4503 4513. Association for Computational Linguistics, 2018. Jun-Yan Zhu, Philipp Kr ahenb uhl, Eli Shechtman, and Alexei A. Efros. Generative Visual Manipulation on the Natural Image Manifold. In Proceedings of European Conference on Computer Vision (ECCV), 2016. Published as a conference paper at ICLR 2020 A GEOMETRICAL DETAILS A.1 RIEMANNIAN GEOMETRY An elementary notion in Riemannian geometry is that of a real, smooth manifold M Rn, which is a collection of real vectors x that is locally similar to a linear space, and lives in the ambient space Rn. At each point of the manifold x M a real vector space of the same dimensionality as M is defined, called the tangent space at point x: Tx M. Intuitively, the tangent space contains all the directions and speeds at which one can pass through x. Given a matrix representation G(x) Rn n of the Riemannian metric tensor g(x), we can define a scalar product on the tangent space: , x : Tx M Tx M M, where a, b x = g(x)(a, b) = a T G(x)b for any a, b Tx M. A Riemannian manifold is then the tuple (M, g). The scalar product induces a norm on the tangent space Tx M: ||a||x = p a, a x a Tx M (Petersen et al., 2006). Although it seems like the manifold only defines a local geometry, it induces global quantities by integrating the local contributions. The metric tensor induces a local infinitesimal volume element on each tangent space Tx M and hence a measure is induced as well d M(x) = p |G(x)|dx where dx is the Lebesgue measure. The length of a curve γ : t 7 γ(t) M, t [0, 1] is given by L(γ) = R 1 0 q d dtγ(t) γ(t)dt. Straight lines are generalized to constant speed curves giving the shortest path between pairs of points x, y M, so called geodesics, for which it holds that γ = arg minγ L(γ), such that γ(0) = x, γ(1) = y, and d dtγ(t) γ(t) = 1. Global distances are thus induced on M by d M(x, y) = infγ L(γ). Using this metric, we can go on to define a metric space (M, d M). Moving from a point x M in a given direction v Tx M with constant velocity is formalized by the exponential map: expx : Tx M M. There exists a unique unit speed geodesic γ such that γ(0) = x and dγ(t) dt t=0 = v, where v Tx M. The corresponding exponential map is then defined as expx(v) = γ(1). The logarithmic map is the inverse logx = exp 1 x : M Tx M. For geodesically complete manifolds, i.e. manifolds in which there exists a length-minimizing geodesic between every x, y M, such as the Lorentz model, hypersphere, and many others, expx is well-defined on the full tangent space Tx M. To connect vectors in tangent spaces, we use parallel transport PTx y : Tx M Ty M, which is an isomorphism between the two tangent spaces, so that the transported vectors stay parallel to the connection. It corresponds to moving tangent vectors along geodesics and defines a canonical way to connect tangent spaces. A.2 BRIEF COMPARISON OF CONSTANT CURVATURE SPACE MODELS We have seen five different models of constant curvature space, each of which has advantages and disadvantages when applied to learning latent representations in them using VAEs. A big advantage of the hyperboloid and hypersphere is that optimization in the spaces does not suffer from as many numerical instabilities as it does in the respective projected spaces. On the other hand, we have seen that when K 0, the norms of points go to infinity. As we will see in experiments, this is not a problem when optimizing curvature within these spaces in practice, except if we re trying to cross the boundary at K = 0 and go from a hyperboloid to a sphere, or vice versa. Intuitively, the points are just positioned very differently in the ambient space of H ϵ and Sϵ, for a small ϵ > 0. Since points in the n-dimensional projected hypersphere and Poincar e ball models can be represented using a real vector of length n, it enables us to visualize points in these manifolds directly for n = 2 or even n = 3. On the other hand, optimizing a function over these models is not very well-conditioned. In the case of the Poincar e ball, a significant amount of points lie close to the boundary of the ball (i.e. with a squared norm of almost 1/K), which causes numerical instabilities even when using 64-bit float precision in computations. A similar problem occurs with the projected hypersphere with points that are far away from the origin 0 (i.e. points that are close to the South pole on the backprojected sphere). Unintuitively, Published as a conference paper at ICLR 2020 all points that are far away from the origin are actually very close to each other with respect to the induced distance function and very far away from each other in terms of Euclidean distance. Both distance conversion theorems (A.5 and its projected hypersphere counterpart) rely on the points being fixed when changing curvature. If they are somehow dependent on curvature, the convergence theorem does not hold. We conjecture that if points stay close to the boundary in P or far away from 0 in D as K 0, this is exactly the reason for numerical instabilities (apart from the standard numerical problem of representing large numbers in floating-point notation). Because of the above reasons, we do some of our experiments with the projected spaces and others with the hyperboloid and hypersphere, and aim to compare the performance of these empirically as well. A.3 EUCLIDEAN GEOMETRY A.3.1 EUCLIDEAN SPACE Distance function The distance function in En is d E(x, y) = x y 2 . Due to the Pythagorean theorem, we can derive that x y 2 2 = x y, x y 2 = x 2 2 2 x, y 2 + y 2 2 = x 2 2 + y 2 2 2 x 2 y 2 cos 1 θx,y Exponential map The exponential map in En is expx(v) = x + v. The fact that the resulting points belong to the space is trivial. Deriving the inverse function, i.e. the logarithmic map, is also trivial: logx(y) = y x. Parallel transport We do not need parallel transport in the Euclidean space, as we can directly sample from a Normal distribution. In other words, we can just define parallel transport to be an identity function. A.4 HYPERBOLIC GEOMETRY A.4.1 HYPERBOLOID Do note, that all the theorems for the hypersphere are essentially trivial corollaries of their equivalents in the hypersphere (and vice-versa) (Section A.5.1). Notable differences include the fact that R2 = 1 K , not R2 = 1 K , and all the operations use the hyperbolic trigonometric functions sinh, cosh, and tanh, instead of their Euclidean counterparts. Also, we often leverage the hyperbolic Pythagorean theorem, in the form cosh2(α) sinh2(α) = 1. Projections Due to the definition of the space as a retraction from the ambient space, we can project a generic vector in the ambient space to the hyperboloid using the shortest Euclidean distance by normalization: proj Hn K(x) = R x ||x||L = x Secondly, the n + 1 coordinates of a point on the hyperboloid are co-dependent; they satisfy the relation x, x L = 1/K. This implies, that if we are given a vector with n coordinates x = (x2, . . . , xn+1), we can compute the missing coordinate to place it onto the hyperboloid: This is useful for example in the case of orthogonally projecting points from Tµ0Hn K onto the manifold. Published as a conference paper at ICLR 2020 Distance function The distance function in Hn K is d K H (x, y) = R θx,y = R cosh 1 x, y L K cosh 1 ( K x, y L) . Remark A.1 (About the divergence of points in Hn K). Since the points on the hyperboloid x Hn K are norm-constrained to x, x L = 1 all the points on the hyperboloid go to infinity as K goes to 0 from below: lim K 0 x, x L = . This confirms the intuition that the hyperboloid grows flatter , but to do that, it has to go away from the origin of the coordinate space 0. A good example of a point that diverges is the origin of the hyperboloid µK 0 = (1/K, 0, . . . , 0)T = (R, 0, . . . , 0)T . That makes this model unsuitable for trying to learn sign-agnostic curvatures, similarly to the hypersphere. Exponential map The exponential map in Hn K is exp K x (v) = cosh ||v||L x + sinh ||v||L and in the case of x := µ0 = (R, 0, . . . , 0)T : exp K µ0(v) = cosh || v||2 R; sinh || v||2 R || v||2 v T T , where v = (0; v T )T and ||v||L = ||v||2 = || v||2. Theorem A.2 (Logarithmic map in Hn K). For all x, y Hn K, the logarithmic map in Hn K maps y to a tangent vector at x: log K x (y) = cosh 1(α) α2 1 (y αx), where α = K x, y L. Proof. We show the detailed derivation of the logarithmic map as an inverse function to the exponential map logx(y) = exp 1 x (y), adapted from (Nagano et al., 2019). As mentioned previously, y = exp K x (v) = cosh v L x + sinh v L Solving for v, we obtain R y cosh v L However, we still need to rewrite ||v||L in evaluatable terms: 0 = x, v L = ||v||L x, y L cosh v L x, x L | {z } R2 R2 x, y L , and therefore ||v||L = R cosh 1 1 K cosh 1(K x, y L) = d K H (x, y). Published as a conference paper at ICLR 2020 Plugging the result back into the first equation, we obtain R y cosh v L = R cosh 1 (α) R sinh 1 RR cosh 1 (α) y cosh 1 RR cosh 1 (α) x = cosh 1(α) sinh(cosh 1(α))(y cosh(cosh 1(α))x) = cosh 1(α) α2 1 (y αx), where α = 1 R2 x, y L = K x, y L , and the last equality assumes |α| > 1. This assumption holds, since for all points x, y Hn K it holds that x, y L R2, and x, y L = R2 if and only if x = y, due to Cauchy-Schwarz (Ratcliffe, 2006, Theorem 3.1.6). Hence, the only case where this would be a problem would be if x = y, but it is clear that the result in that case is u = 0. Parallel transport Using the generic formula for parallel transport in manifolds for x, y M and v Tx M PTK x y(v) = v D log K x (y), v E x d M(x, y) (log K x (y) + log K y (x)), (1) and the logarithmic map formula from Theorem A.2 log K x (y) = cosh 1(α) α2 1 (y αx), where α = 1 R2 x, y L , we derive parallel transport in Hn K: PTK x y(v) = v + y, v L R2 x, y L (x + y). A special form of parallel transport exists for when the source vector is µ0 = (R, 0, . . . , 0)T : PTK µ0 y(v) = v + y, v 2 R2 + Ry1 y1 + R y2 ... yn+1 A.4.2 POINCAR E BALL Do note, that all the theorems for the projected hypersphere are essentially trivial corollaries of their equivalents in the Poincar e ball (and vice-versa) (Section A.5.2). Notable differences include the fact that R2 = 1 K , not R2 = 1 K , and all the operations use the hyperbolic trigonometric functions sinh, cosh, and tanh, instead of their Euclidean counterparts. Also, we often leverage the hyperbolic Pythagorean theorem, in the form cosh2(α) sinh2(α) = 1. Stereographic projection Theorem A.3 (Stereographic backprojected points of Pn K belong to Hn K). For all y Pn K, ρ 1 K (y) 2 L = 1 ρ 1 K (y) 2 L = K y 2 2 + 1 ; 2y T K y 2 2 + 1 Published as a conference paper at ICLR 2020 K y 2 2 + 1 + 4 y 2 2 (K y 2 2 + 1)2 = 1 |K| (K y 2 2 1)2 + 4|K| y 2 2 (K y 2 2 + 1)2 = 1 K (K y 2 2 1)2 4K y 2 2 (K y 2 2 + 1)2 K (K y 2 2 1)2 + 4K y 2 2 (K y 2 2 + 1)2 K K2 y 4 2 + 2K y 2 2 + 1 (K y 2 2 + 1)2 K (K y 2 2 + 1)2 (K y 2 2 + 1)2 = 1 Distance function The distance function in Pn K is (derived from the hyperboloid distance function using the stereographic projection ρK): d P(x, y) = d H(ρ 1 K (x), ρ 1 K (y)) 1 2K x y 2 2 (1 + K x 2 2)(1 + K y 2 2) 1 + 2R2 x y 2 2 (R2 x 2 2)(R2 y 2 2) Theorem A.4 (Distance equivalence in Pn K). For all K < 0 and for all pairs of points x, y Pn K, the Poincar e distance between them equals the gyrospace distance d P(x, y) = d Pgyr(x, y). Proof. Proven using Mathematica (File: distance limits.ws), proof involves heavy algebra. Theorem A.5 (Gyrospace distance converges to Euclidean in Pn K). For any fixed pair of points x, y Pn K, the Poincar e gyrospace distance between them converges to the Euclidean distance in the limit (up to a constant) as K 0 : lim K 0 d Pgyr(x, y) = 2 x y 2 . lim K 0 d Pgyr(x, y) = 2 lim K 0 = 2 lim K 0 = 2 y x 2 , where the second equality holds because of the theorem of limits of composed functions, where f(a) = tanh 1(a Published as a conference paper at ICLR 2020 g(K) = x K y 2 . We see that lim K 0 g(K) = y x 2 due to Theorem A.14, and lim a x y 2 f(a) = tanh 1(a Additionally for the last equality, we need the fact that lim x 0 tanh 1(a p Theorem A.6 (Distance converges to Euclidean as K 0 in Pn K). For any fixed pair of points x, y Pn K, the Poincar e distance between them converges to the Euclidean distance in the limit (up to a constant) as K 0 : lim K 0 d P(x, y) = 2 x y 2 . Proof. Theorem A.4 and A.5. Exponential map As derived and proven in Ganea et al. (2018a), the exponential map in Pn K and its inverse is exp K x (v) = x K log K x (y) = 2 KλK x tanh 1 K x K y 2 x K y x K y 2 In the case of x := µ0 = (0, . . . , 0)T they simplify to: exp K µ0(v) = tanh log K µ0(y) = tanh 1 K y 2 y y 2 . Parallel transport Kochurov et al. (2019); Ganea et al. (2018a) have also derived and implemented the parallel transport operation for the Poincar e ball: PTK x y(v) = λK x λK y gyr[y, x]v, PTK µ0 y(v) = 2 λK y v, PTK x µ0(v) = λK x 2 v, gyr[x, y]v = (x K y) K (x K (y K v)) is the gyration operation (Ungar, 2008, Definition 1.11). Unfortunately, on the Poincar e ball, , x has a form that changes with respect to x, unlike in the hyperboloid. Published as a conference paper at ICLR 2020 A.5 SPHERICAL GEOMETRY A.5.1 HYPERSPHERE All the theorems for the hypersphere are essentially trivial corollaries of their equivalents in the hyperboloid (Section A.4.1). Notable differences include the fact that R2 = 1 K , not R2 = 1 K , and all the operations use the Euclidean trigonometric functions sin, cos, and tan, instead of their hyperbolic counterparts. Also, we often leverage the Pythagorean theorem, in the form sin2(α) + cos2(α) = 1. Projections Due to the definition of the space as a retraction from the ambient space, we can project a generic vector in the ambient space to the hypersphere using the shortest Euclidean distance by normalization: proj Sn 1 K (x) = R x ||x||2 = x Secondly, the n + 1 coordinates of a point on the sphere are co-dependent; they satisfy the relation x, x 2 = 1/K. This implies, that if we are given a vector with n coordinates x = (x2, . . . , xn+1), we can compute the missing coordinate to place it onto the sphere: This is useful for example in the case of orthogonally projecting points from Tµ0Sn K onto the manifold. Distance function The distance function in Sn K is d K S (x, y) = R θx,y = R cos 1 x, y 2 K cos 1 (K x, y 2) . Remark A.7 (About the divergence of points in Sn K). Since the points on the hypersphere x Sn K are norm-constrained to x, x 2 = 1 all the points on the sphere go to infinity as K goes to 0+ from above: lim K 0+ x, x 2 = . This confirms the intuition that the sphere grows flatter , but to do that, it has to go away from the origin of the coordinate space 0. A good example of a point that diverges is the north pole of the sphere µK 0 = (1/K, 0, . . . , 0)T = (R, 0, . . . , 0)T . That makes this model unsuitable for trying to learn sign-agnostic curvatures, similarly to the hyperboloid. Exponential map The exponential map in Sn K is exp K x (v) = cos ||v||2 x + sin ||v||2 Theorem A.8 (Logarithmic map in Sn K). For all x, y Sn K, the logarithmic map in Sn K maps y to a tangent vector at x: log K x (y) = cos 1(α) 1 α2 (y αx), where α = K x, y 2. Proof. Analogous to the proof of Theorem A.2. As mentioned previously, y = exp K x (v) = cos v 2 x + sin v 2 Published as a conference paper at ICLR 2020 Solving for v, we obtain R y cos v 2 However, we still need to rewrite ||v||2 in evaluatable terms: 0 = x, v 2 = ||v||2 x, y 2 cos v 2 x, x 2 | {z } R2 R2 x, y 2 , and therefore ||v||2 = R cos 1 1 K cos 1(K x, y 2) = d K S (x, y). Plugging the result back into the first equation, we obtain R y cos v 2 = R cos 1 (α) R sin 1 RR cos 1 (α) y cos 1 RR cos 1 (α) x = cos 1(α) sin(cos 1(α))(y cos(cos 1(α))x) 1 α2 (y αx), where α = 1 R2 x, y 2 = K x, y 2 , and the last equality assumes |α| > 1. This assumption holds, since for all points x, y Sn K it holds that x, y 2 R2, and x, y 2 = R2 if and only if x = y, due to Cauchy-Schwarz (Ratcliffe, 2006, Theorem 3.1.6). Hence, the only case where this would be a problem would be if x = y, but it is clear that the result in that case is u = 0. Parallel transport Using the generic formula for parallel transport in manifolds (Equation A.4.1) for x, y Sn K and v Tx Sn K and the spherical logarithmic map formula log K x (y) = cos 1(α) 1 α2 (y αx), where α = K x, y 2 , we derive parallel transport in Sn K: PTK x y(v) = v y, v 2 R2 + x, y 2 (x + y) = v K y, v 2 1 + K x, y 2 (x + y). A special form of parallel transport exists for when the source vector is µ0 = (R, 0, . . . , 0)T : PTK µ0 y(v) = v y, v 2 R2 + Ry1 y1 + R y2 ... yn+1 A.5.2 PROJECTED HYPERSPHERE Do note, that all the theorems for the projected hypersphere are essentially trivial corollaries of their equivalents in the Poincar e ball (and vice-versa) (Section A.4.2). Notable differences include the fact that R2 = 1 K , not R2 = 1 K , and all the operations use the Euclidean trigonometric functions sin, cos, and tan, instead of their hyperbolic counterparts. Also, we often leverage the Pythagorean theorem, in the form sin2(α) + cos2(α) = 1. Published as a conference paper at ICLR 2020 Stereographic projection Remark A.9 (Homeomorphism between Sn K and Rn). We notice that ρK is not a homeomorphism between the n-dimensional sphere and Rn, as it is not defined at µ0 = ( R; 0T )T . If we additionally changed compactified the plane by adding a point at infinity and set it equal to ρK(µ0), ρK would become a homeomorphism. Theorem A.10 (Stereographic backprojected points of Dn K belong to Sn K). For all y Dn K, ρ 1 K (y) 2 2 = 1 ρ 1 K (y) 2 2 = K y 2 2 + 1 ; 2y T K y 2 2 + 1 K y 2 2 + 1 + 4 y 2 2 (K y 2 2 + 1)2 = 1 |K| (K y 2 2 1)2 + 4|K| y 2 2 (K y 2 2 + 1)2 K (K y 2 2 1)2 + 4K y 2 2 (K y 2 2 + 1)2 K K2 y 4 2 + 2K y 2 2 + 1 (K y 2 2 + 1)2 K (K y 2 2 + 1)2 (K y 2 2 + 1)2 = 1 Distance function The distance function in Dn K is (derived from the spherical distance function using the stereographic projection ρK): d D(x, y) = d S(ρ 1 K (x), ρ 1 K (y)) 1 2K x y 2 2 (1 + K x 2 2)(1 + K y 2 2) 1 2R2 x y 2 2 (R2 + x 2 2)(R2 + y 2 2) Theorem A.11 (Distance equivalence in Dn K). For all K > 0 and for all pairs of points x, y Dn K, the spherical projected distance between them equals the gyrospace distance d D(x, y) = d Dgyr(x, y). Proof. Proven using Mathematica (File: distance limits.ws), proof involves heavy algebra. Theorem A.12 (Gyrospace distance converges to Euclidean in Dn K). For any fixed pair of points x, y Dn K, the spherical projected gyrospace distance between them converges to the Euclidean distance in the limit (up to a constant) as K 0+: lim K 0+ d Dgyr(x, y) = 2 x y 2 . Published as a conference paper at ICLR 2020 lim K 0+ d Dgyr(x, y) = 2 lim K 0+ = 2 lim K 0+ = 2 y x 2 , where the second equality holds because of the theorem of limits of composed functions, where f(a) = tan 1(a K g(K) = x K y 2 . We see that lim K 0 g(K) = y x 2 due to Theorem A.14, and lim a x y 2 f(a) = tan 1(a Additionally for the last equality, we need the fact that lim x 0 tanh 1(a p Theorem A.13 (Distance converges to Euclidean as K 0+ in Dn K). For any fixed pair of points x, y Dn K, the spherical projected distance between them converges to the Euclidean distance in the limit (up to a constant) as K 0+: lim K 0+ d D(x, y) = 2 x y 2 . Proof. Theorem A.11 and A.12. Exponential map Analogously to the derivation of the exponential map in Pn K in Ganea et al. (2018a, Section 2.3 2.4), we can derive M obius scalar multiplication in Dn K: r K x = 1 i K tanh(r tanh 1(i K x 2)) x x 2 K tanh(ri tan 1( K x 2)) x x 2 K tan(r tan 1( K x 2)) x x 2 , where we use the fact that tanh 1(ix) = i tan 1(x) and tanh(ix) = i tan(x). We can easily see that 1 K x = x. Hence, the geodesic has the form of γx y(t) = x K t K ( x K y), and therefore the exponential map in Dn K is: exp K x (v) = x K Published as a conference paper at ICLR 2020 The inverse formula can also be computed: log K x (y) = 2 KλK x tan 1 K x K y 2 x K y x K y 2 In the case of x := µ0 = (0, . . . , 0)T they simplify to: exp K µ0(v) = tan log K µ0(y) = tan 1 Parallel transport Similarly to the Poincar e ball, we can derive the parallel transport operation for the projected sphere: PTK x y(v) = λK x λK y gyr[y, x]v, PTK µ0 y(v) = 2 λK y v, PTK x µ0(v) = λK x 2 v, gyr[x, y]v = (x K y) K (x K (y K v)) is the gyration operation (Ungar, 2008, Definition 1.11). Unfortunately, on the projected sphere, , x has a form that changes with respect to x, similarly to the Poincar e ball and unlike in the hypersphere. A.6 MISCELLANEOUS PROPERTIES Theorem A.14 (M obius addition converges to Eucl. vector addition). lim K 0(x K y) = x + y. Note: This theorem works from both sides, hence applies to the Poincar e ball as well as the projected spherical space. Observe that the M obius addition has the same form for both spaces. lim K 0(x K y) = lim K 0 " (1 2K x, y 2 K y 2 2)x + (1 + K x 2 2)y 1 2K x, y 2 + K2 x 2 2 y 2 2 Theorem A.15 (ρ 1 K is the inverse stereographic projection). For all (ξ; x T )T Mn K, ξ R ρ 1 K (ρ((ξ; x T )T )) = x, where M {S, H}. ρ 1 K (ρK((ξ; x T )T )) = ρ 1 K Published as a conference paper at ICLR 2020 K x 2 2 (1 p |K|ξ)2 1; 2 p We observe that x 2 2 = 1 K ξ2, because x Mn K. Therefore ρ 1 K (ρK((ξ; x T )T )) = = . . . (above) |K|ξ)2 1; 2 p |K|ξ)(1 + p |K|ξ)2 1; 2 p |K|ξ 1; 2 p |K|x T T = ξ; x T T . Lemma A.16 (λK x converges to 2 as K 0). For all x in Pn K or Dn K, it holds that lim K 0 λK x = 2. lim K 0 λK x = lim K 0 2 1 + K x 2 2 = 2. Theorem A.17 (exp K x (v) converges to x + v as K 0). For all x in the Poincar e ball Pn K or the projected sphere Dn K and v Tx M, it holds that lim K 0 exp K x (v) = expx(v) = x + v, hence the exponential map converges to its Euclidean variant. Published as a conference paper at ICLR 2020 Proof. For the positive case K > 0 lim K 0+ exp K x (v) = lim K 0+ |K|λK x v 2 = x + lim K 0+ |K|λK x v 2 = x + v v 2 lim K 0+ K v 2 = x + v, due to several applications of the theorem of limits of composed functions, Lemma A.16, and the fact that lim α 0 tan( αa) α = a. The negative case K < 0 is analogous. Theorem A.18 (log K x (y) converges to y x as K 0). For all x, y in the Poincar e ball Pn K or the projected sphere Dn K, it holds that lim K 0 log K x (y) = logx(v) = y x, hence the logarithmic map converges to its Euclidean variant. Proof. Firstly, z = x K y K 0 y x, due to Theorem A.14. For the positive case K > 0 lim K 0+ log K x (y) = lim K 0+ |K|λK x tan 1 K p |K| z 2 z z 2 = lim K 0+ 2 λK x lim K 0+ K z 2 lim K 0+ z = 1 1 (x vy) = x y, due to several applications of the theorem of limits of composed functions, product rule for limits, Lemma A.16, and the fact that lim α 0 tan 1( αa) α = a. The negative case K < 0 is analogous. Lemma A.19 (gyr[x, y]v converges to v as K 0). For all x, y in the Poincar e ball Pn K or the projected sphere Dn K and v Tx M, it holds that lim K 0 gyr[x, y]x = v, hence gyration converges to an identity function. Published as a conference paper at ICLR 2020 lim K 0 gyr[x, y]v = lim K 0 ( K(x K y) K (x K (y K v))) = (x + y) + (x + (y + v)) = x y + x + y + v = v, due to Theorem A.14 and the theorem of limits of composed functions. Theorem A.20 (PTK x y(v) converges to v as K 0). For all x, y in the Poincar e ball Pn K or the projected sphere Dn K and v Tx M, it holds that lim K 0 PT K x y(v) = v. lim K 0 PT K x y(v) = lim K 0 λK x λK y gyr[y, x]v = lim K 0 λK x λK y |{z} lim K 0 gyr[y, x]v | {z } due to the product rule for limits, Lemma A.16, and Lemma A.19. B PROBABILITY DETAILS B.1 WRAPPED NORMAL DISTRIBUTIONS Theorem B.1 (Probability density function of WN(z; µ, Σ) in Hn K). log WN(z; µ, Σ) = log N(v; 0, Σ) (n 1) log where u = log K µ (z), v = PTK µ µ0(u), and R = 1/ Proof. This was shown for the case K = 1 by Nagano et al. (2019). The difference is that we do not assume unitary radius R = 1 = 1/ K. Hence, our tranformation function has the form f = exp K µ PTK µ0 µ, and f 1 = PTK µ µ0 log K µ . The derivative of parallel transport PT K x y(v) for any x, y Hn K and v Tx Hn K is a map d PTK x y(v) : Tv(Tx Hn K). Using the orthonormal basis (with respect to the Lorentz product) {ξ1, . . . ξn}, we can compute the determinant by computing the change with respect to each basis vector. d PTK x y(ξ) = ϵ=0 PTK x y(v + ϵξ) (v + ϵξ) + y, v + ϵξ L R2 x, y L (x + y) = ξ + y, ξ L R2 x, y L (x + y) ϵ=0 = PTK x y(ξ). Since parallel transport preserves norms and vectors in the orthonormal basis have norm 1, the change is d PTK x y(ξ) L = PTK x y(ξ) L = 1. Published as a conference paper at ICLR 2020 For computing the determinant of the exponential map Jacobian, we choose the orthonormal basis {ξ1 = u/ u L , ξ2, . . . , ξn}, where we just completed the basis based on the first vector. We again look at the change with respect to each basis vector. For the basis vector ξ1: d exp K x (ξ1) = ϵ=0 exp K x u + ϵ u u L cosh | u L + ϵ| x + R sinh | u L+ϵ| u L | u L + ϵ| ( u L + ϵ) u ( u L + ϵ) sinh | u L+ϵ| R| u L + ϵ| x + cosh | u L+ϵ| R + cosh u L where the second equality is due to u + ϵ u u L u L = 1 + ϵ u L u L = | u L + ϵ|. For every other basis vector ξk where k > 1: d exp K x (ξ) = ϵ=0 exp K x (u + ϵξ) cosh u + ϵξ L x + R sinh u+ϵξ L u + ϵξ L (u + ϵξ) u 2 L + ϵ2 (u + ϵξ) u 2 L + ϵ2 (u + ϵξ) + (R2 u 2 L ξ R2ϵu + ϵ( u 2 L + ϵ2)x) sinh R( u 2 L + ϵ2)3/2 = R2 u 2 L sinh u L R( u 2 L)3/2 ξ = R sinh u L where the third equality holds because u + ϵξ 2 L = u 2 L + ϵ2 ξ 2 L 2 u, ϵξ L = u 2 L + ϵ2 2ϵ u, ξ L = u 2 L + ϵ2, where the last equality relies on the fact that the basis is orthogonal, and u is parallel to ξ1 = u/ u L, hence it is orthogonal to all the other basis vectors. Published as a conference paper at ICLR 2020 Because the basis is orthonormal the determinant is a product of the norms of the computed change for each basis vector. Therefore, det PTx y(v) Additionally, the following two properties hold: L = sinh u L R + cosh u L = sinh2 u L x 2 L R2 + cosh2 u L u 2 L u 2 L = sinh2 u L + cosh2 u L d exp K x (ξ) 2 L = = R2 sinh2 u L u 2 L ξ 2 L = R2 sinh2 u L Therefore, we obtain det exp K x (u) u exp K µ (u) u PTK µ0 µ(v) Theorem B.2 (Probability density function of WN(z; µ, Σ) in Sn K). log WN(z; µ, Σ) = log N(v; 0, Σ) (n 1) log where u = log K µ (z), v = PTK µ µ0(u), and R = 1/ Proof. The theorem is very similar to Theorem B.1. The difference is that in this one, our manifold changes from Hn K to Sn K, hence K > 0. Our tranformation function has the form f = exp K µ PTK µ0 µ, and f 1 = PTK µ µ0 log K µ . The derivative of parallel transport PT K x y(v) for any x, y Sn K and v Tx Sn K is a map d PTK x y(v) : Tv(Tx Sn K). Using the orthonormal basis (with respect to the Lorentz product) {ξ1, . . . ξn}, we can compute the determinant by computing the change with respect to each basis vector. d PTK x y(ξ) = ϵ=0 PTK x y(v + ϵξ) Published as a conference paper at ICLR 2020 (v + ϵξ) y, v + ϵξ 2 R2 + x, y 2 (x + y) = ξ y, ξ 2 R2 + x, y 2 (x + y) ϵ=0 = PTK x y(ξ). Since parallel transport preserves norms and vectors in the orthonormal basis have norm 1, the change is d PTK x y(ξ) 2 = PTK x y(ξ) 2 = 1. For computing the determinant of the exponential map Jacobian, we choose the orthonormal basis {ξ1 = u/ u 2 , ξ2, . . . , ξn}, where we just completed the basis based on the first vector. We again look at the change with respect to each basis vector. For the basis vector ξ1: d exp K x (ξ1) = ϵ=0 exp K x u + ϵ u u 2 cos | u 2 + ϵ| x + R sin | u 2+ϵ| u 2 | u 2 + ϵ| ( u 2 + ϵ) u ( u 2 + ϵ) sin | u 2+ϵ| R| u 2 + ϵ| x + cos | u 2+ϵ| u u 2 sin u 2 where the second equality is due to u + ϵ u u 2 u 2 = 1 + ϵ u 2 u 2 = | u 2 + ϵ|. For every other basis vector ξk where k > 1: d exp K x (ξ) = ϵ=0 exp K x (u + ϵξ) cos u + ϵξ 2 x + R sin u+ϵξ 2 u + ϵξ 2 (u + ϵξ) u 2 2 + ϵ2 (u + ϵξ) u 2 2 + ϵ2 (u + ϵξ) + (R2 u 2 2 ξ R2ϵu ϵ( u 2 2 + ϵ2)x) sin R( u 2 2 + ϵ2)3/2 = R2 u 2 2 sin u 2 R( u 2 2)3/2 ξ = R sin u 2 Published as a conference paper at ICLR 2020 where the third equality holds because u + ϵξ 2 2 = u 2 2 + ϵ2 ξ 2 2 2 u, ϵξ 2 = u 2 2 + ϵ2 2ϵ u, ξ 2 = u 2 2 + ϵ2, where the last equality relies on the fact that the basis is orthogonal, and u is parallel to ξ1 = u/ u 2, hence it is orthogonal to all the other basis vectors. Because the basis is orthonormal the determinant is a product of the norms of the computed change for each basis vector. Therefore, det PTx y(v) Additionally, the following two properties hold: d exp K x 2 = cos u 2 u u 2 sin u 2 x 2 2 R2 + cos2 u 2 u 2 2 u 2 2 d exp K x (ξ) 2 2 = = R2 sin2 u 2 u 2 2 ξ 2 2 = R2 sin2 u 2 Therefore, we obtain det exp K x (u) u exp K µ (u) u PTK µ0 µ(v) Theorem B.3 (Probability density function of WN(z; µ, Σ) in Pn K). log WN Pn K(z; µ, Σ) = log WN Hn K(ρ 1 K (z); ρ 1 K (µ), Σ). Proof. Follows from Theorem B.1 and A.3. Also proven by (Mathieu et al., 2019) in a slightly different form for a scalar scale parameter WN(z; µ, σ2I). Given log N(z; µ, σ2I) = d E(µ, z)2 Published as a conference paper at ICLR 2020 log WN(z; µ, σ2I) = d K P (µ, z)2 + (n 1) log Kd K P (µ, z) sinh( Kd K P (µ, z)) Theorem B.4 (Probability density function of WN(z; µ, Σ) in Dn K). log WN Dn K(z; µ, Σ) = log WN Sn K(ρ 1 K (z); ρ 1 K (µ), Σ). Proof. Follows from Theorem B.2 and A.3 adapted from P to D. C RELATED WORK Universal models of geometry Duality between spaces of constant curvature was first noticed by Lambert (1770), and later gave rise to various theorems that have the same or similar forms in all three geometries, like the law of sines (Bolyai, 1832) sin A p K(a) = sin B p K(b) = sin C where p K(r) = 2π sin K(r) denotes the circumference of a circle of radius r in a space of constant curvature K, and sin K(x) = x Kx3 ( 1)i Kix2i+1 (2i + 1)! . Other unified formulas for the law of cosines, or recently, a unified Pythagorean theorem has also been proposed (Foote, 2017): A(c) = A(a) + A(b) K 2π A(a)A(b), where A(r) is the area of a circle of radius r in a space of constant curvature K. Unfortunately, in this formulation A(r) still depends on the sign of K w.r.t. the choice of trigonometric functions in its definition. There also exist approaches defining a universal geometry of constant curvature spaces. Li et al. (2001, Chapter 4) define a unified model of all three geometries using the null cone (light cone) of a Minkowski space. The term Minkowski space comes from special relativity and is usually denoted as R1,n, similar to the ambient space of what we defined as Hn, with the Lorentz scalar product , L. The hyperboloid Hn corresponds to the positive (upper, future) null cone of R1,n. All the other models can be defined in this space using the appropriate stereographic projections and pulling back the metric onto the specific sub-manifold. Unfortunately, we found the formalism not useful for our application, apart from providing a very interesting theoretical connection among the models. Concurrent VAE approaches The variational autoencoder was originally proposed in Kingma & Welling (2014) and concurrently in Rezende et al. (2014). One of the most common improvements on the VAE in practice is the choice of the encoder and decoder maps, ranging from linear parametrizations of the posterior to feed-forward neural networks, convolutional neural networks, etc. For different data domains, extensions like the Graph VAE (Simonovsky & Komodakis, 2018) using graph convolutional neural networks for the encoder and decoder were proposed. The basic VAE framework was mostly improved upon by using autoregressive flows (Chen et al., 2014) or small changes to the ELBO loss function (Matthey et al., 2017; Burda et al., 2016). An important work in this area is β-VAE, which adds a simple scalar multiplicative constant to the KL divergence term in the ELBO, and has shown to improve both sample quality and (if β > 1) disentanglement of different dimensions in the latent representation. For more information on disentanglement, see Locatello et al. (2018). Published as a conference paper at ICLR 2020 Geometric deep learning One of the emerging trends in deep learning has been to leverage non Euclidean geometry to learn representations, originally emerging from knowledge-base and graph representation learning (Bronstein et al., 2017). Recently, several approaches to learning representations in Euclidean spaces have been generalized to non-Euclidean spaces (Dhingra et al., 2018; Ganea et al., 2018b; Nickel & Kiela, 2017). Since then, this research direction has grown immensely and accumulated more approaches, mostly for hyperbolic spaces, like Ganea et al. (2018a); Nickel & Kiela (2018); Tifrea et al. (2019); Law et al. (2019). Similarly, spherical spaces have also been leveraged for learning non-Euclidean representations (Batmanghelich et al., 2016; Wilson & Hancock, 2010). To be able to learn representations in these spaces, new Riemannian optimization methods were required as well (Wilson & Leimeister, 2018; Bonnabel, 2013; B ecigneul & Ganea, 2019). The generalization to products of constant curvature Riemannian manifolds is only natural and has been proposed by Gu et al. (2019). They evaluated their approach by directly optimizing a distancebased loss function using Riemannian optimization in products of spaces on graph reconstruction and word analogy tasks, in both cases reaping the benefits of non-Euclidean geometry, especially when learning lower-dimensional representations. Further use of product spaces with constant curvature components to train Graph Convolutional Networks was concurrently with this work done by Bachmann et al. (2020). Geometry in VAEs One of the first attempts at leveraging geometry in VAEs was Arvanitidis et al. (2018). They examine how a Euclidean VAE benefits both in sample quality and latent representation distribution quality when employing a non-Euclidean Riemannian metric in the latent space using kernel transformations. Hence, a potential improvement area of VAEs could be the choice of the posterior family and prior distribution. However, the Gaussian (Normal) distribution works very well in practice, as it is the maximum entropy probability distribution with a known variance, and imposes no constraints on higher-order moments (skewness, kurtosis, etc.) of the distribution. Recently, non-Euclidean geometry has been used in learning variational autoencoders as well. Generalizing Normal distributions to these spaces is in general non-trivial.. Two similar approaches, Davidson et al. (2018) and Xu & Durrett (2018), used the von Mises Fischer distribution on the unit hypersphere to generalize VAEs to spherical spaces. The von Mises Fischer distribution is again a maximum entropy probability distribution on the unit hypersphere, but only has a spherical covariance parameter, which makes it less general than a Gaussian distribution. Conversely, two approaches, Mathieu et al. (2019) and Nagano et al. (2019), have generalized VAEs to hyperbolic spaces both the Poincar e ball and the hyperboloid, respectively. They both adopt a non-maximum entropy probability distribution called the Wrapped Normal. Additionally, Mathieu et al. (2019) also derive the Riemannian Normal, which is a maximum entropy distribution on the Poincar e disk, but in practice performs similar to the Wrapped Normal, especially in higher dimensions. Our approach generalizes on the afore-mentioned geometrical VAE work, by employing a products of spaces approach similar to Gu et al. (2019) and unifying the different approaches into a single framework for all spaces of constant curvature. D EXTENDED FUTURE WORK Even though we have shown that one can approximate the true posterior very well with Normallike distributions in Riemannian manifolds of constant curvature, there remain several promising directions of explorations. First of all, an interesting extension of this work would be to try mixed-curvature VAEs on graph data, e.g. link prediction on social networks, as some of our models might be well suited for sparse and structured data. Another very beneficial extension would be to investigate why the obtained results have a relatively big variance across runs and try to reduce it. However, this is a problem that affects the Euclidean VAE as well, even if not as flagrantly. Published as a conference paper at ICLR 2020 Secondly, we have empirically noticed that it seems to be significantly harder to optimize our models in spherical spaces they seem more prone to divergence. In discussions, other researchers have also observed similar behavior, but a more thorough investigation is not available at the moment. We have side-stepped some optimization problems by introducing products of spaces previously, it has been reported that both spherical and hyperbolic VAEs generally do not scale well to dimensions greater than 20 or 40. For those cases, we could successfully optimize a subdivided space (S2)36 instead of one big manifold S72. However, that also does not seem to be a conclusive rule. Especially in higher dimensions, we have noticed that our VAEs (S2)36 with learnable curvature and D72 1 with fixed curvature seem to consistently diverge. In a few cases S72 with fixed curvature and even the product (E2)12 (H2)12 (S2)12 with learnable curvature seemed to diverge quite often as well. The most promising future direction seems to be the use of Normalizing Flows for variational inference as presented by Rezende & Mohamed (2015) and Gemici et al. (2016). More recently, it was also combined with autoregressive flows in Huang et al. (2018). Using normalizing flows, one should be able to achieve the desired level of complexity of the latent distribution in a VAE, which should, similarly to our work, help to approximate the true posterior of the data better. The advantage of normalizing flows is the flexibility of the modeled distributions, at the expense of being more computationally expensive. Finally, another interesting extension would be to extend the defined geometrical models to allow for training generative adversarial networks (GANs) (Goodfellow et al., 2014) in products of constant curvature spaces and benefit from the better sharpness and quality of samples that GANs provide. Finally, one could synthesize the above to achieve adversarially trained autoencoders in Riemannian manifolds similarly to Pan et al. (2018); Kim et al. (2017); Makhzani et al. (2015) and aim to achieve good sample quality and a well-formed latent space at the same time. E EXTENDED RESULTS Figure 1: Learned curvature across epochs (with standard deviation) with latent space dimension of 6, diagonal covariance parametrization, on the MNIST dataset. Published as a conference paper at ICLR 2020 Table 6: Summary of results (mean and standard-deviation) with latent space dimension of 6, spherical covariance parametrization, on the BDP dataset. Model LL ELBO BCE KL (S2 1)3 55.89 0.36 56.72 0.40 51.01 0.31 5.72 0.09 S6 1 55.81 0.35 56.57 0.44 51.16 0.78 5.41 0.42 (v MF S2 1)3 57.87 1.52 58.64 1.63 53.96 2.16 4.68 0.53 v MF S6 1 58.78 0.83 60.74 2.29 56.03 2.64 4.71 0.48 (D2 1)3 56.01 0.24 56.67 0.31 51.02 0.40 5.65 0.10 D6 1 55.78 0.07 56.38 0.06 50.85 0.20 5.53 0.24 (E2)3 56.34 0.45 56.94 0.50 51.32 0.55 5.62 0.19 E6 56.28 0.56 56.99 0.59 51.58 0.69 5.41 0.29 (H2 1)3 56.08 0.52 56.80 0.54 50.94 0.38 5.86 0.25 H6 1 56.18 0.32 57.10 0.21 51.48 0.47 5.62 0.31 (P2 1)3 55.98 0.62 56.49 0.62 50.96 0.61 5.52 0.31 P6 1 56.74 0.55 57.61 0.74 52.01 0.71 5.60 0.24 (RN P2 1)3 54.99 0.12 55.90 0.13 52.42 0.71 3.48 0.60 (S2)3 56.05 0.21 56.69 0.36 51.07 0.21 5.61 0.22 (v MF S2)3 57.56 0.88 57.80 0.89 52.68 1.62 5.12 0.84 S6 56.06 0.51 56.65 0.64 50.93 0.38 5.72 0.40 v MF S6 58.21 0.92 59.87 1.51 54.99 1.79 4.88 0.39 (D2)3 56.06 0.36 56.69 0.54 50.95 0.40 5.74 0.17 D6 56.10 0.25 56.69 0.17 50.90 0.19 5.79 0.03 (H2)3 55.80 0.32 56.72 0.16 51.14 0.39 5.58 0.28 H6 56.03 0.21 56.82 0.20 50.99 0.16 5.83 0.27 (P2)3 56.29 0.05 57.11 0.22 51.41 0.19 5.69 0.30 P6 56.40 0.31 57.13 0.25 51.17 0.33 5.96 0.27 (RN P2)3 56.25 0.56 57.26 0.45 53.16 1.07 4.11 0.64 D2 E2 P2 55.87 0.22 56.35 0.22 50.67 0.57 5.69 0.43 D2 1 E2 P2 1 56.06 0.41 56.86 0.65 51.23 0.67 5.64 0.11 D2 E2 (RN P2) 56.35 0.82 57.06 0.78 51.89 0.71 5.17 0.12 D2 1 E2 (RN P2 1) 56.17 0.43 56.75 0.56 51.80 0.80 4.95 0.32 E2 H2 S2 55.92 0.42 56.54 0.45 51.13 0.74 5.41 0.40 E2 H2 1 S2 1 56.04 0.57 56.71 0.77 51.09 0.86 5.62 0.12 E2 H2 (v MF S2) 55.82 0.43 56.32 0.47 51.10 0.67 5.21 0.20 E2 H2 1 (v MF S2 1) 55.77 0.51 56.34 0.65 51.33 0.57 5.01 0.17 (U2)3 55.56 0.15 56.05 0.32 50.68 0.23 5.37 0.10 U6 55.84 0.38 56.46 0.41 50.66 0.38 5.81 0.18 Published as a conference paper at ICLR 2020 Table 7: Summary of results (mean and standard-deviation) with latent space dimension of 6, spherical covariance parametrization, on the MNIST dataset. Model LL ELBO BCE KL (S2 1)3 96.77 0.26 101.66 0.32 87.04 0.49 14.62 0.18 (v MF S2 1)3 97.72 0.22 102.98 0.15 87.77 0.18 15.21 0.07 S6 1 96.71 0.17 101.55 0.30 86.90 0.30 14.65 0.10 v MF S6 1 97.03 0.14 102.12 0.26 87.42 0.28 14.69 0.03 (D2 1)3 97.84 0.10 102.75 0.22 88.43 0.12 14.33 0.13 D6 1 98.21 0.23 103.02 0.14 88.44 0.05 14.58 0.11 (E2)3 97.04 0.14 101.44 0.18 86.77 0.22 14.67 0.22 E6 97.16 0.15 101.67 0.14 87.17 0.26 14.50 0.20 (H2 1)3 97.31 0.09 102.20 0.29 87.81 0.23 14.39 0.13 H6 1 97.10 0.44 101.89 0.33 87.32 0.22 14.56 0.20 (P2 1)3 97.56 0.04 102.33 0.22 87.93 0.32 14.40 0.10 (RN P2 1)3 92.54 0.19 97.19 0.21 88.42 0.20 8.76 0.04 P6 1 97.80 0.05 102.60 0.04 88.14 0.08 14.46 0.07 (S2)3 96.46 0.12 101.30 0.17 86.79 0.25 14.51 0.09 (v MF S2)3 97.62 0.30 102.72 0.37 87.48 0.37 15.24 0.03 S6 96.72 0.15 101.39 0.16 86.69 0.15 14.70 0.13 v MF S6 96.72 0.18 101.55 0.21 86.82 0.23 14.73 0.02 (D2)3 97.68 0.24 102.51 0.44 88.11 0.34 14.41 0.11 D6 97.72 0.15 102.31 0.16 87.70 0.22 14.61 0.06 (H2)3 97.37 0.13 102.07 0.24 87.56 0.30 14.51 0.11 H6 97.47 0.16 102.18 0.20 87.64 0.23 14.53 0.07 (P2)3 97.62 0.05 102.34 0.16 87.92 0.16 14.43 0.06 (RN P2)3 94.16 0.68 98.65 0.66 89.27 0.79 9.38 0.15 P6 97.71 0.24 102.55 0.21 88.24 0.23 14.32 0.04 D2 E2 P2 97.48 0.18 102.22 0.29 87.85 0.17 14.37 0.13 D2 1 E2 P2 1 97.58 0.13 102.23 0.15 87.75 0.15 14.49 0.15 D2 E2 (RN P2) 96.43 0.47 101.31 0.51 88.82 0.50 12.50 0.03 D2 1 E2 (RN P2 1) 96.18 0.21 100.91 0.31 88.58 0.47 12.33 0.19 E2 H2 S2 96.80 0.20 101.60 0.33 87.13 0.19 14.47 0.17 E2 H2 1 S2 1 96.76 0.09 101.48 0.13 86.99 0.17 14.49 0.05 E2 H2 (v MF S2) 96.56 0.27 101.49 0.28 86.58 0.36 14.91 0.14 E2 H2 1 (v MF S2 1) 96.76 0.39 101.82 0.13 87.08 0.06 14.74 0.13 (U2)3 97.12 0.04 101.68 0.06 87.13 0.14 14.55 0.16 U6 97.26 0.16 102.05 0.18 87.54 0.21 14.51 0.11 Published as a conference paper at ICLR 2020 Table 8: Summary of results (mean and standard-deviation) with latent space dimension of 6, diagonal covariance parametrization, on the MNIST dataset. Model LL ELBO BCE KL (S2 1)3 96.57 0.04 101.34 0.12 86.88 0.17 14.45 0.10 S6 1 96.51 0.09 101.29 0.18 86.71 0.20 14.58 0.13 (D2 1)3 97.81 0.14 102.58 0.23 88.31 0.25 14.27 0.02 D6 1 97.89 0.10 102.65 0.10 88.39 0.16 14.26 0.08 (E2)3 96.94 0.34 101.34 0.41 86.89 0.36 14.44 0.11 E6 96.88 0.16 101.36 0.08 86.90 0.14 14.46 0.07 (H2 1)3 97.19 0.32 102.06 0.28 87.63 0.37 14.42 0.10 H6 1 97.38 0.73 102.22 0.95 87.75 0.59 14.47 0.37 (P2 1)3 97.57 0.12 102.22 0.18 87.83 0.30 14.39 0.13 P6 1 97.33 0.15 102.02 0.35 87.71 0.36 14.31 0.04 (S2)3 96.78 0.35 101.43 0.24 86.93 0.28 14.50 0.05 S6 96.44 0.20 101.18 0.36 86.74 0.38 14.44 0.05 (D2)3 97.61 0.19 102.37 0.26 87.96 0.21 14.41 0.06 D6 97.53 0.22 102.31 0.38 87.97 0.37 14.34 0.08 (H2)3 96.86 0.31 101.61 0.30 87.13 0.30 14.48 0.08 H6 96.90 0.26 101.48 0.35 87.18 0.48 14.30 0.15 (P2)3 97.52 0.02 102.30 0.07 88.11 0.07 14.19 0.12 P6 97.26 0.16 102.00 0.17 87.58 0.16 14.42 0.08 D2 E2 P2 97.37 0.14 102.12 0.19 87.78 0.23 14.34 0.12 D2 1 E2 P2 1 97.29 0.16 101.86 0.16 87.54 0.17 14.32 0.04 E2 H2 S2 96.71 0.19 101.34 0.16 86.91 0.17 14.43 0.06 E2 H2 1 S2 1 96.66 0.27 101.46 0.44 87.02 0.38 14.44 0.08 (U2)3 97.06 0.13 101.66 0.19 87.22 0.12 14.44 0.07 U6 96.90 0.10 101.68 0.07 87.27 0.11 14.42 0.12 (a) Original (d) E2 H2 S2 (e) (E2)12 (H2)12 (S2)12 Figure 2: Qualitative comparison of reconstruction quality of randomly selected runs of a selection of well-performing models on MNIST test set digits. Published as a conference paper at ICLR 2020 Table 9: Summary of results (mean and standard-deviation) with latent space dimension of 12, diagonal covariance parametrization, on the MNIST dataset. Model LL ELBO BCE KL (S2 1)6 79.92 0.21 84.88 0.14 62.83 0.21 22.06 0.07 S12 1 80.72 0.34 85.73 0.36 63.86 0.32 21.87 0.04 (D2 1)6 80.53 0.10 85.59 0.08 63.62 0.12 21.97 0.16 D12 1 80.81 0.12 86.40 0.17 64.42 0.19 21.98 0.06 (E2)6 79.51 0.10 83.91 0.12 61.84 0.06 22.07 0.13 E12 79.51 0.09 83.95 0.06 61.66 0.10 22.29 0.04 (H2 1)6 80.54 0.23 86.05 0.52 63.78 0.26 22.27 0.26 H12 1 79.37 0.14 84.76 0.08 62.32 0.05 22.44 0.10 (P2 1)6 80.39 0.07 85.46 0.15 63.48 0.22 21.98 0.17 P12 1 80.88 0.20 85.87 0.45 63.66 0.59 22.21 0.17 (S2)6 79.95 0.14 84.90 0.25 62.83 0.34 22.07 0.17 S12 79.99 0.27 84.78 0.26 62.89 0.29 21.89 0.18 (D2)6 80.40 0.09 85.38 0.08 63.49 0.12 21.89 0.18 D12 80.37 0.16 85.26 0.19 63.24 0.15 22.02 0.13 (H2)6 80.13 0.08 85.22 0.24 63.32 0.34 21.90 0.10 H12 79.77 0.10 84.58 0.15 62.49 0.10 22.09 0.20 (P2)6 80.31 0.08 85.35 0.10 63.57 0.17 21.79 0.07 P12 80.66 0.09 85.55 0.03 63.55 0.17 22.00 0.14 (D2)2 (E2)2 (P2)2 80.30 0.31 85.22 0.40 63.52 0.48 21.70 0.11 (D2 1)2 (E2)2 (P2 1)2 80.14 0.11 85.00 0.08 62.99 0.16 22.01 0.24 D4 E4 P4 80.17 0.11 84.95 0.27 62.87 0.39 22.08 0.18 D4 1 E4 P4 1 80.14 0.20 84.99 0.26 63.06 0.26 21.92 0.08 (E2)2 (H2)2 (S2)2 79.59 0.25 84.43 0.20 62.68 0.20 21.75 0.20 (E2)2 (H2 1)2 (S2 1)2 79.87 0.45 84.82 0.61 62.66 0.42 22.17 0.20 E4 H4 S4 79.69 0.14 84.45 0.12 62.64 0.28 21.81 0.21 E4 H4 1 S4 1 79.77 0.09 84.75 0.03 62.68 0.25 22.07 0.24 (U2)6 79.61 0.06 84.13 0.04 61.92 0.22 22.21 0.23 U12 80.01 0.30 84.86 0.51 62.90 0.63 21.96 0.16 Published as a conference paper at ICLR 2020 Table 10: Summary of results (mean and standard-deviation) with latent space dimension of 72, diagonal covariance parametrization, on the MNIST dataset. Model LL ELBO BCE KL (S2 1)36 78.43 0.44 84.99 0.49 56.88 0.28 28.11 0.56 (D2 1)36 76.03 0.17 83.04 0.25 54.35 0.15 28.69 0.17 (E2)36 74.53 0.06 80.05 0.10 50.91 0.17 29.15 0.07 E72 74.42 0.06 80.09 0.12 51.45 0.30 28.63 0.20 (H2 1)36 77.92 0.32 84.76 0.55 56.85 0.60 27.91 0.42 H72 1 77.30 0.12 86.98 0.09 58.04 0.29 28.94 0.25 (P2 1)36 76.11 0.08 82.63 0.19 53.89 0.36 28.74 0.30 P72 1 77.50 0.05 84.53 0.13 55.80 0.20 28.73 0.18 S72 75.24 0.01 81.39 0.14 53.03 0.27 28.36 0.16 (D2)36 75.66 0.06 81.94 0.09 53.32 0.16 28.61 0.11 D72 77.11 2.21 83.94 2.81 54.94 2.55 29.00 1.31 (H2)36 77.87 0.02 83.95 0.02 55.71 0.35 28.24 0.36 H72 75.03 0.11 81.23 0.14 52.63 0.10 28.61 0.11 (P2)36 75.77 0.12 82.07 0.02 53.65 0.38 28.43 0.39 P72 75.71 0.08 81.95 0.09 53.29 0.14 28.67 0.05 (D2)12 (E2)12 (P2)12 77.40 0.55 83.35 0.41 53.90 0.40 29.45 0.12 (D2 1)12 (E2)12 (P2 1)12 75.36 0.23 81.53 0.42 53.02 0.39 28.51 0.45 D24 E24 P24 75.11 0.05 80.99 0.07 52.48 0.19 28.52 0.16 (E2)12 (H2 1)12 (S2 1)12 77.53 0.34 83.95 0.40 55.54 0.43 28.42 0.08 E24 H24 S24 75.04 0.16 81.17 0.18 52.61 0.32 28.55 0.38 (U2)36 74.64 0.08 80.52 0.10 52.04 0.10 28.48 0.07 U72 75.46 0.09 81.76 0.09 53.27 0.18 28.49 0.18 Published as a conference paper at ICLR 2020 (a) First component of (E2)3. (b) Second component of (E2)3. (c) First (negative) component of (U2)3. (d) Second (negative) component of (U2)3. (e) P2 of E2 P2 D2. (f) D2 of E2 P2 D2. Figure 3: Samples from various models of a grid search around 0 of a single component s latent space on MNIST test digits. Published as a conference paper at ICLR 2020 (a) H2 of E2 H2 S2 (c) E2 of E2 H2 S2 (e) S2 of E2 H2 S2 Figure 4: Illustrative latent space visualization of a randomly selected run of the models E2 H2 S2, E6, H6, and S6 on MNIST with spherical covariance. E2 is visualized directly, S2 is visualized using a Lambert azimuthal equal-area projection (Snyder, 1987, Chapter 24), H2 is transformed to the Poincar e ball model using a stereographic conformal projection. All other latent space sizes were first projected using the respective transformation (to Poincar e ball, Lambert projection) if applicable, and then projected to R2 using Principal Component Analysis (Abdi & Williams, 2010, PCA) and visualized directly. Published as a conference paper at ICLR 2020 Table 11: Summary of results (mean and standard-deviation) with latent space dimension of 6, diagonal covariance parametrization, on the Omniglot dataset. Model LL ELBO BCE KL (S2 1)3 136.80 1.31 141.68 1.52 131.73 5.65 9.95 4.33 S6 1 136.69 0.94 141.46 0.92 129.52 0.74 11.94 0.19 (D2 1)3 136.21 0.12 140.44 0.17 128.93 0.14 11.51 0.04 D6 1 137.42 1.20 141.95 1.94 130.70 2.18 11.25 0.26 (E2)3 136.08 0.21 140.46 0.24 128.85 0.34 11.62 0.14 E6 136.05 0.29 140.50 0.35 128.95 0.41 11.55 0.14 (H2 1)3 137.14 0.13 141.87 0.16 130.18 0.21 11.69 0.10 H6 1 137.09 0.06 142.22 0.19 130.37 0.21 11.85 0.12 (P2 1)3 136.16 0.20 140.63 0.32 129.29 0.34 11.34 0.03 P6 1 135.86 0.20 140.36 0.19 128.92 0.23 11.44 0.16 (S2)3 136.14 0.27 140.68 0.32 128.98 0.27 11.70 0.13 S6 136.20 0.44 140.76 0.45 129.10 0.37 11.66 0.13 (D2)3 136.13 0.17 140.59 0.15 129.10 0.20 11.49 0.12 D6 136.30 0.08 140.74 0.14 129.35 0.16 11.39 0.05 (H2)3 136.17 0.09 140.65 0.17 129.26 0.07 11.39 0.16 H6 136.24 0.32 140.92 0.33 129.48 0.27 11.45 0.12 (P2)3 136.09 0.07 140.41 0.08 129.04 0.05 11.37 0.08 P6 136.05 0.44 140.42 0.47 129.04 0.53 11.38 0.07 D2 E2 P2 135.89 0.40 140.28 0.42 128.75 0.40 11.53 0.04 D2 1 E2 P2 1 136.01 0.31 140.52 0.35 129.02 0.27 11.50 0.11 E2 H2 S2 135.93 0.48 140.51 0.53 128.85 0.48 11.66 0.14 E2 H2 1 S2 1 136.34 0.41 141.02 0.46 129.24 0.47 11.78 0.10 (U2)3 136.21 0.07 140.65 0.30 129.14 0.34 11.52 0.15 U6 136.04 0.17 140.43 0.14 129.07 0.27 11.36 0.13 Published as a conference paper at ICLR 2020 Table 12: Summary of results (mean and standard-deviation) with latent space dimension of 72, diagonal covariance parametrization, on the Omniglot dataset. Model LL ELBO BCE KL (S2 1)36 112.33 0.14 118.94 0.14 91.04 0.37 27.90 0.23 (D2 1)36 108.66 0.24 116.06 0.18 85.95 0.16 30.11 0.04 (E2)36 105.96 0.33 112.41 0.35 79.80 0.72 32.61 0.41 E72 105.89 0.16 112.40 0.17 79.52 0.19 32.89 0.20 (H2 1)36 112.22 0.11 119.06 0.15 91.30 0.47 27.76 0.35 H72 1 111.19 0.42 120.49 0.35 91.11 0.73 29.38 0.40 (P2 1)36 109.05 0.09 115.99 0.10 85.81 0.42 30.18 0.34 P72 1 111.24 0.28 118.36 0.24 89.53 0.38 28.84 0.18 S72 109.39 0.32 116.42 0.32 87.22 0.58 29.20 0.28 (D2)36 108.89 0.36 115.65 0.45 85.29 0.74 30.37 0.30 D72 108.81 0.08 115.71 0.09 85.68 0.10 30.03 0.09 (H2)36 112.21 0.28 118.74 0.30 91.03 0.76 27.71 0.47 H72 108.62 0.40 115.54 0.30 85.18 0.62 30.37 0.34 (P2)36 108.78 0.66 115.54 0.70 85.16 1.38 30.38 0.69 P72 109.66 0.61 116.50 0.68 87.09 1.43 29.42 0.75 (D2)12 (E2)12 (P2)12 107.02 1.56 115.62 1.76 88.52 8.24 27.10 6.48 (D2 1)12 (E2)12 (P2 1)12 108.06 0.47 114.92 0.39 83.95 0.58 30.97 0.22 (E2)12 (H2)12 (S2)12 114.85 0.38 120.98 0.15 95.12 0.49 25.86 0.40 (E2)12 (H2 1)12 (S2 1)12 110.28 0.37 116.90 0.42 87.71 0.82 29.19 0.41 (U2)36 105.98 0.05 112.70 0.19 79.85 0.80 32.85 0.61 U72 106.58 0.12 113.68 0.11 81.53 0.34 32.15 0.36 Table 13: Summary of results (mean and standard-deviation) with latent space dimension of 6, diagonal covariance parametrization, on the CIFAR dataset. All nan standard deviation values below indicate the repeated experiment was not stable enough to produce a meaningful estimate of spread. Model LL ELBO BCE KL S6 1 1893.16 nan 1901.32 nan 1885.97 nan 15.35 nan D6 1 1891.69 nan 1897.77 nan 1884.12 nan 13.64 nan E6 1896.19 2.54 1905.75 3.19 1889.97 2.88 15.78 0.32 H6 1 1888.23 2.12 1896.56 2.93 1882.05 2.65 14.51 0.34 P6 1 1893.27 0.61 1902.67 0.74 1887.44 0.83 15.23 0.16 D6 1893.85 0.36 1902.67 0.69 1887.37 0.74 15.30 0.08 S6 1889.76 1.62 1897.31 1.71 1882.55 1.48 14.76 0.24 P6 1891.40 2.14 1899.68 2.74 1884.58 2.56 15.10 0.18 D2 E2 P2 1899.90 4.60 1904.63 1.46 1889.13 1.38 15.50 0.08 E2 H2 S2 1895.46 0.92 1897.57 0.94 1882.84 0.70 14.73 0.24 (U2)3 1895.09 4.27 1904.46 5.21 1888.89 4.71 15.57 0.51