# consistency_regularization_for_variational_autoencoders__94d4ecb7.pdf

Consistency Regularization for Variational Auto-Encoders

Samarth Sinha Vector Institute University of Toronto

Adji B. Dieng Google Brain Princeton University

Variational auto-encoders (vaes) are a powerful approach to unsupervised learning. They enable scalable approximate posterior inference in latent-variable models using variational inference (vi). A vae posits a variational family parameterized by a deep neural network called an encoder that takes data as input. This encoder is shared across all the observations, which amortizes the cost of inference. However the encoder of a vae has the undesirable property that it maps a given observation and a semantics-preserving transformation of it to diﬀerent latent representations. This inconsistency" of the encoder lowers the quality of the learned representations, especially for downstream tasks, and also negatively aﬀects generalization. In this paper, we propose a regularization method to enforce consistency in vaes. The idea is to minimize the Kullback-Leibler (kl) divergence between the variational distribution when conditioning on the observation and the variational distribution when conditioning on a random semantic-preserving transformation of this observation. This regularization is applicable to any vae. In our experiments we apply it to four diﬀerent vae variants on several benchmark datasets and found it always improves the quality of the learned representations but also leads to better generalization. In particular, when applied to the nouveau variational auto-encoder (nvae), our regularization method yields state-of-the-art performance on mnist, cifar-10, and celeba. We also applied our method to 3D data and found it learns representations of superior quality as measured by accuracy on a downstream classiﬁcation task. Finally, we show our method can even outperform the triplet loss, an advanced and popular contrastive learning-based method for representation learning. 1

1 Introduction

Variational auto-encoders (vaes) have signiﬁcantly impacted research on unsupervised learning. They have been used in several areas, including density estimation (Kingma & Welling, 2013; Rezende et al., 2014), image generation (Gregor et al., 2015), text generation (Bowman et al., 2015; Fang et al., 2019), music generation (Roberts et al., 2018), topic modeling (Miao et al., 2016; Dieng et al., 2019), and recommendation systems (Liang et al., 2018). Vaes have also been used for diﬀerent representation learning problems such as semi-supervised learning (Kingma et al., 2014), anomaly detection (An & Cho, 2015; Zimmerer et al., 2018), language modeling Bowman et al. (2015), active learning (Sinha et al., 2019), continual learning (Achille et al., 2018), and motion prediction of agents (Walker et al., 2016). This widespread application of vae representations makes it critical that we focus on improving them.

vaes extend deterministic auto-encoders to probabilistic generative modeling. The encoder of a vae parameterizes an approximate posterior distribution over latent variables of a generative model. The encoder is shared between all observations, which amortizes the cost of posterior inference. Once ﬁtted, the encoder of a vae can be used to obtain low-dimensional representations of data, (e.g. for

1Code for this work can be found at https://github.com/sinhasam/CRVAE

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

(a) (b) (c)

Figure 1: Illustration of the inconsistency problem in vaes and how cr-vaes address this problem. The red dots correspond to the representations of few images from mnist. The blue dots correspond to the representations of the transformed images. The transformations used here are rotations, translations, and scaling; they are semantics-preserving. The arrows connect the representations of any two pairs of an image and its transformation. The shorter the arrow, the better. (a): The vae maps the two sets of images to diﬀerent areas in the latent space. (b): Even when trained with the original dataset augmented with the transformed images, the vae still maps the two sets of images to diﬀerent parts in the latent space. (c): The cr-vae maps an image and its transformation to nearby areas in the latent space.

downstream tasks.) The quality of these representations is therefore very important to a successful application of vaes.

Researchers have looked at ways to improve the quality of the latent representations of vaes, often tackling the so-called latent variable collapse problem in which the approximate posterior distribution induced by the encoder collapses to the prior over the latent variables (Bowman et al., 2015; Kim et al., 2018; Dieng et al., 2018; He et al., 2019; Fu et al., 2019).

In this paper, we focus on a diﬀerent problem pertaining to the latent representations of vaes for image data. Indeed, the encoder of a ﬁtted vae tends to map an image and a semantics-preserving transformation of that image to diﬀerent parts in the latent space. This inconsistency" of the encoder aﬀects the quality of the learned representations and generalization. We propose a method to enforce consistency in vaes. The idea is simple and consists in maximizing the likelihood of the images while minimizing the Kullback-Leibler (kl) divergence between the approximate posterior distribution induced by the encoder when conditioning on the image, on one hand, and its transformation, on the other hand. This regularization technique can be applied to any vae variant to improve the quality of the learned representations and boost generalization performance. We call a vae with this form of regularization, a consistency-regularized variational auto-encoder (cr-vae).

Figure 1 illustrates the inconsistency problem of vaes and how cr-vaes address this problem on mnist. The red dots are representations of a few images and the blue dots are the representations of their transformations. We applied semantics-preserving transformations: rotation, translation, and scaling. The vae maps each image and its transformation to diﬀerent parts in the latent space as evidenced by the long arrows connecting each pair (a). Even when we include the transformed images to the data and ﬁt the vae the inconsistency problem still occurs (b). The cr-vae does not suﬀer from the inconsistency problem; it maps each image and its transformation to nearby areas in the latent space, as evidenced by the short arrows connecting each pair (c).

In our experiments (see Section 4), we apply the proposed technique to four vae variants, the original vae (Kingma & Welling, 2013), the importance-weighted auto-encoder (iwae) (Burda et al., 2015), the β-vae (Higgins et al., 2017), and the nouveau variational auto-encoder (nvae) (Vahdat & Kautz, 2020). We found, on four diﬀerent benchmark datasets, that cr-vaes always yield better representations and generalize better than their base vaes. In particular, consistency-regularized nouveau variational auto-encoders (cr-nvaes) yield state-of-the-art performance on mnist and cifar-10. We also applied cr-vaes to 3D data where these conclusions still hold.

We consider a latent-variable model pθ(x, z) = pθ(x|z) p(z), where x denotes an observation and z is its associated latent variable. The marginal p(z) is a prior over the latent variable and pθ(x|z)

is an exponential family distribution whose natural parameter is a function of z parameterized by θ, e.g. through a neural network. Our goal is to learn the parameters θ and a posterior distribution over the latent variables. The approach of vaes is to maximize the evidence lower bound (elbo), a lower bound on the log marginal likelihood of the data,

Lvae = elbo = Eqφ(z|x)

log pθ(x, z)

where qφ(z|x) is an approximate posterior distribution over the latent variables. The idea of a vae is to let the parameters of the distribution qφ(z|x) be given by the output of a neural network, with parameters φ, that takes x as input. The parameters θ and φ are then jointly optimized by maximizing a Monte Carlo approximation of the elbo using the reparameterization trick (Kingma & Welling, 2013).

Consider a semantics-preserving transformation t( x|x) of data x (e.g. rotation or translation for images.) A good representation learning algorithm should provide similar latent representations for x and x. This is not the case for the vae that maximizes Equation 1 and its variants. Once ﬁt to data, the encoder of a vae is unable to yield similar latent representations for a data x and its tranformation x (see Figure 1). This is because there is nothing in Equation 1 that forces this desideratum.

We now propose a regularization method that ensures consistency of the encoder of a vae. We call a vae with such a regularization a cr-vae. The regularization proposed is applicable to many variants of the vae such as the iwae (Burda et al., 2015), the β-vae (Higgins et al., 2017), and the nvae (Vahdat & Kautz, 2020). In what follows, we use the standard vae, the one that maximizes Equation 1, as the base vae to regularize to illustrate the method.

Consider an image x. Denote by t( x|x) the random process by which we generate x, a semanticspreserving transformation of x. We draw x from t( x|x) as follows:

x t( x|x) ϵ p(ϵ) and x = g(x, ϵ). (2)

Here g(x, ϵ) is a semantics-preserving transformation of the image x, e.g. translation with random length ϵ drawn from p(ϵ) = U[ δ, δ] for some threshold δ. A cr-vae then maximizes

Lcr-vae(x) = Lvae(x) + Et( x|x) [Lvae( x)] λ R(x, φ) (3)

where the regularization term R(x, φ) is

R(x, φ) = Et( x|x) [kl (qφ(z| x)||qφ(z|x))] . (4)

Maximizing the objective in Equation 3 maximizes the likelihood of the data and their augmentations while enforcing consistency through R(x, φ). Minimizing R(x, φ), which only aﬀects the encoder (with parameters φ), forces each observation and the corresponding augmentations to lie close to each other in the latent space. The hyperparameter λ 0 controls the strength of this constraint.

The objective in Equation 3 is intractable but we can easily approximate it using Monte Carlo with the reparameterization trick. In particular, we approximate the regularization term with one sample from t( x|x) and make the dependence to this sample explicit using the notation R(x, x, φ). Algorithm 1 illustrates this in greater detail. Although we show the application of consistency regularization using the vae that maximizes the elbo, Lvae( ) in Equation 3 can be replaced with any vae objective.

3 Related Work

Applying consistency regularization to vaes, as we do in this paper, has not been previously explored. Consistency regularization is a widely used technique for semi-supervised learning (Bachman et al., 2014; Sajjadi et al., 2016; Laine & Aila, 2016; Miyato et al., 2018; Xie et al., 2019). The core idea behind consistency regularization for semi-supervised learning is to force classiﬁers to learn representations that are insensitive to semantics-preserving changes to images, so as to improve classiﬁcation of unlabeled images. Examples of semantics-preserving changes used in the literature include rotation, zoom, translation, crop, or adversarial attacks. Consistency is often enforced by minimizing the L2 distance between a classiﬁer s logit output for an image and the logit output for its semantics-preserving transformation (Sajjadi et al., 2016; Laine & Aila, 2016), or by minimizing

Algorithm 1: Consistency Regularization for Variational Autoencoders

input :Data x, consistency regularization strength λ, latent space dimensionality K Initialize parameters θ, φ for iteration t = 1, 2, . . . do Draw minibatch of observations {xn}B n=1 for n = 1, . . . , B do Transform the data: ϵn p(ϵn) and xn = T(xn, ϵn) Get variational mean and variance for the data: µn = W NN(xn; φ) + a and σn = softplus(Q NN(xn; φ) + b) Get S samples from the variational distribution when conditioning on xn: η(s) N(0, I) and z(s) n = µn + η(s) σn for s = 1, . . . , S Get variational mean and variance for the transformed data: µn = W NN( xn; φ) + a and σn = softplus(Q NN( xn; φ) + b) Get S samples from the variational distribution when conditioning on xn: η(s) N(0, I) and z(s) n = µn + η(s) σn for s = 1, . . . , S end Compute Lvae(x):

B PB n=1 1 S PS s=1 h log pθ(xn, z(s) n ) log qφ(z(s) n |xn) i

Compute Lvae( x):

B PB n=1 1 S PS s=1 h log pθ( xn, z(s) n ) log qφ( z(s) n | xn) i

Compute KL consistency regularizer:

R(x, x, φ) = 1

2 PK k=1 σ2 nk+( µnk µnk)2

σ2 nk 1 + 2 log σnk

Compute ﬁnal loss: Lcr-vae(x) = Lvae(x) + Lvae( x) λ R(x, x, φ) Backpropagate through L(x, θ, φ) = Lcr-vae(x) and take a gradient step for θ and φ end

the kl divergence between the classiﬁer s label distribution induced by the image and that of its tranformation (Miyato et al., 2018; Xie et al., 2019).

More recently, consistency regularization has been applied to generative adversarial networks (gans) (Goodfellow et al., 2014). Indeed Wei et al. (2018) and Zhang et al. (2020) show that applying consistency regularization on the discriminator of a gan also a classiﬁer can substantially improve its performance.

The idea we develop in this paper diﬀers from the works above in two ways. First, it applies consistency regularization to vaes for image data. Second, it leverages consistency regularization, not in the label or logit space, as done in the works mentioned above, but in the latent space.

Although diﬀerent, consistency regularization for vaes relates to works that study ways to constrain the sensitivity of encoders to various perturbations. For example, denoising auto-encoders (daes) and their variants (Vincent et al., 2008, 2010) corrupt an image x into x , typically using Gaussian noise, and then minimize the distance between the reconstruction of x and the un-corrupted image x. The motivation is to learn representations that are insensitive to the added noise. Our work diﬀers in that we do not constrain the decoder to recover the original image from the corrupted image but, rather, to constrain the encoder to recover the latent representation of the original image from the corrupted image via a kl divergence minimization constraint.

Contractive auto-encoders (caes) (Rifai et al., 2011) share a similar goal with cr-vaes. A cae is an auto-encoder whose encoder is constrained by minimizing the norm of the Jacobian of the output of the encoder with respect to the input image. This norm constraint on the Jacobian forces the representations learned by the encoder to be insensitive to changes in the input. Our work diﬀers in several main ways. First, cr-vaes are not deterministic auto-encoders, contrary to caes. We can easily sample from a cr-vae, as for any vae, which is not the case for a cae. Second, a cae does

Table 1: cr-vaes learn better representations than their base vaes on all three benchmark datasets. Although ﬁtting the base vae with augmentations does improve the representations, adding the consistency regularization further improves the quality of these learned representations. The value of β for the β-vae is inside the parentheses.

mnist omniglot celeba Method MI AU MI AU MI AU vae 124.5 1.1 36 0.8 105.4 1.2 50 0.0 33.8 0.2 32 0.9 vae + Aug 125.9 0.2 42 0.5 105.9 0.7 50 0.0 34.1 0.8 33 0.9 cr-vae 126.3 0.9 47 0.5 107.8 1.1 50 0.0 34.9 0.5 33 1.2 iwae 127.1 0.7 39 0.5 110.3 1.1 50 0.0 36.9 0.5 36 1.6 iwae+Aug 129.0 0.9 45 0.8 112.9 0.7 50 0.0 37.0 0.2 36 1.2 cr-iwae 129.7 1.0 50 0.0 115.3 0.8 50 0.0 38.4 0.5 36 1.9 β-vae (0.5) 284.3 1.1 50 0.0 143.4 1.0 50 0.0 75.8 0.5 49 0.5 β-vae (0.5) + Aug 289.3 1.0 50 0.0 159.6 1.3 50 0.0 75.7 0.3 49 0.0 β-cr-vae (0.5) 291.9 0.7 50 0.0 169.5 0.5 50 0.0 77.1 0.1 50 0.0 β-vae (10) 6.3 0.6 8 1.7 1.4 0.2 4 0.9 3.6 0.3 7 0.8 β-vae (10) + Aug 6.5 0.5 9 1.1 1.6 0.2 4 0.5 3.7 0.1 7 0.0 β-cr-vae (10) 6.9 0.6 10 0.5 1.6 0.1 4 0.5 3.7 0.4 9 0.9

not apply transformations to the input image, which limits the sensitivities it can learn to limit to those exhibited in the training set. Finally, caes use the Jacobian to impose a consistency constraint, which are not as easy to compute as the kl divergence we use on the variational distribution induced by the encoder.

4 Empirical Study

In this section we show that a cr-vae improves the learned representations of its base vae and positively aﬀects generalization performance We also show that the proposed regularization method is amenable to diﬀerent vae variants by applying it not only to the original vae but also to the iwae, the β-vae, and the nvae. We showcase the importance of the KL regularization term by conducting an ablation study. We found that only regularizing with data augmentation improves performance but that accounting for the kl term (λ > 0) further improves the quality of the learned representations and generalization.

We will conduct three sets of experiments. In the ﬁrst experiment, we will apply the regularization method proposed in this paper to standard vaes such as the original vae, the iwae, and the β-vae. We use mnist, omniglot, and celeba as datasets for this experiment. For celeba, we choose the 32x32 resolution for this experiment. Our results show that adding consistency regularization always improves upon the base vae, both in terms of the quality of the learned representations and generalization. We conduct an ablation study and also report performance of the diﬀerent vae variants above when they are ﬁtted with the original data and their augmentations. The results from this ablation highlight the importance of setting λ > 0.

In the second set of experiments we apply our method to a large-scale vae, the latest nvae (Vahdat & Kautz, 2020). We use mnist, cifar-10, and celeba as datasets for this experiment. We increased the resolution for the celeba dataset for this experiment to 64x64. We reach the same conclusions as for the ﬁrst sets of experiments; cr-vaes improve the learned representations and generalization of their base vaes. In this particular setting, the cr-nvae achieves state-of-the-art generalization performance on both mnist and cifar-10. This state-of-the-art performance couldn t be reach simply by training the nvae with augmentations, as our results show.

Finally, in a third set of experiments, we apply our regularization technique to a 3D point-cloud dataset called Shape Net (Chang et al., 2015). We adapt a high-performing auto-encoding method called Folding Net (Yang et al., 2018) to its vae counterpart and apply the method we described in this paper to that vae variant on the Shape Net dataset. We found that adding consistency regularization yields better learned representations.

We next describe in great detail the set up for each of these experiments and the results showcasing the usefulness of the regularization method we propose in this paper.

Table 2: cr-vaes learn representations that yield higher accuracy on downstream classiﬁcation than their base vaes. These results correspond to the accuracy from a linear classiﬁer that was ﬁtted on the training. We fed this classiﬁer with the representations learned by each method. On both mnist and cifar-10, cr-vaes yield higher accuracy.

Method mnist cifar-10 vae 98.5 32.6 vae+Aug 98.9 40.1 cr-vae 99.4 44.7 iwae 98.6 35.8 iwae+Aug 99.9 37.1 cr-iwae 99.9 44.8

βvae (0.5) 97.6 27.0 βvae (0.5)+Aug 98.7 27.6 βcr-vae (0.5) 98.9 30.0

βvae (10) 99.4 36.5 βvae (10)+Aug 99.6 42.1 βcr-vae (10) 99.6 46.1

Table 3: cr-vaes generalize better than their base vaes on almost all cases; they achieve lower negative log-likelihoods. Although training the base vaes with the augmented data improves generalization, adding the consistency regularization term further improves generalization performance.

Method mnist omniglot celeba vae 83.7 0.3 128.2 0.8 66.1 0.2 vae + Aug 82.8 0.4 125.7 0.2 66.0 0.2 cr-vae 81.2 0.2 124.1 0.1 65.9 0.2 iwae 81.7 0.3 127.5 0.5 65.3 0.1 iwae+Aug 80.4 0.2 125.0 0.6 65.3 0.1 cr-iwae 79.7 0.3 123.6 0.5 65.0 0.2 β-vae (0.5) 92.6 0.3 137.1 0.2 68.7 0.2 β-vae (0.5) + Aug 90.0 0.5 134.6 0.5 68.8 0.2 β-cr-vae (0.5) 85.7 0.6 132.5 0.3 68.2 0.1 β-vae (10) 126.1 1.8 157.5 1.1 92.7 0.5 β-vae (10) + Aug 127.1 1.0 157.3 0.5 92.7 0.3 β-cr-vae (10) 126.2 0.5 157.6 0.6 92.6 0.1

4.1 Application to standard vaes on benchmark datasets We apply consistency regularization, as described in this paper, to the original vae, the iwae, and the β-vae. We now describe the set up and results for this experiment.

Datasets. We study three benchmark datasets that we brieﬂy describe below. We ﬁrst consider mnist. mnist is a handwritten digit recognition dataset with 60, 000 images in the training set and 10, 000 images in the test set (Le Cun, 1998). We form a validation set of 10, 000 images randomly sampled from the training set.

We also consider omniglot, a handwritten alphabet recognition dataset (Lake et al., 2011). This dataset is composed of 19, 280 images. We use 16, 280 randomly sampled images for training and 1, 000 for validation and the remaining 2, 000 samples for testing.

Finally we consider celeba. It is a dataset of faces, consisting of 162, 770 images for training, 19, 867 images for validation, and 19, 962 images for testing (Liu et al., 2018). We set the resolution to 32x32 for this experiment.

Transformations t( x|x). We consider three transformations variants for image data t( x|x). The ﬁrst randomly translates an image [ 2, 2] pixels in any direction. The second transformation randomly

Table 4: The regularization term λ aﬀects both generalization performance and the quality of the learned representations. Many values of λ perform better than the base vae. However a large enough value of λ, e.g. λ = 1, can lead to worse performance than the base vae because for large values of λ the regularization term takes over the data-term in the objective function.

λ MI AU NLL vae 124.5 36 83.7 cr-vae 0.001 125.0 38 83.5 cr-vae 0.01 125.9 41 82.4 cr-vae 0.1 126.3 47 81.2 cr-vae 1 124.3 47 83.9

Table 5: The choice of augmentation aﬀects both generalization performance and the quality of the learned representations. Jointly using all augmentations works best.

Augmentation MI AU NLL

Rotations only 125.8 45 82.1 Translations only 126.1 45 81.9 Scaling only 125.1 42 82.7 All 126.3 47 81.2

rotates an image uniformly in [ 15, 15] degrees clockwise. Finally the third transformation randomly scales an image by a factor uniformly sampled from [0.9, 1.1].

Evaluation metrics. The regularization method we propose in this paper is mainly aimed at improving the learned representations of vaes. To assess these representations we use three metrics: mutual information, number of active latent units, and accuracy on a downstream classiﬁcation task. We also evaluate the eﬀect of the proposed method on generalization to unseen data. For that we also report negative log-likelihood. We deﬁne each of these metrics next.

Mutual information (MI). The ﬁrst quality metric is the mutual information I(z; x) between the observations and the latents under the joint distribution induced by the encoder,

I(z; x) = Epd(x) [KL(qφ(z|x)||p(z)) kl(qφ(z)||p(z))] (5)

where pd(x) is the empirical data distribution and qφ(z) is the aggregated posterior, the marginal over z induced by the joint distribution deﬁned by pd(x) and qφ(z|x). The mutual information is intractable but we can approximate it with Monte Carlo. Higher mutual information corresponds to more interpretable latent variables.

Number of active latent units (AU). The second quality metrics we consider is the number of active latent units (AU). It is deﬁned in Burda et al. (2015) and measures the activity" of a dimension of the latent variables z. A latent dimension is active" if

Covx(Eu qφ(u|x)) > δ (6)

where δ is a threshold deﬁned by the user. For our experiments we set δ = 0.01. The higher the number of latent active units, the better the learned representations.

Accuracy on downstream classiﬁcation. This metric is calculated by ﬁtting a given vae, taking the learned representations for each data in the test set and computing the accuracy from the prediction of the labels of the images in that same test set by a classiﬁer ﬁtted on the training set. This metric is only applicable to labelled datasets.

Negative log-likelihood. We use negative held-out log-likelihood to assess generalization. Consider an unseen data x , its negative held-out log-likelihood under the ﬁtted model is

log pθ(x ) = log Eqφ(z|x )

Table 6: The cr-vae outperforms a popular and advanced contrastive learning technique called triplet loss on both generalization performance and quality of learned representations.

Method MI AU NLL vae 124.5 36 83.7 vae + augmentations 125.9 42 82.8 vae + triplet loss 124.9 39 83.1 cr-vae 126.3 47 81.2

This is intractable and we approximate it using Monte Carlo,

log pθ(x ) log 1

pθ(x , z(s))

qφ(z(s)|x ) (8)

where z(1), . . . , z(S) qφ(z|x ).

Settings. The vaes are built on the same architecture as Tolstikhin et al. (2017). The networks are trained with the Adam optimizer with a learning rate of 10 4 (Kingma & Ba, 2014) and trained for 100 epochs with a batch size of 64. We set the dimensionality of the latent variables to 50, therefore the maximum number of active latent units in the latent space is 50. We found λ = 0.1 to be best according to cross-validation using held-out log-likelihood and exploring the range [1e 4, 1.0] datasets. In an ablation study we explore λ = 0. For the β-vae we set λ = 0.1 β and study both β = 0.1 and β = 10, two regimes under which the β-vae performs qualitatively very diﬀerently (Higgins et al., 2017). All experiments were done on a GPU cluster consisting of Nvidia P100 and RTX. The training took approximately 1 day for most experiments.

Results. Table 1 shows that on all the three benchmark datasets all the diﬀerent vae variants we studied, consistency regularization as developed in this paper always improves the quality of the learned representations as measured by mutual information and the number of active latent units. These results are conﬁrmed by the numbers shown in Table 2 where cr-vaes always lead to better accuracy on downstream classiﬁcation.

We proposed consistency regularization as a way to improve the quality of the learned representations. Incidentally, Table 3 also shows that it can improve generalization as measured by negative loglikelihood.

Ablation Study. We now look at the impact of each factor that goes into the regularization method we introduced in this paper using mnist. We test the impact of the regularization term λ and the impact of the choice of augmentation on all metrics. Table 4 and Table 5 show the results.

Table 4 shows that even small consistency regularization (a small λ value) results in improvement over the base vae but that a large enough λ value can hurt performance.

Table 5 shows that rotations and translations are more important than scaling, but the combination of all three augmentations works best for cr-vaes.

Comparison to Contrastive Learning. We look at how cr-vaes compare against a popular and advanced contrastive-learning-based technique, the triplet loss (Schroﬀet al., 2015) using mnist. Table 6 shows that the cr-vae outperforms the triplet loss on both generalization performance and quality of learned representations. Table 6 also conﬁrms existing literature showing simply applying augmentations can outperform complex contrastive learning-based methods such as the triplet loss (Kostrikov et al., 2020; Sinha & Garg, 2021).

4.2 Application to the large-scale nvae on benchmark datasets Along with standard VAE variants, we also experiment with a large scale state-of-the-art vae, the nvae(Vahdat & Kautz, 2020). Similar to before, we simply add consistency regularization using the image-based augmentations techniques to the NVAE model and experiment on benchmark datasets: mnist (Le Cun, 1998), cifar-10 (Krizhevsky et al., 2009) and celeba (Liu et al., 2018).

The results for large scale generative modeling are tabulated in Table 8 and Table 7, where we see that using cr-nvae we are able to learn representations that yield better accuracy on downstream

Table 7: The cr-nvaes learns better representations than the base nvae as measured by accuracy on a downstream classiﬁcation on both mnist and cifar-10. We get to this same conclusion when looking at the number of active units as an indicator for the quality of the learned latent representations; cr-nvae recovers 226 units whereas nvae recovers 211 units.

Method mnist cifar-10 nvae 99.9 57.9 nvae+Aug 99.9 66.4 cr-nvae 99.9 71.4

Table 8: Large-scale experiments with nvaes with and without consistency-regularization on 3 benchmark datasets: dynamically binarized mnist, cifar-10 and celeba. We report generalization using negative log-likelihood on mnist and bits per dim on cifar-10 and celeba. On all datasets consistency regularization improves generalization performance. In particular cr-nvae achieves state-of-the-art performance on mnist and cifar-10.

mnist (28 28) cifar-10 (32 32) celeba (64 64) nvae 78.19 2.91 2.03 nvae+Aug 77.53 2.70 1.96 cr-nvae 76.93 2.51 1.86

Figure 2: Interpolation between two samples of a lamp, airplane and table using a trained CRFolding Net trained on the Shape Net dataset. The CR-Folding Net is able to learn an interpretable latent space.

classiﬁcation and set new state-of-the-art values on each of the datasets, improving upon the baseline log-likelihood values. This shows the ability of consistency regularization to work at scale on challenging generative modeling tasks.

4.3 Application to the Folding Net on 3D point-cloud data

Along with working with image data, we additionally experiment with 3D point cloud data using a Folding Net Yang et al. (2018) and the Shape Net dataset Chang et al. (2015) which consists of 55 distinct object classes. Folding Net learns a deep Auto Encoder to learn unsupervised representations from the point cloud data. To add consistency regularization, we ﬁrst substitute the Auto Encoder to a

Table 9: The Folding Net yields higher accuracy when paired with consistency regularization on the Shape Net dataset. The results shown here correspond to a Folding Net that was trained with augmented data, the same used to apply consistency regularization. As can be seen from these results, enforcing consistency through KL as we do in this paper leads to representations that perform well on a downstream classiﬁcation. Here the classiﬁer used is a linear SVM. We also report mean reconstruction error through Chamfer distance where the same conclusion holds.

Method Accuracy Reconstruction Loss

Folding Net (Aug) 82.5% 0.0355 CR-Folding Net 84.6% 0.0327

vae by adding the KL term from the ELBO to the baseline Folding Net. We then add the additional consistency regularization KL term to the latent space of Folding Net.

For the Shape Net point cloud data, we perform data augmentation using a similar scheme to what we did for the previous experiments, we randomly translate, rotate and add jitter to the (x, y, z) coordinates of the point cloud data. We follow the same scheme detailed in Folding Net (Yang et al., 2018).

We train both the Folding Net turned in a vae and the CR-Folding Net with these augmentations. To train CR-Folding Net, we additionally apply the consistency regularization term as proposed in Equation 3. The results on the validation set for reconstruction (as measured by Chamfer distance) and accuracy are shown in Table 9.

We also visualize the point clouds reconstructions and interpolations between 3 diﬀerent object classes using a CR-Folding Net in Figure 2. We perform 4 interpolation steps for each of the objects, to highlight the interpretable learned latent space. Additionally, we perform the same interpolation on the baseline Folding Net model. We show these interpolations in the appendix.

5 Conclusion

We proposed a simple regularization technique to constrain encoders of vaes to learn similar latent representations for an image and a semantics-preserving transformation of the image. The idea consists in maximizing the likelihood of the pair of images while minimizing the kl divergence between the variational distribution induced by the encoder when conditioning on the image on one hand, and its transformation, on the other hand. We applied this technique to several vae variants on several datasets, including a 3D dataset. We found it always leads to better learned representations and also better generalization to unseen data. In particular, when applied to the nvae, the regularization technique we developed in this paper yields state-of-the-art results on mnist and cifar-10.

Broader Impact

In this paper, we propose a simple method that performs a KL-based consistency regularization scheme using data augmentation for vaes. The broader impact of the study includes practical applications such as graphics and computer vision applications. The method we propose improves the learned representations of vaes, and as an artifact, also improves their generalization to unseen data. In this regard, any implications of vaes also apply to this work. For example, the generative model ﬁt by a vae may be used to generate artiﬁcial data such as images, text, and 3D objects. Biases may arise as a result of poor data selection. Furthermore, text generated from generative systems may amplify harmful speech contained in the data. However, the method we propose can also improve the performance of vaes when used in certain practical domains as we discussed in the introduction of the paper.

6 Acknowledgements

We thank Kevin Murphy, Ben Poole, and Augustus Odena for their comments on this work.

Achille, A., Eccles, T., Matthey, L., Burgess, C. P., Watters, N., Lerchner, A., and Higgins, I. Life-long disentangled representation learning with cross-domain latent homologies. ar Xiv preprint ar Xiv:1808.06508, 2018.

An, J. and Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1), 2015.

Bachman, P., Alsharif, O., and Precup, D. Learning with pseudo-ensembles. In Advances in neural information processing systems, pp. 3365 3373, 2014.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015.

Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015.

Dieng, A. B., Kim, Y., Rush, A. M., and Blei, D. M. Avoiding latent variable collapse with generative skip models. ar Xiv preprint ar Xiv:1807.04863, 2018.

Dieng, A. B., Ruiz, F. J., and Blei, D. M. Topic modeling in embedding spaces. ar Xiv preprint ar Xiv:1907.04907, 2019.

Fang, L., Li, C., Gao, J., Dong, W., and Chen, C. Implicit deep latent variable models for text generation. ar Xiv preprint ar Xiv:1908.11527, 2019.

Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., and Carin, L. Cyclical annealing schedule: A simple approach to mitigating kl vanishing. ar Xiv preprint ar Xiv:1903.10145, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. Draw: A recurrent neural network for image generation. ar Xiv preprint ar Xiv:1502.04623, 2015.

Hadjeres, G., Nielsen, F., and Pachet, F. Glsr-vae: Geodesic latent space regularization for variational autoencoder architectures. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1 7. IEEE, 2017.

He, J., Spokoyny, D., Neubig, G., and Berg-Kirkpatrick, T. Lagging inference networks and posterior collapse in variational autoencoders. ar Xiv preprint ar Xiv:1901.05534, 2019.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2(5):6, 2017.

Jun, H., Child, R., Chen, M., Schulman, J., Ramesh, A., Radford, A., and Sutskever, I. Distribution augmentation for generative modeling. In International Conference on Machine Learning, pp. 5006 5019. PMLR, 2020.

Kim, Y., Wiseman, S., Miller, A. C., Sontag, D., and Rush, A. M. Semi-amortized variational autoencoders. ar Xiv preprint ar Xiv:1802.02550, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling, M. Semi-supervised learning with deep generative models. ar Xiv preprint ar Xiv:1406.5298, 2014.

Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ar Xiv preprint ar Xiv:2004.13649, 2020.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Laine, S. and Aila, T. Temporal ensembling for semi-supervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016.

Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011.

Le Cun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Liang, D., Krishnan, R. G., Hoﬀman, M. D., and Jebara, T. Variational autoencoders for collaborative ﬁltering. In Proceedings of the 2018 World Wide Web Conference, pp. 689 698, 2018.

Liu, Z., Luo, P., Wang, X., and Tang, X. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15: 2018, 2018.

Miao, Y., Yu, L., and Blunsom, P. Neural variational inference for text processing. In International conference on machine learning, pp. 1727 1736, 2016.

Miyato, T., Maeda, S.-i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.

Osada, G., Ahsan, B., Bora, R. P., and Nishide, T. Regularization with latent space virtual adversarial training. In European Conference on Computer Vision, pp. 565 581. Springer, 2020.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014.

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. 2011.

Roberts, A., Engel, J., Raﬀel, C., Hawthorne, C., and Eck, D. A hierarchical latent vector model for learning long-term structure in music. ar Xiv preprint ar Xiv:1803.05428, 2018.

Sajjadi, M., Javanmardi, M., and Tasdizen, T. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Neur IPS, 2016.

Schroﬀ, F., Kalenichenko, D., and Philbin, J. Facenet: A uniﬁed embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815 823, 2015.

Sinha, S. and Garg, A. S4rl: Surprisingly simple self-supervision for oﬄine reinforcement learning. ar Xiv preprint ar Xiv:2103.06326, 2021.

Sinha, S., Ebrahimi, S., and Darrell, T. Variational adversarial active learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5972 5981, 2019.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017.

Vahdat, A. and Kautz, J. Nvae: A deep hierarchical variational autoencoder. ar Xiv preprint ar Xiv:2007.03898, 2020.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096 1103, 2008.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371 3408, 2010.

Walker, J., Doersch, C., Gupta, A., and Hebert, M. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835 851. Springer, 2016.

Wei, X., Gong, B., Liu, Z., Lu, W., and Wang, L. Improving the improved training of wasserstein gans: A consistency term and its dual eﬀect. ar Xiv preprint ar Xiv:1803.01541, 2018.

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation for consistency training. ar Xiv preprint ar Xiv:1904.12848, 2019.

Yang, Y., Feng, C., Shen, Y., and Tian, D. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206 215, 2018.

Zhang, H., Zhang, Z., Odena, A., and Lee, H. Consistency regularization for generative adversarial networks. 2020.

Zimmerer, D., Kohl, S. A., Petersen, J., Isensee, F., and Maier-Hein, K. H. Context-encoding variational autoencoder for unsupervised anomaly detection. ar Xiv preprint ar Xiv:1812.05941, 2018.