# monte_carlo_variational_autoencoders__ba956890.pdf

Monte Carlo Variational Auto-Encoders

Achille Thin 1 Nikita Kotelevskii 2 Alain Durmus 3 Maxim Panov 2 Eric Moulines 1 4 Arnaud Doucet 5

Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform poorly in high dimensions. While it has been suggested many times in the literature to use more sophisticated algorithms such as Annealed Importance Sampling (AIS) and its Sequential Importance Sampling (SIS) extensions, the potential beneﬁts brought by these advanced techniques have never been realized for VAE: the AIS estimate cannot be easily differentiated, while SIS requires the speciﬁcation of carefully chosen backward Markov kernels. In this paper, we address both issues and demonstrate the performance of the resulting Monte Carlo VAEs on a variety of applications.

1. Introduction

Variational Auto-Encoders (VAE) introduced by (Kingma & Welling, 2013) are a very popular class of methods in unsupervised learning and generative modelling. These methods aim at ﬁnding a parameter θ maximizing the marginal loglikelihood pθ(x) = R pθ(x, z)dz where x RN is the observation and z Rd is the latent variable. They rely on the introduction of an additional parameter φ and a family of variational distributions qφ(z|x). The joint parameters {θ, φ} are then inferred through the optimization of the

*Equal contribution 1CMAP, Ecole Polytechnique, Universite Paris-Saclay, France 2CDISE, Skolkovo Institute of Science and Technology, Moscow, Russia 3Ecole Nationale Sup erieure Paris-Saclay, France 4HDI Lab, HSE University, Moscow, Russia 5University of Oxford. Correspondence to: Achille Thin <achille.thin@polytechnique.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Evidence Lower Bound (ELBO) deﬁned as

L(θ, φ) = Z log pθ(x, z)

= log pθ(x) KL qφ(z|x) pθ(z|x) log pθ(x) .

The design of expressive variational families has been the topic of many works and is a core ingredient in the efﬁciency of VAE (Rezende & Mohamed, 2015; Kingma et al., 2016). Another line of research consists in using positive unbiased estimators ˆpθ(x) of the loglikelihood pθ(x) for qφ, i.e. Eqφ[ˆpθ(x)] = pθ(x). Indeed, as noted in (Mnih & Rezende, 2016), it follows from Jensen s inequality that

L(θ, φ) = Eqφ[log ˆpθ(x)] log pθ(x) . (1)

A Taylor expansion shows that

L(θ, φ) log pθ(x) 1

see e.g. (Maddison et al., 2017; Domke & Sheldon, 2018) for formal results. Hence the ELBO becomes tighter as the variance of the estimator decreases.

A common method to obtain an unbiased estimate is built on importance sampling; i.e. ˆpθ(x) = n 1 Pn i=1[pθ(x, zi)/qφ(zi|x)] for zi i.i.d. qφ( |x). In particular, combined with (1), we obtain the popular Importance Weighted Auto Encoder (IWAE) proposed by (Burda et al., 2015). However, it is expected that the relative variance of this importance-sampling based estimator typically increases with the dimension of the latent z. To circumvent this issue, we suggest in this paper to consider other estimates of the evidence which have shown great success in the Monte Carlo literature. In particular, Annealed Importance Sampling (AIS) (Neal, 2001; Wu et al., 2016), and its Sequential Importance Sampling (SIS) extensions (Del Moral et al., 2006) deﬁne state-of-the-art estimators of the evidence. These algorithms rely on an extended target distribution for which an efﬁcient importance distribution can be deﬁned using non-homogeneous Markov kernels.

It has been suggested in various contributions that AIS could be useful to train VAE (Salimans et al., 2015; Wu et al., 2016; Maddison et al., 2017; Wu et al., 2020). However, to the authors knowledge, no contribution discusses how an

Monte Carlo VAE

unbiased gradient of the resulting ELBO could be obtained. Indeed, the main difﬁculty in this computation arises from the MCMC transitions in AIS. As a result, when this estimator is used, alternatives to the ELBO (1) have often been considered, see e.g. (Ding & Freedman, 2019).

Whereas AIS requires using MCMC transition kernels, SIS is more ﬂexible and can exploit Markov transition kernels which are only approximately invariant w.r.t. to a given target distribution, e.g. unadjusted Langevin kernels. In this case, the construction of the augmented target distribution, which is at the core of the estimator, requires the careful selection of a class of auxiliary backward Markov kernels. (Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016; Goyal et al., 2017; Huang et al., 2018) propose to learn such auxiliary kernels parameterized with neural networks through the ELBO. However, as illustrated in our simulations, this comes at an increase in the overall computational cost.

Our contributions. The contributions of this paper are as follows:

(i) We show how to obtain an SIS-based ELBO relying on undadjusted Langevin dynamics which, contrary to (Salimans et al., 2015; Goyal et al., 2017; Huang et al., 2018), does not require introducing and optimizing backward Markov kernels. In addition, an unbiased gradient estimate of the resulting ELBO which exploits the reparameterization trick is derived.

(ii) We propose an unbiased gradient estimate of the ELBO computed based on AIS. At the core of this estimate is a non-standard representation of Metropolis Hastings type kernels which allows us to differentiate them. This is combined to a variance reduction technique for improved efﬁciency.

(iii) We apply these new methods to build novel Monte Carlo VAEs, and show their efﬁciency on real-world datasets.

All the theoretical results are detailed in the supplementary material.

2. Variational Inference via Sequential Importance Sampling

2.1. SIS estimator

The design of efﬁcient proposal importance distributions has been proposed in (Crooks, 1998; Neal, 2001) starting from an annealing schedule, and was later extended by (Del Moral et al., 2006). Let {γk}K k=0 be a sequence of unnormalized bridge densities satisfying γ0(z) = qφ(z|x), γK(z) =

pθ(x, z). Note that γK is not normalized, in contrast to γ0. Here we set

γk(z) = qφ(z|x)1 βkpθ(x, z)βk (2)

for an annealing schedule 0 = β0 < < βK = 1, but alternative sequences of intermediate densities could be considered. At each iteration, SIS samples a nonhomogeneous Markov chain use the transition kernels {Mk}K k=1. In this section, we assume that Mk admits a positive transition density mk, such that mk leaves γk invariant, i.e. R γk(z)mk(z, z )dz = γk(z ), or is only approximately invariant. In particular, mk typically depends on the data x. However, to simplify notation, this dependence is omitted.

Based on this sequence of transition densities, SIS considers the joint density on the path space z0:K Rd(K+1)

q K φ (z0:K|x) = qφ(z0|x)

k=1 mk(zk 1, zk) , (3)

where zi:j = (zi, . . . , zj) for 0 i j. By construction, we expect the K-th marginal q K φ (z K|x) = R q K φ (z0:K|x)dz0:K 1 to be close to pθ(z K|x). However, we cannot use importance sampling to correct for the discrepancy between these two densities, as q K φ (z K|x) is typically intractable. To bypass this problem, we introduce another density on the path space

p K θ (x, z0:K) = pθ(x, z K)

k=K 1 ℓk(zk+1, zk) , (4)

where {ℓk}K 1 k=0 is a sequence of auxiliary positive transition densities. Note that in this case, the K-th marginal R p K θ (x, z0:K)dz0:K 1 is exactly pθ(x, z K). Using now importance sampling on the path space, we obtain the following unbiased SIS estimator (Del Moral et al., 2006; Salimans et al., 2015) by sampling independently zi 0:K q K φ ( |x) and computing

i=1 w K(zi 0:K) , w K(z0:K) = p K θ (x, z0:K) q K φ (z0:K|x) .

2.2. AIS estimator

The selection of the kernel {ℓk}K 1 k=0 has a large impact on the variance of the estimator. The optimal reverse kernels ℓk minimizing the variance of ˆpθ(x) are given by

ℓk 1(zk, zk 1) = qφ,k 1(zk 1|x)mk(zk 1, zk)

qφ,k(zk|x) , (6)

where qφ,k(zk|x) = R q K φ (z0:K|x)dz k 0:K is the k-th marginal of q K φ ; see (Del Moral et al., 2006). However,

Monte Carlo VAE

the resulting estimator ˆpθ(x) is usually intractable. An approximation to (6) leading to a tractable estimator is provided by AIS (Crooks, 1998; Neal, 2001). When mk is γk-invariant, then, by assuming that qφ,k qφ,k 1 γk in (6), we obtain

ℓk 1(zk, zk 1) = γk(zk 1)mk(zk 1, zk)

γk(zk) . (7)

We refer to this kernel as the reversal of mk. In particular, if mk is γk-reversible, i.e. γk(zk 1)mk(zk 1, zk) = γk(zk)mk(zk, zk 1), then ℓk 1 = mk. Note that the weights (5) can be computed using the decomposition w K(z0:K) = QK k=1 wk(zk 1, zk) with

wk(zk 1, zk) = γk(zk)ℓk 1(zk, zk 1) γk 1(zk 1)mk(zk 1, zk) , (8)

which simpliﬁes when using (7) as

wk(zk 1, zk) = γk(zk 1)/γk 1(zk 1) . (9)

In contrast to previous works, we will consider in the next section transition kernels mk which are only approximately γk-invariant but still build ℓk 1 in the spirit of (7).

2.3. SIS-ELBO using unadjusted Langevin

For any k, we consider the Langevin dynamics associated to γk and the corresponding Euler-Maruyama discretization. Then, we choose for mk the transition density associated to this discretization

mk(zk 1, zk) = N(zk; zk 1 + η log γk(zk 1), 2η Id) . (10) Note that sampling q K φ boils down to sampling Z0 qφ and deﬁning the Markov chain {Zk}k K recursively by

Zk = Zk 1 + η log γk(Zk 1) + p

2ηUk , (11)

where Uk N(0, Idd). Moreover, as the continuous Langevin dynamics is reversible w.r.t. γk, this suggests that we can approximate the reversal ℓk 1 of mk by mk directly as in AIS and thus select for any k, ℓk 1 = mk as done in (Heng et al., 2020). However, in this case, the weights wk(zk 1, zk) do not simplify to γk(zk 1)/γk 1(zk 1) as mk is not exactly γk-invariant and we need to rely on the general expression given in (8). This approach was concurrently and independently proposed in (Wu et al., 2020) but they do not discuss gradient computations therein.

Based on (1) and (5), we introduce

LSIS = Z log

k=1 wk(zk 1, zk)

q K φ (z0:K|x)dz0:K .

Algorithm 1 Langevin Monte Carlo VAE

Input: Number of steps K, initial distribution qφ, unnormalized target distribution pθ, step-size η, annealing schedule {βk}K k=0. Output: SIS estimator W of log pθ(x).

Draw z0 qφ( |x); Set W = log qφ(z0|x); for k = 1 to K do

Draw uk ϕd; Set γk( ) = βk log pθ(x, ) + (1 βk) log qφ( |x); Set zk = zk 1 + η log γk(zk) + 2ηuk; Set W = W + log mk(zk, zk 1) log mk(zk 1, zk); end for Set W = W + log pθ(x, z K); Return W

We consider a reparameterization of (12) based on the Langevin mappings associated with target γk given by

Tk,u(zk 1) = zk 1+η log γk(zk 1)+ p

An easy change of variable based on the identity Zk = Tk,Uk(Zk 1) in (11) shows that LSIS can be written as Z log

k=1 wk(zk 1, zk)

qφ(z0|x)ϕd,K(u1:K)dz0du1:K,

where ϕd stands for the density of the d-dimensional standard Gaussian distribution, ϕd,K(u1:K) = QK i=1 ϕd(ui), and we write zk = Tk,uk T1,u1(z0) = k i=1Ti,ui(z0). By (5) and as ℓk 1 = mk, our objective thus ﬁnally writes

LSIS(θ, φ; x) = Z qφ(z0|x)ϕd,K(u1:K) (14)

pθ(x, z K) QK k=1 mk(zk, zk 1)

qφ(z0|x) QK k=1 mk(zk 1, zk)

This deﬁnes the reparameterizable Langevin Monte Carlo VAE (L-MCVAE). The algorithm to obtain an unbiased SIS estimate of pθ(x) is described in Algorithm 1. This estimate is related to the one presented in (Caterini et al., 2018), however this work builds on a deterministic dynamics which limits the expressivity of the induced variational approximation. In contrast, we rely here on a stochastic dynamics. While we limit ourselves here to undajusted (overdamped) Langevin dynamics, this could also be easily extended to unajusted underdamped Langevin dynamics (Monmarch e, 2020).

3. Variational Inference via Annealed Importance Sampling

In Section 2.3, we derived the SIS estimate of the evidence computed using ULA kernels Mk, whose invariant distri-

Monte Carlo VAE

bution approximates γk. We can differentiate the resulting ELBO and exploit the reparametrization trick as these kernels admit a density mk w.r.t. Lebesgue measure. When computing the AIS estimates, we need Markov kernels Mk that are γk-invariant. Such Markov kernels Mk most often rely on a Metropolis Hastings (MH) step and therefore do not typically admit a transition density. While this does not invalidate the expression of the AIS estimate presented earlier, it complicates signiﬁcantly the computation of an unbiased gradient estimate of the corresponding ELBO. In this section, we propose a way to compute an unbiased estimator of the ELBO gradient for MH Markov kernels. We use here elementary measure-theoretical notations which are recalled in Section 1

3.1. Differentiating Markov kernels

Markov kernel. Let B(Rd) denote the Borelian σ-ﬁeld associated to Rd. A Markov kernel M is a function deﬁned on Rd B(Rd), such that for any z Rd, M(z, ) is a probability distribution on B(Rd), i.e. M(z, A) is the probability starting from z to hit the set A Rd. The simplest is deterministic , in which case Q(z, A) = δT(z)(A), where T is a measurable mapping on Rd and δy is the Dirac mass at y. Instead of a single mapping T, we can consider a family of indexed mappings Tu : u Rdu . If g is a p.d.f on Rd U , we consider

M(z, A) = Z 1A(Tu(z))g(u)du .

To sample M(z, ), we ﬁrst sample u g and then apply the mapping Tu(z). If d = du, we consider the function Gz : Rd 7 Rd deﬁned for all z Rd by

Gz : u 7 Gz(u) = Tu(z) (15)

If Gz is a diffeomorphism, then by applying the change of variables formula, we obtain

M(z, A) = R 1A Gz(u) g(u)du (16)

= R 1A(z )m(z, z )dz ,

where, denoting JG 1 z (z ) is the absolute value of the Jacobian determinant of G 1 z evaluated at z , we set

m(z, z ) = JG 1 z (z )g G 1 z (z ) . (17)

In this case, the Markov kernel has a transition density m(z, z ). This is the setting considered in the previous section.

Finally, some Markov kernels have both a deterministic and a continuous component. This is for example the case of Metropolis Hastings (MH) kernels:

M(z, A) = Z

U Qu(z, A)g(u)du , where (18)

Qu z, A = αu(z)δTu(z)(A) + 1 αu(z) δz(A) ,

with αu(z) = α z, Tu(z) is the acceptance function and Tu : Rd Rd is the proposal mapping. In the sequel, we denote α1 u(z) = αu(z) and α0 u(z) = 1 αu(z), and set T0 u(z) = z. With these notations, (18) can be rewritten in a more concise way

a=0 αa u(z)δTau(z)(A) . (19)

To sample M(z, ), we ﬁrst draw u g and then compute the proposal y = Tu(z). With probability αu(z), the next state of the chain is set to z = y, and z = z otherwise. If Gz deﬁned in (15) is a diffeomorphism, then the Metropolis Hasting kernel may be expressed as

M(z, A) = Z α(z, z )m(z, z )1A(z )dz

+ 1 α(z) δz(A),

where α(z) = R α(z, z )m(z, z )dz is the mean acceptance probability at z (the probability of accepting a move) and m(z, z ) is deﬁned as in (17). In MH-algorithms, the acceptance function α: R2d [0, 1] is chosen so that M is π-reversible π(dz)M(z, dz ) = π(dz )M(z , dz), where π is the target distribution. This implies, in particular, that M is π-invariant. Standard MH algorithms use α(z, z ) = 1 π(z )m(z , z)/π(z)m(z, z ); see (Tierney, 1994).

To illustrate these deﬁnitions and constructions, consider ﬁrst the symmetric Random Walk Metropolis Algorithm (RWM). In this case, d U = d and g ϕd, where ϕd is the d-dimensional standard Gaussian density. The proposal mapping is given by

GRWM z (u) = TRWM u (z) = z + Σ1/2u ,

where Σ is a positive deﬁnite matrix, and the acceptance function is given by αRWM u (z) = 1 π T RWM u (z) /π(z) .

Consider now the Metropolis Adjusted Langevin Algorithm (MALA); see (Besag, 1994). Assume that z 7 log π(z) is differentiable and denote by log π(z) its gradient. The Langevin proposal mapping TMALA u is deﬁned by

GMALA z (u) = TMALA u (z) = z + η log π(z) + p

2ηu . (20) We set g ϕd and αMALA u (z) is the acceptance given by

αMALA u (z) = 1 π TMALA u (z) mη TMALA u (z), z

π(z)mη z, TMALA u (z) ,

(21) where mη(z, z ) = η 1/2ϕd η 1/2{z TMALA 0 (z)} , similarly to (10).

Lemma 1. For all z Rd, GMALA z is a C1-diffeomorphism. Moreover, assume that log π is L-smooth with L > 0, i.e. for

Monte Carlo VAE

z, z Rd, log π(z ) log π(z) L z z . Then, if 0 η < L 1, for all u Rd, TMALA u deﬁned in (20) is a C1-diffeomorphism.

The proof of Lemma 1 is postponed to Section 3.

3.2. Differentiable AIS-based ELBO

We now generalize the derivation of Section 2.1 to handle the case where Mk and its reversal do not admit transition densities. In this case, the proposal and unnormalized target distributions are deﬁned by

QK φ (dz0:K|x) = qφ(z0|x)dz0

k=1 Mk(zk 1, dzk) , (22)

Pun θ (x, dz0:K) = pθ(x, z K)dz K

k=K Lk 1(zk, dzk 1) ,

where we deﬁne the reversal Markov kernel Lk 1 by

γk(zk 1)dzk 1Mk(zk 1, dzk)

= γk(zk)dzk Lk 1(zk, dzk 1) .

We consider then the AIS estimator presented in Section 2.1 in (5) by sampling independently zi 0:K QK φ ( |x) and with w K(z0:K) = QK k=1 wk(zk 1), wk(zk 1) = γk(zk 1)/γk 1(zk 1). A rigorous proof of the unbiasedness of the resulting estimator can be found in the Supplementary Material and is based on the formula

pθ(x) = Z w K(z0:K)QK φ (dz0:K|x) . (23)

In the sequel, we use Metropolis Adjusted Langevin Algorithm (MALA) kernels {Mk}K k=1 targeting γk for each k {1, . . . , K}. By construction, the Markov kernel Mk is reversible w.r.t. γk and we set for the reversal kernel Lk 1 = Mk. Note that we could easily generalize to other cases, especially inspired by recent works on non-reversible MCMC algorithms (Thin et al., 2020).

For k {1, . . . , K}, we use the representation of MALA kernel Mk outlined in (18) with proposal mapping Tk,u and acceptance function αk,u deﬁned as in (20) and (21) with π γk. We set

Qk,u(z, dz ) =

a=0 αa k,u(z)δTa k,u(z)(dz ) . (24)

By construction, the MALA kernel Mk (see (18)) writes Mk(z, dz ) = R Qk,u(z, dz )ϕd(u)du. Plugging this representation into (22), we get

QK φ (dz0:K|x) = Z K Y

k=1 Qk,uk(zk 1, dzk)

qφ(z0|x)ϕd,K(u1:K)dz0du1:K ,

writing ϕd,K(u1:K) = QK i=1 ϕd(ui). Eq. (24) suggests to consider the extended distribution on (z0:K, a1:K, u1:K):

QK φ (dz0:K, a1:K, du1:K|x) = qφ(z0|x)

k=1 αak k,uk zk 1

k=1 δT ak k,uk (zk 1)(dzk)ϕd,K(u1:K)dz0du1:K ,

which admits again as a marginal QK φ (dz0:K|x). Note that the variables a1:K correspond to the binary outcomes of the K A/R steps. By construction, z1:K are deterministic functions of z0, a1:K and u1:K: for each k {1, . . . , K}, zk = k i=1Tai i,ui(z0).

Set w K(z0:K) = QK k=1 wk(zk 1), and

A(z0, a1:K, u1:K) =

k=1 αak k,uk k 1 i=1 Tai i,ui(z0) ,

W(z0, a1:K, u1:K) = log

k=1 wk k 1 i=1 Tai i,ui(z0) .

Given z0 and u1:K, A(z0, a1:K, u1:K) is the conditional distribution of the A/R random variables a1:K. It is easy to sample this distribution recursively: for each k {1, . . . , K} we sample ak from a Bernoulli distribution of parameter αk,uk(zk 1) and we set zk = Tak k,uk(zk 1). Eqs. (5) and (9) naturally lead us to consider the ELBO

a1:K QK φ (dz0:K, a1:K, du1:K|x) log w K(z0:K)

a1:K qφ(z0|x)A(z0, a1:K, u1:K)

W(z0, a1:K, u1:K) ϕd,K(u1:K)dz0du1:K ,

where in the last line we have integrated w.r.t. z1:K. The fact that LAIS is an ELBO stems immediately from (23), by applying Jensen s inequality. We can optimize this ELBO w.r.t. the different parameters at stake, even possibly the parameters of the proposal mappings.

We reparameterize the latent variable distribution qφ(z0|x) in terms of a known base distribution and a differentiable transformation (such as a location-scale transformation). Assuming for simplicity that qφ(z0|x) is a Gaussian distribution N(z0; µφ(x), σ2 φ(x)), then the location-scale transformation using the standard Normal as a base distribution allows us to reparameterize z0 as z0 = µφ(x)+σφ(x) u0 = Vφ,x(u0), with u0 ϕd. Using this reparameterization

Monte Carlo VAE

trick, we can write LAIS = LAIS1 + LAIS2 with

LAIS1 = Z X

a1:K A(Vφ,x(u0), a1:K, u1:K)

W(Vφ,x(u0), a1:K, u1:K)ϕd,K+1(u0:K)du0:K ,

LAIS2 = Z X

a1:K A(Vφ,x(u0), a1:K, u1:K)W(Vφ,x(u0), a1:K, u1:K)

log A(Vφ,x(u0), a1:K, u1:K)ϕd,K+1(u0:K)du0:K . (25)

The estimation of LAIS1 is straightforward. We sample n independent samples u1:n 0:K ϕd,K+1 and, for i {1, . . . , n}, we set zi 0 = Vφ,x(ui 0) and then, for k {1, . . . , K}, we sample the A/R variable ai k

Ber{α1 k,ui k(zi k 1)} and set zi k = Tai k k,ui k(zi k 1), see Algorithm 2. Similarly, we then compute

\ LAIS2,n = n 1 n X

i=1 W(Vφ,x(ui 0), ai 1:K, ui 1:K) .

The expression LAIS2 is the REINFORCE gradient estimator (Williams, 1992) for the A/R probabilities. Indeed, we have to compute the gradient of the conditional distribution of the A/R variables given (z0, u0:K), and there is no obvious reparametrization for such purpose (see however (Maddison et al., 2016) for a possible solution to the problem; this solution was not investigated in this work). To reduce the variance of the REINFORCE estimator, we rely on control variates, in the spirit of (Mnih & Rezende, 2016). For i {1, . . . , n}, we deﬁne

Wn,i = 1 n 1

j =i W(Vφ,x(uj 0), aj 1:K, uj 1:K) ,

which is independent of W(Vφ,x(ui 0), ai 1:K, ui 1:K) and log A(Vφ,x(ui 0), ai 1:K, ui 1:K) by construction. This provides the new unbiased estimator of the gradient using

\ LAIS2,n = n 1 n X

h W(Vφ,x(ui 0), ai 1:K, ui 1:K) Wn,i i

log A(Vφ,x(ui 0), ai 1:K, ui 1:K) . (26)

Algorithm 2 shows how to compute W and log A.

4. Experiments

4.1. Methods and practical guidelines

In what follows, we consider two sets of experiments1. In the ﬁrst one, we aim at illustrating the expressivity and the efﬁciency of our estimator for VI. In the second, we tackle the problem of learning VAE: (a) Classical VAE based

1The code to reproduce all of the experiments is available online at https://github.com/premolab/metflow/.

Algorithm 2 Annealed Importance Sampling VAE

Input: Number of steps K, proposal mappings {Tk,u}k K,u U, acceptance functions {αk,u}k K,u U, initial distribution qφ, unnormalized target distribution pθ, annealing schedule {βk}K k=0. Output: AIS estimator W of log pθ(x), sum log A of the A/R log probabilities.

Draw z0 qφ( |x); Set W = 0; Set log A = 0; for k = 1 to K do

Draw uk ϕd; Draw ak Ber(αk,uk zk 1) ; if ak = 1 (Accept) then

Set zk = zk 1 + η log γk(zk) + 2ηuk; else

zk = zk 1; end if Compute log wk(zk 1) = (βk βk 1) log pθ(x, zk 1) log qφ(zk 1|x) ; Set W = W + log wk(zk 1); Set log A = log A + log αak k,uk(zk 1); end for Return W, log A

on mean-ﬁeld approximation (Kingma & Welling, 2013); (b) Importance-weighted Autoencoder (IWAE, (Burda et al., 2015)); (c) L-MCVAE given by Algorithm 1; (d) A-MCVAE given by Algorithm 2. We provide in the following some guidelines on how to tune the step sizes and the annealing schedules in Algorithm 1 and Algorithm 2.

A crucial hyperparameter of our method is the step size η. In principle, it could be learned by including it as an additional inference parameter φ and by maximizing the ELBO. However, it is then difﬁcult to ﬁnd a good tradeoff between having a high A/R ratio and a large step size η at the same time. Instead, we suggest adjusting η by targeting a ﬁxed A/R ratio ρ. It has proven effective to use a preconditioned version of (11), i.e. Zk = Zk 1 + η log γk(Zk 1) + 2η Uk with η Rp, where we adapt each component of η using the following rule

η(i) = 0.9η(i) +0.1η0/ ϵ+std[ z(i) log pθ(x, z)] . (27)

Here std denotes the standard deviation over the batch x of the quantity z(i) log pθ(x, z), and ϵ > 0. The scalar η0 is a tuning parameter which is adjusted to target the A/R ratio ρ. This strategy follows the same heuristics as Adam (Kingma & Ba, 2014). In the following ρ is set to 0.8 for A-MCVAE and 0.9 for L-MCVAE (keeping it high for LMCVAE ensures that the Langevin dynamics stays almost reversible , thus keeping a low variance SIS estimator).

Monte Carlo VAE

Figure 1. Visualization of the posterior approximation given after optimization of different bounds for toy generation process. Top row, from left to right: True posterior, VAE posterior, IWAE posterior. Bottom row, from left to right: VI with Real NVP posterior, A-MCVAE posterior, L-MCVAE posterior.

An optimal choice of the temperature schedule {βk}K k=0 for SIS and AIS is a difﬁcult problem. We have focused in our experiments on three different settings. First, we consider the temperature schedule ﬁxed and regularly spaced between 0 and 1. Following (Grosse et al., 2015), the second option is the sigmoidal tempering scheme where βk = ( βk β1)/( βK β1) with, βk = σ δ(2k/K 1) , σ is the sigmoid function and δ > 0 is a parameter that we optimize during the training phase. The last schedule consists in learning the temperatures {βk}K k=0 directly as additional inference parameters φ.

4.2. Toy 2D example and Probabilistic Principal Component Analysis

In the following two examples, we ﬁx the parameters θ of the likelihood model and apply Algorithm 1 and Algorithm 2 to perform VI to sample from z 7 pθ(z|x). Consider ﬁrst a toy hierarchical example where we generate some i.i.d. data x = (xi)N i=1 RN from the i.i.d. latent variables z = (zi)N i=1 R2N as follows for ξ > 0 xi|zi N(ξ ( zi 2 + ζ), σ2) = pθ(xi|zi). We consider the variational approximation as qφ(z|x) = N z; µφ(x), σφ(x)2 Id , where µφ(x), σφ(x) Rd are the outputs of a fully connected neural network with parameters φ. We compare these algorithms to VI using Real-valued Non-Volume Preserving transformation (Real NVP, Dinh et al. (2016)).

Figure 1 displays the VI posterior approximations corresponding to the different schemes for a given observation x. It can be observed that MCVAE beneﬁt from ﬂexible variational distributions compared to other classical schemes, which mostly fail to recover the true posterior. Additional results on the estimation of the parameters ξ, ζ, given in the supplementary material, further support our claims; see Section 2.1.

We now illustrate the performance of MCVAE on a probabilistic principal component analysis problem applied to MNIST (Salakhutdinov & Murray, 2008), as we can access in this case the exact likelihood and its gradient. We follow the set-up of (Ruiz et al., 2021, Section 6.1). We consider here a batch of size N = 100 for the

Figure 2. Representation of the different estimators (top) and their gradient (bottom) of the true log likelihood. From left to right, a/ L-MCVAE, K = 5, b/ L-MCVAE, K = 10, c/ A-MCVAE, K = 5, d/ A-MCVAE, K = 10, e/ A-MCVAE, K = 5 with control variates.

model pθ(x, z) = N(z; 0, Idd)N(x; θ0 + θ1z, σ2 Idp), with d = 100 and p = 784. We ﬁx arbitrarily θ0, θ1, and ﬁt an amortized variational distribution qφ(z|x) by maximizing the IWAE bound w.r.t. φ with K = 100 importance samples for a large number of epochs. The distribution qφ(z|x) = N(z; µφ(x), diag(σ2 φ(x))) is a mean-ﬁeld Gaussian distribution where µφ, σφ are linear functions of the observation x.

We compare the Langevin SIS estimator (L-MCVAE) of the log evidence log pθ(x) with Langevin auxiliary kernels as described in Section 2.3, and the Langevin AIS estimate (A-MCVAE). Moreover, we also compare the gradients of these quantities w.r.t. the parameters θ0, θ1.

Figure 2 summarises the results with boxplots computed over 200 independent samples of each estimator. The quantity reported on the ﬁrst boxplot corresponds to the Monte Carlo samples of log ˆpθ(x) log pθ(x). One the one hand, we note that the SIS estimator has larger variance than AIS, and that the latter achieves a better ELBO. Moreover, in both cases, increasing the number of steps K tightens the

Monte Carlo VAE

bound. On the other hand, the estimator of the gradient of AIS is noisier than that of SIS, even though variance reduction techniques allows us to recover a similar variance. We also present in the supplementary material the Langevin SIS estimator using auxiliary backward kernels learnt with neural networks (as done in previous contributions); see Section 2.2. The auxiliary neural backward kernels are set as l(z, z ) = N(z ; µψ(z), diag(σ2 ψ(z))), µψ, σψ Rd, where the parameters ψ are learnt through the SIS ELBO, similarly to (Huang et al., 2018). The variance of the associated estimator and their gradients are larger than that of SIS using the approximate reversals as backward kernels; i.e. ℓk 1 = mk.

1 3 5 10 Transitions

Log likelihood

L-MCVAE IWAE

Figure 3. Log-likelihood of L-MCVAE depending on the number of Langevin steps K. Increasing K improves performance, however at the expense of the computational complexity.

4.3. Numerical results for image datasets

Following (Wu et al., 2016), we propose to evaluate our models using AIS (not to be confused with the proposed AIS-based VI approach) to get an estimation of the negative log-likelihood. The base distribution is the distribution output by the encoder, and we perform K steps of annealing to compute the estimator of the likelihood, as given by (5). In practice, we use K = 5 HMC steps with 3 leapfrogs for evaluating our models.

0 20 40 60 80 Epoch

Log likelihood

L-MCVAE A-MCVAE VAE IWAE

Figure 4. Evolution of the held-out loglikelihood during training for A-MCVAE, L-MCVAE, IWAE and VAE on MNIST.

We evaluate our models on three different datasets: MNIST, CIFAR-10 and Celeb A. All the models we compare share the same architecture: the inference network qφ is given by a convolutional network with 8 convolutional layers and one linear layer, which outputs the parameters µφ(x), σφ(x) Rd of a factorized Gaussian distribution, while the generative model pθ( |z) is given by another convolutional network πθ, where we use nearest neighbor upsamplings. This outputs the parameters for the factorized Bernoulli distribution (for MNIST dataset), that is

i=1 Ber x(i)| πθ(z) (i)

and similarly the mean of the Gaussian distributions for colored datasets (CIFAR-10, Celeba). We compare A-MCVAE, L-MCVAE, IWAE, and VAE with different settings. All the models are implemented using Py Torch (Paszke et al., 2019) and optimized using the Adam optimizer (Kingma & Ba, 2014) for 100 epochs each. The training process is using Py Torch Lightning toolkit (Falcon, 2019).

First, consider dynamically binarized MNIST dataset (Salakhutdinov & Murray, 2008). In this case, the latent dimension is set to d = 64. We present in Table 1 the results of the different models at different stages of the optimization. Moreover, we show on Figure 3 the performance of L-MCVAE for different values of K compared to IWAE baseline. In particular, we see that increasing K increases the performance of our VAE, however at the expense of an increase in computational cost. We also display on Figure 4 the evolution of the held-out loglikelihood for various objectives during training. Adding Langevin transitions appears to help convergence of the models.

Second, we compare similarly the different models on Celeb A and CIFAR, see Table 2 and Table 3. In this case, the latent dimension is chosen to be d = 128. Increasing the number of MCMC steps seems again to improve both the ELBO and the ﬁnal loglikelihood estimate. In each case, all models are run with 5 different seeds to compute the presented empirical standard deviation.

5. Discussion

We have shown in this article how one can leverage state-ofthe-art Monte Carlo estimators of the evidence to develop novel competitive VAEs by developing novel gradient estimates of the corresponding ELBOs.

For a given computational complexity, AIS based on MALA provides ELBO estimates which are typically tighter than SIS estimates based on ULA. However, the variance of the gradient estimates of the AIS-based ELBO (A-MCVAE) is also signiﬁcantly larger than for the SIS-based ELBO

Monte Carlo VAE

Table 1. Results of the different models on MNIST. A more detailed version of this table is included in the supplementary material.

negative ELBO estimate NLL estimate number of epochs 10 30 100 10 30 100 VAE 95.26 0.49 91.58 0.27 89.70 0.19 89.83 0.59 86.86 0.26 85.22 0.07 IWAE, K = 10 91.42 0.21 88.56 0.07 87.16 0.19 88.54 0.27 86.07 0.1 84.82 0.1 IWAE, K = 50 90.34 0.27 87.5 0.16 86.05 0.11 89.4 0.25 86.54 0.15 85.05 0.1 L-MCVAE, K = 5 96.62 3.24 88.58 0.75 87.51 0.41 90.59 2.01 85.68 0.49 84.92 0.24 L-MCVAE, K = 10 96.78 1.06 87.99 0.71 86.8 0.66 91.33 0.61 85.47 0.46 84.58 0.39 A-MCVAE, K = 3 96.21 3.43 88.64 0.78 87.63 0.42 90.42 2.34 85.77 0.65 85.02 0.37 A-MCVAE, K = 5 95.55 2.96 87.99 0.57 87.03 0.27 90.39 2.21 85.6 0.67 84.84 0.38 VAE with Real NVP 95.23 0.33 91.69 0.15 89.62 0.17 89.98 0.24 86.88 0.05 85.23 0.18

Table 2. Results of the different models on Celeb A. A more detailed version of this table is included in the supplementary material. 11400 must be added to all scores in this table.

negative ELBO - 11400+ NLL - 11400+ number of epochs 10 30 100 10 30 100 VAE 23.78 1.95 17.99 0.4 14.72 0.16 17.35 1.7 12.68 0.62 10.11 0.32 IWAE, K = 10 20.59 0.71 15.45 0.52 12.2 0.3 18.25 0.6 13.18 0.42 10.14 0.31 IWAE, K = 50 19.05 0.39 13.59 0.5 10.48 0.89 19.08 0.42 13.17 0.54 10.12 0.86 L-MCVAE, K = 5 21.61 1.48 12.72 0.43 11.6 0.37 16.42 1.47 9.62 0.47 8.72 0.4 L-MCVAE, K = 10 20.7 1.15 11.81 0.34 10.6 0.23 17.0 1.87 9.29 0.73 8.24 0.52 A-MCVAE, K = 3 21.59 1.5 13.94 0.42 12.84 0.3 16.64 1.37 10.98 0.48 9.95 0.3 A-MCVAE, K = 5 20.95 1.18 12.42 0.42 11.13 0.37 17.42 1.49 9.97 0.59 8.82 0.57 VAE with Real NVP 15.12 0.48 13.63 0.27 12.58 0.61 10.42 0.33 9.04 0.26 8.98 0.2

Table 3. Results of the different models on CIFAR. A more detailed version of this table is included in the supplementary material. 2800 must be added to all scores in this table.

negative ELBO - 2800+ NLL - 2800+ number of epochs 10 30 100 10 30 100 VAE 69.57 0.08 69.55 0.51 68.84 0.06 68.51 0.07 68.41 0.33 67.9 0.03 IWAE, K= 10 69.82 0.03 69.35 0.03 69.36 0.36 68.56 0.03 68.0 0.03 68.02 0.4 IWAE, K= 50 69.94 0.08 69.55 0.04 69.43 0.03 69.15 0.15 68.37 0.18 67.93 0.02 L-MCVAE, K= 5 70.62 0.41 68.55 0.18 68.09 0.1 69.15 0.38 67.73 0.07 67.5 0.07 L-MCVAE, K= 10 70.99 0.59 68.36 0.04 68.03 0.0 69.8 0.67 67.76 0.04 67.51 0.03 A-MCVAE, K= 3 69.97 0.99 68.48 0.29 68.18 0.16 69.26 0.76 67.77 0.18 67.55 0.1 A-MCVAE, K= 5 70.1 0.89 68.28 0.2 68.01 0.08 69.23 0.75 67.71 0.15 67.5 0.07 VAE with Real NVP 70.01 0.12 69.51 0.07 69.19 0.13 68.73 0.05 68.35 0.05 68.05 0.02

(L-MCVAE) as it has to rely on REINFORCE gradient estimates. While control variates can be considered to reduce the variance, this comes at a signiﬁcant increase in computational cost.

Empirically, L-MCVAE should thus be favoured as it provides both a tighter ELBO than standard techniques and low variance gradient estimates.

Acknowledgements

The work was partly supported by ANR-19-CHIA-0002-01 SCAI and EPSRC Co SIn ES grant EP/R034710/1. It was partly carried out under the framework of HSE University

Basic Research Program. The development of a software system for the experimental study of VAEs and its application to computer vision problems (Section 4) was supported by the Russian Science Foundation grant 20-71-10135. Part of this research has been carried out under the auspice of the Lagrange Center for Mathematics and Computing.

Monte Carlo VAE

Besag, J. Comments on Representations of knowledge in complex systems by u. Grenander and m. Miller. J. Roy. Statist. Soc. Ser. B, 56:591 592, 1994.

Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015.

Caterini, A. L., Doucet, A., and Sejdinovic, D. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, pp. 8167 8177, 2018.

Crooks, G. E. Nonequilibrium measurements of free energy differences for microscopically reversible Markovian systems. Journal of Statistical Physics, 90(5-6):1481 1487, 1998.

Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B, 68(3):411 436, 2006.

Ding, X. and Freedman, D. J. Learning deep generative models with annealed importance sampling. ar Xiv preprint ar Xiv:1906.04904, 2019.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using Real NVP. ar Xiv preprint ar Xiv:1605.08803, 2016.

Domke, J. and Sheldon, D. R. Importance weighting and variational inference. In Advances in neural information processing systems, pp. 4470 4479, 2018.

Falcon, W. Pytorch lightning. Git Hub. Note: https://github.com/Py Torch Lightning/pytorch-lightning, 3, 2019.

Goyal, A. G. A. P., Ke, N. R., Ganguli, S., and Bengio, Y. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392 4402, 2017.

Grosse, R. B., Ghahramani, Z., and Adams, R. P. Sandwiching the marginal likelihood using bidirectional monte carlo. ar Xiv preprint ar Xiv:1511.02543, 2015.

Heng, J., Bishop, A. N., Deligiannidis, G., and Doucet, A. Controlled sequential Monte Carlo. The Annals of Statistics, 48(5):2904 2929, 2020.

Huang, C.-W., Tan, S., Lacoste, A., and Courville, A. C. Improving explorability in variational inference with annealed variational objectives. In Advances in Neural Information Processing Systems, pp. 9701 9711, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving variational inference with inverse autoregressive ﬂow. ar Xiv preprint ar Xiv:1606.04934, 2016.

Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary deep generative models. In International conference on machine learning, pp. 1445 1453. PMLR, 2016.

Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016.

Maddison, C. J., Lawson, D., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh, Y. W. Filtering variational objectives. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6576 6586, 2017.

Mnih, A. and Rezende, D. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pp. 2188 2196. PMLR, 2016.

Monmarch e, P. High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin. ar Xiv preprint ar Xiv:2007.05455, 2020.

Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. ar Xiv preprint ar Xiv:1912.01703, 2019.

Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, pp. 324 333. PMLR, 2016.

Rezende, D. and Mohamed, S. Variational inference with normalizing ﬂows. In International Conference on Machine Learning, pp. 1530 1538. PMLR, 2015.

Ruiz, F. J., Titsias, M. K., Cemgil, T., and Doucet, A. Unbiased gradient estimation for variational auto-encoders using coupled Markov chains. In Uncertainty in Artiﬁcial Intelligence, 2021.

Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872 879, 2008.

Monte Carlo VAE

Salimans, T., Kingma, D., and Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pp. 1218 1226, 2015.

Thin, A., Kotelevskii, N., Andrieu, C., Durmus, A., Moulines, E., and Panov, M. Nonreversible MCMC from conditional invertible transforms: a complete recipe with convergence guarantees. ar Xiv preprint ar Xiv:2012.15550, 2020.

Tierney, L. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701 1728, 1994.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229 256, 1992.

Wu, H., K ohler, J., and No e, F. Stochastic normalizing ﬂows. Advances in Neural Information Processing Systems, 2020.

Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the quantitative analysis of decoder-based generative models. ar Xiv preprint ar Xiv:1611.04273, 2016.