# go_gradient_for_expectationbased_objectives__480f769a.pdf

Published as a conference paper at ICLR 2019

GO GRADIENT FOR EXPECTATION-BASED OBJECTIVES

Yulai Cong Miaoyun Zhao Ke Bai Lawrence Carin Department of Electrical and Computer Engineering, Duke University

Within many machine learning algorithms, a fundamental problem concerns efﬁcient calculation of an unbiased gradient wrt parameters γ for expectation-based objectives Eqγ(y)[f(y)]. Most existing methods either (i) suffer from high variance, seeking help from (often) complicated variance-reduction techniques; or (ii) they only apply to reparameterizable continuous random variables and employ a reparameterization trick. To address these limitations, we propose a General and One-sample (GO) gradient that (i) applies to many distributions associated with non-reparameterizable continuous or discrete random variables, and (ii) has the same low-variance as the reparameterization trick. We ﬁnd that the GO gradient often works well in practice based on only one Monte Carlo sample (although one can of course use more samples if desired). Alongside the GO gradient, we develop a means of propagating the chain rule through distributions, yielding statistical back-propagation, coupling neural networks to common random variables.

1 INTRODUCTION

Neural networks, typically trained using back-propagation for parameter optimization, have recently demonstrated signiﬁcant success across a wide range of applications. There has been interest in coupling neural networks with random variables, so as to embrace greater descriptive capacity. Recent examples of this include black-box variational inference (BBVI) (Kingma & Welling, 2014; Rezende et al., 2014; Ranganath et al., 2014; Hernández-Lobato et al., 2016; Ranganath et al., 2016b; Li & Turner, 2016; Ranganath et al., 2016a; Zhang et al., 2018) and generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018). Unfortunately, efﬁciently backpropagating gradients through general distributions (random variables) remains a bottleneck. Most current methodology focuses on distributions with continuous random variables, for which the reparameterization trick may be readily applied (Kingma & Welling, 2014; Grathwohl et al., 2017).

As an example, the aforementioned bottleneck greatly constrains the applicability of BBVI, by limiting variational approximations to reparameterizable distributions. This limitation excludes discrete random variables and many types of continuous ones. From the perspective of GAN, the need to employ reparameterization has constrained most applications to continuous observations. There are many forms of data that are more-naturally discrete.

The fundamental problem associated with the aforementioned challenges is the need to efﬁciently calculate an unbiased low-variance gradient wrt parameters γ for an expectation objective of the form Eqγ(y)[f(y)]1. We are interested in general distributions qγ(y), for which the components of y may be either continuous or discrete. Typically the components of y have a hierarchical structure, and a subset of the components of y play a role in evaluating f(y).

Unfortunately, classical methods for estimating gradients of Eqγ(y)[f(y)] wrt γ have limitations. The REINFORCE gradient (Williams, 1992), although generally applicable (e.g., for continuous and discrete random variables), exhibits high variance with Monte Carlo (MC) estimation of the

Correspondence to: Yulai Cong <yulaicong@gmail.com>, Miaoyun Zhao <miaoyun9zhao@gmail.com>. 1In this paper, we consider expectation objectives meeting basic assumptions: (i) qγ(y) is differentiable wrt γ; (ii) f(y) is differentiable for continuous y; and (iii) f(y) < for discrete y. For simplicity of the main paper, those assumptions are implicitly made, as well as fundamental rules like Leibniz s rule.

Published as a conference paper at ICLR 2019

expectation, forcing one to apply additional variance-reduction techniques. The reparameterization trick (Rep) (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) works well, with as few as only one MC sample, but it is limited to continuous reparameterizable y. Many efforts have been devoted to improving these two formulations, as detailed in Section 6. However, none of these methods is characterized by generalization (applicable to general distributions) and efﬁciency (working well with as few as one MC sample).

The key contributions of this work are based on the recognition that REINFORCE and Rep are seeking to solve the same objective, but in practice Rep yields lower-variance estimations, albeit for a narrower class of distributions. Recent work (Ranganath et al., 2016b) has made a connection between REINFORCE and Rep, recognizing that the former estimates a term the latter evaluates analytically. The high variance by which REINFORCE approximates this term manifests high variance in the gradient estimation. Extending these ideas, we make the following main contributions. (i) We propose a new General and One-sample (GO) gradient in Section 3, that principally generalizes Rep to many non-reparameterizable distributions and justiﬁes two recent methods (Figurnov et al., 2018; Jankowiak & Obermeyer, 2018); the One sample motivating the name GO is meant to highlight the low variance of the proposed method, although of course one may use more than one sample if desired. (ii) We ﬁnd that the core of the GO gradient is something we term a variable-nabla, which can be interpreted as the gradient of a random variable wrt a parameter. (iii) Utilizing variablenablas to propagate the chain rule through distributions, we broaden the applicability of the GO gradient in Sections 4-5 and present statistical back-propagation, a statistical generalization of classic back-propagation (Rumelhart & Hinton, 1986). Through this generalization, we may couple neural networks to general random variables, and compute needed gradients with low variance.

2 BACKGROUND

To motivate this paper, we begin by brieﬂy elucidating common machine learning problems for which there is a need to efﬁciently estimate gradients of γ for functions of the form Eqγ(y)[f(y)]. Assume access to data samples {xi}i=1,N, drawn i.i.d. from the true (and unknown) underlying distribution q(x). We seek to learn a model pθ(x) to approximate q(x). A classic approach to such learning is to maximize the expected log likelihood ˆθ = argmaxθ Eq(x)[log pθ(x)], perhaps with an added regularization term on θ. Expectation Eq(x)( ) is approximated via the available data samples, as ˆθ = argmaxθ 1 N PN i=1 log pθ(xi).

It is often convenient to employ a model with latent variables z, i.e., pθ(x) = R pθ(x, z)dz = R pθ(x|z)p(z)dz, with prior p(z) on z. The integral wrt z is typically intractable, motivating introduction of the approximate posterior qφ(z|x), with parameters φ. The well-known evidence lower bound (ELBO) (Jordan et al., 1999; Bishop, 2006) is deﬁned as

ELBO(θ, φ; x) = Eqφ(z|x)[log pθ(x, z) log qφ(z|x)] (1)

= log pθ(x) KL[qφ(z|x) pθ(z|x)] log pθ(x) (2)

where pθ(z|x) is the true posterior, and KL( ) represents the Kullback-Leibler divergence. Variational learning seeks (ˆθ, ˆφ) = argmaxθ,φ PN i=1 ELBO(θ, φ; xi).

While computation of the ELBO has been considered for many years, a problem introduced recently concerns adversarial learning of pθ(x), or, more precisely, learning a model that allows one to efﬁciently and accurately draw samples x pθ(x) that are similar to x q(x). With generative adversarial networks (GANs) (Goodfellow et al., 2014), one seeks to solve

min θ max β Eq(x)[log Dβ(x)] + Epθ(x)[log(1 Dβ(x))] , (3)

where Dβ(x) is a discriminator with parameters β, quantifying the probability x was drawn from q(x), with 1 Dβ(x) representing the probability that it was drawn from pθ(x). There have been many recent extensions of GAN (Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018), but the basic setup in (3) holds for most.

To optimize (1) and (3), the most challenging gradients that must be computed are of the form γEqγ(y)[f(y)]; for (1) y = z and γ = φ, while for (3) y = x and γ = θ. The need to evaluate expressions like γEqγ(y)[f(y)] arises in many other machine learning problems, and consequently it has generated much prior attention.

Published as a conference paper at ICLR 2019

Evaluation of the gradient of the expectation is simpliﬁed if the gradient can be moved inside the expectation. REINFORCE (Williams, 1992) is based on the identity γEqγ(y)[f(y)] = Eqγ(y) f(y) γ log qγ(y) . (4) While simple in concept, this estimate is known to have high variance when the expectation Eqγ(y)( ) is approximated (as needed in practice) by a ﬁnite number of samples.

An approach (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) that has attracted recent attention is termed the reparameterization trick, applicable when qγ(y) can be reparametrized as y = τ γ(ϵ), with ϵ q(ϵ), where τ γ(ϵ) is a differentiable transformation and q(ϵ) is a simple distribution that may be readily sampled. To keep consistency with the literature, we call a distribution reparameterizable if and only if those conditions are satisﬁed2. In this case we have γEqγ(y)[f(y)] = Eq(ϵ) [ γτ γ(ϵ)][ yf(y)]|y=τ γ(ϵ) . (5) This gradient, termed Rep, is typically characterized by relatively low variance, when approximating Eq(ϵ)( ) with a small number of samples ϵ q(ϵ). This approach has been widely employed for computation of the ELBO and within GAN, but it limits one to models that satisfy the assumptions of Rep.

3 GO GRADIENT

The reparameterization trick (Rep) is limited to reparameterizable random variables y with continuous components. There are situations for which Rep is not readily applicable, e.g., where the components of y may be discrete or nonnegative Gamma distributed. We seek to gain insights from the relationship between REINFORCE and Rep, and generalize the types of random variables y for which the latter approach may be effected. We term our proposed approach a General and One-sample (GO) gradient. In practice, we ﬁnd that this approach works well with as few as one sample for evaluating the expectation, and it is applicable to more general settings than Rep.

Recall that Rep was ﬁrst applied within the context of variational learning (Kingma & Welling, 2014), as in (1). Speciﬁcally, it was assumed qγ(y) = Q

v qγ(yv), omitting explicit dependence on data x, for notational convenience; yv is component v of y. In Kingma & Welling (2014) qγ(yv) corresponded to a Gaussian distribution qγ(yv) = N(yv; µv(γ), σ2 v(γ)), with mean µv(γ) and variance σ2 v(γ). In the following we generalize qγ(yv) such that it need not be Gaussian. Applying integration by parts (Ranganath et al., 2016b)

γEqγ(y)[f(y)] = X

v Eqγ(y v) R f(y) γqγ(yv)dyv (6)

v Eqγ(y v) h f(y) γQγ(yv) | {z } 0

R [ γQγ(yv)][ yvf(y)]dyv | {z } Key

where y v denotes y with yv excluded, and Qγ(yv) is the cumulative distribution function (CDF) of qγ(yv). The 0 term is readily proven to be zero for any Qγ(yv), with the assumption that f(y) doesn t tend to inﬁnity faster than γQγ(yv) tending to zero when yv .

The Key term exactly recovers the one-dimensional Rep when reparameterization yv = τγ(ϵv), ϵv q(ϵv) exists (Ranganath et al., 2016b). Further, applying γqγ(yv) = qγ(yv) γ log qγ(yv) in (6) yields REINFORCE. Consequently, it appears that Rep yields low variance by analytically setting to zero the unnecessary but high-variance-injecting 0 term, while in contrast REINFORCE implicitly seeks to numerically implement both terms in (7).

We generalize qγ(y) for discrete yv, here assuming yv {0, 1, . . . , }. It is shown in Appendix A.2 that this framework is also applicable to discrete yv with a ﬁnite alphabet. It may be shown (see Appendix A.2) that

γEqγ(y)[f(y)] = X

v Eqγ(y v) h X

yv f(y) γqγ(yv) i

v Eqγ(y v) h [f(y) γQγ(yv)]|yv= | {z } 0

yv[ γQγ(yv)][f(y v, yv + 1) f(y)] | {z } Key

2A Bernoulli random variable y Bernoulli(P) is identical to y = 1ϵ<P with ϵ U(0, 1). But y is called non-reparameterizable because the transformation y = 1ϵ<P is not differentiable.

Published as a conference paper at ICLR 2019

where Qγ(yv) = Pyv n=0 qγ(n), and Qγ( ) = 1 for all γ. Theorem 1 (GO Gradient). For expectation objectives Eqγ(y)[f(y)], where qγ(y) satisﬁes (i) qγ(y) = Q

v qγ(yv); (ii) the corresponding CDF Qγ(yv) is differentiable wrt parameters γ; and (iii) one can calculate γQγ(yv), the General and One-sample (GO) gradient is deﬁned as

γEqγ(y)[f(y)] = Eqγ(y) h Gqγ(y) γ Dy[f(y)] i , (9)

where Gqγ(y) γ speciﬁes Gq(y) κ , gq(yv) κ , with variable-nabla gq(yv) κ 1 q(yv) κQ(yv),

Dy[f(y)] = , Dyv[f(y)], T , and

( yvf(y), Continous yv f(y v, yv + 1) f(y), Discrete yv

All proofs are provided in Appendix A, where we also list gqγ(yv) γ for a wide selection of possible qγ(y), for both continuous and discrete y. Note for the special case with continuous y, GO reduces to Implicit Rep gradients (Figurnov et al., 2018) and pathwise derivatives (Jankowiak & Obermeyer, 2018); in other words, GO provides a principled explanation for their low variance, namely their foundation (implicit differentiation) originates from integration by parts. For high-dimensional discrete y, calculating Dy[f(y)] is computationally expensive. Fortunately, for f(y) often used in practice special properties hold that can be exploited for efﬁcient parallel computing. Also for discrete yv with ﬁnite support, it is possible that one could analytically evaluate a part of expectations in (9) for lower variance, mimicking the local idea in Titsias & Lázaro-Gredilla (2015); Titsias (2015). Appendix I shows an example illustrating how to handle these two issues in practice.

4 DEEP GO GRADIENT

The GO gradient in Theorem 1 can only handle single-layer mean-ﬁeld qγ(y), characterized by an independence assumption on the components of y. One may enlarge the descriptive capability of qγ(y) by modeling it as a marginal distribution of a deep model (Ranganath et al., 2016b; Bishop, 2006). Hereafter, we focus on this situation, and begin with a 2-layer model for simple demonstration. Speciﬁcally, consider

qγ(y) = Z qγ(y, λ)dλ = Z qγy(y|λ)qγλ(λ)dλ = Z Y

v qγy(yv|λ) Y

k qγλ(λk)dλ,

where γ = {γy, γλ}, y is the leaf variable, and the internal variable λ is assumed to be continuous. Components of y are assumed to be conditionally independent given λ, but upon marginalizing out λ this independence is removed.

The objective becomes Eqγ(y)[f(y)] = Eqγy (y|λ)qγλ(λ)[f(y)], and via Theorem 1

γy Eqγ(y)[f(y)] = Eqγ(y,λ) h G qγy (y|λ) γy Dy[f(y)] i . (10)

Lemma 1. Equation (10) exactly recovers the Rep gradient in (5), if γ = γy and qγ(y) has reparameterization y = τ γ(λ), λ q(λ) for differentiable τ γ(λ) and easily sampled q(λ).

Lemma 1 shows that Rep is a special case of our deep GO gradient in the following Theorem 2. Note neither Implicit Rep gradients nor pathwise derivatives can recover Rep in general, because a neural-network-parameterized y = τ γ(λ) may lead to non-trivial CDF Qγ(y).

For the gradient wrt γλ, we ﬁrst apply Theorem 1, yielding

γλEqγ(y)[f(y)] = Eqγλ(λ) h G qγλ(λ) γλ Dλ Eqγy (y|λ)[f(y)] i .

For continuous internal variable λ one can apply Theorem 1 again, from which

γλEqγ(y)[f(y)] = Eqγ(y,λ)

G qγλ(λ) γλ G qγy (y|λ) λ Dy[f(y)] . (11)

Published as a conference paper at ICLR 2019

Now extending the same procedure to deeper models with L layers, we generalize the GO gradient in Theorem 2. Random variable y(L) is assumed to be the leaf variable of interest, and may be continuous or discrete; latent/internal random variables {y(1), . . . , y(L 1)} are assumed continuous (these generalize λ from above).

Theorem 2 (Deep GO Gradient). For expectation objectives Eqγ(y(L))[f(y(L))] with qγ(y(L)) being the marginal distribution of

qγ y(1), , y(L) = qγ(1) y(1) h YL

l=2 qγ(l) y(l)|y(l 1) i ,

where γ = {γ(1), , γ(L)}, all internal variables y(1), , y(L 1) are continuous, the leaf variable y(L) is either continuous or discrete, qγ(l)(y(l)|y(l 1)) = Q v qγ(l)(y(l) v |y(l 1)), and one has

access to variable-nablas g qγ(l)(y(l) v |y(l 1))

γ(l) and g qγ(l)(y(l) v |y(l 1))

y(l 1) , as deﬁned in Theorem 1, the General and One-sample (GO) gradient is deﬁned as

γ(l)Eqγ(y(L))[f(y(L))] = Eqγ(y(1), ,y(L)) h G qγ(l)(y(l)|y(l 1))

γ(l) BP y(l) i , (12)

where BP[y(L)] = Dy(L)[f(y(L))] and BP[y(l)] = G qγ(l+1)(y(l+1)|y(l))

y(l) BP[y(l+1)] for l < L.

Corollary 1. The deep GO gradient in Theorem 2 exactly recovers back-propagation (Rumelhart & Hinton, 1986) when each element distribution qγ(l)(y(l) v |y(l 1)) is speciﬁed as the Dirac delta function located at the activated value after activation function.

Figure 1 is presented for better understanding. With variable-nablas, one can readily verify that the deep GO gradient in Theorem 2 in expectation obeys the chain rule.

Assume all continuous variables y, and abuse notations for better understanding. Dy[f(y)] yf(y), Gq(y) κ κy

Figure 1: Relating back-propagation (Rumelhart & Hinton, 1986) with the deep GO gradient in Theorem 2. (i) In deterministic deep neural networks, one forward-propagates information using activation functions, like Re LU, to sequentially activate {y(l)}l=1, ,L (black solid arrows), and then back-propagates gradients from Loss f( ) to each parameter γ(l) via gradient-ﬂow through {y(k)}k=L, ,l (red dashed arrows). (ii) Similarly for the deep GO gradient with 1 MC sample, one forward-propagates information to calculate the expected loss function f(y(L)) using distributions as statistical activation functions, and then uses variable-nablas to sequentially back-propagate gradients through random variables {y(k)}k=L, ,l to each γ(l), as in (12).

5 STATISTICAL BACK-PROPAGATION AND HIERARCHICAL VARIATIONAL INFERENCE

Recall the motivating discussion in Section 2, in which we considered generative model pθ(x, z) and inference model qφ(z|x), the former used to model synthesis of the observed data x and the latter used for inference of z given observed x. In recent deep architectures, a hierarchical representation for cumulative latent variables z = (z(1), . . . , z(L)) has been considered (Rezende et al., 2014; Ranganath et al., 2015; Zhou et al., 2015; 2016; Ranganath et al., 2016b; Cong et al., 2017; Zhang et al., 2018). As an example, there are models with pθ(x, z) = pθ(x|z(1)) QL 1 l=1 pθ(z(l)|z(l+1)) pθ(z(L)). When performing inference for such models, it is intuitive to consider ﬁrst-order Markov chain structure for qφ(z|x) = qφ(z(1)|x) QL l=2 qφ(z(l)|z(l 1)). The discussion in this section is most relevant for variational inference, for computation of φEqφ(z(1),...,z(L)|x)[f(z(1), . . . , z(L))], and consequently we specialize to that notation in the subsequent discussion (we consider representations in terms of z, rather than the more general y notation employed in Section 4).

Before proceeding, we seek to make clear the distinction between this section and Section 4. In the latter, only the leaf variable z(L) appears in γEqγ(z(L))[f(z(L))]; see (12), with y(L) z(L). That

Published as a conference paper at ICLR 2019

is because in Section 4 the underlying model is a marginal distribution of z(L), i.e., qγ(z(L)), which is relevant to the generators of GANs; see (3), with x z(L) and pθ(x) qγ(z(L)). Random variables z(1), . . . , z(L 1) were marginalized out of qγ(z(1), . . . , z(L 1), z(L)) to represent qγ(z(L)). As discussed in Section 4, z(1), . . . , z(L 1) were added there to enhance the modeling ﬂexibility of qγ(z(L)). In this section, the deep set of random variables z = (z(1), . . . , z(L)) are inherent components of the underlying generative model for x, i.e., pθ(x, z) = pθ(x, z(1), . . . , z(L)). Hence, all components of z manifested via inference model qφ(z|x) = qφ(z(1), . . . , z(L)|x) play a role in f(z). Besides, no speciﬁc structure is imposed on pθ(x, z) and qφ(z|x) in this section, moving beyond the aforementioned ﬁrst-order Markov structure. For a practical application, one may employ domain knowledge to design suitable graphical models for pθ(x, z) and qφ(z|x), and then use the following Theorem 3 for training. Theorem 3 (Statistical Back-Propagation). For expectation objectives

Eqφ({z(i)}L i=1) f({z(i)}L i=1) ,

where {z(i)}I i=1 denotes I continuous internal variables with at least one child variable, {z(j)}L j=I+1 represents L I continuous or discrete leaf variables with no children except f( ), and qφ( ) is constructed as a hierarchical probabilistic graphical model

qφ({z(i)}L i=1) = qφ({z(i)}I i=1) YL

j=I+1 qφ(z(j)|{z(i)}I i=1)

with each element distribution qφ(zv|pa(zv)) having accessible variable-nablas as deﬁned in Theorem 1, pa(zv) denotes the parent variables of zv, the General and One-sample (GO) gradient for φk φ is deﬁned as

φk Eqφ({z(i)}L i=1) f({z(i)}L i=1) = Eqφ({z(i)}L i=1) h Gqφ[ch(φk)] φk BP[ch(φk)] i , (13)

where ch(φk) denotes the children variables of φk, and with zv ch(φk),

Gqφ[ch(φk)] φk = [ , gqφ(zv|pa(zv)) φk , ], BP[ch(φk)] = [ , BP[zv], ]T ,

and BP[zv] is iteratively calculated as BP[zv] = Gqφ[ch(zv)] zv BP[ch(zv)], until leaf variables where BP[zv] = Dzv[f({z(i)}L i=1)].

Statistical back-propagation in Theorem 3 is relevant to hierarchical variational inference (HVI) (Ranganath et al., 2016b; Hoffman & Blei, 2015; Mnih & Gregor, 2014) (see Appendix G), greatly generalizing GO gradients to the inference of directed acyclic probabilistic graphical models. In HVI variational distributions are speciﬁed as hierarchical graphical models constructed by neural networks. Using statistical back-propagation, one may rely on GO gradients to perform HVI with low variance, while greatly broadening modeling ﬂexibility.

6 RELATED WORK

There are many methods directed toward low-variance gradients for expectation-based objectives. Attracted by the generalization of REINFORCE, many works try to improve its performance via efﬁcient variance-reduction techniques, like control variants (Mnih & Gregor, 2014; Titsias & Lázaro Gredilla, 2015; Gu et al., 2015; Mnih & Rezende, 2016; Tucker et al., 2017; Grathwohl et al., 2017) or via data augmentation and permutation techniques (Yin & Zhou, 2018). Most of this research focuses on discrete random variables, likely because Rep (if it exists) works well for continuous random variables but it may not exist for discrete random variables. Other efforts are devoted to continuously relaxing discrete variables, to combine both REINFORCE and Rep for variance reduction (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2017).

Inspired by the low variance of Rep, there are methods that try to generalize its scope. The Generalized Rep (GRep) gradient (Ruiz et al., 2016) employs an approximate reparameterization whose transformed distribution weakly depends on the parameters of interest. Rejection sampling variational inference (RSVI) (Naesseth et al., 2016) exploits highly-tuned transformations in mature rejection sampling simulation to better approximate Rep for non-reparameterizable distributions. Compared

Published as a conference paper at ICLR 2019

to the aforementioned methods, the proposed GO gradient, containing Rep as a special case for continuous random variables, applies to both continuous and discrete random variables with the same low-variance as the Rep gradient. Implicit Rep gradients (Figurnov et al., 2018) and pathwise derivatives (Jankowiak & Obermeyer, 2018) are recent low-variance methods that exploit the gradient of the expected function; they are special cases of GO in the single-layer continuous settings.

The idea of gradient backpropagation through random variables has been exploited before. RELAX (Grathwohl et al., 2017), employing neural-network-parametrized control variants to assist REINFORCE for that goal, has a variance potentially as low as the Rep gradient. SCG (Schulman et al., 2015) utilizes the generalizability of REINFORCE to construct widely-applicable stochastic computation graphs. However, REINFORCE is known to have high variance, especially for highdimensional problems, where the proposed methods are preferable when applicable (Schulman et al., 2015). Stochastic back-propagation (Rezende et al., 2014; Fan et al., 2015), focusing mainly on reparameterizable Gaussian random variables and deep latent Gaussian models, exploits the product rule for an integral to derive gradient backpropagation through several continuous random variables. By comparison, the proposed statistical back-propagation based on the GO gradient is applicable to most distributions for continuous random variables. Further, it also ﬂexibly generalizes to hierarchical probabilistic graphical models with continuous internal variables and continuous/discrete leaf ones.

7 EXPERIMENTS

We examine the proposed GO gradients and statistical back-propagation with four experiments: (i) simple one-dimensional (gamma and negative binomial) examples are presented to verify the GO gradient in Theorem 1, corresponding to nonnegative and discrete random variables; (ii) the discrete variational autoencoder experiment from Tucker et al. (2017) and Grathwohl et al. (2017) is reproduced to compare GO with the state-of-the-art variance-reduction methods; (iii) a multinomial GAN, generating discrete observations, is constructed to demonstrate the deep GO gradient in Theorem 2; (iv) hierarchical variational inference (HVI) for two deep non-conjugate Bayesian models is developed to verify statistical back-propagation in Theorem 3. Note the experiments of Figurnov et al. (2018) and Jankowiak & Obermeyer (2018) additionally support our GO in the single-layer continuous settings.

Many mature machine learning frameworks, like Tensor Flow (Abadi et al.) and Py Torch (Paszke et al., 2017), are optimized for implementation of methods like back-propagation. Fortunately, all gradient calculations in the proposed theorems obey the chain rule in expectation, enabling convenient incorporation of the proposed approaches into existing frameworks. Experiments presented below were implemented in Tensor Flow or Py Torch with a Titan Xp GPU. Code for all experiments can be found at github.com/Yulai Cong/GOgradient.

Notation Gam(α, β) denotes the gamma distribution with shape α and rate β, NB(r, P) the negative binomial distribution with number of failures r and success probability P, Bern(P) the Bernoulli distribution with probability P, Mult(n, P ) the multinomial distribution with number of trials n and event probabilities P , Pois(λ) the Poisson distribution with rate λ, and Dir(α) the Dirichlet distribution with concentration parameters α.

7.1 GAMMA AND NB ONE-DIMENSIONAL SIMPLE EXAMPLES

We ﬁrst consider illustrative one-dimensional toy problems, to examine the GO gradient for both continuous and discrete random variables. The optimization objective is expressed as

max φ ELBO(φ) = Eqφ(z)[log p(z|x) log qφ(z)] + log p(x),

where for continuous z we assume p(z|x) = Gam(z; α0, β0) for given set (α0, β0), with qφ(z) = Gam(z; α, β) and φ = {α, β}; for discrete z we assume p(z|x) = NB(z; r0, p0) for given set (r0, p0), with qφ(z) = NB(z; r, p) and φ = {r, p}. Stochastic gradient ascent with onesample-estimated gradients is used to optimize the objective, which is equivalent to minimizing KL(qφ(z) p(z|x)).

Figure 2 shows the results (see Appendix H for additional details). For the nonnegative continuous z associated with the gamma distribution, we compare our GO gradient with GRep (Ruiz et al., 2016),

Published as a conference paper at ICLR 2019

0 200 400 600 800 1000 Iteration

Grad-Variance

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

Grad-Variance

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

GRep RSVI GRep-Stick RSVI-Stick GO

400 600 800 1000 -8

0 200 400 600 800 1000 Iteration

r Grad-Variance

REINFORCE REINFORCE2 GO

Figure 2: Gamma (a-c) and NB (d) toy experimental results. (a) The gradient variance of gamma shape α versus iterations, with posterior parameters α0 = 1, β0 = 0.5. (b)-(c) The gradient variance of α and ELBOs versus iterations respectively, when α0 = 0.01, β0 = 0.5. (d) The gradient variance of NB r versus iterations with r0 = 10, p0 = 0.2. In each iteration, gradient variances are estimated with 20 Monte Carlo samples (each sample corresponds to one gradient estimate), among which the last one is used to update parameters.

RSVI (Naesseth et al., 2016), and their modiﬁed version using the sticking idea (Roeder et al., 2017), denoted as GRep-Stick and RSVI-Stick, respectively. For RSVI and RSVI-Stick, the shape augmentation parameter is set as 5 by default. The only difference between GRep and GRep-Stick (also RSVI and RSVI-Stick) is the latter does not analytically express the entropy Eqφ(z)[ log qφ(z)]. Figure 2(a) clearly shows the utility of employing sticking to reduce variance; without it, GRep and RSVI exhibit high variance, that destabilizes the optimization for small gamma shape parameters, as shown in Figures 2(b) and 2(c). We adopt the sticking approach hereafter for all the compared methods. Among methods with sticking, GO exhibits the lowest variance in general, as shown in Figures 2(a) and 2(b). GO empirically provides more stable learning curves, as shown in Figure 2(c). For the discrete case corresponding to the NB distribution, GO is compared to REINFORCE (Williams, 1992). To address the concern about comparison with the same number of evaluations of the expected function3, another curve of REINFORCE using 2 samples is also added, termed REINFORCE2. It is apparent from Figure 2(d) that, thanks to analytically removing the 0 terms in (8), the GO gradient has much lower variance, even in this simple one-dimensional case.

7.2 DISCRETE VARIATIONAL AUTOENCODER

To demonstrate the low variance of the proposed GO gradient, we consider the discrete variational autoencoder (VAE) experiment from REBAR (Tucker et al., 2017) and RELAX (Grathwohl et al., 2017), to make a direct comparison with state-of-the-art variance-reduction methods. Since the statistical back-propagation in Theorem 3 cannot handle discrete internal variables, we focus on the single-latent-layer settings (1 layer of 200 Bernoulli random variables), i.e.,

pθ(x, z) : x Bern NNP x|z(z) , z Bern P z

qφ(z|x) : z Bern NNP z|x(x)

where P z is the parameters of the prior pθ(z), NNP x|z(z) means using a neural network to project the latent binary code z to the parameters P x|z of the likelihood pθ(x|z), and NNP z|x(x) is similarly deﬁned for qφ(z|x). The objective is given in (1). See Appendix I for more details.

Table 1: Best obtained ELBOs for discrete variational autoencoders. Results of REBAR and RELAX are obtained by running the released code4 from Grathwohl et al. (2017). All methods are run with the same learning rate for 1, 000, 000 iterations.

Dataset Model Training Validation

REBAR RELAX GO REBAR RELAX GO

MNIST Linear 1 layer -112.16 -112.89 -110.21 -114.85 -115.36 -114.27 Nonlinear -96.99 -95.99 -82.26 -112.96 -112.42 -111.48

Omniglot Linear 1 layer -122.19 -122.17 -119.03 -124.81 -124.95 -123.84 Nonlinear -79.51 -80.67 -54.96 -129.00 -128.99 -126.59

3In the NB experiments, REINFORCE uses 1 sample and 1 evaluation of the expected function; REINFORCE2 uses 2 sample and 2 evaluations; and GO uses 1 sample and 2 evaluations.

Published as a conference paper at ICLR 2019

0 2 4 6 8 10 Iteration 105

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

0 2000 4000 6000 8000 Time (seconds)

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

Figure 3: Training curves for the discrete VAE experiment with 1-layer linear model (see Appendix I) on the stochastically binarized MNIST dataset(Salakhutdinov & Murray, 2008). All methods are run with the same learning rate for 1, 000, 000 iterations. The black line represents the best training ELBO of REBAR and RELAX. ELBOs are calculated using all training/validation data.

Table 2: Inception scores on quantized MNIST. BGAN s results are run with the author-released code https://github.com/rdevon/BGAN.

bits (states) BGAN MNGAN-GO 1 (1) 8.31 .06 9.10 .06 1 (2) 8.56 .04 8.40 .07 2 (4) 7.76 .04 9.02 .06 3 (8) 7.26 .03 9.26 .07 4 (16) 6.29 .05 9.27 .06

Figure 4: Generated samples from BGAN (top) and MNGAN-GO (bottom) trained on 4-bit quantized MNIST.

Figure 3 (also Figure 9 of Appendix I) shows the training curves versus iteration and running time for the compared methods. Even without any variance-reduction techniques, GO provides better performance, faster convergence rate, and better running efﬁciency (about ten times faster in achieving the best training ELBO of RERAR/RELAX in this experiment). We believe GO s better performance originates from: (i) its inherent low-variance nature; (ii) GO has less parameters compared to REBAR and RELAX (no control variant is adopted for GO); (iii) efﬁcient batch processing methods (see Appendix I) are adopted to beneﬁt from parallel computing. Table 1 presents the best training/validation ELBOs under various experimental settings for the compared methods. GO provides the best performance in all situations. Additional experimental results are given in Appendix I.

Many variance-reduction techniques can be used to further reduce the variance of GO, especially when complicated models are of interest. Compared to RELAX, GO cannot be directly applied when f(y) is not computable or where the interested model has discrete internal variables (like multilayer Sigmoid belief networks (Neal, 1992)). For the latter issue, we present in Appendix B.4 a procedure to assist GO (or statistical back-propagation in Theorem 3) in handling discrete internal variables.

7.3 MULTINOMIAL GAN

To demonstrate the deep GO gradient in Theorem 2, we adopt multinomial leaf variables x and construct a new multinomial GAN (denoted as MNGAN-GO) for generating discrete observations with a ﬁnite alphabet. The corresponding generator pθ(x) is expressed as

ϵ N(0, I), x Mult(1, NNP(ϵ)).

For brevity, we integrate the generator s parameters θ into the NN notation, and do not explicitly express them. Details for this example are provided in Appendix J.

We compare MNGAN-GO with the recently proposed boundary-seeking GAN (BGAN) (Hjelm et al., 2018) on 1-bit (1-state, Bernoulli leaf variables x), 1-bit (2-state), 2-bit (4-state), 3-bit (8-state) and

4github.com/duvenaud/relax

Published as a conference paper at ICLR 2019

4-bit (16-state) discrete image generation tasks, using quantized MNIST datasets (Le Cun et al., 1998). Table 2 presents inception scores (Salimans et al., 2016) of both methods. MNGAN-GO performs better in general. Further, with GO s assistance, MNGAN-GO shows more potential to beneﬁt from richer information coming from more quantized states. For demonstration, Figure 4 shows the generated samples from the 4-bit experiment, where better image quality and higher diversity are observed for the samples from MNGAN-GO.

7.4 HVI FOR DEEP EXPONENTIAL FAMILIES AND DEEP LATENT DIRICHLET ALLOCATION

To demonstrate statistical back-propagation in Theorem 3, we design variational inference nets for two nonconjugate hierarchical Bayesian models, i.e., deep exponential families (DEF) (Ranganath et al., 2015) and deep latent Dirichlet allocation (DLDA) (Zhou et al., 2015; 2016; Cong et al., 2017). DEF : x Pois(W(1)z(1)), z(l) Gam αz, αz/W(l+1)z(l+1) , W(l) Gam(α0, β0)

DLDA : x Pois(Φ(1)z(1)), z(l) Gam Φ(l+1)z(l+1), c(l+1) , Φ(l) Dir(η0). For demonstration, we design the inference nets qφ(z|x) following the ﬁrst-order Markov chain construction in Section 5, namely

qφ(z|x) : z(1) Gam NN(1) α (x), NN(1) β (x) , z(l) Gam NN(l) α (z(l 1)), NN(l) β (z(l 1)) . Further details are provided in Appendix K. One might also wish to design inference nets that have structure beyond the above ﬁrst-order Markov chain construction, as in Zhang et al. (2018); we do not consider that here, but Theorem 3 is applicable to that case.

Figure 5: ELBOs of HVI for a 128-64 DEF on MNIST. Except for different ways to calculate gradients, all other experimental settings are the same for compared methods, including the sticking idea and one-MC-sample gradient estimate.

0 0.5 1 1.5 2 Iteration 104

Figure 6: HVI results for a 128-64-32 DLDA on MNIST. Upperleft is the training ELBOs. The remaining subﬁgures are learned dictionary atoms from Φ(1) (top-right), Φ(1)Φ(2) (bottom-left), and Φ(1)Φ(2)Φ(3) (bottom-right).

HVI for a 2-layer DEF is ﬁrst performed, with the ELBO curves shown in Figure 5. GO enables faster and more stable convergence. Figure 6 presents the HVI results for a 3-layer DLDA, for which stable ELBOs are again observed. More importantly, with the GO gradient, one can utilize pure gradient-based methods to efﬁciently train such complicated nonconjugate models for meaningful dictionaries (see Appendix K for more implementary details).

8 CONCLUSIONS

For expectation-based objectives, we propose a General and One-sample (GO) gradient that applies to continuous and discrete random variables. We further generalize the GO gradient to cases for which the underlying model is deep and has a marginal distribution corresponding to the latent variables of interest, and to cases for which the latent variables are hierarchical. The GO-gradient setup is demonstrated to yield the same low-variance estimation as the reparameterization trick, which is only applicable to reparameterizable continuous random variables. Alongside the GO gradient, we constitute a means of propagating the chain rule through distributions. Accordingly, we present statistical back-propagation, to ﬂexibly integrate deep neural networks with general classes of random variables.

Published as a conference paper at ICLR 2019

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their useful comments. The research was supported by part by DARPA, DOE, NIH, NSF and ONR. The Titan Xp GPU used was donated by the NVIDIA Corporation. We also wish to thank Chenyang Tao, Liqun Chen, and Chunyuan Li for helpful discussions.

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensor Flow: Large-scale machine learning on heterogeneous systems. URL https://www.tensorflow.org/. Software available from tensorﬂow.org.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In ICLR, 2017.

C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Y. Cong, B. Chen, H. Liu, and M. Zhou. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML, 2017.

K. Fan, Z. Wang, J. Beck, J. Kwok, and K. Heller. Fast second order stochastic backpropagation for variational inference. In NIPS, pp. 1387 1395, 2015.

M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. ar Xiv preprint ar Xiv:1805.08498, 2018.

Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial networks. In NIPS, pp. 5253 5262, 2017.

K. Geddes, M. Glasser, R. Moore, and T. Scott. Evaluation of classes of deﬁnite integrals involving elementary functions via differentiation of special functions. Applicable Algebra in Engineering, Communication and Computing, 1(2):149 165, 1990.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pp. 2672 2680, 2014.

W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. ar Xiv:1711.00123, 2017.

S. Gu, S. Levine, I. Sutskever, and A. Mnih. Mu Prop: Unbiased backpropagation for stochastic neural networks. ar Xiv:1511.05176, 2015.

J. M. Hernández-Lobato, Y. Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R. E. Turner. Black-box α-divergence minimization. In ICML, 2016.

R. Hjelm, A. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary seeking GANs. In ICLR, 2018.

M. Hoffman and D. Blei. Stochastic structured variational inference. In AISTATS, pp. 361 369, 2015.

E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. ar Xiv:1611.01144, 2016.

M. Jankowiak and F. Obermeyer. Pathwise derivatives beyond the reparameterization trick. ar Xiv preprint ar Xiv:1806.01851, 2018.

M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.

D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Published as a conference paper at ICLR 2019

C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In NIPS, pp. 5501 5509, 2017.

C. Li, J. Li, G. Wang, and L. Carin. Learning to sample with adversarially learned likelihood-ratio. 2018.

Y. Li and R. E. Turner. Rényi divergence variational inference. In NIPS, pp. 1073 1081, 2016.

C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv:1611.00712, 2016.

A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, pp. 1791 1799, 2014.

A. Mnih and D. Rezende. Variational inference for monte carlo objectives. In ICML, pp. 2188 2196, 2016.

C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. M. Blei. Rejection sampling variational inference. ar Xiv:1610.05683, 2016.

R. M. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71 113, 1992.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in Py Torch. 2017.

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv:1511.06434, 2015.

R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, 2014.

R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS, 2015.

R. Ranganath, D. Tran, J. Altosaar, and D. Blei. Operator variational inference. In NIPS, pp. 496 504, 2016a.

R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In ICML, pp. 324 333, 2016b.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.

G. Roeder, Y. Wu, and D. K. Duvenaud. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In NIPS, pp. 6928 6937, 2017.

F. J. R. Ruiz, M. K. Titsias, and D. Blei. The generalized reparameterization gradient. In NIPS, pp. 460 468, 2016.

D. Rumelhart and G. Hinton. Learning representations by back-propagating errors. Nature, 323(9), 1986.

R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML, pp. 872 879. ACM, 2008.

T. Salimans, D. A. Knowles, et al. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837 882, 2013.

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, pp. 2234 2242, 2016.

J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, pp. 3528 3536, 2015.

M. Titsias. Local expectation gradients for doubly stochastic variational inference. ar Xiv preprint ar Xiv:1503.01494, 2015.

M. Titsias and M. Lázaro-Gredilla. Local expectation gradients for black box variational inference. In NIPS, pp. 2638 2646, 2015.

Published as a conference paper at ICLR 2019

G. Tucker, A. Mnih, C. Maddison, J. Lawson, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In NIPS, pp. 2624 2633, 2017.

R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

M. Yin and M. Zhou. ARM: Augment-REINFORCE-merge gradient for discrete latent variable models. ar Xiv preprint ar Xiv:1807.11143, 2018.

H. Zhang, B. Chen, D. Guo, and M. Zhou. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In ICLR, 2018.

J. Zhao, M. Mathieu, and Y. Le Cun. Energy-based generative adversarial network. In ICLR, 2016.

M. Zhou, Y. Cong, and B. Chen. The Poisson gamma belief network. In NIPS, pp. 3025 3033, 2015.

M. Zhou, Y. Cong, and B. Chen. Augmentable gamma belief networks. JMLR, 17(1):5656 5699, 2016.

Published as a conference paper at ICLR 2019

A PROOF OF THEOREM 1

We ﬁrst prove (7) in the main manuscript, followed by its discrete counterpart, i.e., (8) in the main manuscript. Then, it is easy to verify Theorem 1.

A.1 PROOF OF EQUATION (7) IN THE MAIN MANUSCRIPT

Similar proof in one-dimension is also given in the supplemental materials of Ranganath et al. (2016b).

We want to calculate

γEqγ(y)[f(y)] = γEQ

v qγ(yv)[f(y)] = X

v Eqγ(y v) R f(y) γqγ(yv)dyv ,

where y v denotes y with yv excluded. Without loss of generality, we assume yv ( , ).

Let v (yv) = γqγ(yv), and we have

v(yv) = Z yv

γqγ(t)dt = γ

qγ(t)dt = γQγ(yv),

where Qγ(yv) is the cumulative distribution function (CDF) of qγ(yv). Further deﬁne u(yv) = f(yv, y v), we then apply integration by parts (or partial integration) to get

γEqγ(y)[f(y)] = X

v Eqγ(y v) R u(yv)v (yv)dyv

v Eqγ(y v) u(yv)v(yv)| R u (yv)v(yv)dyv

v Eqγ(y v) f(y) γQγ(yv)| | {z } 0

R [ γQγ(yv)][ yvf(y)]dyv | {z } Key

With Qγ( ) = 1 and Qγ( ) = 0, it s straightforward to verify that the ﬁrst term is always zero for any Qγ(yv), thus named the 0 term.

A.2 PROOF OF EQUATION (8) IN THE MAIN MANUSCRIPT

For discrete variables y, we have

γEqγ(y)[f(y)] = X

v Eqγ(y v) h XN

yv=0 f(y) γqγ(yv) i ,

where yv {0, 1, , N} and N is the size of the alphabet.

To handle the summation of products of two sequences and develop discrete counterpart of (7), we ﬁrst introduce Abel transformation. Abel transformation. Given two sequences {an} and {bn}, with n {0, , N}, we deﬁne B0 = b0 and Bn = Pn k=0 bn for n 1. Accordingly, we have

n=0 anbn = a0b0 + XN

n=1 an(Bn Bn 1)

= a0B0 + XN

n=1 an Bn XN 1

= a NBN + XN 1

n=0 an Bn XN 1

= a NBN XN 1

n=0 (an+1 an)Bn.

Substituting n = yv, an = f(y), bn = γqγ(yv), and Bn = γQγ(yv) into the above equation, we have γEqγ(y)[f(y)] = X

v Eqγ(y v) h f(y v, yv = N) γQγ(yv = N) | {z } 0

yv=0[f(y v, yv + 1) f(y)] γQγ(yv) | {z } Key

Published as a conference paper at ICLR 2019

Note the ﬁrst term equals zero for both ﬁnite alphabet, i.e., N < , and inﬁnite alphabet, i.e., N = . When N = , we get (8) in the main manuscript.

With the above proofs for Eqs. (7) and (8), one could straightforwardly verify Theorem 1.

Table 3: Variable-nabla examples. Note y is a scalar random variable.

qγ(y) gqγ(y) γ

Delta δ(y µ) gqγ(y) µ = 1

Bernoulli(p) gqγ(y) p = 1/(1 p) y = 0 0 y = 1 Normal(µ, σ2) gqγ(y) µ = 1 gqγ(y) σ = y µ

σ Log-Normal(µ, σ2) gqγ(y) µ = y gqγ(y) σ = y log y µ

σ Logit-Normal(µ, σ2) gqγ(y) µ = y(1 y) gqγ(y) σ = y(1 y) logit(y) µ

σ Cauchy(µ, γ) gqγ(y) µ = 1 gqγ(y) γ = y µ

γ Gamma α, 1

β gqγ(y) α = [log(βy) ψ(α)]Γ(α,βy)+βy T (3,α,βy)

βαyα 1e βy gqγ(y) β = y

Beta α, β gqγ(y) α =

" log y ψ(α) + ψ(α + β)

yα 1(1 y)β 1 B(y; α, β)

α2(1 y)β 1 3F2(α, α, 1 β; α + 1, α + 1; y)

log(1 y) ψ(β) + ψ(α + β)

yα 1(1 y)β 1 B(1 y; β, α)

β2yα 1 3F2(β, β, 1 α; β + 1, β + 1; 1 y)

NB r, p gqγ(y) r =

log(1 p) ψ(r) + ψ(r + y + 1)

(1 p)rpy (y + r)B(1 p; r, y + 1)

pyr2 3F2(r, r, y; r + 1, r + 1; 1 p)

gqγ(y) p = y+r

1 p Exponential(λ) gqγ(y) λ = y

Student s t(v) gqγ(y) v = y

+ 2y(v + 1)

3v F 212 201

2 ; 1; 1, v + 1

2 ; ; v + 3

Weibull(λ, k) gqγ(y) λ = y

λ gqγ(y) k = y

Laplace(µ, b) gqγ(y) µ = 1 gqγ(y) b = y µ

b Poisson(λ) gqγ(y) λ = 1 Geometric(p) gqγ(y) p = y+1

p Categorical(p) gqγ(y) p = 1

py [10 y, 11 y, , 1(N 1) y, 0]T

for y = {0, 1, , N}, p = [p0, p1, , p N]T

ψ(x) is the digamma function. Γ(x, y) is the upper incomplete gamma function. B(x; α, β) is the incomplete beta function. p Fq(a1, , ap; b1, , bp; x) is the generalized hypergeometric function. T(m, s, x) is a special case of Meijer G-function (Geddes et al., 1990).

B WHY DISCRETE Internal VARIABLES ARE CHALLENGING?

For simpler demonstration, we ﬁrst use a 2-layer model to show why discrete internal variables are challenging. Then, we present an importance-sampling proposal that might be useful under speciﬁc situations. Finally, we present a strategy to learn discrete internal variables with the statistical back-propagation in Theorem 3 of the main manuscript.

Assume qγ(y) being the marginal distribution of the following 2-layer model

qγ(y) = Eqγλ(λ)[qγy(y|λ)]

where γ = {γy, γλ}, qγ(y, λ) = qγy(y|λ)qγλ(λ) = Q

v qγy(yv|λ) Q

k qγλ(λk), and both the leaf variable y and the internal variable λ could be either continuous or discrete.

Accordingly, the objective becomes

Eqγ(y)[f(y)] = Eqγ(y,λ)[f(y)] = Eqγy (y|λ)qγλ(λ)[f(y)].

Published as a conference paper at ICLR 2019

For gradient wrt γy, using Theorem 1, it is straight to show

γy Eqγ(y,λ)[f(y)] = Eqγλ(λ) h γy Eqγy (y|λ)[f(y)] i = Eqγ(y,λ) h G qγy (y|λ) γy Dy[f(y)] i . (15)

For gradient wrt γλ, we ﬁrst have

γλEqγ(y,λ)[f(y)] = γλEqγλ(λ) h Eqγy (y|λ)[f(y)] i .

With ˆf(λ) = Eqγy (y|λ)[f(y)], we then apply Theorem 1 and get

γλEqγ(y,λ)[f(y)] = Eqγλ(λ) h G qγλ(λ) γλ Dλ[ ˆf(λ)] i , (16)

where Dλ[ ˆf(λ)] = [ , Dλk[ ˆf(λ)], ]T , and

Dλk[ ˆf(λ)]

( λk ˆf(λ), Continous λk ˆf(λ k, λk + 1) ˆf(λ), Discrete λk

Next, we separately discuss the situations where λk is continuous or discrete.

B.1 FOR CONTINUOUS λk

One can directly apply Theorem 1 again, namely

Dλk[ ˆf(λ)] = λk Eqγy (y|λ)[f(y)] = Eqγy (y|λ) h G qγy (y|λ) λk Dy[f(y)] i . (17)

Substituting (17) into (16), we have

γλEqγ(y,λ)[f(y)] = Eqγ(y,λ) h G qγλ(λ) γλ G qγy (y|λ) λ Dy[f(y)] i . (18)

B.2 FOR DISCRETE λk

In this case, we need to calculate Dλk[ ˆf(λ)] = ˆf(λ k, λk + 1) ˆf(λ). The keys are again partial integration and Abel transformation. For simplicity, we ﬁrst assume one-dimensional y, and separately discuss y being continuous and discrete.

For continuous y, we apply partial integration to ˆf(λ) and get

ˆf(λ) = Eqγy (y|λ)[f(y)] = Z qγy(y|λ)f(y)dy

= f(y)Qγy(y|λ) Z Qγy(y|λ) yf(y)dy.

Accordingly, we have

Dλk[ ˆf(λ)] = ˆf(λ k, λk + 1) ˆf(λ)

= Z h Qγy(y|λ) Qγy(y|λ k, λk + 1) i yf(y)dy

+ f(y)Qγy(y|λ k, λk + 1) f(y)Qγy(y|λ) | {z } 0

Removing the 0 term, we have

Dλk[ ˆf(λ)] = Z h Qγy(y|λ) Qγy(y|λ k, λk + 1) i yf(y)dy. (19)

For discrete y, by similarly exploiting Abel transformation, we have

ˆf(λ) = Eqγy (y|λ)[f(y)] = X

y qγy(y|λ)f(y)

= f( )Qγy( |λ) X

y Qγy(y|λ) f(y + 1) f(y) .

Published as a conference paper at ICLR 2019

Accordingly, we get

Dλk[ ˆf(λ)] = X

h Qγy(y|λ) Qγy(y|λ k, λk + 1) i f(y + 1) f(y) . (20)

Unifying (19) for continuous y and (20) for discrete y, we have

Dλk[ ˆf(λ)] = ˆf(λ k, λk + 1) ˆf(λ) = Eqγy (y|λ k,λk+1)[f(y)] Eqγy (y|λ)[f(y)]

= Eqγy (y|λ) h Dy[f(y)] g qγy (y|λ) λk

where we deﬁne

g qγy (y|λ) λk Qγy(y|λ) Qγy(y|λ k, λk + 1)

Multi-dimensional y. Based on the above one-dimensional foundation, we next move on to multidimensional situations. With deﬁnitions y:i {y1, , yi} and yi: {yi, , y V }, where V is the dimensionality of y, we have

Dλk[ ˆf(λ)] = ˆf(λ k, λk + 1) ˆf(λ) = Eqγy (y|λ k,λk+1)[f(y)] Eqγy (y|λ)[f(y)]

= Eqγy (y2:|λ k,λk+1)

Eqγy (y1|λ k,λk+1)[f(y)] Eqγy (y1|λ)[f(y)]

+ Eqγy (y1|λ)[f(y)] Eqγy (y|λ)[f(y)]

Apply (21) and we have

Dλk[ ˆf(λ)] = Eqγy (y2:|λ k,λk+1) h Eqγy (y1|λ) Dy1[f(y)] g qγy (y1|λ) λk i

+ Eqγy (y2:|λ k,λk+1) Eqγy (y1|λ)[f(y)] Eqγy (y|λ)[f(y)]

= Eqγy (y2:|λ k,λk+1)qγy (y1|λ) h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i Eqγy (y|λ)[f(y)]

Similarly, we add extra terms to the above equation to enable applying (21) again as

Dλk[ ˆf(λ)] = Eqγy (y2:|λ k,λk+1)qγy (y1|λ) h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i

Eqγy (y3:|λ k,λk+1)qγy (y:2|λ) h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i

+ Eqγy (y3:|λ k,λk+1)qγy (y:2|λ) h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i

Eqγy (y|λ)[f(y)].

Accordingly, we apply (21) to the ﬁrst two terms and have

Dλk[ ˆf(λ)] = Eqγy (y3:|λ k,λk+1)qγy (y:2|λ) h Dy2 h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i g qγy (y2|λ) λk

+ Eqγy (y3:|λ k,λk+1)qγy (y:2|λ) h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i Eqγy (y|λ)[f(y)]

= Eqγy (y3:|λ k,λk+1)qγy (y:2|λ)

Dy2 h Dy1[f(y)] g qγy (y1|λ) λk + f(y) i g qγy (y2|λ) λk

+ Dy1[f(y)] g qγy (y1|λ) λk + f(y)

Eqγy (y|λ)[f(y)]

So forth, we summarize the pattern into the following equation as

Dλk[ ˆf(λ)] = Eqγy (y|λ) [Aλk(y, V ) f(y)] , (23)

Published as a conference paper at ICLR 2019

where Aλk(y, V ) is iteratively calculated as

Aλk(y, 0) = f(y)

Aλk(y, v) = Dyv Aλk(y, v 1) g qγy (yv|λ) λk + Aλk(y, v 1)

Aλk(y, V ) = Dy V Aλk(y, V 1) g qγy (y V |λ) λk + Aλk(y, V 1).

Despite elegant structures within (23), to calculate it, one must iterate over all dimensions of y, which is computational expensive in practice. More importantly, it is straightforward to show that deeper models will have similar but much more complicated expressions.

B.3 AN IMPORTANCE-SAMPLING PROPOSAL TO HANDLE DISCRETE Internal VARIABLES

Next, we present an intuitive proposal that might be useful under speciﬁc situations.

The key idea is to use different extra items to enable easy-to-use expression for Dλk[ ˆf(λ)], namely

Dλk[ ˆf(λ)] = ˆf(λ k, λk + 1) ˆf(λ) = Eqγy (y|λ k,λk+1)[f(y)] Eqγy (y|λ)[f(y)]

= Eqγy (y|λ k,λk+1)[f(y)]

Eqγy (y:V 1|λ k,λk+1)qγy (y V |λ)[f(y)]

+ Eqγy (y:V 1|λ k,λk+1)qγy (y V |λ)[f(y)]

Eqγy (y:i 1|λ k,λk+1)qγy (yi:|λ)[f(y)]

+ Eqγy (y:i 1|λ k,λk+1)qγy (yi:|λ)[f(y)]

Eqγy (y1|λ k,λk+1)qγy (y2:|λ)[f(y)]

+ Eqγy (y1|λ k,λk+1)qγy (y2:|λ)[f(y)]

Eqγy (y|λ)[f(y)].

Apply (21) to the adjacent two terms for V times, we have

Dλk[ ˆf(λ)] = Eqγy (y:V 1|λ k,λk+1)qγy (y V |λ) h Dy V [f(y)] g qγy (y V |λ) λk

+ Eqγy (y:i 1|λ k,λk+1)qγy (yi:|λ) h Dyi[f(y)] g qγy (yi|λ) λk

+ Eqγy (y V |λ) h Dy1[f(y)] g qγy (y1|λ) λk

where we can apply the idea of importance sampling and modify the above equation to

Dλk[ ˆf(λ)] = Eqγy (y|λ) h Dy V [f(y)] g qγy (y V |λ) λk qγy(y:V 1|λ k, λk + 1)

qγy(y:V 1|λ)

+ Eqγy (y|λ) h Dyi[f(y)] g qγy (yi|λ) λk qγy(y:i 1|λ k, λk + 1)

qγy(y:i 1|λ)

+ Eqγy (y V |λ) h Dy1[f(y)] g qγy (y1|λ) λk

= Eqγy (y|λ)

v=1 Dyv f(y) g qγy (yv|λ) λk qγy(y:v 1|λ k, λk + 1)

qγy(y:v 1|λ)

Published as a conference paper at ICLR 2019

Note that importance sampling may not always work well in practice (Bishop, 2006).

We further deﬁne the generalized variable-nabla as

gg qγy (yv|λ)

1 qγy(yv|λ) λk Qγy(yv|λ), Continuous λk

Qγy(yv|λ) Qγy(yv|λ k, λk + 1)

qγy(yv|λ) qγy(y:v 1|λ k, λk + 1)

qγy(y:v 1|λ) , Discrete λk

With the generalized variable-nabla, we unify (24) for discrete λk and (17) for continuous λk and get

Dλk[ ˆf(λ)] = Eqγy (y|λ) h X

v Dyv[f(y)] ggqγ(yv|λ) λk

which apparently obeys the chain rule. Accordingly, we have the gradient for γλ in (16) as

γλEqγ(y,λ)[f(y)] = Eqγ(y,λ)

v Dyv[f(y)] ggqγ(yv|λ) λk

i gg qγλ(λk) γλ

One can straightforwardly verify that, with the generalized variable-nabla deﬁned in (25), the chain rule applies to γEqγ(y(L))[f(y(L))], where one can freely specify both leaf and internal variables to be either continuous or discrete. The only problem is that, for discrete internal variables, the importance sampling trick used in (24) may not always work as expected.

B.4 STRATEGY TO LEARN DISCRETE INTERNAL VARIABLES WITH STATISTICAL BACK-PROPAGATION

Practically, if one has to deal with a qγ(y(1), , y(L)) with discrete internal variables y(l), l < L, we suggest the strategy in Figure 7, with which one should expect a close performance but enjoy much easier implementation with statistical back-propagation in Theorem 3 of the main manuscript. In fact, one can always add additional continuous internal variables to the graphical models to remedy the performance loss or even boost the performance.

Figure 7: A strategy for discrete internal variables. Blue and red circles denote continuous and discrete variables, respectively. The centered dots represent the corresponding distribution parameters. (a) Practically, one uses a neural network (black arrow) to connect the left variable to the parameters of the center discrete one, and then uses another neural network to propagate the sampled value to the next. (b) Instead, we suggest extracting the discrete variable as a leaf one and propagate its parameters to the next.

C PROOF OF LEMMA 1

First, a marginal distribution qγ(y) with reparameterization y = τ γ(ϵ), ϵ q(ϵ) can be expressed as a joint distribution, namely

qγ(y) = qγ(y, ϵ) = qγ(y|ϵ)q(ϵ),

where qγ(y|ϵ) = δ(y τ γ(ϵ)), δ( ) is the Dirac delta function, and τ γ(ϵ) could be ﬂexibly speciﬁed as a injective, surjective, or bijective function.

Next, we align notations and rewrite (10) as

γEqγ(y)[f(y)] = Eqγ(y,ϵ) h Gqγ(y|ϵ) γ Dy[f(y)] i

= Eq(ϵ) h Eqγ(y|ϵ) Gqγ(y|ϵ) γ Dy[f(y)] i , (26)

where Gqγ(y|ϵ) γ = , gqγ(yv|ϵ) γ , and gqγ(yv|ϵ) γ 1 qγ(yv|ϵ) γQγ(yv|ϵ).

Published as a conference paper at ICLR 2019

With qγ(yv|ϵ) = δ(yv [τ γ(ϵ)]v), we have

Qγ(yv|ϵ) = 1 [τ γ(ϵ)]v yv 0 [τ γ(ϵ)]v > yv

Accordingly, we have

γQγ(yv|ϵ) = [τ γ(ϵ)]v Qγ(yv|ϵ) γ[τ γ(ϵ)]v = δ([τ γ(ϵ)]v yv) γ[τ γ(ϵ)]v

and gqγ(yv|ϵ) γ = 1 qγ(yv|ϵ) γQγ(yv|ϵ) = γ[τ γ(ϵ)]v.

Substituting the above equations into (26), we get

γEqγ(y)[f(y)] = Eq(ϵ) h [ γτ γ(ϵ)][ yf(y)]|y=τ γ(ϵ) i , (27)

which is the multi-dimensional Rep gradient in (5) of the main manuscript.

D PROOF OF THEOREM 2

Firstly, with the internal variable λ being continuous, (10) and (11) in the main manuscript are proved by (15) and (18) in Section B, respectively. Then, by iteratively generalizing the similar derivations to deep models and utilizing the fact that the GO gradients with variable-nablas in expectation obey the chain rule for models with continuous internal variables, Theorem 2 could be readily veriﬁed.

E PROOFS FOR COROLLARY 1

When all qγ(i) y(i)|y(i 1) s are speciﬁed as Dirac delta functions, namely

qγ(i) y(i)|y(i 1) = δ(y(i) σ(γ(i), y(i 1))),

where σ(γ(i), y(i 1)) denotes the activated values after activation functions, the objective becomes

Eqγ(y(L))[f(y(L))] = f(y(L)) = f(σ(γ(L), y(L 1)))

= f(σ(γ(L), σ(γ(L 1), y(L 1))))

= f(σ(γ(L), σ(γ(L 1), , σ(γ(1)))) = f(γ),

where γ = {γ(1), , γ(N)}.

Back-Propagation. For the objective in (28), the Back-Propagation is expressed as

γ(i)f(γ) = [ γ(i)y(i)][ y(i)f( )], (29)

where y(i)f( ) = [ y(i)y(i+1)][ y(i+1)f( )].

Deep GO Gradient. We consider the continuous special case, where Dy(L) f(y(L)) = y(L)f(y(L)).

With qγ(i+1) y(i+1)|y(i) s being Dirac delta functions, namely,

qγ(i+1) y(i+1) k |y(i) =

( , [σ(γ(i+1), y(i))]k = y(i+1) k 0, [σ(γ(i+1), y(i))]k = y(i+1) k we have

Qγ(i+1) y(i+1) k |y(i) =

( 1, [σ(γ(i+1), y(i))]k y(i+1) k 0, [σ(γ(i+1), y(i))]k > y(i+1) k

Published as a conference paper at ICLR 2019

Taking derivative wrt y(i) v , we got

y(i) v Qγ(i+1) y(i+1) k |y(i)

= [σ(γ(i+1),y(i))]k Qγ(i+1) y(i+1) k |y(i) y(i) v [σ(γ(i+1), y(i))]k

= δ [σ(γ(i+1), y(i))]k y(i+1) k y(i) v [σ(γ(i+1), y(i))]k

Accordingly, we have

g qγ(i+1)(y(i+1) k |y(i))

qγ(i+1)(y(i+1) k |y(i)) y(i) v Qγ(i+1)(y(i+1) k |y(i))

= y(i) v [σ(γ(i+1), y(i))]k = y(i) v y(i+1) k By substituting the above equation into (12) in Theorem 2, and then comparing it with(29), one can easily verify Corollary 1.

F PROOF FOR THEOREM 3

Based on the proofs for Theorem 1 and Theorem 2, it is clear that, if one constrains all internal variables to be continuous, the GO gradients in expectation obey the chain rule. Therefore, one can straightforwardly utilizing the chain rule to verify Theorem 3. Actually, Theorem 3 may be seen as the chain rule generalized with random variables, among which the internal ones are only allowed to be continuous.

G DERIVATIONS FOR HIERARCHICAL VARIATIONAL INFERENCE

In Hierarchical Variational Inference, the objective is to maximize the evidence lower Bound (ELBO)

ELBO(θ, φ; x) = Eqφ(z|x)[log pθ(x, z) log qφ(z|x)]. (30)

For the common case with z = {z(1), , z(L)}, it is obvious that Theorem 3 of the main manuscript can be applied when optimizing φ.

Practically, there are situations where one might further put a latent variable λ in reference qφ(z|x), namely qφ(z|x) = R qφz(z|λ)qφλ(λ)dλ with φ = {φz, φλ}. Following Ranganath et al. (2016b), we brieﬂy discuss this situation here.

We ﬁrst show that there is another unnecessary variance-injecting 0 term.

φELBO(θ, φ; x) = Z [ φqφ(z|x)][log pθ(x, z) log qφ(z|x)]dz

Z qφ(z|x) φ log qφ(z|x)dz | {z } 0

where the second 0 term is straightly veriﬁed as Z qφ(z|x) φ log qφ(z|x)dz = Z φqφ(z|x)dz = φ

Z qφ(z|x)dz = φ1 = 0.

Eliminating the 0 term from (31), one still has another problem, that is, log qφ(z|x) is usually non-trivial when qφ(z|x) is marginal. For this problem, we follow Ranganath et al. (2016b) to use another lower bound ELBO2 of the ELBO in (30).

log qφ(z|x) = Z qφ(λ|z, x)[ log qφ(z|x)]dλ = Z qφ(λ|z, x) log qφ(z, λ|x)

= Eqφ(λ|z,x)

log qφ(z, λ|x)

rω(λ|z, x) + log qφ(λ|z, x)

log qφ(z, λ|x)

Published as a conference paper at ICLR 2019

where rω(λ|z, x), evaluable, is an additional variational distribution to approximate the variational posterior qφ(λ|z, x). Accordingly, we get the ELBO2 for Hierarchical Variational Inference as

ELBO2(θ, φ; x) = Eqφ(z,λ|x) h log pθ(x, z) log qφ(z, λ|x) + log rω(λ|z, x) i .

Note similar to (31), the unnecessary 0 term related to log qφ(z, λ|x) should also be removed. Accordingly, we have

φELBO2(θ, φ; x) = Z φqφ(z, λ|x) log pθ(x, z) log qφ(z, λ|x)+log rω(λ|z, x) dλdz.

Obviously, Theorem 3 is readily applicable to provide GO gradients.

H DETAILS OF GAMMA AND NB ONE-DIMENSIONAL SIMPLE EXAMPLES

We ﬁrst consider illustrative one-dimensional toy problems, to examine the GO gradient in Theorem 1 for both continuous and discrete random variables.

The optimization objective is expressed as max φ ELBO(φ) = Eqφ(z)[log p(z|x) log qφ(z)] + log p(x),

where for continuous z we assume p(z|x) = Gam(z; α0, β0) for set (α0, β0), with qφ(z) = Gam(z; α, β) and φ = {α, β}; for discrete z we assume p(z|x) = NB(z; r0, p0) for set (r0, p0), with qφ(z) = NB(z; r, p) and φ = {r, p}. Stochastic gradient ascent with one-sample-estimated gradients is used to optimize the objective, which is equivalent to minimizing KL(qφ(z) p(z|x)).

Figure 8 shows the experimental results. For the nonnegative continuous z associated with the gamma distribution, we compare our GO gradient with GRep (Ruiz et al., 2016), RSVI (Naesseth et al., 2016), and their modiﬁed version using the sticking idea (Roeder et al., 2017), denoted as GRep-Stick and RSVI-Stick respectively. For RSVI and RSVI-Stick, the shape augmentation parameter is set as 5 by default. The only difference between GRep and GRep-Stick (also RSVI and RSVI-Stick) is the latter does NOT analytically express the entropy Eqφ(z)[ log qφ(z)]. One should apply sticking because (i) Figures 8(a)-8(c) clearly show its utility in reducing variance; and (ii) without it, GRep and RSVI exhibit high variance that unstabilizes the optimization for small gamma shape parameters, as shown in Figures 8(d)-8(f). We adopt the sticking approach hereafter for all the compared methods. Since the gamma rate parameter β is reparameterizable, its gradient calculation is the same for all sticking methods, including GO, GRep-Stick, and RSVI-Stick. Therefore, similar variances are observed in Figures 8(b) and 8(e). Among methods with sticking, GO exhibits the lowest variance in general, as shown in Figures 8(a) and 8(d). Note it is high variance that causes optimization issues. As a result, GO empirically provides more stable learning curves, as shown in Figures 8(c) and 8(f). For the discrete case corresponding to the NB distribution, GO is compared to REINFORCE (Williams, 1992). To estimate gradient, REINFORCE uses 1 sample of z and 1 evaluation of the expected function; whereas GO uses 1 sample and 2 evaluations. To address the concern about comparison with the same number of evaluations of the expected function, another curve of REINFORCE using 2 samples (thus 2 evaluations of the expected function) is also added, termed REINFORCE2. It is apparent from Figures 8(g)-8(l) that, thanks to analytically removing the 0 terms, the GO gradient has much lower variance and thus faster convergence, even in this simple one-dimensional case.

I DETAILS OF THE DISCRETE VAE EXPERIMENT

Complementing the discrete VAE experiment of the main manuscript, we present below its experimental settings, implementary details, and additional results.

Since the presented statistical back-propagation in Theorem 3 of the main manuscript cannot handle discrete internal variables, we focus on the single-latent-layer settings (1 layer of 200 Bernoulli random variables) for fairness, i.e., pθ(x, z) : x Bern NNP x|z(z) , z Bern P z

qφ(z|x) : z Bern NNP z|x(x) . (32)

Referring to the experimental settings in Grathwohl et al. (2017), we consider

Published as a conference paper at ICLR 2019

0 200 400 600 800 1000 Iteration

Grad-Variance

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

Grad-Variance

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

Grad-Variance

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

Grad-Variance

GRep RSVI GRep-Stick RSVI-Stick GO

0 200 400 600 800 1000 Iteration

GRep RSVI GRep-Stick RSVI-Stick GO

400 600 800 1000 -8

0 200 400 600 800 1000 Iteration

r Grad-Variance

REINFORCE REINFORCE2 GO

0 200 400 600 800 1000 Iteration

p Grad-Variance

REINFORCE REINFORCE2 GO

0 200 400 600 800 1000 Iteration

REINFORCE REINFORCE2 GO

0 200 400 600 800 1000 Iteration

r Grad-Variance

REINFORCE REINFORCE2 GO

0 200 400 600 800 1000 Iteration

p Grad-Variance

REINFORCE REINFORCE2 GO

0 200 400 600 800 1000 Iteration

REINFORCE REINFORCE2 GO

Figure 8: Gamma (a-f) and NB (g-l) toy experimental results. Columns show the gradient variance for the ﬁrst parameter (gamma α or NB r), that for the second parameter (gamma β or NB p), and the ELBO, respectively. The ﬁrst two rows correspond to the gamma toys with posterior parameters α0 = 1, β0 = 0.5 and α0 = 0.01, β0 = 0.5, respectively. The last two rows show NB toy results with r0 = 10, p0 = 0.2 and r0 = 0.5, p0 = 0.2, respectively. In each iteration, gradient variances are estimated with 20 Monte Carlo samples (each sample corresponds to one gradient estimate), among which the last one is used to update parameters. 100 Monte Carlo samples are used to calculate the ELBO in the NB toys.

1-layer linear model: NNP x|z(z) = σ(WT p z + bp)

NNP z|x(x) = σ(WT q x + bq)

where σ( ) is the sigmoid function. Nonlinear model:

NNP x|z(z) = σ(WT p2h(2) p +bp2), h(2) p = tanh(WT p1h(1) p +bp1), h(1) p = tanh(WT p z+bp)

NNP z|x(x) = σ(WT q2h(2) q +bq2), h(2) q = tanh(WT q1h(1) q +bq1), h(1) q = tanh(WT q x+bq)

where tanh( ) is the hyperbolic-tangent function.

Published as a conference paper at ICLR 2019

The used datasets and other experimental settings, including the hyperparameter search strategy, are the same as those in Grathwohl et al. (2017).

For such single-latent-layer settings, it is obvious that Theorem 3 (also Theorem 1) can be straightforwardly applied. However, since Bernoulli distribution has ﬁnite support, as mentioned in the main manuscript, we should analytically express some expectations for lower variance, as detailed below. Notations of (8) and (9) of the main manuscript are used for clarity and also for generalization.

In fact, we should take a step back and start from (8) of the main manuscript, which is equivalent to analytically express an expectation in (9), namely

γEqγ(y)[f(y)] = X

v Eqγ(y v) h X

yv[ γQγ(yv)][f(y v, yv + 1) f(y)] i , (33)

where Qγ(yv) = 1 Pv(γ) yv = 0 1 yv = 1 with Pv(γ) being the Bernoulli probability of Bernoulli

random variable yv. Accordingly, we have

γEqγ(y)[f(y)] = X

v Eqγ(y v) h [ γQγ(yv = 0)][f(y v, yv = 1) f(y v, yv = 0)] i

v Eqγ(y v) h [ γPv(γ)][f(y v, yv = 1) f(y v, yv = 0)] i

= Eqγ(y) h X

v[ γPv(γ)][f(y v, yv = 1) f(y v, yv = 0)] i

(34) For better understanding only, with abused notations γP = [ , γPv(γ), ]T , P y = I, and yf(y) = [ , {f(y v, yv = 1) f(y v, yv = 0)}, ]T , one should observe a chain rule within the above equation.

To assist better understanding of how to practically cooperate the presented GO gradients with deep learning frameworks like Tensor Flow or Py Torch, we take (34) as an example, and present for it the following simple algorithm.

Algorithm 1 An algorithm for (34) as an example to demonstrate how to practically cooperate GO gradients with deep learning frameworks like Tensor Flow or Py Torch. One sample is assumed for clarity. Practically, an easy-to-use trick for changing gradients of any function h(x) is to deﬁne ˆh(x) = x T Stop Gradient[g] + Stop Gradient[h(x) x T g] with g the desired gradients.

# Forward-Propagation 1. γ P (γ): Calculate Bernoulli probabilities P (γ ) 2. P (γ) y: Sample y Bern(P (γ)) Change gradient: P (γ)y = I 3. y f(y): Calculate the loss f(y)

Change gradient: [ yf(y)]v = f(y v, yv = 1) f(y v, yv = 0) # Back-Propagation 1. Rely on the mature auto-differential software for back-propagating gradients

For efﬁcient implementation of Dyf(y), one should exploit the prior knowledge of function f(y). For example, f(y)s are often neural-network-parameterized. Under that settings, one should be able to exploit tensor operation to design efﬁcient implementation of Dyf(y). Again we take (34) as an example, and assume f(y) has the special structure

f(y) = r(ΘT h + c), h = σ(WT y + b) (35) where σ( ) is an element-wise nonlinear activation function, and r( ) is a function that takes in a vector and outputs a scalar. One can easily modify the above f(y) for the considered discrete VAE experiment.

Since yvs are now Bernoulli random variables with support {0, 1}, we have that Dyf(y)

v = f(y v, yv = 1) f(y v, yv = 0)

= f(y v, yv + 1) f(y) yv = 0 f(y) f(y v, yv 1) yv = 1

= av[f(y v, yv + av) f(y)],

Published as a conference paper at ICLR 2019

where av = 1 yv = 0 1 yv = 1 is the v-th element of vector a.

0 2 4 6 8 10 Iteration 105

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(a) MNIST Linear Iteration

0 2000 4000 6000 8000 Time (seconds)

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(b) MNIST Linear Running-Time

0 2 4 6 8 10 Iteration 105

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(c) MNIST Nonlinear Iteration

0 2000 4000 6000 8000 10000 Time (seconds)

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(d) MNIST Nonlinear Running-Time

0 2 4 6 8 10 Iteration 105

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(e) Omniglot Linear Iteration

0 2000 4000 6000 8000 Time (seconds)

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(f) Omniglot Linear Running-Time

0 2 4 6 8 10 Iteration 105

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid

(g) Omniglot Nonlinear Iteration

0 2000 4000 6000 8000 Time (seconds)

REBAR train REBAR valid RELAX train RELAX valid GO train GO valid 500 1500 2500 -145

(h) Omniglot Nonlinear Running-Time

Figure 9: Training/Validation ELBOs for the discrete VAE experiments. Rows correspond to the experimental results on the MNIST/Omniglot dataset with the 1-layer-linear/nonlinear model, respectively. Shown in the ﬁrst/second column is the ELBO curves as a function of iteration/running-time. All methods are run with the same learning rate for 1, 000, 000 iterations. The black line represents the best training ELBO of REBAR and RELAX. ELBOs are calculated using all training/validation data. Note GO does not suffer more from over-ﬁtting, as clariﬁed in the text.

Published as a conference paper at ICLR 2019

Table 4: Average running time per 100 iterations for discrete variational autoencoders. Results of REBAR and RELAX are obtained by running the released code5 from Grathwohl et al. (2017). The same computer with one Titan Xp GPU is used.

Dataset Model REBAR RELAX GO

MNIST Linear 1 layer 0.94s 0.95s 0.25s Nonlinear 0.99s 1.02s 0.54s

Omniglot Linear 1 layer 0.89s 0.92s 0.21s Nonlinear 0.97s 0.98s 0.45s

Then to efﬁciently calculate the f(y v, yv + av)s, we use the following batch processing procedure to beneﬁt from parallel computing.

Step 1: Deﬁne Ξyh as the matrix whose element [Ξyh]vj represents the new h j when input {y v, yv + av} in (35). Then, we have

[Ξyh]vj = σ(y T W:j + bj + av Wvj)

where W:j is the j-th column of the matrix W. Note the vth row of Ξyh, i.e., [Ξyh]v:, happens to be the new h when input {y v, yv + av}.

Step 2: Similarly, we deﬁne Ξyf as the vector whose element [Ξyf]v = f(y v, yv + av). Utilizing Ξyh obtained in Step 1, we have

Ξyf = r([Ξyh]ϑ + c T ),

where r( ) is applied to each row of the matrix ([Ξyh]ϑ + c T ).

Note the above batch processing procedure can be easily extended to deeper neural networks. Accordingly, we have Dyf(y) = a [Ξyf f(y)],

where represents the matrix element-wise product.

Now we can rely on Algorithm 1 to solve the problem whose objective has its gradient expressed as (34), for example the inference of the single-latent-layer discrete VAE in (32).

All training curves versus iteration/running-time are given in Figure 9, where it is apparent that GO provides better performance, a faster convergence rate, and a better running efﬁciency in all situations. The average running time per 100 iterations for the compared methods are given in Table 4, where GO is 2 4 times faster in ﬁnishing the same number of training iterations. We also quantify the running efﬁciency of GO by considering its running time to achieve the best training ELBO (within 1, 000, 000 training iterations) of RERAR/RELAX, referring to the black lines shown in the second-column subﬁgures of Figure 9. It is clear that GO is approximately 5 10 times more efﬁcient than REBAR/RELAX in the considered experiments.

As shown in the second and fourth rows of Figures 9, for the experiments with nonlinear models all methods suffer from over-ﬁtting, which originates from the redundant complexity of the adopted neural networks and appeals for model regularizations. We detailedly clarify these experimental results as follows.

All the compared methods are given the same and only objective, namely to maximize the training ELBO on the same training dataset with the same model; GO clearly shows its power in achieving a better objective.

The level of over-ﬁtting is ultimately determined by the used dataset, model, and objective; it is independent of the adopted optimization method. Different optimization methods just reveal different optimizing trajectories, which show different sequences of training objectives and over-ﬁtting levels (validation objectives).

5github.com/duvenaud/relax

Published as a conference paper at ICLR 2019

Since all methods are given the same dataset, model, and objective, they have the same over-ﬁtting level. Because GO has a lower variance, and thus more powerful optimization capacity, it gets to the similar situations much faster than REBAR/RELAX. Note this does not mean GO suffers more from over-ﬁtting. In fact, GO provides better validation ELBOs in all situations, as shown in Figure 9 and also Table 1 of the main manuscript. In practice, GO can beneﬁt from the early-stopping trick to get a better generalization ability.

J DETAILS OF THE MULTINOMIAL GAN

Complementing the multinomial GAN experiment in the main manuscript, we present more details as follows. For a quantitative assessment of the computational complexity, our Py Torch code takes about 30 minutes to get the most challenging 4-bit task in Figs. 4 and 11, with a Titan Xp GPU.

Firstly, recall that the generator pθ(x) of the developed multinomial GAN (denoted as MNGAN-GO) is expressed as ϵ N(0, I), x Mult(1, NNP(ϵ)), where NNP(ϵ) denotes use of a neural network to project ϵ to distribution parameters P. For brevity, we integration the generator s parameters θ into the NN notation and do not explicitly express them. Multinomial leaf variables x is used to describe discrete observations with a ﬁnite alphabet. To train MNGAN-GO, the vanilla GAN loss (Goodfellow et al., 2014) is used. A deconvolutional neural network as in Radford et al. (2015) is used to map ϵ to P in the generator. The discriminator is constructed as a multilayer perceptron. Detailed model architectures are given in Table 5. Figure 10 illustrates the pipeline of MNGAN-GO. Note MNGAN-GO has a smaller number of parameters, compared to BGAN (Hjelm et al., 2018).

For clarity, we brieﬂy discuss the employed data preprocessing. Taking MNIST for an example, the original data are 8-bit grayscale images, with pixel intensities ranging from 0 to 255. For the n-bit experiment, we obtain the real data, such as the 2-bit one in Figure 10, by rescaling and quantizing the pixel intensities to the range [0, 2n 1], having 2n different states (values).

For the 1-bit special case, the multinomial distribution reduces to the Bernoulli distribution. Of course, one could intuitively employ the redundant multinomial distribution, which is denoted as 1-bit (2-state) in Table 2. An alternative and popular approach is to adopt the Bernoulli distribution to remove the redundancy by only modeling its probability parameters; we denote this case as 1-bit (1-state, Bernoulli) in Table 2.

Figure 11 shows the generated samples from the compared models on different quantized MNIST. It is obvious that MNGAN-GO provides images with better quality and wider diversity in general.

Figure 10: Illustration of MNGAN-GO.

K DETAILS OF HVI FOR DEF AND DLDA

Complementing the HVI experiments in the main manuscript, we present more details as follows.

Published as a conference paper at ICLR 2019

Table 5: Model Architectures of the Multinomial GAN.

Generator Discriminator

Gaussian noise (100 dimension) Input Image 4 4 conv. 256 l Re LU, stride 1, zero-pad 0, BN Linear output 512, l Re LU, SN 4 4 conv. 128 l Re LU, stride 2, zero-pad 1, BN Linear output 256, l Re LU, SN 4 4 conv. 64 l Re LU, stride 2, zero-pad 2, BN Linear output 128, l Re LU, SN 4 4 conv. 2bit Soft Max, stride 2, zero-pad 1, BN Linear output 1, Sigmoid Multinomial Sampling. Output: 28 28

Table 6: Inception scores on quantized MNIST. BGAN s results are ran with the author released code https: //github.com/rdevon/BGAN.

bits (states) BGAN MNGAN-GO 1 (1, Bernoulli) 8.31 .06 9.10 .06 1 (2) 8.56 .04 8.40 .07 2 (4) 7.76 .04 9.02 .06 3 (8) 7.26 .03 9.26 .07 4 (16) 6.29 .05 9.27 .06

For a quantitative assessment of the computational complexity, the Tensor Flow code used takes about 0.08 seconds per iteration (including > 40, 000 Meijer-G function calculations, with an approximate algorithm coded also with Tensor Flow).

Deep exponential families (DEF) (Ranganath et al., 2015) is expressed as

x Pois(W(1)z(1)), z(l) Gam αz, αz/W(l+1)z(l+1) , W(l) Gam(α0, β0),

where αz = 0.1, α0 = 0.3, and β0 = 0.1 following Ruiz et al. (2016); Naesseth et al. (2016). Since the variables of interest, W(l) and z(l), are gamma-distributed, we design the variational approximations as

qφz(z|x) : z(1) Gam NN(1) α (x), NN(1) β (x) , z(l) Gam NN(l) α (z(l 1)), NN(l) β (z(l 1)) ,

qφW(W) : W(l) Gam α(l) W, β(l) W ,

where NN(l)( )s are set to the same shapes as the corresponding z(l), α(l) W and β(l) W also have the shape of W(l). We employ a simple two-layer DEF for demonstration, with z(1) and z(2) having 128 and 64 components, respectively. The mini-batch size is set to 200. One-sample gradient estimates are used to train the model for the compared methods. For RSVI (Naesseth et al., 2016), the shape augmentation parameter B is set to 5. All experimental settings, except for different ways to calculate gradients, are the same for the compared methods.

Deep latent Dirichlet allocation (DLDA) (Zhou et al., 2015; 2016; Cong et al., 2017) is expressed as

x Pois(Φ(1)z(1)), z(l) Gam Φ(l+1)z(l+1), c(l+1) , z(L) Gam r, c(L+1) ,

Φ(l) Dir(η0), c(l) Gam(e0, f0), r Gam(γ0/K, c0), (37)

where hyperparameters are chosen following Cong et al. (2017). Compared to DEF, DLDA is more complicated: 1) model is constructed via the highly nonlinear Gamma shape parameters; 2) dictionaries are described by more challenging Dirichlet distributions; 3) more interested random variables.

The variational approximations for DLDA are designed as

qφ(z, c|x) :z(1) Gam NN(1) αz(x), NN(1) βz (x) , c(2) Gam NN(2) αc(z(1)), NN(2) βc (z(1)) ,

z(l) Gam NN(l) αz(z(l 1)), NN(l) βz(z(l 1)) , c(l+1) Gam NN(l+1) αc (z(l)), NN(l+1) βc (z(l)) ,

qφ(Φ) :Φ(l) Dir(η(l)), qφ(r) :r Gam(αr, βr),

Published as a conference paper at ICLR 2019

Figure 11: Generated images from BGAN (top) and MNGAN-GO (bottom). Columns correspond to 1bit(Bernoulli), 1-bit, 2-bit, 3-bit, 4-bit tasks, respectively.

where motivated by the original upward-downward Gibbs sampler developed in Zhou et al. (2015), we specify NN(l) βz( ) as a scaler to mimic the Gibbs conditional posteriors. The other NN( )s and also η(l), αr, βr have the same shapes of the corresponding random variables. A three-layer DLDA, having 128, 64, 32 components for latent z(1), z(2), z(3) respectively, is trained on MNIST with mini-batches of size 200.

With the learned variational inference nets, one could efﬁciently project the observed x to its latent variables z, c during testing. For applications requiring realtime processing, this is a clear advantage. For demonstration, Figure 12 shows test data samples x and their reconstruction

ˆx = Φ(1)ˆz(1), ˆz(1) Gam NN(1) αz(x), NN(1) βz (x) ,

where one Monte Carlo sample is used to calculate ˆz(1).

Figure 12: (a) Test data samples and (b) their reconstruction via the learned 128-64-32 DLDA.

Published as a conference paper at ICLR 2019

Note for the challenging DLDA task in (37), we ﬁnd it tricky to naively apply pure-gradient-based learning methods. The main reason is: the latent code z(l)s and their gamma shape parameters Φ(l+1)z(l+1)s are usually extremely sparse, meaning most elements are almost zero; a gamma distribution z Gam(α, β) with almost-zero α has an increasingly steep slope when z approaches zero, namely the gradient wrt z shall have an enormous magnitude that unstablize the learning procedure. Even though it might not be sufﬁcient to just use the ﬁrst-order gradient information, empirically the following tricks help us get the presented reasonable results.

Let z(l) Tz, where Tz = 1e 5 is used in the experiments;

Let c(l) Tc, where Tc = 1e 5;

Let Φ(l+1)z(l+1) Tα with Tα = 0.2;

Use a factor to compromise the likelihood and prior for each z(l).

For more details, please refer to our released code. We are working on exploiting higher-order information (such as Hessian) to help remedy this issue.