# go_gradient_for_expectationbased_objectives__480f769a.pdf Published as a conference paper at ICLR 2019 GO GRADIENT FOR EXPECTATION-BASED OBJECTIVES Yulai Cong Miaoyun Zhao Ke Bai Lawrence Carin Department of Electrical and Computer Engineering, Duke University Within many machine learning algorithms, a fundamental problem concerns efficient calculation of an unbiased gradient wrt parameters γ for expectation-based objectives Eqγ(y)[f(y)]. Most existing methods either (i) suffer from high variance, seeking help from (often) complicated variance-reduction techniques; or (ii) they only apply to reparameterizable continuous random variables and employ a reparameterization trick. To address these limitations, we propose a General and One-sample (GO) gradient that (i) applies to many distributions associated with non-reparameterizable continuous or discrete random variables, and (ii) has the same low-variance as the reparameterization trick. We find that the GO gradient often works well in practice based on only one Monte Carlo sample (although one can of course use more samples if desired). Alongside the GO gradient, we develop a means of propagating the chain rule through distributions, yielding statistical back-propagation, coupling neural networks to common random variables. 1 INTRODUCTION Neural networks, typically trained using back-propagation for parameter optimization, have recently demonstrated significant success across a wide range of applications. There has been interest in coupling neural networks with random variables, so as to embrace greater descriptive capacity. Recent examples of this include black-box variational inference (BBVI) (Kingma & Welling, 2014; Rezende et al., 2014; Ranganath et al., 2014; Hernández-Lobato et al., 2016; Ranganath et al., 2016b; Li & Turner, 2016; Ranganath et al., 2016a; Zhang et al., 2018) and generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018). Unfortunately, efficiently backpropagating gradients through general distributions (random variables) remains a bottleneck. Most current methodology focuses on distributions with continuous random variables, for which the reparameterization trick may be readily applied (Kingma & Welling, 2014; Grathwohl et al., 2017). As an example, the aforementioned bottleneck greatly constrains the applicability of BBVI, by limiting variational approximations to reparameterizable distributions. This limitation excludes discrete random variables and many types of continuous ones. From the perspective of GAN, the need to employ reparameterization has constrained most applications to continuous observations. There are many forms of data that are more-naturally discrete. The fundamental problem associated with the aforementioned challenges is the need to efficiently calculate an unbiased low-variance gradient wrt parameters γ for an expectation objective of the form Eqγ(y)[f(y)]1. We are interested in general distributions qγ(y), for which the components of y may be either continuous or discrete. Typically the components of y have a hierarchical structure, and a subset of the components of y play a role in evaluating f(y). Unfortunately, classical methods for estimating gradients of Eqγ(y)[f(y)] wrt γ have limitations. The REINFORCE gradient (Williams, 1992), although generally applicable (e.g., for continuous and discrete random variables), exhibits high variance with Monte Carlo (MC) estimation of the Correspondence to: Yulai Cong , Miaoyun Zhao . 1In this paper, we consider expectation objectives meeting basic assumptions: (i) qγ(y) is differentiable wrt γ; (ii) f(y) is differentiable for continuous y; and (iii) f(y) < for discrete y. For simplicity of the main paper, those assumptions are implicitly made, as well as fundamental rules like Leibniz s rule. Published as a conference paper at ICLR 2019 expectation, forcing one to apply additional variance-reduction techniques. The reparameterization trick (Rep) (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) works well, with as few as only one MC sample, but it is limited to continuous reparameterizable y. Many efforts have been devoted to improving these two formulations, as detailed in Section 6. However, none of these methods is characterized by generalization (applicable to general distributions) and efficiency (working well with as few as one MC sample). The key contributions of this work are based on the recognition that REINFORCE and Rep are seeking to solve the same objective, but in practice Rep yields lower-variance estimations, albeit for a narrower class of distributions. Recent work (Ranganath et al., 2016b) has made a connection between REINFORCE and Rep, recognizing that the former estimates a term the latter evaluates analytically. The high variance by which REINFORCE approximates this term manifests high variance in the gradient estimation. Extending these ideas, we make the following main contributions. (i) We propose a new General and One-sample (GO) gradient in Section 3, that principally generalizes Rep to many non-reparameterizable distributions and justifies two recent methods (Figurnov et al., 2018; Jankowiak & Obermeyer, 2018); the One sample motivating the name GO is meant to highlight the low variance of the proposed method, although of course one may use more than one sample if desired. (ii) We find that the core of the GO gradient is something we term a variable-nabla, which can be interpreted as the gradient of a random variable wrt a parameter. (iii) Utilizing variablenablas to propagate the chain rule through distributions, we broaden the applicability of the GO gradient in Sections 4-5 and present statistical back-propagation, a statistical generalization of classic back-propagation (Rumelhart & Hinton, 1986). Through this generalization, we may couple neural networks to general random variables, and compute needed gradients with low variance. 2 BACKGROUND To motivate this paper, we begin by briefly elucidating common machine learning problems for which there is a need to efficiently estimate gradients of γ for functions of the form Eqγ(y)[f(y)]. Assume access to data samples {xi}i=1,N, drawn i.i.d. from the true (and unknown) underlying distribution q(x). We seek to learn a model pθ(x) to approximate q(x). A classic approach to such learning is to maximize the expected log likelihood ˆθ = argmaxθ Eq(x)[log pθ(x)], perhaps with an added regularization term on θ. Expectation Eq(x)( ) is approximated via the available data samples, as ˆθ = argmaxθ 1 N PN i=1 log pθ(xi). It is often convenient to employ a model with latent variables z, i.e., pθ(x) = R pθ(x, z)dz = R pθ(x|z)p(z)dz, with prior p(z) on z. The integral wrt z is typically intractable, motivating introduction of the approximate posterior qφ(z|x), with parameters φ. The well-known evidence lower bound (ELBO) (Jordan et al., 1999; Bishop, 2006) is defined as ELBO(θ, φ; x) = Eqφ(z|x)[log pθ(x, z) log qφ(z|x)] (1) = log pθ(x) KL[qφ(z|x) pθ(z|x)] log pθ(x) (2) where pθ(z|x) is the true posterior, and KL( ) represents the Kullback-Leibler divergence. Variational learning seeks (ˆθ, ˆφ) = argmaxθ,φ PN i=1 ELBO(θ, φ; xi). While computation of the ELBO has been considered for many years, a problem introduced recently concerns adversarial learning of pθ(x), or, more precisely, learning a model that allows one to efficiently and accurately draw samples x pθ(x) that are similar to x q(x). With generative adversarial networks (GANs) (Goodfellow et al., 2014), one seeks to solve min θ max β Eq(x)[log Dβ(x)] + Epθ(x)[log(1 Dβ(x))] , (3) where Dβ(x) is a discriminator with parameters β, quantifying the probability x was drawn from q(x), with 1 Dβ(x) representing the probability that it was drawn from pθ(x). There have been many recent extensions of GAN (Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018), but the basic setup in (3) holds for most. To optimize (1) and (3), the most challenging gradients that must be computed are of the form γEqγ(y)[f(y)]; for (1) y = z and γ = φ, while for (3) y = x and γ = θ. The need to evaluate expressions like γEqγ(y)[f(y)] arises in many other machine learning problems, and consequently it has generated much prior attention. Published as a conference paper at ICLR 2019 Evaluation of the gradient of the expectation is simplified if the gradient can be moved inside the expectation. REINFORCE (Williams, 1992) is based on the identity γEqγ(y)[f(y)] = Eqγ(y) f(y) γ log qγ(y) . (4) While simple in concept, this estimate is known to have high variance when the expectation Eqγ(y)( ) is approximated (as needed in practice) by a finite number of samples. An approach (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) that has attracted recent attention is termed the reparameterization trick, applicable when qγ(y) can be reparametrized as y = τ γ(ϵ), with ϵ q(ϵ), where τ γ(ϵ) is a differentiable transformation and q(ϵ) is a simple distribution that may be readily sampled. To keep consistency with the literature, we call a distribution reparameterizable if and only if those conditions are satisfied2. In this case we have γEqγ(y)[f(y)] = Eq(ϵ) [ γτ γ(ϵ)][ yf(y)]|y=τ γ(ϵ) . (5) This gradient, termed Rep, is typically characterized by relatively low variance, when approximating Eq(ϵ)( ) with a small number of samples ϵ q(ϵ). This approach has been widely employed for computation of the ELBO and within GAN, but it limits one to models that satisfy the assumptions of Rep. 3 GO GRADIENT The reparameterization trick (Rep) is limited to reparameterizable random variables y with continuous components. There are situations for which Rep is not readily applicable, e.g., where the components of y may be discrete or nonnegative Gamma distributed. We seek to gain insights from the relationship between REINFORCE and Rep, and generalize the types of random variables y for which the latter approach may be effected. We term our proposed approach a General and One-sample (GO) gradient. In practice, we find that this approach works well with as few as one sample for evaluating the expectation, and it is applicable to more general settings than Rep. Recall that Rep was first applied within the context of variational learning (Kingma & Welling, 2014), as in (1). Specifically, it was assumed qγ(y) = Q v qγ(yv), omitting explicit dependence on data x, for notational convenience; yv is component v of y. In Kingma & Welling (2014) qγ(yv) corresponded to a Gaussian distribution qγ(yv) = N(yv; µv(γ), σ2 v(γ)), with mean µv(γ) and variance σ2 v(γ). In the following we generalize qγ(yv) such that it need not be Gaussian. Applying integration by parts (Ranganath et al., 2016b) γEqγ(y)[f(y)] = X v Eqγ(y v) R f(y) γqγ(yv)dyv (6) v Eqγ(y v) h f(y) γQγ(yv) | {z } 0 R [ γQγ(yv)][ yvf(y)]dyv | {z } Key where y v denotes y with yv excluded, and Qγ(yv) is the cumulative distribution function (CDF) of qγ(yv). The 0 term is readily proven to be zero for any Qγ(yv), with the assumption that f(y) doesn t tend to infinity faster than γQγ(yv) tending to zero when yv . The Key term exactly recovers the one-dimensional Rep when reparameterization yv = τγ(ϵv), ϵv q(ϵv) exists (Ranganath et al., 2016b). Further, applying γqγ(yv) = qγ(yv) γ log qγ(yv) in (6) yields REINFORCE. Consequently, it appears that Rep yields low variance by analytically setting to zero the unnecessary but high-variance-injecting 0 term, while in contrast REINFORCE implicitly seeks to numerically implement both terms in (7). We generalize qγ(y) for discrete yv, here assuming yv {0, 1, . . . , }. It is shown in Appendix A.2 that this framework is also applicable to discrete yv with a finite alphabet. It may be shown (see Appendix A.2) that γEqγ(y)[f(y)] = X v Eqγ(y v) h X yv f(y) γqγ(yv) i v Eqγ(y v) h [f(y) γQγ(yv)]|yv= | {z } 0 yv[ γQγ(yv)][f(y v, yv + 1) f(y)] | {z } Key 2A Bernoulli random variable y Bernoulli(P) is identical to y = 1ϵ