# pathwise_derivatives_beyond_the_reparameterization_trick__c044b908.pdf Pathwise Derivatives Beyond the Reparameterization Trick Martin Jankowiak * 1 Fritz Obermeyer * 1 We observe that gradients computed via the reparameterization trick are in direct correspondence with solutions of the transport equation in the formalism of optimal transport. We use this perspective to compute (approximate) pathwise gradients for probability distributions not directly amenable to the reparameterization trick: Gamma, Beta, and Dirichlet. We further observe that when the reparameterization trick is applied to the Choleskyfactorized multivariate Normal distribution, the resulting gradients are suboptimal in the sense of optimal transport. We derive the optimal gradients and show that they have reduced variance in a Gaussian Process regression task. We demonstrate with a variety of synthetic experiments and stochastic variational inference tasks that our pathwise gradients are competitive with other methods. 1. Introduction Maximizing objective functions via gradient methods is ubiquitous in machine learning. When the objective function L is defined as an expectation of a (differentiable) test function fθ(z) w.r.t. a probability distribution qθ(z), L = Eqθ(z) [fθ(z)] (1) computing exact gradients w.r.t. the parameters θ is often unfeasible so that optimization methods must instead make due with stochastic gradient estimates. If the gradient estimator is unbiased, then stochastic gradient descent with an appropriately chosen sequence of step sizes can be shown to have nice convergence properties (Robbins & Monro, 1951). If, however, the gradient estimator exhibits large variance, stochastic optimization algorithms may be impractically slow. Thus it is of general interest to develop gradient estimators with reduced variance. *Equal contribution 1Uber AI Labs, San Francisco, USA. Correspondence to: , . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). We revisit the class of gradient estimators popularized in (Kingma & Welling, 2013; Rezende et al., 2014; Titsias & L azaro-Gredilla, 2014), which go under the name of the pathwise derivative or the reparameterization trick. While this class of gradient estimators is not applicable to all choices of probability distribution qθ(z), empirically it has been shown to yield suitably low variance in many cases of practical interest and thus has seen wide use. We show that the pathwise derivative in the literature is in fact a particular instance of a continuous family of gradient estimators. Drawing a connection to tangent fields in the field of optimal transport,1 we show that one can define a unique pathwise gradient that is optimal in the sense of optimal transport. For the purposes of this paper, we will refer to these optimal gradients as OMT (optimal mass transport) gradients. The resulting geometric picture is particularly intriguing in the case of multivariate distributions, where each choice of gradient estimator specifies a velocity field on the sample space. To make this picture more concrete, in Figure 1 we show the velocity fields that correspond to two different gradient estimators for the off-diagonal element of the Cholesky factor parameterizing a bivariate Normal distribution. We note that the velocity field that corresponds to the reparameterization trick has a large rotational component that makes it suboptimal in the sense of optimal transport. In Sec. 7 we show that this suboptimality can result in reduced performance when fitting a Gaussian Process to data. The rest of this paper is organized as follows. In Sec. 2 we provide a brief overview of stochastic gradient variational inference. In Sec. 3 we show how to compute pathwise gradients for univariate distributions. In Sec. 4 we expand our discussion of pathwise gradients to the case of multivariate distributions, introduce the connection to the transport equation, and provide an analytic formula for the OMT gradient in the case of the multivariate Normal. In Sec. 5 we discuss how we can compute high precision approximate pathwise gradients for the Gamma, Beta, and Dirichlet distributions. In Sec. 6 we place our work in the context of related research. In Sec. 7 we demonstrate the performance of our gradient estimators with a variety of synthetic experiments and experiments on real world datasets. Finally, in Sec. 8 we conclude with a discussion of directions for future work. 1See (Villani, 2003; Ambrosio et al., 2008) for a review. Pathwise Derivatives Beyond the Reparameterization Trick Figure 1. Velocity fields for a bivariate Normal distribution parameterized by a Cholesky factor L = 12. The gradient is w.r.t. the off-diagonal element L21. On the left we depict the velocity field corresponding to the reparameterization trick and on the right we depict the velocity field that is optimal in the sense of optimal transport. The solid black circle denotes the 1-σ covariance ellipse, with the gray ellipses denoting displaced covariance ellipses that result from small increases in L21. Note that the ellipses evolve the same way under both velocity fields, but individual particles flow differently to effect the same global displacement of mass. 2. Stochastic Gradient Variational Inference One area where stochastic gradient estimators play a particularly central role is stochastic variational inference (Hoffman et al., 2013). This is especially the case for black-box methods (Wingate & Weber, 2013; Ranganath et al., 2014), where conjugacy and other simplifying structural assumptions are unavailable, with the consequence that Monte Carlo estimators become necessary. For concreteness, we will refer to this class of methods as Stochastic Gradient Variational Inference (SGVI). In this section we give a brief overview of this line of research, as it serves as the motivating use case for our work. Furthermore, in Sec. 7 SGVI will serve as the main testbed for our proposed methods. Let p(x, z) define a joint probability distribution over observed data x and latent random variables z. One of the main tasks in Bayesian inference is to compute the posterior distribution p(z|x) = p(x,z) p(x) . For many models of interest, this is an intractably hard problem and so approximate methods become necessary. Variational inference recasts Bayesian inference as an optimization problem. Specifically we define a variational family of distributions qθ(z) parameterized by θ and seek to find a value of θ that minimizes the KL divergence between qθ(z) and the (unknown) posterior p(z|x). This is equivalent to maximizing the ELBO (Jordan et al., 1999), defined as ELBO = Eqθ(z) [log p(x, z) log qθ(z)] (2) For general choices of p(x, z) and qθ(z), this expectation much less its gradients cannot be computed analytically. In these circumstances a natural approach is to build a Monte Carlo estimator of the ELBO and its gradient w.r.t. θ. The properties of the chosen gradient estimator especially its bias and variance play a critical rule in determining the viability of the resulting stochastic optimization. Next, we review two commonly used gradient estimators; we leave a brief discussion of more elaborate variants to Sec. 6. 2.1. Score Function Estimator The score function estimator, also referred to as the logderivative trick or REINFORCE (Glynn, 1990; Williams, 1992), provides a simple and broadly applicable recipe for estimating ELBO gradients (Paisley et al., 2012). The score function estimator expresses the gradient as an expectation with respect to qθ(z), with the simplest variant given by θELBO = Eqθ(z) [ θ log r + log r θ log qθ(z)] (3) where log r = log p(x, z) log qθ(z). Monte Carlo estimates of Eqn. 3 can be formed by drawing samples from qθ(z) and computing the term in the square brackets. Although the score function estimator is very general (e.g. it applies to discrete random variables) it typically suffers from high variance, although this can be mitigated with the use of variance reduction techniques such as Rao-Blackwellization (Casella & Robert, 1996) and control variates (Ross, 2006). 2.2. Pathwise Gradient Estimator The pathwise gradient estimator, a.k.a. the reparameterization trick (RT), is not as broadly applicable as the score function estimator, but it generally exhibits lower variance (Price, 1958; Salimans et al., 2013; Kingma & Welling, 2013; Glasserman, 2013; Rezende et al., 2014; Titsias & L azaro-Gredilla, 2014). It is applicable to continuous random variables whose probability density qθ(z) can be reparameterized such that we can rewrite expectations Eqθ(z) [fθ(z)] Eq0(ϵ) [fθ(T (ϵ; θ))] (4) where q0(z) is a fixed distribution with no dependence on θ and T (ϵ; θ) is a differentiable θ-dependent transformation. Since the expectation w.r.t. q0(ϵ) has no θ dependence, gradients w.r.t. θ can be computed by pushing θ through the expectation. This reparameterization can be done for a number of distributions, including for example the Normal distribution. Unfortunately the reparameterization trick is non-trivial to apply to a number of commonly used distributions, e.g. the Gamma and Beta distributions, since the required shape transformations T (ϵ; θ) inevitably involve special functions. 3. Univariate Pathwise Gradients Consider an objective function given as the expectation of a test function fθ(z) with respect to a distribution qθ(z), Pathwise Derivatives Beyond the Reparameterization Trick where z is a continuous one-dimensional random variable: L = Eqθ(z) [fθ(z)] (5) Here qθ(z) and fθ(z) are parameterized by θ, and we would like to compute (stochastic) gradients of L w.r.t. θ, where θ is a scalar component of θ: θL = θEqθ(z) [fθ(z)] (6) Crucially we would like to avoid the log-derivative trick, which yields a gradient estimator that tends to have high variance. Doing so will be easy if we can rewrite the expectation in terms of a fixed distribution that does not depend on θ. A natural choice is to use the standard uniform distribution U, L = EU(u) fθ(F 1 θ (u)) (7) where the transformation F 1 θ : u z is the inverse CDF of qθ(z). As desired, all dependence on θ is now inside the expectation. Unfortunately, for many continuous univariate distributions of interest (e.g. the Gamma and Beta distributions) the transformation F 1 θ (as well as its derivative w.r.t. θ) does not admit a simple analytic expression. Fortunately, by making use of implicit differentiation we can compute the gradient in Eqn. 6 without explicitly introducing F 1 θ . To complete the derivation define u by u Fθ(z) = Z z qθ(z )dz (8) and differentiate both sides of Eqn. 8 w.r.t. θ and make use of the fact that u U does not depend on θ to obtain dθ qθ(z) + Z z θqθ(z )dz (9) This then yields our master formula for the univariate case θ (z) qθ(z) (10) where the corresponding gradient estimator is given by θL = Eqθ(z) dz dz dθ + fθ(z) While this derivation is elementary, it helps to clarify things: the key ingredient needed to compute pathwise gradients in Eqn. 6 is the ability to compute (or approximate) the derivative of the CDF, i.e. θFθ(z). In the supplementary materials we verify that Eqn. 11 results in correct gradients. It is worth emphasizing how this approach differs from a closely related alternative. Suppose we construct a (differentiable) approximation of the inverse CDF, ˆF 1 θ (u) F 1 θ (u). For example, we might train a neural network nn(u, θ) F 1 θ (u). We can then push samples u U through nn(u, θ) and obtain approximate samples from qθ(z) as well as approximate derivatives dz dθ via the chain rule; in this case, there will be a mismatch between the probability qθ(z) assigned to samples z and the actual distribution over z. By contrast, if we use the construction of Eqn. 10, our samples z will still be exact2 and the fidelity of our approximation of (the derivatives of) Fθ(z) will only affect the accuracy of our approximation for dz 4. Multivariate Pathwise Gradients In the previous section we focused on continuous univariate distributions. Pathwise gradients can also be constructed for continuous multivariate distributions, although the analysis is in general expected to be much more complicated than in the univariate case directly analogous to the difference between ordinary and partial differential equations. Before constructing estimators for particular distributions, we introduce the connection to the transport equation. 4.1. The Transport Equation Consider a multivariate distribution qθ(z) in D dimensions and consider differentiating Eqθ(z) [f(z)] with respect to the parameter θ.3 As we vary θ we move qθ(z) along a curve in the space of distributions over the sample space. Alternatively, we can think of each distribution as a cloud of particles; as we vary θ from θ to θ + θ each particle undergoes an infinitesimal displacement dz. Any set of displacements that ensures that the displaced particles are distributed according to the displaced distribution qθ+ θ(z) is allowed. This intuitive picture can be formalized with the transport a.k.a. continuity equation:4 θqθ + z qθvθ = 0 (12) Here the velocity field vθ is a vector field defined on the sample space that displaces samples (i.e. particles) z as we vary θ infinitesimally. Note that there is a velocity field vθ for each component θ of θ. This equation is readily interpreted in the language of fluid dynamics. In order for the the total probability to be conserved, the term θqθ(z) which is the rate of change of the number of particles in the infinitesimal volume element at z has to be counterbalanced by the in/out-flow of particles as given by the divergence term. 2Or rather their exactness will be determined by the quality of our sampler for qθ(z), which is fully decoupled from how we compute derivatives dz dθ . 3Here without loss of generality we assume that f(z) has no dependence on θ, since computing Eqθ(z) [ θfθ(z)] presents no difficulty; the difficulty stems from the dependence on θ in qθ(z). 4We refer the reader to (Villani, 2003) and (Ambrosio et al., 2008) for details. Pathwise Derivatives Beyond the Reparameterization Trick 4.2. Gradient Estimator Given a solution to Eqn. 12, we can form the gradient estimator θL = Eqθ(z) vθ zf (13) which generalizes Eqn. 11 to the multivariate case. That this is an unbiased gradient estimator follows directly from the divergence theorem (see the supplementary materials). 4.3. Tangent Fields In general Eqn. 12 admits an infinite dimensional space of solutions. In the context of our derivation of Eqn. 10, we might loosely say that different solutions of Eqn. 12 correspond to different ways of specifying quantiles of qθ(z). To determine a unique5 solution the tangent field from the theory of optimal transport we require that v OMT i zj = v OMT j zi i, j (14) In this case it can be shown that v OMT minimizes the total kinetic energy, which is given by6 Z dz qθ(z)||v||2 (15) 4.4. Gradient variance The ||v||2 term that appears in Eqn. 15 might lead one to hope that v OMT provides gradients that minimize gradient variance. Unfortunately, the situation is more complicated. Denoting the (mean) gradient by g = Eqθ(z)[v zf(z)] the total gradient variance is given by Eqθ(z) ||v zf||2 ||g||2 (16) Since g is the same for all unbiased gradient estimators, the gradient estimator that minimizes the total variance is the one that minimizes the first term in Eqn. 16. For test functions f(z) that approximately satisfy zf 1 over the bulk of the support of qθ(z), the first term in Eqn. 16 term is approximately proportional to the kinetic energy. In this case the OMT gradient estimator will be (nearly) optimal. Note that the kinetic energy weighs contributions from different components of v equally, whereas g scales different components of v with zf. Thus we can think of the OMT gradient estimator as a good choice for generic choices of f(z) that are relatively flat and isotropic (or, alternatively, for choices of f(z) where we have little a priori knowledge about the detailed structure of zf). So for any particular choice of a generic f(z) there will be some gradient 5We refer the reader to Ch. 8 of (Ambrosio et al., 2008) for details. 6Note that the univariate solution, Eqn. 10, is automatically the OMT solution. estimator that has lower variance than the OMT gradient estimator. Still, for many choices of f(z) we expect the OMT gradient estimator to have lower variance than the RT gradient estimator, since the latter has no particular optimality guarantees (at least not in any coordinate system that we expect to be well adapted to f(z)). 4.5. The Multivariate Normal In the case of a (zero mean) multivariate Normal distribution parameterized by a Cholesky factor L via z = L z, where z is white noise, the reparameterization trick yields the following velocity field for Lab:7 v RT i = zi Lab = δia(L 1z)b (17) Note that Eqn. 17 is just a particular instance of the solution to the transport equation that is implicitly provided by the reparameterization trick, namely vθ = T (ϵ; θ) ϵ=T 1(z;θ) (18) In the supplementary materials we verify that Eqn. 17 satisfies the transport equation Eqn. 12. However, it is evidently not optimal in the sense of optimal transport, since v RT i zj = δia L 1 bj is not symmetric in i and j. In fact the tangent field takes the form v OMT i = 1 2 δia(L 1z)b + za L 1 bi + (Sabz)i (19) where Sab is a symmetric matrix whose precise form we give in the supplementary materials. We note that computing gradients with Eqn. 19 is O(D3), since it involves a singular value decomposition of the covariance matrix. In Sec. 7 we show that the resulting gradient estimator can lead to reduced variance. 5. Numerical Recipes In this section we show how Eqn. 10 can be used to obtain pathwise gradients in practice. In many cases of interest we will need to derive approximations to θF(z) that balance the need for high accuracy (thus yielding gradient estimates with negligible bias) with the need for computational efficiency. In particular we will derive accurate approximations to Eqn. 10 for the Gamma, Beta, and Dirichlet distributions. These approximations will involve three basic components: 1. Elementary Taylor expansions 2. The Lugannani-Rice saddlepoint expansion (Lugannani & Rice, 1980; Butler, 2007) 7Note that the reparameterization trick already yields the OMT gradient for the location parameter µ. Pathwise Derivatives Beyond the Reparameterization Trick 3. Rational polynomial approximations in regions of (z, θ) that are analytically intractable The CDF of the Gamma distribution involves the (lower) incomplete gamma function γ( ): Fα,β(z) = γ(α,βz) Γ(α) . Unfortunately γ( ) does not admit simple analytic expressions for derivatives w.r.t. its first argument, and so we must resort to numerical approximations. Since z Gamma(α, β = 1) z/β Gamma(α, β) it is sufficient to consider dz dα for the standard Gamma distribution with β = 1. To give a flavor for the kinds of approximations we use, consider how we can approximate αγ(α, z) in the limit z 1. We simply do a Taylor series in powers of z: 0 (z )α 1/z 1 + 1 2z + ... dz α z α + 1 + α + 2 + ... In practice we use 6 terms in this expansion, which is accurate for z < 0.8. Details for the remaining approximations can be found in the supplementary materials. The CDF of the Beta distribution, FBeta, is the (regularized) incomplete beta function; just like in the case of the Gamma distribution, its derivatives do not admit simple analytic expressions. We describe the numerical approximations we used in the supplementary materials. 5.3. Dirichlet Let z Dir(α) be Dirichlet distributed with n components. Noting that the zi are constrained to lie within the unit (n 1)-simplex, we proceed by representing z in terms of n 1 mutually independent Beta variates (Wilks, 1962): zi Beta(αi, Pn j=i+1αj) for i = 1, ..., n 1 z1 = z1 zn = Qn 1 j=1 (1 zj) zi = zi Qi 1 j=1(1 zj) for i = 2, ..., n 1 Without loss of generality, we will compute d dα1 zi for i = 1, ..., n. Crucially, the only dependence on α1 in Eqn. 20 is through z1. We find: α1 (z1|α1, αtot α1) Beta(z1|α1, αtot α1) 1, z2 1 z1 , ..., zn (20) Note that Eqn. 20 implies that d dα P i zi = 0, as it must because of the simplex constraint. Since we have already developed an approximation for FBeta θ , Eqn. 20 provides a complete recipe for pathwise Dirichlet gradients. Note that although we have used a stick-breaking construction to derive Eqn. 20, this in no way dictates the sampling scheme we use when generating z Dir(α). In the supplementary materials we verify that Eqn. 20 satisfies the transport equation. 5.4. Implementation It is worth emphasizing that pathwise gradient estimators of the form in Eqn. 13 have the advantage of being plugand-play. We simply plug an approximate or exact velocity field into our favorite automatic differentiation engine8 so that samples z and fθ(z) are differentiable w.r.t. θ. There is no need to construct a surrogate objective function to form the gradient estimator. 6. Related Work A number of lines of research bears upon our work. There is a large body of work on constructing gradient estimators with reduced variance, much of which can be understood in terms of control variates (Ross, 2006): for example, (Mnih & Gregor, 2014) construct neural baselines for scorefunction gradients; (Schulman et al., 2015) discuss gradient estimators for stochastic computation graphs and their Rao Blackwellization; and (Tucker et al., 2017; Grathwohl et al., 2017) construct adaptive control variates for discrete random variables. Another example of this line of work is reference (Miller et al., 2017), where the authors construct control variates that are applicable when qθ(z) is a diagonal Normal distribution. While our OMT gradient for the multivariate Normal distribution, Eqn. 19, can also be understood in the language of control variates,9 (Miller et al., 2017) relies on Taylor expansions of the test function fθ(z).10 In (Graves, 2016), the author derives formula Eqn. 10 and uses it to construct gradient estimators for mixture distributions. Unfortunately, the resulting gradient estimator is expensive, relying on a recursive computation that scales with the dimension of the sample space. Another line of work constructs partially reparameterized gradient estimators for cases where the reparameterization trick is difficult to apply. The generalized reparameterization gradient (G-REP) (Ruiz et al., 2016) uses standardization 8Our approximations for pathwise gradients for the Gamma, Beta, and Dirichlet distributions are available in the 0.4 release of Py Torch (Paszke et al., 2017). 9See Sec. 8 and the supplementary materials for a brief discussion. 10In addition, note that in their approach variance reduction for gradients w.r.t. the scale parameter σ necessitates a multi-sample estimator (at least for high-dimensional models where computing the diagonal of the Hessian is prohibitively expensive). Pathwise Derivatives Beyond the Reparameterization Trick 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 log(z / (1 z)) Figure 2. Derivatives dz dα for samples z Beta(1, 1). We compare the OMT gradient to the gradient that is obtained when samples z Beta(α, β) are represented as the ratio of two Gamma variates (each with its own pathwise derivative). The OMT derivative has a deterministic value for each sample z, whereas the Gamma representation induces a higher variance stochastic derivative due to the presence of an auxiliary random variable. via sufficient statistics to obtain a transformation T (ϵ; θ) that minimizes the dependence of q(ϵ) on θ. This results in a partially reparameterized gradient estimator that also includes a score function-like term.11 In RSVI (Naesseth et al., 2017) the authors consider gradient estimators in the case that qθ(z) can be sampled from efficiently via rejection sampling. This results in a gradient estimator with the same generic structure as G-REP, although in the case of RSVI the score function-like term can often be dropped in practice at the cost of small bias (with the benefit of reduced variance). Besides the fact that this gradient estimator is not fully pathwise, one key difference with our approach is that for many distributions of interest (e.g. the Beta and Dirichlet distributions), rejection sampling introduces auxiliary random variables, which results in additional stochasticity and thus higher variance (cf. Figure 2). In contrast our pathwise gradients for the Beta and Dirichlet distributions are deterministic for a given z and θ. Finally, (Knowles, 2015) uses (somewhat imprecise) approximations to the inverse CDF to derive gradient estimators for Gamma random variables. As the final version of this manuscript was being prepared, we became aware of (Figurnov et al., 2018), which has some overlap with this work. In particular, (Figurnov et al., 2018) derives Eqn. 10 and an interesting generalization to the multivariate case. This allows the authors to construct pathwise derivatives for the Gamma, Beta, and Dirichlet distributions. For the latter two distributions, however, the derivatives include additional stochasticity that our pathwise derivatives avoid. Also, the authors do not draw the connection to the transport equation and optimal transport or consider the multivariate Normal distribution in any detail. 11That is a term in the gradient estimator that is proportional to the test function fθ(z). Gra Gient Variance 207 5SVI (B 4) Variance 5ati R Gra Gient Ab V. Bia V 207 5SVI (B 4) Figure 3. We compare the OMT gradient to the RSVI gradient with B = 4 for the test function f(z) = z3 and qθ(z) = Beta(z|α, α). In the bottom panel we depict finite-sample bias for 25 million samples (this also includes effects from finite numerical precision). 7. Experiments All experiments in this section use single-sample gradient estimators. 7.1. Synthetic Experiments In this section we validate our pathwise gradients for the Beta, Dirichlet, and multivariate Normal distributions. Where appropriate we compare to the RT gradient, the score function gradient, or RSVI. 7.1.1. BETA DISTRIBUTION In Fig. 3 we compare the performance of our OMT gradient for Beta random variables to the RSVI gradient estimator. We use a test function f(z) = z3 for which we can compute the gradient exactly. We see that the OMT gradient performs favorably over the entire range of parameter α that defines the distribution Beta(α, α) used to compute L. For smaller α, where L exhibits larger curvature, the variance of the estimator is noticeably reduced. Notice that one reason for the reduced variance of the OMT estimator as compared to the RSVI estimator is the presence of an auxiliary random variable in the latter case (cf. Figure 2). 7.1.2. DIRICHLET DISTRIBUTION In Fig. 4 we compare the variance of our pathwise gradient for the Dirichlet distribution to the RSVI gradient estimator. We compute stochastic gradients of the ELBO for a Multinomial-Dirichlet model initialized at the exact posterior (where the exact gradient is zero). The Dirichlet distribution has 1995 components, and the single data point is a bag of words from a natural language document. We see that the pathwise gradient performs favorably over the entire Pathwise Derivatives Beyond the Reparameterization Trick 1.0 1.5 2.0 2.5 3.0 α0 Mean gradient variance Pathwise RSVI (B=1) RSVI (B=3) RSVI (B=10) RSVI (B=20) Figure 4. Gradient variance for the ELBO of a conjugate Multinomial-Dirichlet model. We compare the pathwise gradient to RSVI for different boosts B. See Sec. 7.1.2 for details. 0.0 0.2 0.4 0.6 0.8 1.0 Magnitude of Off-Diagonal of L Variance Ratio cosine quadratic quartic Figure 5. We compare the OMT gradient estimator for the multivariate Normal distribution to the RT estimator for three test functions. The horizontal axis controls the magnitude of the offdiagonal elements of the Cholesky factor L. The vertical axis depicts the ratio of the mean variance of the OMT estimator to that of the RT estimator for the off-diagonal elements of L. range of the model hyperparameter α0 considered. Note that as we crank up the shape augmentation setting B, the RSVI variance approaches that of the pathwise gradient.12 7.1.3. MULTIVARIATE NORMAL In Fig. 5 we use synthetic test functions to illustrate the amount of variance reduction that can be achieved with the OMT gradient estimator for the multivariate Normal distribution. The dimension is D = 50; the results are qualitatively similar for different dimensions. 12As discussed in Sec. 6, the variance of the RSVI gradient estimator can also be reduced by dropping the score function-like term (at the cost of some bias). 7.2. Real World Datasets In this section we investigate the performance of our gradient estimators for the Gamma, Beta, and multivariate Normal distributions in two variational inference tasks on real world datasets. Note that we include an additional experiment for the multivariate Normal distribution in Sec. 10 of the supplementary materials. All the experiments in this section were implemented in the Pyro13 probabilistic programming language. 7.2.1. SPARSE GAMMA DEF The Sparse Gamma DEF (Ranganath et al., 2015) is a probabilistic model with multiple layers of local latent random variables z(ℓ) nk and global random weights w(ℓ) kk that mimics the architecture of a deep neural network. Here each n corresponds to an observed data point xn, ℓindexes the layer, and k and k run over the latent components. We consider Poisson-distributed observations xnd for each dimension d. Concretely, the model is specified as14 z(ℓ) nk Gamma k z(ℓ+1) nk w(ℓ) k k ℓ= 1, ..., L 1 xnd Poisson k z(1) nk w(0) k d z L nk Gamma (αz, αz) We set αz = 0.1 and use L = 3 layers with 100, 40, and 15 latent factors per data point (for ℓ= 1, 2, 3, respectively). We consider two model variants that differ in the prior placed on the weights. In the first variant we place Gamma priors over the weights with α = 0.3 and β = 0.1. In the second variant we place β priors over the weights with the same means and variances as in the first variant.15 The dataset we consider is the Olivetti faces dataset,16 which consists of 64 64 grayscale images of human faces. In Fig. 6 we depict how the training set ELBO increases during the course of optimization. We find that on this task the performance of the OMT gradient estimator is nearly identical to RSVI.17 Figure 6 suggests that gradient variance is not the limiting factor for this particular task and dataset. 13http://pyro.ai 14 Note that this experiment closely follows the setup in (Ruiz et al., 2016) and (Naesseth et al., 2017). 15If z Beta(α, β) then z 1 z β (α, β). Thus like the Gamma distribution the Beta prime distribution has support on the positive real line. 16http://www.cl.cam.ac.uk/research/dtg/ attarchive/facedatabase.html 17Note that we do not compare to any alternative estimators such as G-REP, since (Naesseth et al., 2017) shows that RSVI has superior performance on this task. Pathwise Derivatives Beyond the Reparameterization Trick Gamma 20T Gamma RSVI β 20T β RSVI Figure 6. ELBO during training for two variants of the Sparse Gamma DEF, one with and one without Beta random variables. We compare the OMT gradient to RSVI. At each iteration we depict a multi-sample estimate of the ELBO with N = 100 samples. 7.2.2. GAUSSIAN PROCESS REGRESSION In this section we investigate the performance of our OMT gradient for the multivariate Normal distribution, Eqn. 19, in the context of a Gaussian Process regression task. We model the Mauna Loa CO2 data from (Keeling & Whorf, 2004) considered in (Rasmussen, 2004). We use a structured kernel that accommodates a long term linear trend as well as a periodic component. We fit the GP using a single-sample Monte Carlo ELBO gradient estimator and all D = 468 data points. The variational family is a multivariate Normal distribution with a Cholesky parameterization for the covariance matrix. Progress on the ELBO during the course of training is depicted in Fig. 7. We can see that the OMT gradient estimator has superior sample efficiency due to its lower variance. By iteration 270 the OMT gradient estimator has attained the same ELBO that the RT estimator attains at iteration 500. Since each iteration of the OMT estimator is 1.9x slower than the corresponding RT iteration, the superior sample efficiency of the OMT estimator is largely canceled when judged by wall clock time. Nevertheless, the lower variance of the OMT estimator results in a higher ELBO than that obtained by the RT estimator. 8. Discussion and Future Work We have seen that optimal transport offers a fruitful perspective on pathwise gradients. On the one hand it has helped us formulate pathwise gradients in situations where this was assumed to be impractical. On the other hand it has focused our attention on a particular notion of optimality, which led us to develop a new gradient estimator for the multivariate Normal distribution. A better understanding of this notion 100 200 300 400 500 Iteration 10 20 30 40 50 60 70 Wall Clock Time (s) Figure 7. ELBO during training for the Gaussian Process regression task in Sec. 7.2.2. At each iteration we depict a multi-sample estimate of the ELBO with N = 100 samples. We compare the OMT gradient estimator to the RT estimator. of optimality and, more broadly, a better understanding of when pathwise gradients are preferable over score function gradients (or vice versa) would be useful in guiding the practical application of these methods. Since each solution of the transport equation Eqn. 12 yields an unbiased gradient estimator, the difference between any two such estimators can be thought of as a control variate. In the case of the multivariate Normal distribution, where computing the OMT gradient has a cost O(D3), an attractive alternative to using v OMT is to adaptively choose v during the course of optimization in direct analogy to adaptive control variate techniques. In future work we will explore this approach in detail, which promises lower variance than the OMT estimator at reduced computational cost. The geometric picture from optimal transport and thus the potential for non-trivial derivative applications is especially rich for multivariate distributions. Here we have explored the multivariate Normal and Dirichlet distributions in some detail, but this just scratches the surface of multivariate distributions. It would be of general interest to develop pathwise gradients for a broader class of multivariate distributions, including for example mixture distributions. Rich distributions with low variance gradient estimators are of special interest in the context of SGVI, where the need to approximate complex posteriors demands rich families of distributions that lend themselves to stochastic optimization. In future work we intend to explore this connection further. Pathwise Derivatives Beyond the Reparameterization Trick Acknowledgements We thank Peter Dayan and Zoubin Ghahramani for feedback on a draft manuscript and other colleagues at Uber AI Labs especially Noah Goodman and Theofanis Karaletsos for stimulating conversations during the course of this work. We also thank Christian Naesseth for clarifying details of the experimental setup for the deep exponential family experiment in (Naesseth et al., 2017). Ambrosio, Luigi, Gigli, Nicola, and Savar e, Giuseppe. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008. Butler, Ronald W. Saddlepoint approximations with applications, volume 22. Cambridge University Press, 2007. Casella, George and Robert, Christian P. Raoblackwellisation of sampling schemes. Biometrika, 83 (1):81 94, 1996. Figurnov, Michael, Mohamed, Shakir, and Mnih, Andriy. Implicit reparameterization gradients. ar Xiv preprint ar Xiv:1805.08498, 2018. Glasserman, Paul. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013. Glynn, Peter W. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10): 75 84, 1990. Grathwohl, Will, Choi, Dami, Wu, Yuhuai, Roeder, Geoff, and Duvenaud, David. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. ar Xiv preprint ar Xiv:1711.00123, 2017. Graves, Alex. Stochastic backpropagation through mixture density distributions. ar Xiv preprint ar Xiv:1607.05690, 2016. Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013. Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2): 183 233, 1999. Keeling, Charles David and Whorf, Timothy P. Atmospheric co2 concentrations derived from flask air samples at sites in the sio network. Trends: a compendium of data on Global Change, 2004. Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Knowles, David A. Stochastic gradient variational bayes for gamma approximating distributions. ar Xiv preprint ar Xiv:1509.01631, 2015. Lugannani, Robert and Rice, Stephen. Saddle point approximation for the distribution of the sum of independent random variables. Advances in applied probability, 12 (2):475 490, 1980. Miller, Andrew, Foti, Nick, D Amour, Alexander, and Adams, Ryan P. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pp. 3711 3721, 2017. Mnih, Andriy and Gregor, Karol. Neural variational inference and learning in belief networks. ar Xiv preprint ar Xiv:1402.0030, 2014. Naesseth, Christian, Ruiz, Francisco, Linderman, Scott, and Blei, David. Reparameterization gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics, pp. 489 498, 2017. Paisley, John, Blei, David, and Jordan, Michael. Variational bayesian inference with stochastic search. ar Xiv preprint ar Xiv:1206.6430, 2012. Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, Gregory, Yang, Edward, De Vito, Zachary, Lin, Zeming, Desmaison, Alban, Antiga, Luca, and Lerer, Adam. Automatic differentiation in pytorch. 2017. Price, Robert. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions on Information Theory, 4(2):69 72, 1958. Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814 822, 2014. Ranganath, Rajesh, Tang, Linpeng, Charlin, Laurent, and Blei, David. Deep exponential families. In Artificial Intelligence and Statistics, pp. 762 771, 2015. Rasmussen, Carl Edward. Gaussian processes in machine learning. In Advanced lectures on machine learning, pp. 63 71. Springer, 2004. Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Pathwise Derivatives Beyond the Reparameterization Trick Ross, Sheldon M. Simulation. Academic Press, San Diego, 2006. Ruiz, Francisco R, AUEB, Michalis Titsias RC, and Blei, David. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, pp. 460 468, 2016. Salimans, Tim, Knowles, David A, et al. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837 882, 2013. Schulman, John, Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528 3536, 2015. Titsias, Michalis and L azaro-Gredilla, Miguel. Doubly stochastic variational bayes for non-conjugate inference. In International Conference on Machine Learning, pp. 1971 1979, 2014. Tucker, George, Mnih, Andriy, Maddison, Chris J, Lawson, John, and Sohl-Dickstein, Jascha. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pp. 2624 2633, 2017. Villani, C edric. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003. Wilks, S.S. Mathematical Statistics. John Wiley and Sons Inc., 1962. Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Wingate, David and Weber, Theophane. Automated variational inference in probabilistic programming. ar Xiv preprint ar Xiv:1301.1299, 2013.