# reparameterization_gradient_for_nondifferentiable_models__95ea8084.pdf Reparameterization Gradient for Non-differentiable Models Wonyeol Lee Hangyeol Yu Hongseok Yang School of Computing, KAIST Daejeon, South Korea {wonyeol, yhk1344, hongseok.yang}@kaist.ac.kr We present a new algorithm for stochastic variational inference that targets at models with non-differentiable densities. One of the key challenges in stochastic variational inference is to come up with a low-variance estimator of the gradient of a variational objective. We tackle the challenge by generalizing the reparameterization trick, one of the most effective techniques for addressing the variance issue for differentiable models, so that the trick works for non-differentiable models as well. Our algorithm splits the space of latent variables into regions where the density of the variables is differentiable, and their boundaries where the density may fail to be differentiable. For each differentiable region, the algorithm applies the standard reparameterization trick and estimates the gradient restricted to the region. For each potentially non-differentiable boundary, it uses a form of manifold sampling and computes the direction for variational parameters that, if followed, would increase the boundary s contribution to the variational objective. The sum of all the estimates becomes the gradient estimate of our algorithm. Our estimator enjoys the reduced variance of the reparameterization gradient while remaining unbiased even for non-differentiable models. The experiments with our preliminary implementation confirm the benefit of reduced variance and unbiasedness. 1 Introduction Stochastic variational inference (SVI) is a popular choice for performing posterior inference in Bayesian machine learning. It picks a family of variational distributions, and formulates posterior inference as a problem of finding a member of this family that is closest to the target posterior. SVI, then, solves this optimization problem approximately using stochastic gradient ascent. One major challenge in developing an effective SVI algorithm is the difficulty of designing a low-variance estimator for the gradient of the optimization objective. Addressing this challenge has been the driver of recent advances for SVI, such as reparameterization trick [13, 30, 31, 26, 15], clever control variate [28, 7, 8, 34, 6, 23], and continuous relaxation of discrete distributions [20, 10]. Our goal is to tackle the challenge for models with non-differentiable densities. Such a model naturally arises when one starts to use both discrete and continuous random variables or specifies a model using programming constructs, such as if statement, as in probabilistic programming [4, 22, 37, 5]. The high variance of a gradient estimate is a more serious issue for these models than for those with differentiable densities. Key techniques for addressing it simply do not apply in the absence of differentiability. For instance, a prerequisite for the so called reparameterization trick is the differentiability of a model s density function. In the paper, we present a new gradient estimator for non-differentiable models. Our estimator splits the space of latent variables into regions where the joint density of the variables is differentiable, and their boundaries where the density may fail to be differentiable. For each differentiable region, the estimator applies the standard reparameterization trick and estimates the gradient restricted to the 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. region. For each potentially non-differentiable boundary, it uses a form of manifold sampling, and computes the direction for variational parameters that, if followed, would increase the boundary s contribution to the variational objective. This manifold sampling step cannot be skipped if we want to get an unbiased estimator, and it only adds a linear overhead to the overall estimation time for a large class of non-differentiable models. The result of our gradient estimator is the sum of all the estimated values for regions and boundaries. Our estimator generalizes the estimator based on the reparameterization trick. When a model has a differentiable density, these two estimators coincide. But even when a model s density is not differentiable and so the reparameterization estimator is not applicable, ours still applies; it continues to be an unbiased estimator, and enjoys variance reduction from reparameterization. The unbiasedness of our estimator is not trivial, and follows from an existing yet less well-known theorem on exchanging integration and differentiation under moving domain [3] and the divergence theorem. We have implemented a prototype of an SVI algorithm that uses our gradient estimator and works for models written in a simple first-order loop-free probabilistic programming language. The experiments with this prototype confirm the strength of our estimator in terms of variance reduction. 2 Variational Inference and Reparameterization Gradient Before presenting our results, we review the basics of stochastic variational inference. Let x and z be, respectively, observed and latent variables living in Rm and Rn, and p(x, z) a density that specifies a probabilistic model about x and z. We are interested in inferring information about the posterior density p(z|x0) for a given value x0 of x. Variational inference approaches this posterior-inference problem from the optimization angle. It recasts posterior inference as a problem of finding a best approximation to the posterior among a collection of pre-selected distributions {qθ(z)}θ Rd, called variational distributions, which all have easy-to-compute and easy-to-differentiate densities and permit efficient sampling. A standard objective for this optimization is to maximize a lower bound of log p(x0) called evidence lower bound or simply ELBO: argmaxθ ELBOθ , where ELBOθ Eqθ(z) log p(x0, z) It is equivalent to the objective of minimizing the KL divergence from qθ(z) to the posterior p(z|x0). Most of recent variational-inference algorithms solve the optimization problem (1) by stochastic gradient ascent. They repeatedly estimate the gradient of ELBOθ and move θ towards the direction of this estimate: θ θ + η θELBOθ The success of this iterative scheme crucially depends on whether it can estimate the gradient well in terms of computation time and variance. As a result, a large part of research efforts on stochastic variational inference has been devoted to constructing low-variance gradient estimators or reducing the variance of existing estimators. The reparameterization trick [13, 30] is the technique of choice for constructing a low-variance gradient estimator for models with differentiable densities. It can be applied in our case if the joint p(x, z) is differentiable with respect to the latent variable z. The trick is a two-step recipe for building a gradient estimator. First, it tells us to find a distribution q(ϵ) on Rn and a smooth function f : Rd Rn Rn such that fθ(ϵ) for ϵ q(ϵ) has the distribution qθ. Next, the reparameterization trick suggests us to use the following estimator: i=1 θlog r(fθ(ϵi)) qθ(fθ(ϵi)), where r(z) p(x0, z) and ϵ1, . . . , ϵN q(ϵ). (2) The reparameterization gradient in (2) is unbiased, and has variance significantly lower than the so called score estimator (or REINFORCE) [35, 27, 36, 28], which does not exploit differentiability. But so far its use has been limited to differentiable models. We will next explain how to lift this limitation. 3 Reparameterization for Non-differentiable Models Our main result is a new unbiased gradient estimator for a class of non-differentiable models, which can use the reparameterization trick despite the non-differentiability. Recall the notations from the previous section: x Rm and z Rn for observed and latent variables, p(x, z) for their joint density, x0 for an observed value, and qθ(z) for a variational distribution parameterized by θ Rd. Our result makes two assumptions. First, the variational distribution qθ(z) satisfies the conditions of the reparameterization gradient. Namely, qθ(z) is continuously differentiable with respect to θ Rd, and is the distribution of fθ(ϵ) for a smooth function f : Rd Rn Rn and a random variable ϵ Rn distributed by q(ϵ). Also, the function fθ on Rn is bijective for every θ Rd. Second, the joint density r(z) = p(x0, z) at x = x0 has the following form: k=1 1[z Rk] rk(z) (3) where rk is a non-negative continuously-differentiable function Rn R, Rk is a (measurable) subset of Rn with measurable boundary Rk such that R Rk dz = 0, and {Rk}1 k K is a partition of Rn. Note that r(z) is an unnormalized posterior under the observation x = x0. The assumption indicates that the posterior r may be non-differentiable at some z s, but all the non-differentiabilities occur only at the boundaries Rk of regions Rk. Also, it ensures that when considered under the usual Lebesgue measure on Rn, these non-differentiable points are negligible (i.e., they are included in a null set of the measure). As we illustrate in our experiments section, models satisfying our assumption naturally arise when one starts to use both discrete and continuous random variables or specifies models using programming constructs, such as if statement, as in probabilistic programming [4, 22, 37, 5]. Our estimator is derived from the following theorem: Theorem 1. Let hk(ϵ, θ) log rk(fθ(ϵ)) qθ(fθ(ϵ)), V (ϵ, θ) Rd n, V (ϵ, θ)ij f 1 θ (z) z=fθ(ϵ) θELBOθ = Eq(ϵ) k=1 1[fθ(ϵ) Rk] θhk(ϵ, θ) | {z } Rep Gradθ f 1 θ ( Rk) q(ϵ)hk(ϵ, θ)V (ϵ, θ) dΣ | {z } Bou Contrθ where the RHS of the plus uses the surface integral of q(ϵ)hk(ϵ, θ)V (ϵ, θ) over the boundary f 1 θ ( Rk) expressed in terms of ϵ, the dΣ is the normal vector of this boundary that is outward pointing with respect to f 1 θ (Rk), and the operation denotes the matrix-vector multiplication. The theorem says that the gradient of ELBOθ comes from two sources. The first is the usual reparameterized gradient of each hk but restricted to its region Rk. The second source is the sum of the surface integrals over the region boundaries Rk. Intuitively, the surface integral for k computes the direction to move θ in order to increase the contribution of the boundary Rk to ELBOθ. Note that the integrand of the surface integral has the additional V term. This term is a by-product of rephrasing the original integration over z in terms of the reparameterization variable ϵ. We write Rep Gradθ for the contribution from the first source, and Bou Contrθ for that from the second source. The proof of the theorem uses an existing but less known theorem about interchanging integration and differentiation under moving domain [3], together with the divergence theorem. It appears in the supplementary material of this paper. At this point, some readers may feel uneasy with the Bou Contrθ term in our theorem. They may reason like this. Every boundary Rk is a measure-zero set in Rn, and non-differentiabilities occur only at these Rk s. So, why do we need more than Rep Gradθ, the case-split version of the usual reparameterization? Unfortunately, this heuristic reasoning is incorrect, as indicated by the following proposition: Proposition 2. There are models satisfying this section s conditions s.t. θELBOθ = Rep Gradθ. Proof. Consider the model p(x, z) = N(z|0, 1) 1[z > 0]N(x|5, 1) + 1[z 0]N(x| 2, 1) for x R and z R, the variational distribution qθ(z) = N(z|θ, 1) for θ R, and its reparameterization fθ(ϵ) = ϵ + θ and q(ϵ) = N(ϵ|0, 1) for ϵ R. For an observed value x0 = 0, the joint density p(x0, z) becomes r(z) = 1[z > 0] c1N(z|0, 1) + 1[z 0] c2N(z|0, 1), where c1 = N(0|5, 1) and c2 = N(0| 2, 1). Notice that r is non-differentiable only at z = 0 and {0} is a null set in R. For any θ, θELBOθ is computed as follows: Since log(r(z)/qθ(z)) = 1[z > 0] (θ2/2 zθ + log c1) + 1[z 0] (θ2/2 zθ + log c2), we have1 ELBOθ = 1 2[ θ2 + erf(θ/ 2) log(c1/c2) + log(c1c2)] and thus obtain θELBOθ = θ + log(c1/c2) exp θ2/2 / On the other hand, Rep Gradθ is computed as follows: After reparameterizing z into ϵ, we have log r(fθ(ϵ))/qθ(fθ(ϵ)) = 1[ϵ + θ > 0] ( θ2/2 ϵθ+log c1)+1[ϵ + θ 0] ( θ2/2 ϵθ+log c2), so the term inside the expectation of Rep Gradθ is 1[ϵ + θ > 0] ( θ ϵ) + 1[ϵ + θ 0] ( θ ϵ) and we obtain Rep Gradθ = θ. Note that θELBOθ = Rep Gradθ for any θ. The difference between the two quantities is Bou Contrθ in Theorem 1. The main culprit here is that interchanging differentiation and integration is sometimes invalid: for D1, D2(θ) Rn and α1, α2 : Rn Rd R, the below equations do not hold in general if α1 is not differentiable in θ, and if D2( ) is not constant (even when α2 is differentiable in θ). D1 α1(ϵ, θ)dϵ = Z D1 θα1(ϵ, θ)dϵ and θ D2(θ) α2(ϵ, θ)dϵ = Z D2(θ) θα2(ϵ, θ)dϵ. The Rep Gradθ term in Theorem 1 can be easily estimated by the standard Monte Carlo: Rep Gradθ 1 N k=1 1 fθ(ϵi) Rk θhk(ϵi, θ) for i.i.d. ϵ1, . . . , ϵN q(ϵ). We write Rep Gradθ for this estimate. However, estimating the other Bou Contrθ term is not that easy, because of the difficulties in estimating surface integrals in the term. In general, to approximate a surface integral well, we need a parameterization of the surface, and a scheme for generating samples from it [2]; this general methodology and a known theorem related to our case are reviewed in the supplementary material. In this paper, we focus on a class of models that use relatively simple (reparameterized) boundaries f 1 θ ( Rk) and permit, as a result, an efficient method for estimating surface integrals in Bou Contrθ. A good way to understand our simple-boundary condition is to start with something even simpler, namely the condition that f 1 θ ( Rk) is an (n 1)-dimensional hyperplane {ϵ | a ϵ = c}. Here the operation denotes the dot-product. A surface integral over such a hyperplane can be estimated using the following theorem: Theorem 3. Let q(ϵ) = Qn i=1 q(ϵi) and S a measurable subset of Rn. Assume that S = {ϵ | a ϵ > c} or S = {ϵ | a ϵ c} for some a Rn and c R, and that aj = 0 for some j. Then, Z q(ϵ)F (ϵ) dΣ = Eq(ζ) [G(g(ζ)) n] for all measurable F : Rn Rd n. Here dΣ is the normal vector pointing outward with respect to S, ζ ranges over Rn 1, its density q(ζ) is the product of the densities for its components, and this component density q(ζi) is the same as the density q(ϵi ) for the i -th component of ϵ, where i = i + 1[i j]. Also, G(ϵ) q(ϵj)F (ϵ), g(ζ) ζ1, . . . , ζj 1, 1 aj (c a j ζ), ζj, . . . , ζn 1 , a j (a1, . . . , aj 1, aj+1, . . . , an), n sgn( aj) a1 aj , . . . , aj 1 aj , 1, aj+1 aj , . . . , an 1The error function erf is defined by erf(x) = 2 R x 0 exp t2 dt/ π. The theorem says that if the boundary S is an (n 1)-dimensional hyperplane {ϵ | a ϵ = c}, we can parameterize the surface by a linear map g : Rn 1 Rn and express the surface integral as an expectation over q(ζ). This distribution for ζ is the marginalization of q(ϵ) over the j-th component. Inside the expectation, we have the product of the matrix G and the vector n. The matrix comes from the integrand of the surface integral, and the vector is the direction of the surface. Note that G(ϵ) has q(ϵj) instead of q(ϵ); the missing part of q(ϵ) has been converted to the distribution q(ζ). When every f 1 θ ( Rk) is an (n 1)-dimensional hyperplane {ϵ | a ϵ = c} for a Rn and c R with ajk = 0, we can use Theorem 3 and estimate the surface integrals in Bou Contrθ as follows: Z f 1 θ ( Rk) q(ϵ)hk(ϵ, θ)V (ϵ, θ) dΣ 1 M i=1 W (g(ζi)) n for i.i.d. ζ1, . . . , ζM q(ζ), where W (ϵ) = q(ϵjk)hk(ϵ, θ)V (ϵ, θ). Let Bou Contr(θ,k) be this estimate. Then, our estimator for the gradient of ELBOθ in this case computes: Bou Contr(θ,k) V The estimator is unbiased because of Theorems 1 and 3: Corollary 4. E h θELBOθ Vi = θELBOθ. We now relax the condition that each boundary is a hyperplane, and consider a more liberal simpleboundary condition, which is often satisfied by non-differentiable models from a first-order loop-free probabilistic programming language. This new condition and the estimator under this condition are what we have used in our implementation. The relaxed condition is that the regions {f 1 θ (Rk)}1 k K are obtained by partitioning Rn with L (n 1)-dimensional hyperplanes. That is, there are affine maps Φ1, . . . , ΦL : Rn R such that for all 1 k K, f 1 θ (Rk) = l=1 Sl,(σk)l for some σk { 1, 1}L where Sl,1 = {ϵ | Φl(ϵ) > 0} and Sl, 1 = {ϵ | Φl(ϵ) 0}. Each affine map Φl defines an (n 1)-dimensional hyperplane Sl,1, and (σk)l specifies on which side the region f 1 θ (Rk) lies with respect to the hyperplane Sl,1. Every probabilistic model written in a first-order probabilistic programming language satisfies the relaxed condition, if the model does not contain a loop and uses only a fixed finite number of random variables and the branch condition of each if statement in the model is linear in the latent variable z; in such a case, L is the number of if statements in the model. Under the new condition, how can we estimate Bou Contrθ? A naive approach is to estimate the k-th surface integral for each k (in some way) and sum them up. However, with L hyperplanes, the number K of regions can grow as fast as O (Ln), implying that the naive approach is slow. Even worse the boundaries f 1 θ ( Rk) do not satisfy the condition of Theorem 3, and just estimating the surface integral over each f 1 θ ( Rk) may be difficult. A solution is to transform the original formulation of Bou Contrθ such that it can be expressed as the sum of surface integrals over Sl,1 s. The transformation is based on the following derivation: Bou Contrθ = f 1 θ ( Rk) q(ϵ)hk(ϵ, θ)V (ϵ, θ) dΣ q(ϵ)V (ϵ, θ) k=1 1 h ϵ f 1 θ (Rk) i (σk)lhk(ϵ, θ) dΣ (4) where T denotes the closure of T Rn, and dΣ in (4) is the normal vector pointing outward with respect to Sl,1. Since {f 1 θ (Rk)}k are obtained by partitioning Rn with { Sl,1}l, we can rearrange the sum of K surface integrals over complicated boundaries f 1 θ ( Rk), into the sum of L surface integrals over the hyperplanes Sl,1 as above. Although the expression inside the summation over k in (4) looks complicated, for almost all ϵ, the indicator function is nonzero for exactly two k s: k1 with (σk1)l = 1 and k 1 with (σk 1)l = 1. So, we can efficiently estimate the l-th surface integral in (4) using Theorem 3, and call this estimate Bou Contr(θ,l) . Then, our estimator for the gradient of ELBOθ in this more general case computes: Bou Contr(θ,l) V The estimator is unbiased because of Theorems 1 and 3 and Equation 4: Corollary 5. E h i = θELBOθ. 4 Experimental Evaluation We experimentally compare our gradient estimator (OURS) to the score estimator (SCORE), an unbiased gradient estimator that is applicable to non-differentiable models, and the reparameterization estimator (REPARAM), a biased gradient estimator that computes only Rep Gradθ (discussed in Section 3). REPARAM is biased in our experiments because it is applied to non-differentiable models. We implemented a black-box variational inference engine that accepts a probabilistic model written in a simple probabilistic programming language (which supports basic constructs such as sample, observe, and if statements) and performs variational inference using one of the three aforementioned gradient estimators. Our implementation2 is written in Python and uses autograd [18], an automatic differentiation package for Python, to automatically compute the gradient term in Rep Gradθ V for an arbitrary probabilistic model. Benchmarks. We evaluate our estimator on three models for small sequential data: temperature [33] models the random dynamics of a controller that attempts to keep the temperature of a room within specified bounds. The controller s state has a continuous part for the room temperature and a discrete part that records the on or off of an air conditioner. At each time step, the value of this discrete part decides which of two different random state updates is employed, and incurs the non-differentiability of the model s density. We use a synthetically-generated sequence of 21 noisy measurements of temperatures, and perform posterior inference on the sequence of the controller s states given these noisy measurements. This model consists of a 41-dimensional latent variable and 80 if statements. textmsg [1] is a model for the numbers of per-day SNS messages over the period of 74 days (skipping every other day). It allows the SNS-usage pattern to change over the period, and this change causes non-differentiability. Finding the posterior distribution over this change is the goal of the inference problem in this case. We use the data from [1]. This model consists of a 3-dimensional latent variable and 37 if statements. influenza [32] is a model for the US influenza mortality data in 1969. The mortality rate in each month depends on whether the dominant influenza virus is of type 1 or 2, and finding this type information from a sequence of observed mortality rates is the goal of the inference. The virus type is the cause of non-differentiability in this example. This model consists of a 37-dimensional latent variable and 24 if statements. Experimental setup. We optimize the ELBO objective using Adam [11] with two stepsizes: 0.001 and 0.01. We run Adam for 10000 iterations and at each iteration, we compute each estimator using N {1, 8, 16} Monte Carlo samples. For OURS, we use a single subsample l (drawn uniformly at random from {1, , L}) to estimate the summation in (5), and use N Monte Carlo samples to compute Bou Contr(θ,l) . While maximizing ELBO, we measure two things: the variance of estimated gradients of ELBO, and ELBO itself. Since each gradient is not scalar, we measure two kinds of variance of the gradient, as in [23]: Avg(V( )), the average variance of each of its components, and V( 2), the variance of its l2-norm. To estimate the variances and the ELBO objective, we use 16 and 1000 Monte Carlo samples, respectively. 2 Code is available at https://github.com/wonyeol/reparam-nondiff. Estimator Type of Variance temperature textmsg influenza REPARAM Avg(V( )) 4.45 10 9 2.91 10 2 4.38 10 3 V( 2) 2.45 10 8 2.92 10 2 2.12 10 3 OURS Avg(V( )) 1.85 10 6 2.77 10 2 4.89 10 3 V( 2) 7.59 10 5 2.46 10 2 2.36 10 3 (a) stepsize = 0.001 Estimator Type of Variance temperature textmsg influenza REPARAM Avg(V( )) 3.88 10 11 5.03 10 4 2.46 10 3 V( 2) 6.11 10 11 1.02 10 3 1.26 10 3 OURS Avg(V( )) 1.24 10 11 5.07 10 4 2.80 10 3 V( 2) 8.05 10 11 8.12 10 4 1.40 10 3 (b) stepsize = 0.01 Table 1: Ratio of {REPARAM, OURS} s average variance to SCORE s for N = 1. The values for SCORE are all 1, so omitted. The optimization trajectories used to compute the above variances are shown in Figure 1. Estimator temperature textmsg influenza SCORE 21.7 4.9 18.7 REPARAM 46.1 15.4 251.4 OURS 79.2 24.9 269.8 Table 2: Computation time (in ms) per iteration for N = 1. Results. Table 1 compares the average variance of each estimator for N = 1, where the average is taken over a single optimization trajectory. The table clearly shows that during the optimization process, OURS has several orders of magnitude (sometimes < 10 10 times) smaller variances than SCORE. Since OURS computes additional terms when compared with REPARAM, we expect that OURS would have larger variances than REPARAM, and this is confirmed by the table. It is noteworthy, however, that for most benchmarks, the averaged variances of OURS are very close to those of REPARAM. This suggests that the additional term Bou Contrθ in our estimator often introduces much smaller variances than the reparameterization term Rep Gradθ. Figure 1 shows the ELBO objective, for different estimators with different N s, as a function of the iteration number. As expected, using a larger N makes all estimators converge faster in a more stable manner. In all three benchmarks, OURS outperforms (or performs similarly to) the other two and converges stably, and REPARAM beats SCORE. Increasing the stepsize to 0.01 makes SCORE unstable in temperature and textmsg. It is also worth noting that REPARAM converges to sub-optimal values in temperature (possibly because REPARAM is biased). Table 2 shows the computation time per iteration of each approach for N = 1. Our implementation performs the worst in this wall-time comparison, but the gap between OURS and REPARAM is not huge: the computation time of OURS is less than 1.72 times that of REPARAM in all benchmarks. Furthermore, we want to point out that our implementation is an early unoptimized prototype, and there are several rooms to improve in the implementation. For instance, it currently constructs Python functions dynamically, and computes the gradients of these functions using autograd. But this dynamic approach is costly because autograd is not optimized for such dynamically constructed functions; this can also be observed in the bad performance of REPARAM, particularly in influenza, that employs the same strategy of dynamically constructing functions and taking their gradients. So one possible optimization is to avoid this gradient computation of dynamically constructed functions by building the functions statically during compilation. 0 2000 4000 6000 8000 10000 Iteration Score Repar Ours (a) temperature (stepsize = 0.001) 0 2000 4000 6000 8000 10000 Iteration Score Repar Ours (b) temperature (stepsize = 0.01) 0 2000 4000 6000 8000 10000 Iteration Score Repar Ours (c) textmsg (stepsize = 0.001) 0 2000 4000 6000 8000 10000 Iteration Score Repar Ours (d) textmsg (stepsize = 0.01) 0 2000 4000 6000 8000 10000 Iteration Score Repar Ours (e) influenza (stepsize = 0.001) 0 2000 4000 6000 8000 10000 Iteration Score Repar Ours (f) influenza (stepsize = 0.01) Figure 1: The ELBO objective as a function of the iteration number. {dotted, dashed, solid} lines represent {N = 1, N = 8, N = 16}. 5 Related Work A common example of a model with a non-differentiable density is the one that uses discrete random variables, typically together with continuous random variables.3 Coming up with an efficient algorithm for stochastic variational inference for such a model has been an active research topic. Maddison et al. [20] and Jang et al. [10] proposed continuous relaxations of discrete random variables that convert non-differentiable variational objectives to differentiable ones and make the reparameterization trick applicable. Also, a variety of control variates for the standard score estimator [35, 27, 36, 28] for the gradients of variational objectives have been developed [28, 7, 8, 34, 6, 23], some of which use biased yet differentiable control variates such that the reparameterization trick can be used to correct the bias [7, 34, 6]. Our work extends this line of research by adding a version of the reparameterization trick that can be applied to models with discrete random variables. For instance, consider a model p(x, z) with z discrete. By applying the Gumbel-Max reparameterization [9, 21] to z, we transform p(x, z) to p(x, z, c), where c is sampled from the Gumbel distribution and z in p(x, z, c) is defined determin- 3 Another common example of such a model is the one that uses if statements whose branch conditions contain continuous random variables, which is the main focus of our work. istically from c using the arg max operation. Since arg max can be written as if statements, we can express p(x, z, c) in the form of (3) to which our reparameterization gradient can be applied. Investigating the effectiveness of this approach for discrete random variables is an interesting topic for future research. The reparameterization trick was initially used with normal distribution [13, 30], but its scope was soon extended to other common distributions, such as gamma, Dirichlet, and beta [14, 31, 26]. Techniques for constructing normalizing flow [29, 12] can also be viewed as methods for creating distributions in a reparameterized form. In the paper, we did not consider these recent developments and mainly focused on the reparameterization with normal distribution. One interesting future avenue is to further develop our approach for these other reparameterization cases. We expect that the main challenge will be to find an effective method for handling the surface integrals in Theorem 1. 6 Conclusion We have presented a new estimator for the gradient of the standard variational objective, ELBO. The key feature of our estimator is that it can keep variance under control by using a form of the reparameterization trick even when the density of a model is not differentiable. The estimator splits the space of the latent random variable into a lower-dimensional subspace where the density may fail to be differentiable, and the rest where the density is differentiable. Then, it estimates the contributions of both parts to the gradient separately, using a version of manifold sampling for the former and the reparameterization trick for the latter. We have shown the unbiasedness of our estimator using a theorem for interchanging integration and differentiation under moving domain [3] and the divergence theorem. Also, we have experimentally demonstrated the promise of our estimator using three time-series models. One interesting future direction is to investigate the possibility of applying our ideas to recent variational objectives [24, 17, 19, 16, 25], which are based on tighter lower bounds of marginal likelihood than the standard ELBO. When viewed from a high level, our work suggests a heuristic of splitting the latent space into a bad yet tiny subspace and the remaining good one, and solving an estimation problem in each subspace separately. The latter subspace has several good properties and so it may allow the use of efficient estimation techniques that exploit those properties. The former subspace is, on the other hand, tiny and the estimation error from the subspace may, therefore, be relatively small. We would like to explore this heuristic and its extension in different contexts, such as stochastic variational inference with different objectives [24, 17, 19, 16, 25]. Acknowledgments We thank Hyunjik Kim, George Tucker, Frank Wood and anonymous reviewers for their helpful comments, and Shin Yoo and Seongmin Lee for allowing and helping us to use their cluster machines. This research was supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF2018R1A5A1059921), and also by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (2017M3C4A7068177). [1] C. Davidson-Pilon. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. Addison-Wesley Professional, 2015. [2] P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a Manifold, volume Volume 10 of Collections, pages 102 125. Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2013. [3] H. Flanders. Differentiation Under the Integral Sign. The American Mathematical Monthly, 80(6):615 627, 1973. [4] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a language for generative models. In Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI), 2008. [5] A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic Programming. In International Conference on Software Engineering (ICSE, FOSE track), 2014. [6] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. K. Duvenaud. Backpropagation through the Void: Optimizing control variates for black-box gradient estimation. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. [7] S. Gu, S. Levine, I. Sutskever, and A. Mnih. Mu Prop: Unbiased Backpropagation for Stochastic Neural Networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016. [8] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. [9] E. J. Gumbel. Statistical Theory of Extreme Values and Some Practical Applications: a Series of Lectures. Number 33. US Govt. Print. Office, 1954. [10] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. [11] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015. [12] D. P. Kingma, T. Salimans, R. Józefowicz, X. Chen, I. Sutskever, and M. Welling. Improving Variational Autoencoders with Inverse Autoregressive Flow. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), 2016. [13] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014. [14] D. A. Knowles. Stochastic gradient variational Bayes for gamma approximating distributions. ar Xiv, page 1509.01631, 2015. [15] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic Differentiation Variational Inference. J. Mach. Learn. Res., 18(1):430 474, Jan. 2017. [16] T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Auto-Encoding Sequential Monte Carlo. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. [17] Y. Li and R. E. Turner. RéNyi Divergence Variational Inference. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), 2016. [18] D. Maclaurin. Modeling, Inference and Optimization with Composable Differentiable Procedures. Ph D thesis, Harvard University, 2016. [19] C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. W. Teh. Filtering Variational Objectives. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), 2017. [20] C. J. Maddison, A. Mnih, and Y. W. Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. [21] C. J. Maddison, D. Tarlow, and T. Minka. A* Sampling. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 2014. [22] V. K. Mansinghka, D. Selsam, and Y. N. Perov. Venture: a higher-order probabilistic programming platform with programmable inference. ar Xiv, 2014. [23] A. C. Miller, N. J. Foti, A. D Amour, and R. P. Adams. Reducing Reparameterization Gradient Variance. ar Xiv preprint ar Xiv:1705.07880, 2017. [24] A. Mnih and D. J. Rezende. Variational Inference for Monte Carlo Objectives. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML), 2016. [25] C. A. Naesseth, S. W. Linderman, R. Ranganath, and D. M. Blei. Variational Sequential Monte Carlo. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. To appear. [26] C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. M. Blei. Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. [27] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian Inference with Stochastic Search. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. [28] R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014. [29] D. J. Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), 2015. [30] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014. [31] F. J. R. Ruiz, M. K. Titsias, and D. M. Blei. The Generalized Reparameterization Gradient. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), 2016. [32] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in Statistics). Springer-Verlag, 2005. [33] S. E. Z. Soudjani, R. Majumdar, and T. Nagapetyan. Multilevel Monte Carlo Method for Statistical Model Checking of Hybrid Systems. In Proceedings of the 14th International Conference on Quantitative Evaluation of Systems (QUEST), 2017. [34] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), 2017. [35] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn., 8(3-4):229 256, May 1992. [36] D. Wingate and T. Weber. Automated Variational Inference in Probabilistic Programming. Co RR, abs/1301.1299, 2013. [37] F. Wood, J.-W. van de Meent, and V. Mansinghka. A New Approach to Probabilistic Programming Inference. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014.