# gradient_estimation_with_discrete_stein_operators__2b230474.pdf Gradient Estimation with Discrete Stein Operators Jiaxin Shi Stanford University jiaxins@stanford.edu Yuhao Zhou Tsinghua University yuhaoz.cs@gmail.com Jessica Hwang Stanford University jjhwang@stanford.edu Michalis K. Titsias Deep Mind mtitsias@google.com Lester Mackey Microsoft Research New England lmackey@microsoft.com Gradient estimation approximating the gradient of an expectation with respect to the parameters of a distribution is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations. 1 Introduction Modern machine learning relies heavily on gradient methods to optimize the parameters of a learning system. However, exact gradient computation is often difficult. For example, in variational inference for training latent variable models [29, 43], policy gradient algorithms in reinforcement learning [65], and combinatorial optimization [40], the exact gradient features an intractable sum or integral introduced by an expectation under an evolving probability distribution. To make progress, one resorts to estimating the gradient by drawing samples from that distribution [39, 50]. The two main classes of gradient estimators used in machine learning are the pathwise or reparameterization gradient estimators [29, 47, 58] and the REINFORCE or score function estimators [18, 65]. The pathwise estimators have shown great success in training variational autoencoders [29] but are only applicable to continuous probability distributions. The REINFORCE estimators are more generalpurpose and easily accommodate discrete distributions but often suffer from excessively high variance. To improve the quality of REINFORCE estimators, we develop a new variance reduction technique for discrete expectations. Our method is based upon Stein operators [see, e.g., 1, 55], computable functionals that generate mean-zero functions under a target distribution and provide a natural way of designing control variates (CVs) for stochastic estimation. We first provide a general recipe for constructing practical Stein operators for discrete distributions (Table 1), generalizing the prior literature [6, 8, 9, 16, 25, 46, 67]. We then develop a gradient estimation framework RODEO that augments REINFORCE estimators with mean-zero CVs generated from Stein operators. Finally, inspired by Double CV [60], we extend our method to develop CVs for REINFORCE leave-one-out estimators [30, 49] to further reduce the variance. Work done while at Microsoft Research New England. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Table 1: Discrete Stein operators that generate mean-zero functions under q (i.e., Eq[(Ah)(x)] = 0). Stein Operator (Ah)(x) Gibbs (4) 1 d y i=x i q(yi|x i)h(y) h(x) q(y)/q(x)(h(y) h(x)) Barker (6) P q(y) q(x)+q(y)(h(y) h(x)) Difference (8) 1 d i=1 h(deci(x)) q(inci(x)) The benefits of using Stein operators to construct discrete CVs are twofold. First, the operator structure permits us to learn CVs with a flexible functional form such as those parameterized by neural networks. Second, since our operators are derived from Markov chains on the discrete support, they naturally incorporate information from neighboring states of the process for variance reduction. We evaluate RODEO on 15 benchmark tasks, including training binary variational autoencoders (VAEs) with one or more stochastic layers. In most cases and with the same number of function evaluations, RODEO delivers lower variance and better training objectives than the state-of-the-art gradient estimators Dis ARM [14, 69], ARMS [13], Double CV [60], and RELAX [20]. 2 Background We consider the problem of maximizing the objective function Eq [f(x)] with respect to the parameters of a discrete distribution q (x). Throughout the paper we assume f(x) is a differentiable function of real-valued inputs x 2 Rd but is only evaluated at a discrete subset X d due to the discreteness of q .2 Exact computation of the expectation is typically intractable due to the complex nature of f(x) and q (x). The standard workaround is to rewrite the gradient as r Eq [f(x)] = Eq [f(x)r log q (x)] and employ the Monte Carlo gradient estimator known as the score function or REINFORCE estimator [18, 65]: k=1(f(x(k)) b)r log q (x(k)) for x(1), . . . , x(K) i.i.d. q . (REINFORCE) Here, b is a constant called the baseline introduced to reduce the variance of the estimator by reducing the scaling effect of f(x(k)). Since Eq [r log q (x)] = 0, the REINFORCE estimator is unbiased for any choice of b, and the term br log q (x) is known as a control variate (CV) [42, Ch. 8]. The optimal baseline can be estimated using additional function evaluations [7, 45, 64]. A simpler approach is to use moving averages of f from historical evaluations or to train function approximators to mimic those values [37]. When K 2, a powerful variant of REINFORCE is obtained by replacing b with the leave-one-out average of function values: f(x(k)) 1 K 1 j6=k f(x(j)) r log q (x(k)). (RLOO) This approach is often called the REINFORCE leave-one-out (RLOO) estimator [30, 48, 49] and was recently observed to have very strong performance in training discrete latent variable models [14, 48]. All the above methods construct a baseline that is independent of the point x(k) under consideration, but there are other ways to preserve the unbiasedness of the estimator. We are free to use a sampledependent baseline b(x(k)) as long as the expectation c , Eq [b(x(k))r log q (x(k))] is easily computed or has a low-variance unbiased estimate. In this case, we can correct for any bias introduced by b(x(k)) by adding in this expectation: 1 K k=1(f(x(k)) b(x(k)))r log q (x(k)) + c. For example, b(x(k)) can be a lower bound or a Taylor expansion of f(x(k)) [22, 43]. Taking this a step further, the Double CV estimator [60] proposed to treat f(x(k)) b(x(k)) as the effective objective function and apply the leave-one-out idea: f(x(k)) bk(x(k)) f(x(j)) bj(x(j)) r log q (x(k)) + c. The resulting estimator adds two CVs to RLOO: the global CV bk(x(k))r log q (x(k)) and the local CV bk(x(j)). Intuitively, bk(x(j)) is aimed at reducing the variance of the LOO average. This is 2This assumption holds for many discrete probabilistic models [21] including the binary VAEs of Section 6. motivated by the fact that replacing 1 K 1 j6=k f(x(j)) with Eq [f] in RLOO leads to lower variance [60, Prop 1]. To obtain a tractable correction term c, Titsias and Shi [60] adopt a linear design of bk: bk(x) = 1 K 1 j6=k f(xj)(x µ) for µ = Eq [x] and a coefficient . Although the Double CV framework points to a promising new direction for developing better REINFORCE estimators, one only obtains significant reduction in variance when bk is strongly correlated with f. This may fail to hold for the above linear bk and a highly nonlinear f, especially in applications like training deep generative models. In the following two sections, we will introduce a method that allows us to use very flexible CVs while still maintaining a tractable correction term. Our method enables online adaptation of CVs to minimize gradient variance (similar to RELAX [20]) but does not assume q has a continuous reparameterization. We then apply it to generalize the linear CVs in Double CV to very flexible ones such as neural networks. Moreover, we provide an effective CV design based on surrogate functions that requires no additional evaluation of f compared to RLOO. 3 Control Variates from Discrete Stein Operators At the heart of our new estimator is a technique for generating flexible discrete CVs, that is for generating a rich class of functions that have known expectations under a given discrete distribution q. One way to achieve this is to identify any discrete-time Markov chain (X(t))1 t=0 with stationary distribution q. Then, the transition matrix P, defined via Pxy = P(X(t+1) = y|X(t) = x), satisfies P >q = q and hence Eq[(P I)h] = 0, (1) for any integrable function h. We overload our notation so that, for any suitable matrix A, (Ah)(x) , P y Axyh(y). In other words, for any integrable h, the function (P I)h is a valid CV as it has known mean under q. Moreover, the linear operator P I is an example of a Stein operator [1, 55] in the sense that it generates mean-zero functions under q. In fact, both Stein et al. [56] and Dellaportas and Kontoyiannis [12] developed CVs of the form (P I)h based on reversible discrete-time Markov chains and linear input functions h(x) = xi. More generally, Barbour [3] and Henderson [23] observed that if we identify a continuous-time Markov chain (X(t))t 0 with q as its stationary distribution, then the generator A, defined via (Ah)(x) = lim E[h(X(t))|X(0)=x] h(x) satisfies A>q = 0 and hence Eq[Ah] = 0 for all integrable h. Therefore, A is also a Stein operator suitable for generating CVs. Moreover, since any discrete-time chain with transition matrix P can be embedded into a continuous-time chain with transition rate matrix A = P I, this continuous-time construction is strictly more general [57, Ch. 4]. 3.1 Discrete Stein operators We next present several examples of broadly applicable discrete Stein operators (summarized in Table 1) that can serve as practical defaults. Gibbs Stein operator The transition kernel of the random-scan Gibbs sampler with stationary distribution q [see, e.g., 17] is i=1 q(yi|x i)1(y i = x i), (3) where 1( ) is the indicator function. The associated Stein operator is A = P I with (Ah)(x) = 1 y i=x i q(yi|x i)h(y) h(x). (4) In the binary variable setting, (4) recovers the operator Bresler and Nagaraj [8], Reinert and Ross [46] used to bound distances between the stationary distributions of Glauber dynamics Markov chains. Zanella Stein operator Zanella [70] studied continuous-time Markov chains with generator Axy = (q(y)/q(x)) 1(y 2 Nx, y 6= x) P z6=x Axz 1(y = x), (5) where Nx denotes the transition neighborhood of x and is a continuous positive function satisfying (t) = t (1/t). Conveniently, the neighborhood structure can be arbitrarily sparse. Moreover, the constraint ensures that the chain satisfies detailed balance (i.e., q(x)Axy = q(y)Ayx) and hence that A>q = 0. Hodgkinson et al. [24] call the associated operator the Zanella Stein operator: (Ah)(x) = P (h(y) h(x)). (6) Important special cases include the minimum probability flow [53] (MPF) Stein operator ( (t) = t) discussed in Barp et al. [6] and the Barker [4] Stein operator ( (t) = t t+1). Birth-death Stein operator Let e1, . . . , ed represent the standard basis vectors on Rd. For any finite-cardinality space, we may index the elements of X = {s0, . . . , sm 1}, let idx(xi) return the index of the element that xi represents, and define the increment and decrement operators inci(x) = x + ei(s(idx(xi)+1) mod m xi) and deci(x) = x + ei(s(idx(xi) 1) mod m xi). Then the product birth-death process [28] on X d with birth rates bi,x = q(inci(x)) q(x) , death rates di,x = 1, and generator i=1 bi,x1(y = inci(x)) + di,x1(y = deci(x)) (bi,x + di,x)1(y = x), has q as a stationary distribution, as A>q = 0. This construction yields the birth-death Stein operator (Ag)(x) = 1 i=1 bi,x(g(inci(x)) g(x)) di,x(g(x) g(deci(x))). (7) An analogous operator is available for countably infinite X [24], and Brown and Xia [9], Eichelsbacher and Reinert [16], Holmes [25] used related operators to characterize convergence to discrete target distributions. Moreover, by substituting h(x) = g(x) g(inc(x)) in (7), we recover the difference Stein operator proposed by Yang et al. [67] without reference to birth-death processes: (Ah)(x) = 1 i=1 h(deci(x)) bi,xh(x). (8) Choosing a Stein operator Despite being better-known, the difference and MPF operators often suffer from large variance and numerical instability due to their use of unbounded probability ratios q(y)/q(x). As a result, we recommend the numerically stable Gibbs operator when each component of x is binary or takes on a small number of values. When xi takes on a large number of values (m), the Gibbs operator suffers from linear scaling with m. In this case, we recommend the Barker operator where a sparse neighborhood structure can be specified, such as Nx = {inci(x), deci(x) for i 2 [d]}. The Barker operator is numerically stable as its ( q(y) q(x)) = q(y) q(x)+q(y). 4 Gradient Estimation with Discrete Stein Operators Recall that REINFORCE estimates the gradient Eq [f(x)r log q (x)]. Due to the existence of r log q (x) as a weighting function, we apply a discrete Stein operator to a vector-valued function h : X ! Rd per dimension to construct the following estimator with a mean-zero CV: Eq [f(x)r i log q (x) + (A hi)(x)]. (9) Ideally, we want to choose the h such that A hi will be strongly correlated with f(x)r i log q (x) to reduce its variance. The optimal hi is given by the solution of Poisson equation Eq [fr i log q ] fr i log q = A hi. (10) We could learn a separate hi per dimension to approximate the solution of (10). However, this poses a difficult optimization problem that requires simultaneously solving d Poisson equations. Instead, we will incorporate knowledge about the structure of the solution to simplify the optimization. To determine a candidate functional form for h, we draw inspiration from an optimal Markov chain in which the current state is ignored entirely and the new state is generated independently from q . In Algorithm 1 Optimizing Eq [f (x)] with RODEO gradients input: Objective f , sample points x(1:K) i.i.d. q , Stein operator A, step sizes t, βt for t = 1 : T do 1: {f (x(k)), r f (x(k)), rf (x(k))}K k=1 autodiff(f , x(1:K)). 2: Compute the surrogates hk(x(j)), h? k(x(j)) of (13), (14) for j 6= k and j, k = 1, , K. 3: Compute the RODEO gradient estimator gγ(x(1:K)). k=1 r f(x(k)). 5: + tgγ(x(1:K)). 6: Update hyperparameters: γ γ βtrγkgγ(x(1:K))k2 this case, (P I) hi becomes Eq [ hi] hi, and the optimal solution is hi = fr i log q . Inspired by this, we consider h of the form h(x) = h(x)r log q (x), (11) where we now only need to learn a scalar-valued function h. Notably, when h exactly equals f and A = P I for any discrete time Markov chain kernel P, our CV adjustment amounts to Rao-Blackwellization [42, Sec. 8.7], as we end up replacing f(x)r i log q (x) with its conditional expectation P(fr i log q )(x) = EXt+1|Xt=x[f(Xt+1)r i log q (Xt+1)]. This yields a guaranteed variance reduction. Surrogate function design Based on the above reasoning, we can view h as a surrogate for f. We avoid directly setting h = f because our Stein operators evaluate h at all neighbors of the sample points, and f can be very expensive to evaluate [see, e.g., 51]. To avoid this problem, we first observe that h (11) can be made sample-specific, i.e., we can use a different hk for each sample point x(k): k=1[f(x(k))r log q (x(k)) + (A hk)(x(k))]. (12) We then consider the following choices of hk that are informed about f while being cheap to evaluate: hk(y) = 1 K 1 j6=k H(f(x(j)), rf(x(j))>(y x(j))). (13) Here H belongs to a flexible parametric family of functions such as neural networks and is chosen to have significantly lower cost than f. y will take on the values of x(k) and its neighbors in the Markov chain. We omit x(k) in the sum (13) to ensure that the function hk is independent of x(k) and hence that Eq [ 1 k=1(A hk)(x(k))] = 0. Moreover, this surrogate function design introduces no additional evaluations of f beyond those required for the usual RLOO estimator. Also, as observed by Titsias and Shi [60], for many applications, including VAE training, where f has parameters learned through gradient-based optimization, {rf(x(k))}K k=1 can be obtained for free from the same backpropagation used to compute r f (x(k)) (see Algorithm 1). RODEO We can further improve the estimator by leveraging discrete Stein operators and the above surrogate function design to construct both the global and local CVs in the Double CV framework [60] (see Section 2). We call our final estimator RODEO for RLOO with Discrete St Ein Operators: [(f(x(k)) 1 K 1 (f(x(j))+(Ahj)(x(j))))r log q (x(k))+(A h? k)(x(k))], (RODEO) k(y)r log q (y) and {hk, h? k=1 are scalar-valued functions. Here, (Ahj)(x(j)) is a scalar-valued CV introduced to reduce the variance of the leave-one-out base- j6=k f(x(j)), while (A h? k)(x(k)) acts as a global CV to further reduce the variance of the gradient estimate. We adopt the aforementioned design of h (13) and h (11) for the two CVs. In Appendix B, we show that the RODEO estimator is unbiased for r Eq [f(x)] when each hk is defined as in (13) and k(y) = 1 K 1 j6=k H?(f(x(j)), rf(x(j))>(y x(j))). (14) Optimization with RODEO In practice, we let the functions H and H? share a neural network architecture with two output units. Since the estimator is unbiased, we can optimize the Table 2: Training binary latent VAEs with K = 2, 3 (except for RELAX which uses 3 evaluations) on MNIST, Fashion-MNIST, and Omniglot. We report the average ELBO ( 1 standard error) on the training set after 1M steps over 5 independent runs. Test data bounds are reported in Table 4. Bernoulli Likelihoods Gaussian Likelihoods MNIST Fashion-MNIST Omniglot MNIST Fashion-MNIST Omniglot Dis ARM 102.75 0.08 237.68 0.13 116.50 0.04 668.03 0.61 182.65 0.47 446.61 1.16 Double CV 102.14 0.06 237.55 0.16 116.39 0.10 676.87 1.18 186.35 0.64 447.65 0.87 RODEO (Ours) 101.89 0.17 237.44 0.09 115.93 0.06 681.95 0.37 191.81 0.67 454.74 1.11 ARMS 100.84 0.14 237.05 0.12 115.21 0.07 683.55 1.01 193.07 0.34 457.98 1.03 Double CV 100.94 0.09 237.40 0.11 115.06 0.12 686.48 0.68 193.93 0.20 457.44 0.79 RODEO (Ours) 100.46 0.13 236.88 0.12 115.01 0.05 692.37 0.39 196.56 0.42 461.87 0.90 RELAX (3 evals) 101.99 0.04 237.74 0.12 115.70 0.08 688.58 0.52 196.38 0.66 462.23 0.63 parameters γ of the network in an online fashion to minimize the variance of the estimator (similar to [20]). Specifically, if we denote the RODEO gradient estimate by gγ(x(1:K)), then rγTr(Var(gγ(x(1:K)))) = E[rγkgγ(x(1:K))k2 2]. In Algorithm 1, we use an unbiased estimate of this hyperparameter gradient, rγkgγ(x(1:K))k2 2, to update γ after each optimization step of . 5 Related Work As we have seen in Section 2, there is a long history of designing CVs for REINFORCE estimators using baselines [7, 37, 45, 64]. Recent progress is mostly driven by leave-one-out [30, 38, 48, 49] and sample-dependent baselines [20, 22, 43, 60, 62]. REBAR [62] constructs the baseline through continuous relaxation of the discrete distribution [26, 35] and applies reparameterization gradients to the correction term. As a result REBAR uses three evaluations of f for each x(k) instead of usual single evaluation (see Appendix A for a detailed explanation). The RELAX [20] estimator generalizes REBAR by noticing that their continuous relaxation can be replaced with a free-form CV. However, in order to get strong performance, RELAX still includes the continuous relaxation in their CV and only adds a small deviation to it. Therefore, RELAX also uses three evaluations of f for each x(k) and is usually considered more expensive than other estimators. An attractive property of the RODEO estimator is that it incorporates information from neighboring states thanks to the Stein operator while avoiding additional costly f evaluations thanks to the learned surrogate functions hk. Estimators based on analytic local expectation [59, 61] and GO gradients [11] also use neighborhood information but only at the cost of many additional target function evaluations. In fact, the local expectation gradient [59] can be viewed as a Stein CV adjustment RODEO with a Gibbs Stein operator and the target function f used directly instead of the surrogate h. The downside of these approaches is that f must be evaluated Kd times per training step instead of K times as in RODEO, a prohibitive cost when f is expensive and d 200 as in Section 6. Prior work has also studied variance reduction methods based on sampling without replacement [31] and antithetic sampling [13 15, 68] for gradient estimation. Dis ARM [14, 69] was shown to outperform RLOO estimators when K = 2 and ARMS [13] generalizes Dis ARM to K > 2. Stein operators for continuous distributions have also been leveraged for effective variance reduction in a variety of learning tasks [2, 5, 36, 41, 52, 54] including gradient estimation [27, 34]. In particular, the gradient estimator of Liu et al. [34] is based on the Langevin Stein operator [19] for continuous distributions and coincides with the continuous counterpart of RELAX [20]. In contrast, our approach considers discrete Stein operators for Monte Carlo estimation in discrete distributions with exponentially large state spaces. Recently, Parmas and Sugiyama [44, App. E.4] used a probability flow perspective to characterize all unbiased gradient estimators satisfying a mild technical condition; our estimators fall into this broad class but were not specifically investigated. 6 Experiments Python code replicating all experiments can be found at https://github.com/thjashin/rodeo. 6.1 Training Bernoulli VAEs Following Dong et al. [14] and Titsias and Shi [60], we conduct experiments on training variational auto-encoders [29, 47] (VAEs) with Bernoulli latent variables. VAEs are models with a joint density p (y, x), where x is the latent variable and denotes model parameters. They are typically learned through maximizing the evidence lower bound (ELBO) Eq (x|y)[f(x)] for an auxiliary inference network q (x|y) and f(x) , log p (y, x) log q (x|y). In our experiments, p(x) and q (x|y) are high-dimensional distributions where each dimension of the random variable is an independent Bernoulli. Since exact gradient computations are intractable, we will use gradient estimators to learn the parameters of the inference network. The VAE architecture and training experimental setup follows Titsias and Shi [60], and details are given in Appendix D. The dimensionality of the latent variable x is d = 200. The functions H (13) and H (14) share a neural network architecture with two output units and a single hidden layer with 100 units. For the numerical stability and variance reasons discussed in Section 3, we use the Gibbs Stein operator (4) as a default choice in our experiments, but we revisit this choice in Section 6.3. We consider the MNIST [33], Fashion-MNIST [66] and Omniglot [32] datasets using their standard train, validation, and test splits. We use both binary and continuous VAE outputs (y) as in Titsias and Shi [60]. In the binary output setting, data are dynamically binarized, and the Bernoulli likelihood is used; in the continuous output setting, data are centered between [ 1, 1], and the Gaussian likelihood with learnable diagonal covariance parameters is used. Figure 1: Training binary latent VAEs with 2 or 3 f evaluations per step on binarized MNIST. K = 2 In the first set of experiments we focus on the most common setting of K = 2 sample points and compare our variance reduction method to its counterparts including Dis ARM [14] and Double CV [60]. Results for RLOO are omitted since it is consistently outperformed by Double CV [see 60]. Table 2 shows, on all three datasets, RODEO achieves the best training ELBOs. In Figure 1 (left), we plot the gradient variance and average training ELBOs against training steps for all estimators on dynamically binarized MNIST. RODEO outperforms Dis ARM and Double CV by a large margin in gradient variance. Next, we consider VAEs with Gaussian likelihoods trained on non-binarized datasets. In this case the gradient estimates suffer from even higher variance due to the large range of values f(x) can take. The results are plotted in Figure 2. We see that RODEO has substantially lower variance than Dis ARM and Double CV, leading to significant improvements in training ELBOs. In Appendix Figure 7, we find that RODEO also consistently yields the best test set performance in all six settings. K = 3 In the second set of experiments we compare RELAX [20], which uses three evaluations of f per training step, with RODEO, Double CV, and ARMS [13] for K = 3. Figure 1 (right) and Table 2 demonstrate that RODEO outperforms the three previous methods and generally leads to the lowest variance. Although RELAX was often observed to have very strong performance in prior work [14, 60], our results in Figure 1 suggest that, for dynamically binarized datasets, much larger gains can be achieved by using the same number of function evaluations in other estimators. For the experiments mentioned above, we report final training ELBOs in Table 2, test log-likelihood bounds in Appendix Table 4, and binarized MNIST average running time in Appendix Table 5. For K = 3, RODEO has the best final performance in 5 out of 6 tasks and runtime nearly identical to RELAX. For K = 2, RODEO has the best final performance for all tasks and is 1.5 times slower than Double CV and Dis ARM. We attribute the runtime gap to the Gibbs operator (4) which performs 2d evaluations of the auxiliary functions H and H in (13) and (14). While this results in some extra cost, the network parameterizing H and H takes only 2-dimensional inputs, produces scalar outputs, and is typically small relative to the VAE model. As a result, the evaluation of H and H is Figure 2: Training binary latent VAEs with Gaussian likelihoods, K = 2, and non-binarized datasets. significantly cheaper than that of f, and its relative contribution to runtime shrinks as the cost of f grows. To demonstrate this, we include a wall clock time comparison of our method with RLOO in Appendix C.1. In this experiment we replace the two-layer MLP-based VAE with a Res Net VAE, where the cost of f is significantly higher than the single-layer MLP of H, H . In this case, RODEO and RLOO have very close per-iteration time (0.025s vs. 0.023s). And RODEO achieves better training ELBOs than RLOO for the same amount of time. 6.2 Training hierarchical Bernoulli VAEs To investigate the performance of RODEO when scaling to hierarchical discrete latent variable models, we follow Dis ARM [14, 68] to train VAEs with 2/3/4 stochastic layers, each of which consists of 200 Bernoulli variables. We set K = 2 and compare our estimator with Dis ARM and Double CV on dynamically binarized MNIST, Fashion-MNIST, and Omniglot. For each stochastic layer, we use a different CV network which has the same architecture as those in our VAE experiments from the previous section. More details are presented in Appendix D. Figure 3: Training hierarchical binary latent VAEs with four stochastic layers on Fashion-MNIST. In this experiment, the estimators have very different behaviors towards the beginning and the end of training. We show this on the right by zooming into the first 50K steps of the gradient variance plot. We plot the results on VAEs with four stochastic layers on Fashion-MNIST in Figure 3. The results for other datasets and for 2 and 3 stochastic layers can be found in Appendix E. RODEO generally achieves the best training ELBOs. One difference we noticed when comparing the variance results with those obtained from single-stochastic-layer VAEs is that these estimators have very different behaviors towards the beginning and the end of training. For example, in Figure 3, Double CV starts with lower variance than Dis ARM, but the gap diminishes after around 100K steps, and Dis ARM start to perform better as the training proceeds. In contrast, RODEO has the lowest variance in both phases for all datasets save Omniglot, where Dis ARM overtakes it in the long run. 6.3 Ablation study Finally, we conduct an ablation study to gain more insight into the impact of each component of RODEO. In each experiment we train binary latent VAEs with K = 2. (a) (b) (c) Figure 4: Ablation study of impact of RODEO components: (a) Stein operators, (b) LOO baseline and global and local CVs, (c) surrogate functions on binary VAE training performance. Impact of Stein operator Figure 4a explores the impact of RODEO Stein operator choice on non-binarized MNIST. As expected, the less stable difference (8) and MPF (6) operators lead to significantly higher gradient variances and worse training ELBOs. In fact, the same operators led to divergent training on binarized MNIST. The Barker operator (6) with neighborhoods defined by having one differing coordinate yields results very similar to Gibbs (4) as the operators themselves are very similar for binary variables (notice that q(y) q(y)+q(x) = q(yi|x i) when x i = y i). Impact of LOO baseline and global and local CVs Figure 4b explores binarized MNIST performance when RODEO is modified to retain a) only the LOO baseline (this is equivalent to standard RLOO), b) only the global CV, c) the LOO baseline + the global CV, or d) the LOO baseline + the local CV. We observe that the global and local Stein CVs both contribute to variance reduction and have complementary effects. Moreover, remarkably, the global Stein CV alone outperforms RLOO. Impact of surrogate functions To tease apart the benefits of our new surrogate functions (13) and the remaining RODEO components, we replace the surrogate function cφ(z) in RELAX [20] with our surrogates, which only requires a single evaluation of f per x(k) (see Appendix A for more details). Figure 4c compares the performance of RODEO and this modified RELAX on binarized MNIST. Since the surrogate functions are matched, the consistent improvements over modified RELAX can be attributed to the Stein and double CV components of RODEO. In Appendix C.2, we also experiment with increasing the complexity of H and H by using two hidden layers instead of a single hidden layer in the MLP. We did not observe significant improvements in variance reduction. 7 Conclusions, Limitations, and Future Work This work tackles the gradient estimation problem for discrete distributions. We proposed a variance reduction technique which exploits Stein operators to generate control variates for REINFORCE leave-one-out estimators. Our RODEO estimator does not rely on continuous reparameterization of the distribution, requires no additional function evaluations per sample point, and can be adapted online to learn very flexible control variates parameterized by neural networks. One potential drawback of our surrogate function constructions (13) and (14) is the need to evaluate an auxiliary function (H or H ) at K 1 locations. This cost can be comfortably borne when H and H are much cheaper to evaluate than f, such as in the Res Net VAE example in Appendix C.1 and many large VAE models used in practice [63]. And, in our experiments with the most common sample sizes, the runtime of RODEO was no worse than that of RELAX. To obtain a more favorable cost for large K, one could employ alternative surrogates that require only a constant number of auxiliary function evaluations, e.g., j6=k f(x(j)), 1 K 1 j6=k rf(x(j))>(y x(j)) The runtime of RODEO could also be improved by varying the Stein operator employed. For example, the Gibbs operator (Ah)(x) (4) used in our experiments evaluated its surrogate function h at d neighboring locations of x. This evaluation complexity could be reduced by subsampling neighbors, resulting in a cheaper but still valid Stein operator, or by employing the numerically stable Barker operator (6) with fewer neighbors. Either strategy would introduce a speed-variance trade-off worthy of study in follow-up work. Finally, we have restricted our focus in this work to differentiable target functions f. In future work, this limitation could be overcome by designing effective surrogate functions that make no use of derivative information. Acknowledgments and Disclosure of Funding We thank Heishiro Kanagawa for suggesting appropriate names for the Barker Stein operator. [1] Andreas Anastasiou, Alessandro Barp, François-Xavier Briol, Bruno Ebner, Robert E Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, and Lester Mackey. Stein s method meets statistics: A review of some recent developments. ar Xiv preprint ar Xiv:2105.03481, 2021. [2] Roland Assaraf and Michel Caffarel. Zero-variance principle for Monte Carlo algorithms. Phys. Rev. Lett., 83:4682 4685, Dec 1999. [3] A. D. Barbour. Stein s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A): 175 184, 1988. ISSN 0021-9002. A celebration of applied probability. [4] Av A Barker. Monte Carlo calculations of the radial distribution functions for a proton? electron plasma. Australian Journal of Physics, 18(2):119 134, 1965. [5] Alessandro Barp, Chris Oates, Emilio Porcu, and Mark Girolami. A Riemann-Stein kernel method. ar Xiv preprint ar Xiv:1810.04946, 2018. [6] Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. Advances in Neural Information Processing Systems, 32:12964 12976, 2019. [7] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013. [8] Guy Bresler and Dheeraj Nagaraj. Stein s method for stationary distributions of Markov chains and application to Ising models. The Annals of Applied Probability, 29(5):3230 3265, 2019. [9] Timothy C. Brown and Aihua Xia. Stein s Method and Birth-Death Processes. The Annals of Probability, 29(3):1373 1403, 2001. [10] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. International Conference on Learning Representations, 2015. [11] Yulai Cong, Miaoyun Zhao, Ke Bai, and Lawrence Carin. GO gradient for expectation-based objectives. In International Conference on Learning Representations, 2019. [12] Petros Dellaportas and Ioannis Kontoyiannis. Control variates for estimation based on reversible Markov chain Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):133 161, 2012. [13] Aleksandar Dimitriev and Mingyuan Zhou. ARMS: Antithetic-REINFORCE-Multi-Sample gradient for binary variables. In International Conference on Machine Learning, volume 139, pages 2717 2727, 2021. [14] Zhe Dong, Andriy Mnih, and George Tucker. Dis ARM: An antithetic gradient estimator for binary latent variables. In Advances in Neural Information Processing Systems, volume 33, pages 18637 18647, 2020. [15] Zhe Dong, Andriy Mnih, and George Tucker. Coupled gradient estimators for discrete latent variables. ar Xiv preprint ar Xiv:2106.08056, 2021. [16] Peter Eichelsbacher and Gesine Reinert. Stein s method for discrete Gibbs measures. The Annals of Applied Probability, 18(4):1588 1618, 2008. [17] Charles J. Geyer. Introduction to Markov chain Monte Carlo. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, pages 3 48. CRC Press, 2011. [18] Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Commun. ACM, 33(10): 75 84, October 1990. [19] Jackson Gorham and Lester Mackey. Measuring sample quality with Stein s method. In Advances in Neural Information Processing Systems, pages 226 234, 2015. [20] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018. [21] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops I took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831 3841, 2021. [22] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Mu Prop: Unbiased backpropagation for stochastic neural networks. In International Conference on Learning Representations, 2016. [23] Shane G Henderson. Variance reduction via an approximating Markov process. Ph D thesis, Stanford University, 1997. [24] Liam Hodgkinson, Robert Salomone, and Fred Roosta. The reproducing stein kernel approach for post-hoc corrected sampling. ar Xiv preprint ar Xiv:2001.09266, 2020. [25] Susan Holmes. Stein s method for birth and death chains. In Persi Diaconis and Susan Holmes, editors, Stein s Method: Expository Lectures and Applications, pages 42 65. 2004. [26] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. International Conference on Learning Representations, 2017. [27] Martin Jankowiak and Fritz Obermeyer. Pathwise derivatives beyond the reparameterization trick. In International Conference on Machine Learning, pages 2235 2244. PMLR, 2018. [28] Samuel Karlin and James Mc Gregor. The classification of birth and death processes. Transactions of the American Mathematical Society, 86(2):366 400, 1957. [29] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. [30] W. Kool, H. V. Hoof, and M. Welling. Buy 4 REINFORCE samples, get a baseline for free! In Deep RLStruct Pred@ICLR, 2019. [31] Wouter Kool, Herke van Hoof, and Max Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020. [32] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. [33] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [34] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via Stein s identity. ar Xiv preprint ar Xiv:1710.11198, 2017. [35] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. [36] Antonietta Mira, Reza Solgi, and Daniele Imparato. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653 662, 2013. [37] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Interna- tional Conference on Machine Learning, pages 1791 1799, 2014. [38] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188 2196, 2016. [39] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1 62, 2020. [40] Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit MLE: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34:14567 14579, 2021. [41] Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017. [42] Art B. Owen. Monte Carlo theory, methods and examples. 2013. [43] John Paisley, David M Blei, and Michael I Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, pages 1363 1370, 2012. [44] Paavo Parmas and Masashi Sugiyama. A unified view of likelihood ratio and reparameterization gradients. In International Conference on Artificial Intelligence and Statistics, pages 4078 4086. PMLR, 2021. [45] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In International Conference on Artificial Intelligence and Statistics, page 814 822, 2014. [46] Gesine Reinert and Nathan Ross. Approximating stationary distributions of fast mixing Glauber dynamics, with applications to exponential random graphs. The Annals of Applied Probability, 29(5):3201 3229, 2019. [47] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014. [48] Lorenz Richter, Ayman Boustati, Nikolas Nüsken, Francisco Ruiz, and Omer Deniz Akyildiz. Var Grad: A low-variance gradient estimator for variational inference. In Advances in Neural Information Processing Systems, volume 33, pages 13481 13492, 2020. [49] Tim Salimans and David A Knowles. On using control variates with stochastic approximation for variational Bayes and its connection to stochastic linear regression. ar Xiv preprint ar Xiv:1401.1022, 2014. [50] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. Advances in Neural Information Processing Systems, 28, 2015. [51] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017. [52] Shijing Si, Chris Oates, Andrew B Duncan, Lawrence Carin, and François-Xavier Briol. Scalable control variates for Monte Carlo methods via stochastic optimization. ar Xiv preprint ar Xiv:2006.07487, 2020. [53] Jascha Sohl-Dickstein, Peter Battaglino, and Michael R De Weese. Minimum probability flow learning. In International Conference on Machine Learning, pages 905 912, 2011. [54] Leah F South, Chris J Oates, Antonietta Mira, and Christopher Drovandi. Regularised zero-variance control variates for high-dimensional variance reduction. ar Xiv preprint ar Xiv:1811.05073, 2018. [55] Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [56] Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in the analysis of simulations. Lecture Notes-Monograph Series, pages 1 26, 2004. [57] David Stirzaker. Stochastic processes and models. OUP Oxford, 2005. [58] Michalis K. Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014. [59] Michalis K Titsias and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. Advances in neural information processing systems, 28:2638 2646, 2015. [60] Michalis K Titsias and Jiaxin Shi. Double control variates for gradient estimation in discrete latent variable models. International Conference on Artificial Intelligence and Statistics, 2022. [61] Seiya Tokui and Issei Sato. Evaluating the variance of likelihood-ratio gradient estimators. In International Conference on Machine Learning, pages 3414 3423, 2017. [62] G. Tucker, A. Mnih, C. J. Maddison, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, 2017. [63] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667 19679, 2020. [64] Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 538 545, 2001. [65] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. [66] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [67] Jiasen Yang, Qiang Liu, Vinayak Rao, and Jennifer Neville. Goodness-of-fit testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, pages 5561 5570. PMLR, 2018. [68] Mingzhang Yin and Mingyuan Zhou. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019. [69] Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization, 2020. [70] Giacomo Zanella. Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association, 115(530):852 865, 2020.