# adaptive_antithetic_sampling_for_variance_reduction__ac405d52.pdf Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren * 1 Shengjia Zhao * 1 Stefano Ermon 1 Variance reduction is crucial in stochastic estimation and optimization problems. Antithetic sampling reduces the variance of a Monte Carlo estimator by drawing correlated, rather than independent, samples. However, designing an effective correlation structure is challenging and application specific, thus limiting the practical applicability of these methods. In this paper, we propose a general-purpose adaptive antithetic sampling framework. We provide gradient-based and gradient-free methods to train the samplers such that they reduce variance while ensuring that the underlying Monte Carlo estimator is provably unbiased. We demonstrate the effectiveness of our approach on Bayesian inference and generative model training, where it reduces variance and improves task performance with little computational overhead. 1. Introduction The problem of computing expectations that are too complex to be evaluated analytically is ubiquitous in machine learning. To address this difficulty, Monte Carlo estimation underlies the majority of modern machine learning algorithms. Instead of computing the exact expectation, Monte Carlo estimators draw samples from the underlying distribution and use them to compute an empirical mean. When the number of samples is sufficiently large, this empirical mean will approach the expectation according to the law of large numbers. Despite this guarantee, Monte Carlo estimators can still deviate substantially from the true expectation when the sample size is small. The crux of Monte Carlo estimation is variance. According to Chebyshev inequality, the estimation error is bounded *Equal contribution 1Stanford University. Correspondence to: Hongyu Ren , Shengjia Zhao , Stefano Ermon . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). by the variance of the estimator. Thus, with lower variance we could achieve better estimation error guarantees. Increasing the number of samples is the most direct way to decrease the variance but at the cost of additional computation. Hence, a variety of variance reduction techniques have been proposed in the literature. Classic approaches include baselines (Weaver & Tao, 2001), control variates (Greensmith et al., 2004), importance sampling (Neal, 2001), rejection sampling, Rao-Blackwellization (Grisetti et al., 2007) etc. These techniques modify the underlying distribution, but still rely on the basic idea of taking independent and identically distributed (i.i.d.) samples and computing their empirical mean. However, i.i.d. sampling is neither necessary nor optimal. Antithetic sampling methods (Hammersley & Morton, 1956) forgo the i.i.d. sampling construction. Given a fixed number of samples, antithetic sampling attempts to jointly select all of them at once so that they are more representative. An intuitive example is sampling without replacement: drawing k objects without replacement is guaranteed to provide a more complete picture of the underlying space (compared to i.i.d. sampling) because it avoids redundancy by construction. In fact, sampling without replacement provably reduces variance. The reduction, however, is modest. Although a more effective, general-purpose antithetic sampling strategy is desirable, it is actually impossible. We will prove a no-free-lunch result: no antithetic sampling strategy can work better than sampling without replacement for a natural class of Monte Carlo estimation problems. Therefore, we cannot use a single antithetic sampling strategy and achieve significant variance reduction for every estimation problem; antithetic sampling must tailor to the specific problem at hand. In this paper, we propose a general framework to learn an antithetic sampling distribution that automatically adapts to the underlying Monte Carlo estimation problem. The key idea is to construct a flexible, parametric family of joint probability distributions over a set of samples that is both learnable and guarantees the unbiasedness of the resulting estimator. We then propose learning methods that choose a joint distribution to minimize the variance of the estimator. We provide both a gradient-based learning scheme which Adaptive Antithetic Sampling for Variance Reduction takes full advantage of recent advances in automatic differentiation, and a gradient free version when differentiation is not possible. We demonstrate the effectiveness of our method on a diverse set of problems that benefit from low variance Monte Carlo estimation, including numerical computation of expectations, likelihood estimation for variational Bayesian inference, and training of generative adversarial networks by stochastic gradient descent. For all of these tasks, our method successfully reduces variance, which in turn leads to improved performance on standard metrics for each task. In addition, the computational overhead of our method is small. By amortizing the cost of training the antithetic sampler, we still achieve superior performance when baseline methods are given the same amount of wall-clock time. 2. Background It is common in machine learning to evaluate expectations µ = Ep(x)[f(x)] for some function f and distribution p(x) on some sample space x X. In addition, we often want to optimize the expectation µ(φ) = Ep(x)[f(x; φ)] as a function of some parameters φ. To do so, first-order optimization methods such as gradient descent need to estimate the related expectation µ(φ) = Ep(x)[ f(x; φ)]. Stochastic estimation and optimization problems of this form are ubiquitous in Bayesian inference, reinforcement learning, variational inference, risk minimization, etc (Bishop, 2006). It is almost always too expensive to evaluate µ = Ep(x)[f(x)] exactly (e.g., analytically). In this case, Monte Carlo estimation is widely used: Ep(x)[f(x)] 1 i=1 f(xi) def = ˆµf(x1:m) (1) where x1:m = x1, , xm i.i.d p(x) are m independent, identically distributed samples from p(x). We will denote this i.i.d. distribution as x1:m p(x1:m). For any m 1, ˆµ is an unbiased estimator of µ, meaning that Ep(x1:m)[ˆµf(x1:m)] = µ. Chebychev s inequality provides (probabilistic) guarantees on how far ˆµ can be from µ: Pr [|ˆµf(x1:m) µ| ϵ] Varp(x1:m)[ˆµf(x1:m)] Variance plays a paramount role in Monte Carlo estimation. The smaller the variance, the more concentrated ˆµ is around µ. For i.i.d. samples x1:m p(x1:m) Varp(x1:m)[ˆµf(x1:m)] = Varp(x)[f(x)] so we can increase m to reduce variance. However this is usually expensive, as the computational cost increases (linearly) in m. It is therefore useful to design methods that reduce variance without increasing m. 2.1. Variance Reduction Techniques Given the importance of Monte Carlo estimation and variance reduction, a significant number of techniques have been applied to diverse problems (L Ecuyer & Owen, 2009). Example include stratified sampling (Neyman, 1934), control variate / baselines (Greensmith et al., 2004), importance sampling (Neal, 2001), rejection sampling (Grover et al., 2018), Rao-Blackwellization (Grisetti et al., 2007), etc. In this paper, we build on antithetic sampling (Geweke, 1988; Wu et al., 2019), a classic variance reduction technique that has received less attention in the ML literature. 2.2. Antithetic Sampling In the classic Monte Carlo estimator ˆµf(x1:m) = 1/m Pm i=1 f(xi), the samples x1, , xm are sampled independently from p(x). However, i.i.d. sampling is not necessary nor optimal. In fact, whenever they are sampled from a joint distribution q(x1:m) that preserves the marginals, i.e. it satisfies q(xi) = p(xi), i = 1, , m, our Monte Carlo estimator ˆµf(x1:m) is unbiased because i=1 Eq(xi)[f(xi)] (3) = Ep(x)[f(x)] (4) The core idea of antithetic sampling is to forgo i.i.d. sampling and use a joint distribution q(x1:m) such that Varq(x1:m)[ˆµf(x1:m)] < Varp(x1:m)[ˆµf(x1:m)] For example, when p(x) is symmetric (i.e. p(x) = p( x)), such as a zero-mean Gaussian, a classical choice for q(x1, x2) is given by the following process: sample x1 p(x), then set x2 = x1 (hence the name antithetic). It is easy to see that q(x1) = q(x2) = p(x1). This strategy works perfectly if f is an odd function, i.e. f( x) = f(x), because we are guaranteed to have 1 2(f(x1) + f(x2)) = 0 = Ep(x)[f(x)] Only two samples are required to estimate the expectation exactly. On the other hand, if f is an even function f( x) = f(x), then this strategy backfires because f(x1) = f(x2), so x2 is redundant and antithetic sampling doubles the variance compared to i.i.d. sampling. As the previous example indicates, the variance of our estimator under an antithetic distribution q(x1:m) depends on Adaptive Antithetic Sampling for Variance Reduction the function f. The natural question is: is there an antithetic sampling distribution that performs well for any f? We will now prove that the answer is yes and no: there are antithetic sampling distributions that always reduces variance, however, they only reduce variance by a very small amount in the worst-case. 2.3. No-Free-Lunch of Antithetics We first suppose x X only takes a finite set of values. We define qsr(x1:m) as the following sampling without replacement distribution: draw x1 p(x), and draw x2 p(x)I(x = x1) x3 p(x)I(x = x1)I(x = x2) Since i, qsr(xi) = p(x), sampling without replacement is an antithetic sampling method. In addition, it is guaranteed to improve variance (Sukhatme, 1957) compared to i.i.d. sampling. Sampling without replacement always reduces variance, but only by a tiny amount. It is only an effective strategy if the probability of sampling repeated elements is large, which rarely happens in practice. In fact, no antithetic distribution can do much better, as shown by the following theorem. Theorem 1 (No Free Lunch). Let p(x) be a uniform distribution on X, where X is a finite set. Let q be any antithetic distribution q(x1, x2) where x1, x2 X. Let F be the set of functions X R such that Varp(x1:2)[ˆµf(x1:2)] = 0, then max f F Varq(x1:2)[ˆµf(x1:2)]] Varp(x1:2)[ˆµf(x1:2)] 1 1 |X| 1 (5) For sampling without replacement, for any f F Varqsr(x1:2)[ˆµf(x1:2)]] Varp(x1:2)[ˆµf(x1:2)] = 1 1 |X| 1 (6) Theorem 1 proves that given any antithetic sampling distribution, there exists an estimation problem (a distribution and a worst case function f), such that it only improves variance by a tiny amount compared to i.i.d. sampling. In fact, for the uniform distribution p(x), sampling without replacement is the mini-max optimal antithetic distribution (i.e., it is optimal for the worst-case function f). Therefore, the antithetic sampling method must be adapted to the function f. Manually designing good antithetic distributions requires detailed knowledge of f and p(x), which is almost always not available when f is complex (e.g., the gradient of a deep neural network). An adaptive method is therefore necessary. 3. Gaussian-reparameterized Antithetics To develop an adaptive antithetic sampler, our strategy is to define a family of antithetic samplers qθ(x1:m), and choose the parameters θ to adapt to the specific distribution p(x) and integrand f. We will now focus on the case where X is continuous. The first roadblock is that for antithetic sampling to be unbiased, we need the marginals of qθ(x1:m) to satisfy qθ(xi) = p(xi), i = 1, , m (7) We now show that for a large set of distributions p(x) it is possible to define a flexible family of parameterized distributions qθ that satisfy this property by construction. Suppose x is univariate with density p(x). Let F(x) denote its cumulative density function (CDF), sampling x p(x) can be achieved by inverting the CDF via u U[0, 1], x = F 1(u). Let Φ denote the CDF of the standard Gaussian distribution N(0, 1). By the same argument, it is easy to see that we can alternatively sample from x p(x) via the following process ϵ N(0, 1) , x = F 1(Φ(ϵ)) := g(ϵ) (8) where g = F 1 Φ. Therefore an expectation with respect to p(x) can be converted into an expectation with respect to a standard Normal ϵ N(ϵ; 0, 1) by the law of the unconscious statistician: Ep(x)[f(x)] = EN(ϵ;0,1)[f(g(ϵ))] More generally, when p(x) is a multivariate density, we will assume that x p(x) can be equivalently obtained as x = g(ϵ) for ϵ N(0, Id), where Id denotes an d d identity matrix and g is a suitable function. This means that is possible to sample from p(x) by transforming some simple random variables ϵ, e.g., obtained using random number generation primitives in a programming language. This is a very mild restriction on p(x). Although the theory is general, in the experiments we will consider the most practically relevant case where g( ) can also be evaluated efficiently. We can define a family of antithetic samplers that satisfy the marginalization property of Eq.(7) as follows. Definition 1. Let x be a continuous random vector with density p(x) that can be sampled by x = g(ϵ) for ϵ N(0, Id). A Gaussian antithetic of order m for p(x) is the family of distributions qθ(x1, , xm) defined implicitly by the following sampling procedure (ϵ1, , ϵm) N(0, Σθ) xi = g(ϵi), i = 1, , m Adaptive Antithetic Sampling for Variance Reduction where Σθ Rmd md is positive-definite matrix parameterized by θ that is constrained to have d d identity blocks on the diagonal: Σθ Σunbiased def = {Σ Rmd md, Σ 0, ΣII = Id I = (id + 1, , id + d), i = 1, , m} (9) When x p(x) is obtained by inverting CDFs as in equation (8), the family resembles a Gaussian copula (Durante & Sempi, 2010), a classic approach to build joint distributions with known, fixed marginals. Our estimator for µ = Ep(x)[f(x)] remains unchanged compared to Eq.(1), except we use antithetic (not i.i.d.) samples x1:m qθ(x1, , xm) ˆµf(x1:m) = 1 i=1 f(xi) (10) i=1 f g(ϵi) := ˆµf g(ϵ1:m) (11) A Gaussian-reparameterized antithetic distribution satisfies several desirable properties. Proposition 1. Let qθ(x1:m) be a Gaussianreparameterized antithetic of order m for p(x). Then for any k: 1. For any Σθ Σunbiased, the estimator (10) is unbiased Eqθ(x1:m)[ˆµf(x1:m)] = Ep(x)[f(x)] 2. If Σθ = Imd, the Gaussian-reparameterized antithetic is equivalent to i.i.d sampling. 3. Given a Cholesky decomposition Σθ = LθLT θ , we can sample from qθ(x1:m) by drawing m i.i.d. samples δ = (δ1, , δm)T from N(0, Id), and x1:m = Lθδ. Proof. See Appendix Property 1 guarantees that for any choice of θ the corresponding estimator is unbiased, i.e., it will give the right answer in expectation. Property 2 guarantees that under mild conditions, there is a choice of θ where the performance is no worse than i.i.d. sampling. Property 3 states that drawing samples from qθ(x1:m) is not expensive. Given a Σθ Rmd md, we only have to compute the Cholesky decomposition once, whose time complexity is O(m3d3). If we draw k batches of samples, the time complexity becomes O(m3d3 + km2d2). The cost can thus be amortized and is comparable to i.i.d sampling when k md. 3.1. Parameterizing Gaussian Antithetics To facilitate the design of optimization algorithms over Σθ Σunbiased, we provide an explicit parameterization of the feasible set Σunbiased in Definition 1. To simplify the notation, we will use a slightly more abstract definition than in the previous section. Specifically, we will consider the matrix Σ Rmd md as a m m matrix where each element belongs to the ring of d d matrices. Let M be the ring of d d matrices, and e be its identity element. We rewrite the definition of Σunbiased in Eq.(9) as: Σunbiased def = {Σ Mm m, Σ 0, Σii = e, i} (12) To find a parameterization for Σunbiased we let ϵ > 0 be any real number. Given any θ Mm m, we turn it into an element of Σunbiased with the function ψ defined by Σ = ϵI + θθT , θ Mm m ψ(θ) = diag( Σ) 1/2 Σdiag( Σ) T/2 (13) where diag( Σ) sets all entries of Σ to zero except the diagonal, and A 1/2 is obtained from the Cholesky decomposition A = A1/2AT/2 for a positive semidefinite matrix A. The following proposition shows this parameterization is both correct (i.e., every parameter θ Mm m maps into Σunbiased) and lossless (i.e., every Σ Σunbiased can be obtained from a θ Mm m). Theorem 2. For any ϵ > 0, the map ψ defined in Eq.(13) is a surjection from Mm m into Σunbiased. Therefore we can optimize θ over the space of m m matrices Mm m, and use ψ(θ) to produce the corresponding Σθ Σunbiased that we need for our antithetic distribution. 4. Learning an Antithetic Distribution In the previous section we defined a family of antithetic distributions N(0, Σ) where Σθ Σunbiased. Proposition (1) guarantees that it can perform as well as i.i.d., but the question is of course if we can do better. In this section, we discuss how to find the optimal θ Mm m (equivalently, find the best Σ Σunbiased) by optimization, leveraging the parameterization of Σunbiased via ψ from Theorem 2. Given that the antithetic estimator is unbiased, a natural optimization criterion is for our estimator ˆµf(x1:m) to have small variance: min θ Mm m Varqθ(x1:m) [ˆµf(x1:m)] = EN(ϵ1:m;0,ψ(θ)) (ˆµf g(ϵ1:m) µ)2 Adaptive Antithetic Sampling for Variance Reduction where the equality follows from the definition xi = g(ϵi). When f is vector-valued (e.g., a gradient) we can minimize the trace of the covariance min θ Mm m tr Covqθ(x1:m) [ˆµf(x1:m)] = EN(ϵ1:m;0,ψ(θ)) h ˆµf g(ϵ1:m) µ 2i (14) Eq.(14) involves the value of µ, which is generally not known. An estimate of µ is helpful but not required. This is because by Proposition (1), µ does not depend on θ, so we can rewrite (14) as min θ Mm m EN(ϵ1:m;0,ψ(θ)) h ˆµf g(ϵ1:m) 2i µ 2 µ is thus a additive constant with no effect on the minimization of Eq.(14). We can, in fact, choose any constant in place of µ. Nonetheless, µ (or a good approximation to µ) is an effective control variate (Greensmith et al., 2004) and reduces the variance when estimating Eq.(14) from samples. We can use a bootstrap method to estimate µ: we sample multiple antithetic batches x1:m(1), , x1:m(k) qθ(x1:m); for each antithetic batch we can compute ˆµf(x1:m(i)). We can average over these estimate i ˆµf(x1:m (i)) (15) as an approximation to µ. Instead of minimizing ˆµf(x1:m) µ 2, we minimize ˆµf(x1:m) µ . Intuitively, this choice is also appealing because we are encouraging the empirical average for a batch of size m to approach the average of a larger batch of size mk. Notation: For convenience of notation we abbreviate ˆµf g(ϵ1:m) µ 2 as φ(ϵ1:m). We make its dependence on f, g and µ (which is replaced with µ) implicit. Our goal is to minimize EN(ϵ1:m;0,ψ(θ)) [φ(ϵ1:m)]. There are several methods to minimize Eq.(14). In particular, we use different methods depending on whether f g is differentiable with respect to ϵ. If it is differentiable, then φ(ϵ1:m) will also be differentiable with respect to each ϵi, and we can use gradient based optimization. Otherwise we will have to use gradient free optimizers. 4.1. Gradient Based Variance Minimization Suppose ϵf g(ϵ) can be computed, then we can also compute the gradient of ϵiφ(ϵ1:m). Then we can minimize Eq.(14) by reparameterization (Kingma & Welling, 2013). For any parameter θ Mm m, because ψ(θ) is positive definite, we can compute its Cholesky decomposition (proved in Appendix) as L(θ) = ψ(θ)1/2. Then for δ N(0, I), L(θ)δ will be distributed as N(0, ψ(θ)). Therefore we can rewrite our optimization objective Eq.(14) as EN(δ;0,I)[φ(L(θ)δ)] (16) which we can minimize with stochastic gradient descent. θEN(δ;0,I)[φ(L(θ)δ)] = EN(δ;0,I)[ θφ(L(θ)δ)] 4.2. Gradient Free Variance Optimization When the gradient ϵf g(ϵ) is not available (e.g., a black box function or g is not differentiable), we can use the reinforce estimator. θEN(ϵ1:m;0,Σθ)[φ(ϵ1:m)] (17) = EN(ϵ1:m;0,Σθ) [φ(ϵ1:m) θ log N(ϵ1:m; 0, Σθ)] If computing f g is expensive, then we can use the importance sampled reinforce estimator. Let r(ϵ1:m) be any distribution such that φ(ϵ1:m)N(ϵ1:m; 0, Σθ)/r(ϵ1:m) is finite, then the gradient of Eq.(17) is φ(ϵ1:m)N(ϵ1:m; 0, Σθ) r(ϵ1:m) θ log N(ϵ1:m; 0, Σθ) We can draw a set of samples ϵ1:m(i) from r(ϵ1:m) and store the values φ(ϵ1:m(i)) for each i. We can use this same set of samples to repeated compute the gradient Eq.(18) and reuse the φ(ϵ1:m(i)) we computed. 5. Reducing Computation Cost for Adaptive Antithetic Even though we can find a good antithetic distribution by optimizing Eq.(14), this is itself a stochastic optimization problem that can be expensive. The time invested in finding a good qθ needs to payoff in terms of reduced number of samples required. In this section we propose several strategies to reduce the computational overhead of adaptive antithetic sampling in practice. Combining these approaches, we are able to achieve superior wall-clock time in our experiments compared with i.i.d. and other baselines despite of the overhead of training qθ. 5.1. Reduction of Parameter Count Our parameterization of the feasible space by ψ has several desirable properties as shown in Theorem 2. However, θ Mm m (or equivalently Rmd md) has m2d2 learnable parameters, and becomes infeasible to use for large m or d. To reduce m, we do not perform antithetic sampling on the entire batch. Instead we define antithetic distributions Adaptive Antithetic Sampling for Variance Reduction over micro-batches of size k, and i.i.d. sample m/k microbatches as follows: x1:k, |{z} micro batch 1 xk+1:2k, | {z } micro batch 2 xm k+1:m | {z } micro batch m/k i.i.d qθ(x1:k) (19) Then because we only have to parameterize k dimensional Gaussian antithetic distributions, θ Mk k. This reduces number of parameters to k2d2. We can further reduce the number of trainable parameters by choosing θ Mk k for some k < k. Even for very small k , the model still has many free parameters because each element belongs to the matrix ring M. In experiments we observe that we can even find good Σ with k = 2. For most of our experiments on high dimensional problems, we choose k = 8, and θ Mk 2. The number of parameters will be 2kd2. To reduce d, we note that Theorem 2 (Appendix A) is still true if instead of using the ring M in Eq.(12,13), we use the subring of block diagonal matrices M. That is, given any partition of {1, , d}, we use the ring consisting of matrices of the following form A Rd d, Aij = 0 if i, j belong to different partitions We show in the appendix that ψ is still a surjection into Σunbiased under this new ring. The previous strategies are sufficient for our experiments, but this strategy could be necessary for even higher dimensional problems. 5.2. Amortization of the Learning Cost Monte Carlo estimates are widely used in stochastic optimization frameworks. In these applications we are not estimating Ep(x)[f(x)] for a single f, rather we have a sequence of related estimation problems Ep(x)[f1(x)], , Ep(x)[f T (x)] For example, for stochastic gradient descent, ft(x) is the gradient at the t-th iteration. Learning the antithetic distribution for every ft is clearly infeasible, however, assuming ft and ft+1 are similar, any qθ that achieves small variance on the estimation of Ep(x)[ft(x)] is also likely to achieve small variance on the estimation of Ep(x)[ft+1]. Then what we can do is to choose a subset of time steps {t1, } {1, , T} and update qθ by Eq.(14) only on these time steps. The cost of each update can be amortized over [ti, ti+1]. In particular, if updates are exponentially sparse (e.g., an update every ti = 2i steps), then the overhead of finding q is almost negligible. We show in our experiments that this leads to very good performance on deep learning tasks trained with stochastic gradient descent. 6. Experiments We test our adaptive Gaussian antithetic on three tasks. The first one is a controlled synthetic task, where we verify that our model demonstrates expected behavior. The second task is Bayesian inference, where we reduce variance and achieve better estimation of the posterior. The third task is improving stochastic gradient descent training of generative adversarial networks, where we reduce variance and obtain quantitative improvements in terms of inception scores for the same wall-clock time. 6.1. Simple Estimation Problems In this section we estimate several simple functions to visualize the behavior of our model. In both problems we use a one dimensional X and a one dimensional function f : X R. We choose p(x) as N(0, 1), and antithetic batch size m = 2. qθ(x1, x2) is a two dimensional Gaussian we can easily visualize by contour plots. The two functions we choose are f1 = ex + 2x sin(x) and f2 = x3. In Figure 1(A, B) we plot the optimal antithetic distribution q (x1, x2) that minimizes variance. This distribution is found by grid search over all possible parameters (up to precision 10 2). For f1 the antithetic distribution found by our model is identical to the optimal antithetic distribution (found by grid search). For f2, the optimal sampling distribution has singular covariance: x1 = x2 with probability 1 under q (x1, x2). Eq.(13) only parameterizes Gaussians with full rank covariances, so we cannot exactly represent q (x1, x2). Nonetheless our method learns a good approximation to q . (C) Learned covariance for 𝑓1 (A) Best covariance for 𝑓1 (D) Learned covariance for 𝑓2 (B) Best covariance for 𝑓2 Figure 1: Top: Best covariance for optimal antithetic Gaussian q (x1, x2) found by grid search. Bottom: Covariance found by our adaptive learning method. Adaptive Antithetic Sampling for Variance Reduction 6.2. Application to Variational Bayes In Bayesian estimation problems we often have an unobserved variable z Z, and an observed variable x X. Given a likelihood p(x|z) and a prior p(z), the objective is to either estimate log p(x) or the posterior p(z|x) = p(x|z)p(z) p(x) . For both applications, the major challenge is to compute p(x) or log p(x). A typical way to estimate log p(x) is to use importance sampling. Let q(z) be any distribution on Z. We can obtain the evidence lower bound (ELBO) to the log likelihood log p(x) = log Ep(z)[p(x|z)] = log Eq(z) log p(x, z) The bound is exact if q(z) = p(z|x), so we can maximize L(x) over a set of candidate importance sampling distributions q(z) to obtain the tightest bound. This procedure is usually known as stochastic variational inference (SVI) (Hoffman et al., 2013). However, q(z) is usually a simple distribution for computational reasons, so q(z) seldom approximates p(z|x) well. As a result, L(x) is usually a loose bound. To address this limitation, (Burda et al., 2015) proposed using multi-variable importance sampling Lis(x) = Eq(z1) q(zm) where it is guaranteed that L(x) Lis(x) log p(x). In fact we do not have to use a factorized distribution q(z1) , q(zm), which would correspond to i.i.d. sampling. Instead, we can use a correlated joint distribution qθ(z1, , zm), as long as it has the right marginals. Lanti(x) = Eqθ(z1, ,zm) Lanti(x) still lower bounds log p(x) as long as qθ is an antithetic distribution for q(z) because Lanti(x) log Eqθ(z1, ,zm) Therefore we can optimize over Gaussian-reparameterized antithetic distributions qθ to maximize Lanti(x) because it is a lower bound to log p(x) for any choice of qθ. To find a good antithetic distribution qθ we train on a small set of x, and apply the learned qθ to previously unseen inputs x. We find that qθ generalizes well across different x. Therefore, the cost of learning qθ is negligible when it is amortized over multiple estimation problems for log p(x). 6.2.1. DATASET AND EVALUATION We first train a variational autoencoder on MNIST and Omniglot for the pretrained model q(z|x), p(x|z). We evaluate whether our antithetic sampler can achieve a tighter bound Lanti on log p(x), where x are images in the test set. As our baselines, we use i.i.d. sampling to achieve lower bound Lis(x) as in Eq.(20). We also compared with negative sampling introduced in section 2.2. Since the objective Lanti is a function of x, we also evaluate the generalization of our model by training the antithetic sampler on a small set (100) of x but evaluate on a large number (1000) of samples, such that both the variance of this estimation and the additional computation overhead is negligible. 0.4 0.2 0.0 0.2 0.4 0.0% Our method VS negative sampling 0.4 0.2 0.0 0.2 0.4 0.0% Our method VS iid sampling Figure 2: For each sample x in the test set, we compute the difference between our lower bound Lanti and lower bound obtained by baseline methods. A positive difference indicates that our method is better. Left: Histogram of the difference between our method and negative antithetic. Right: Histogram of the difference between our method and i.i.d. sampling Lis. Negative bins only account for 5% of all data. This means that for almost all samples x in the test set, our method obtains a larger (tighter) lower bound to log p(x) compared to both baselines. 6.2.2. RESULTS Our results for MNIST are shown in Figure 2. The quantitative results for MNIST and Omniglot are shown in the Appendix. We consistently obtain larger (better) lower bound compared to baseline methods. In particular, our method shows strong generalization ability, although it is trained on a small number of x, we still achieve good performance (large Lanti(x)) for almost every x in the test set. If we amortize the cost of learning qθ over approximately 1000 estimation tasks, our improvement comes with negligible overhead. 6.3. Application to GAN Training Let X Rk be the space of images with k pixels, and Z Rd be the space of latent vectors. We are given an Adaptive Antithetic Sampling for Variance Reduction empirical distribution (i.e. set of images) pdata(x) on X, and a simple distribution p(z) on Z (e.g. Gaussian). Generative Adversarial Net (GAN) (Goodfellow et al., 2014) and its variants (Radford et al., 2015; Arjovsky et al., 2017; Gulrajani et al., 2017; Mao et al., 2017) has two sets of functions: a set of generator functions {G : Z X} that generates images by mapping latent vectors in Z into images X, and a set of discriminator functions {D : X R} that maps an image X to a real number R. Training a generative adversarial net consists of two objectives: find a generator that generates images that are indistinguishable to the discriminator, and find a discriminator that better distinguishes generator images G(z), z p(z), and input images pdata(x). A typical training objective is min G max D L(G, D) = Epdata(x)[D(x)] Ep(z)[D(G(z))] To optimize this objective the typical approach is by joint stochastic gradient descent. GL(G, D) = Ep(z)[ GD(G(z))] DL(G, D) = Epdata(x)[ DD(x)] Ep(z)[ DD(G(z))] Gnew = Gold GL(Gold, Dold) Dnew = Dold DL(Gold, Dold) However computation of both GL(G, D) and DL(G, D) require Monte Carlo estimation of expectations. Usually this is achieved by i.i.d. samples of z1, , zm i.i.d. p(z), and samples (that may not be i.i.d.) x1, , xm pdata(x). i GD(G(zi)) i DD(G(zi)) It has been empirically observed that large variance in the estimation of GL(G, D) and DL(G, D) hurts training. (Brock et al., 2018; Chavdarova et al., 2018) decrease the variance of gradient estimation to significantly improve training outcome (measured by FID/inception score). In particular, (Brock et al., 2018) reduces variance by using a larger batch size m; (Chavdarova et al., 2018) uses stochastic variance reduced gradient (SVRG). However, SVRG is computationally expensive, and it is usually more efficient to increase batch size m instead. Our method can be naturally applied to this setup. Instead of increasing batch size, we antithetically sample x1:m. We use exponentially infrequent updates to qθ described in Section 5.2, while for each update, we optimize with Eq.(14) until convergence. The intuition is that as training progresses and the model converges, training becomes increasingly stable; it is more likely that a qθ that achieves low variance for GL(G, D), can also achieve low variance for GL(Gnew, Dnew). Overall, our overhead is small enough, such that our method has more efficient wall clock time compared to increasing batch size and other baselines. Dataset and Evaluation Metrics We train WGAN-GP (Gulrajani et al., 2017) on MNIST dataset and Fashion MNIST. For comparison we use i.i.d. sampled z1, , zm, and negative sampling described in Section 2.2. To evaluate the performance of the learned generator G we use inception score (Salimans et al., 2016). Let n N be the space of labels (e.g. digit class). We train a classifier r(n|x) that maps each image x X to a distribution on the label space. Consider the joint distribution r(x, n) = r(x)r(n|x) where r(x) is defined by x = G(z), z p(z), inception score computes exp DKL(r(n|x) r(n)). Results The results for MNIST and Fashion MNIST are shown in Figure 3 in the Appendix. Given the same batch size m our method out-performs baseline methods on the variance of gradient estimation and inception score. When m is small, our method provides a marginal improvement, which does not justify the overhead. However, as soon as the batch size is large enough (e.g. 64), our method yields a significant improvement. Even when we take into account computation overhead of learning qθ, our method still outperforms baseline methods by a large margin. 7. Conclusion Variance reduction is a key challenge whenever Monte Carlo estimators are used in practice. In this paper, we investigated the use of antithetic sampling, a classic variance reduction technique that is currently not widely used in machine learning. We provided a general framework to automatically learn a good antithetic distribution based on the novel Gaussian-reparameterized antithetic family. Our approach provides provably unbiased estimates and strikes a good balance between flexibility and ease of training with gradient-based or gradient-free methods. Although there is a computational cost associated with learning a good antithetic family, we demonstrated empirically that it can be amortized and the variance reduction it affords pays off in terms of wall-clock time. Antithetic sampling can be easily combined with other techniques such as importance sampling, control variates, etc. Exploring synergies between these orthogonal strategies is an exciting direction for future work. Our methods have limitations with high dimensional random variables, we take it as future work to further reduce the number of parameters in order to increase scalability. Adaptive Antithetic Sampling for Variance Reduction Acknowledgements This research was supported by Toyota Research Institute, NSF (#1651565, #1522054, #1733686), ONR (N0001419-1-2145), AFOSR (FA9550-19-1-0024), Amazon AWS, and JP Morgan. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Bishop, C. M. Pattern recognition and machine learning. springer, 2006. Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015. Chavdarova, T., Stich, S., Jaggi, M., and Fleuret, F. Stochastic variance reduced gradient optimization of generative adversarial networks. 07 2018. Durante, F. and Sempi, C. Copula theory: an introduction. In Copula theory and its applications, pp. 3 31. Springer, 2010. Geweke, J. Antithetic acceleration of monte carlo integration in bayesian inference. Journal of Econometrics, 38 (1-2):73 89, 1988. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471 1530, 2004. Grisetti, G., Stachniss, C., and Burgard, W. Improved techniques for grid mapping with rao-blackwellized particle filters. IEEE transactions on Robotics, 23(1):34 46, 2007. Grover, A., Gummadi, R., Lazaro-Gredilla, M., Schuurmans, D., and Ermon, S. Variational rejection sampling. In Proc. 21st International Conference on Artificial Intelligence and Statistics, 2018. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Hammersley, J. and Morton, K. A new monte carlo technique: antithetic variates. In Mathematical proceedings of the Cambridge philosophical society, volume 52, pp. 449 475. Cambridge University Press, 1956. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. L Ecuyer, P. and Owen, A. B. Monte Carlo and Quasi Monte Carlo Methods 2008. Springer, 2009. Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794 2802, 2017. Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001. Neyman, J. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558 625, 1934. Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234 2242, 2016. Sukhatme, P. V. Sampling theory of surveys with applications. The Indian Society Of Agricultural Statistics; New Delhi, 1957. Weaver, L. and Tao, N. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 538 545. Morgan Kaufmann Publishers Inc., 2001. Wu, M., Goodman, N., and Ermon, S. Differentiable antithetic sampling for variance reduction in stochastic variational inference. 2019.