# monte_carlo_gradient_estimation_in_machine_learning__0183bb24.pdf Journal of Machine Learning Research 21 (2020) 1-62 Submitted 03/19; Revised 05/20; Published 07/20 Monte Carlo Gradient Estimation in Machine Learning Shakir Mohamed 1 shakir@google.com Mihaela Rosca 1 2 mihaelacr@google.com Michael Figurnov 1 mfigurnov@google.com Andriy Mnih 1 amnih@google.com Equal contributions; 1 Deep Mind, London 2 University College London Editor: Jon Mc Auliffe This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis. In machine learning research, this gradient problem lies at the core of many learning problems, in supervised, unsupervised and reinforcement learning. We will generally seek to rewrite such gradients in a form that allows for Monte Carlo estimation, allowing them to be easily and efficiently used and analysed. We explore three strategies the pathwise, score function, and measure-valued gradient estimators exploring their historical development, derivation, and underlying assumptions. We describe their use in other fields, show how they are related and can be combined, and expand on their possible generalisations. Wherever Monte Carlo gradient estimators have been derived and deployed in the past, important advances have followed. A deeper and more widely-held understanding of this problem will lead to further advances, and it is these advances that we wish to support. Keywords: gradient estimation, Monte Carlo, sensitivity analysis, score-function estimator, pathwise estimator, measure-valued estimator, variance reduction 1. Introduction Over the past five decades the problem of computing the gradient of an expectation of a function a stochastic gradient has repeatedly emerged as a fundamental tool in the advancement of the state of the art in the computational sciences. An ostensibly anodyne gradient lies invisibly within many of our everyday activities: within the management of modern supply-chains (Siekman, 2000; Kapuscinski and Tayur, 1999), in the pricing and hedging of financial derivatives (Glasserman, 2013), in the control of traffic lights (Rubinstein and Shapiro, 1993), and in the major milestones in the ongoing research in artificial intelligence (Silver et al., 2016). Yet, computing the stochastic gradient is not without complexity, and its fundamental importance requires that we deepen our understanding of them to sustain future progress. This is our aim in this paper: to provide a broad, accessible, and detailed understanding of the tools we have to compute gradients of stochastic 2020 S. Mohamed, M. Rosca, M. Figurnov, A. Mnih. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-346.html. Mohamed, Rosca, Figurnov, Mnih functions. We also aim to describe their instantiations in other research fields, to consider the trade-offs we face in choosing amongst the available solutions, and to consider questions for future research. Our central question is the computation of a general probabilistic objective F of the form F(θ) := Z p(x; θ)f(x; φ)dx = Ep(x;θ) [f(x; φ)] . (1) Equation (1) is a mean-value analysis, in which a function f of an input variable x with structural parameters φ is evaluated on average with respect to an input distribution p(x; θ) with distributional parameters θ. We will refer to f as the cost and to p(x; θ) as the measure, following the naming used in existing literature. We will make one restriction in this review, and that is to problems where the measure p is a probability distribution that is continuous in its domain and differentiable with respect to its distributional parameters. This is a general framework that allows us to cover problems from queueing theory and variational inference to portfolio design and reinforcement learning. The need to learn the distributional parameters θ makes the gradient of the function (1) of greater interest to us: η := θF(θ) = θEp(x;θ) [f(x; φ)] . (2) Equation (2) is the sensitivity analysis of F, i.e. the gradient of the expectation with respect to the distributional parameters. It is this gradient that lies at the core of tools for model explanation and optimisation. But this gradient problem is difficult in general: we will often not be able to evaluate the expectation in closed form; the integrals over x are typically high-dimensional making quadrature ineffective; it is common to request gradients with respect to high-dimensional parameters θ, easily in the order of tens-of-thousands of dimensions; and in the most general cases, the cost function may itself not be differentiable, or be a black-box function whose output is all we are able to observe. In addition, for applications in machine learning, we will need efficient, accurate and parallelisable solutions that minimise the number of evaluations of the cost. These are all challenges we can overcome by developing Monte Carlo estimators of these integrals and gradients. Overview. We develop a detailed understanding of Monte Carlo gradient estimation by first introducing the general principles and considerations for Monte Carlo methods in Section 2, and then showing how stochastic gradient estimation problems of the form of (2) emerge in five distinct research areas. We then develop three classes of gradient estimators: the score-function estimator, the pathwise estimator, and the measure-valued gradient estimator in Sections 4 6. We discuss methods to control the variance of these estimators in Section 7. Using a set of case studies in Section 8, we explore the behaviour of gradient estimators in different settings to make their application and design more concrete. We discuss the generalisation of the estimators developed and other methods for gradient estimation in Section 9, and conclude in Section 10 with guidance on choosing estimators and some of the opportunities for future research. This paper follows in the footsteps of several related reviews that have had an important influence on our thinking, specifically, Fu (1994), Pflug (1996), V azquez-Abad (2000), and Glasserman (2013), and which we generally recommend reading. Notation. Throughout the paper, bold symbols indicate vectors, otherwise variables are scalars. f(x) is function of variables x which may depend on structural parameters φ, although we will not explicitly write out this dependency since we will not consider these parameters in our exposition. We indicate a probability distribution using the symbol p, and use the notation p(x; θ) to represent the distribution over random vectors x with distributional parameters θ. θ represents the gradient operator that collects all the partial derivatives of a function with respect to parameters in θ, i.e. θf = [ f θ1 , . . . , f θD ], for D-dimensional parameters; θ will also be used for scalar θ. By Monte Carlo Gradient Estimation in Machine Learning convention, we consider vectors to be columns and gradients as rows. We represent the sampling or simulation of variates ˆx from a distribution p(x) using the notation ˆx p(x). We use Ep[f] and Vp[f] to denote the expectation and variance of the function f under the distribution p, respectively. Appendix A lists the shorthand notation used for distributions, such as N or M for the Gaussian and double-sided Maxwell distributions, respectively. Reproducibility. Code to reproduce Figures 2 and 3, sets of unit tests for gradient estimation, and for the experimental case study using Bayesian logistic regression in Section 8.3 are available at https://www.github.com/deepmind/mc_gradients . 2. Monte Carlo Methods and Stochastic Optimisation This section briefly reviews the Monte Carlo method and the stochastic optimisation setting we rely on throughout the paper. Importantly, this section provides the motivation for why we consider the gradient estimation problem (2) to be so fundamental, by exploring an impressive breadth of research areas in which it appears. 2.1. Monte Carlo Estimators The Monte Carlo method is one of the most general tools we have for the computation of probabilities, integrals and summations. Consider the mean-value analysis problem (1), which evaluates the expected value of a general function f under a distribution p. In most problems, the integral (1) will not be known in closed-form, and not amenable to evaluation using numerical quadrature. However, by using the Monte Carlo method (Metropolis and Ulam, 1949) we can easily approximate the value of the integral. The Monte Carlo method says that we can numerically evaluate the integral by first drawing independent samples ˆx(1), . . . , ˆx(N) from the distribution p(x; θ), and then computing the average of the function evaluated at these samples: n=1 f ˆx(n) , where ˆx(n) p(x; θ) for n = 1, ..., N. (3) The quantity FN is a random variable, since it depends on the specific set of random variates {ˆx(1), . . . , ˆx(n)} used, and we can repeat this process many times by constructing multiple sets of such random variates. Equation (3) is a Monte Carlo estimator of the expectation (1). As long as we can write an integral in the form of Equation (1) as a product of a function and a distribution that we can easily sample from we will be able to apply the Monte Carlo method and develop Monte Carlo estimators. This is the strategy we use throughout this paper. There are four properties we will always ask of a Monte Carlo estimator: Consistency. As we increase the number of samples N in (3), the estimate FN should converge to the true value of the integral Ep(x;θ) [f(x)]. This usually follows from the strong law of large numbers. Unbiasedness. If we repeat the estimation process many times, we should find that the estimate is centred on the actual value of the integral on average. The Monte Carlo estimators for (1) Mohamed, Rosca, Figurnov, Mnih satisfy this property easily: Ep(x;θ) FN = Ep(x;θ) n=1 f ˆx(n) # n=1 Ep(x;θ) h f ˆx(n) i = Ep(x;θ) [f(x)] . (4) Unbiasedness is always preferred because it allows us to guarantee the convergence of a stochastic optimisation procedure. Biased estimators can sometimes be useful but require more care in their use (Mark and Baram, 2001). Minimum variance. Because an estimator (3) is a random variable, if we compare two unbiased estimators using the same number of samples N, we will prefer the estimator that has lower variance. We will repeatedly emphasise the importance of low variance estimators for two reasons: the resulting gradient estimates are themselves more accurate, which is essential for problems in sensitivity analysis where the actual value of the gradient is the object of interest; and where the gradient is used for stochastic optimisation, low-variance gradients makes learning more efficient, allowing larger step-sizes (learning rates) to be used, potentially reducing the overall number of steps needed to reach convergence and hence resulting in faster training. Computational efficiency. We will always prefer an estimator that is computationally efficient, such as those that allow the expectation to be computed using the fewest number of samples, those that have a computational cost linear in the number of parameters, and those whose computations can be easily parallelised. We can typically assume that our estimators are consistent because of the generality of the law of large numbers. Therefore most of our effort will be directed towards characterising their unbiasedness, variance and computational cost, since it is these properties that affect the choice we make between competing estimators. Monte Carlo methods are widely studied, and the books by Robert and Casella (2013) and Owen (2013) provide a deep coverage of their wider theoretical properties and practical considerations. 2.2. Stochastic Optimisation The gradient (2) supports at least two key computational tasks, those of explanation and optimisation. Because the gradient provides a computable value that characterises the behaviour of the cost the cost s sensitivity to changing settings a gradient is directly useful as a tool with which to explain the behaviour of a probabilistic system. More importantly, the gradient (2) is the key quantity needed for optimisation of the distributional parameters θ. Figure 1 (adapted from Pflug (1996) sect. 1.2.5) depicts the general stochastic optimisation loop, which consists of two phases: a simulation phase and an optimisation phase. This is a stochastic system because the system or the environment has elements of randomness, i.e. the input parameters θ influence the system in a probabilistic manner. We will consider several case studies in Section 8 that all operate within this optimisation framework. Whereas in a typical optimisation we would make a call to the system for a function value, which is deterministic, in stochastic optimisation we make a call for an estimate of the function value; instead of calling for the gradient, we ask for an estimate of the gradient; instead of calling for the Hessian, we will ask for an estimate of the Hessian; all these estimates are random variables. Making a clear separation between the simulation and optimisation phases allows us to focus our attention on developing the best estimators of gradients we can, knowing that when used with gradient-based optimisation methods available to us, we can Monte Carlo Gradient Estimation in Machine Learning Simulation output Input parameter Gradient estimate Optimisation Gradient-based optimisation System or Environment θ Figure 1: Stochastic optimisation loop comprising a simulation phase and an optimisation phase. The simulation phase produces a simulation of the stochastic system or interaction with the environment, as well as unbiased estimators of the gradient. guarantee convergence so long as the estimate is unbiased, i.e. the estimate of the gradient is correct on average (Kushner and Yin, 2003). If the optimisation phase is also used with stochastic approximation (Robbins and Monro, 1951; Kushner and Yin, 2003), such as stochastic gradient descent that is widely-used in machine learning, then this loop can be described as a doubly-stochastic optimisation (Titsias and L azaro-Gredilla, 2014). In this setting, there are now two sources of stochasticity. One source arises in the simulation phase from the use of Monte Carlo estimators of the gradient, which are random variables because of the repeated sampling from the measure, like those in (3). A second source of stochasticity arises in the optimisation phase from the use of the stochastic approximation method, which introduces a source of randomness through the subsampling of data points (mini-batching) when computing the gradient. In Section 8.3 we explore some of the performance effects from the interaction between these sources of stochasticity. 2.3. The Central Role of Gradient Estimation Across a breadth of research areas, whether in approximate inference, reinforcement learning, experimental design, or active learning, the need for accurate and efficient computation of stochastic gradients and their corresponding Monte Carlo estimators appears, making the gradient question one of the fundamental problems of statistical and machine learning research. We make the problem (2) more concrete by briefly describing its instantiation in five areas, each with independent and thriving research communities of their own. In them, we can see the central role of gradient estimation by matching the pattern of the problem that arises, and in so doing, see their shared foundations. Variational Inference. Variational inference is a general method for approximating complex and unknown distributions by the closest distribution within a tractable family. Wherever the need to approximate distributions appears in machine learning in supervised, unsupervised or reinforcement learning a form of variational inference can be used. Consider a generic probabilistic model p(x|z)p(z) that defines a generative process in which observed data x is generated from a set of unobserved variables z using a data distribution p(x|z) and a prior distribution p(z). In supervised learning, the unobserved variables might correspond to the weights of a regression problem, and in unsupervised learning to latent variables. The posterior distribution of this generative process p(z|x) is unknown, and is approximated by a variational distribution q(z|x; θ), which is a parameterised family of distributions with variational parameters θ, e.g., the mean and variance of a Gaussian distribution. Finding the Mohamed, Rosca, Figurnov, Mnih distribution q(z|x; θ) that is closest to p(z|x) (e.g., in the KL sense) leads to an objective, the variational free-energy, that optimises an expected log-likelihood log p(x|z) subject to a regularisation constraint that encourages closeness between the variational distribution q and the prior distribution p(z) (Jordan et al., 1999; Blei et al., 2017). Optimising the distribution q requires the gradient of the free energy with respect to the variational parameters θ: η = θEq(z|x;θ) log p(x|z) log q(z|x; θ) This is an objective that is in the form of Equation (1): the cost is the difference between a log-likelihood and a log-ratio (of the variational distribution and the prior distribution), and the measure is the variational distribution q(z|x; θ). This problem also appears in other research areas, especially in statistical physics, information theory and utility theory (Honkela and Valpola, 2004). Many of the solutions that have been developed for scene understanding, representation learning, photo-realistic image generation, or the simulation of molecular structures also rely on variational inference (Eslami et al., 2018; Kusner et al., 2017; Higgins et al., 2017). In variational inference we find a thriving research area where the problem (2) of computing gradients of expectations lies at its core. Reinforcement Learning. Model-free policy search is an area of reinforcement learning where we learn a policy a distribution over actions that on average maximises the accumulation of long-term rewards. Through interaction in an environment, we can generate trajectories τ = (s1, a1, s2, a2, . . . , s T , a T ) that consist of pairs of states st and actions at for time period t = 1, . . . , T. A policy is learnt by following the policy gradient (Sutton and Barto, 1998) η = θEp(τ;θ) t=0 γtr(st, at) which again has the form of Equation (1). The cost is the return over the trajectory, which is a weighted sum of rewards obtained at each time step r(st, at), with the discount factor γ [0, 1]. The measure is the joint distribution over states and actions p(τ; θ) = h QT 1 t=0 p(st+1|st, at)p(at|st; θ) i p(a T |s T ; θ), which is the product of a state transition prob- ability p(st+1|st, at) and the policy distribution p(at|st; θ) with policy parameters θ. The Monte Carlo gradient estimator used to compute this gradient with respect to the policy parameters (see score function estimator later) leads to the policy gradient theorem, one of the key theorems in reinforcement learning, which lies at the heart of many successful applications, such as the Alpha Go system for complex board games (Silver et al., 2016), and robotic control (Deisenroth et al., 2013). In reinforcement learning, we find yet another thriving area of research where gradient estimation plays a fundamental role. Sensitivity Analysis. The field of sensitivity analysis is dedicated to the study of problems of the form of (2), and asks what the sensitivity (another term for gradient) of an expectation is to its input parameters. Computational finance is one area where sensitivity analysis is widely used to compare investments under different pricing and return assumptions, in order to choose a strategy that provides the best potential future payout. Knowing the value of the gradient gives information needed to understand the sensitivity of future payouts to different pricing assumptions, and provides a direct measure of the financial risk that an investment strategy will need to manage. In finance, these sensitivities, or gradients, are referred to as the greeks. A classic example of this problem is the Black-Scholes option pricing model, which can be written in the form of Equation (1): the cost function is the discounted value of an option with price s T at the time of maturity T, using discount factor γ; and the measure is a log-normal distribution over the price at maturity p(s T ; s0), and is a function of the initial Monte Carlo Gradient Estimation in Machine Learning price s0. The gradient with respect to the initial price is known as the Black-Scholes Delta (Chriss, 1996) = s0Ep(s T ;s0) e γT max(s T K, 0) , (7) where K is a minimum expected return on the investment; delta is the risk measure that is actively minimised in delta-neutral hedging strategies. The Black-Scholes delta (7) above can be computed in closed form, and the important greeks have closed or semi-closed form formulae in the Black-Scholes formulation. Gradient estimation methods are used when the cost function (the payoff) is more complicated, or in more realistic models where the measure is not log-normal. In these more general settings, the gradient estimators we review in this paper are the techniques that are still widely-used in financial computations today (Glasserman, 2013). Discrete Event Systems and Queuing Theory. An enduring problem in operations research is the study of queues, the waiting systems and lines that we all find ourselves in as we await our turn for a service. Such systems are often described by a stochastic recursion Lt = xt + max(Lt 1 at, 0), where Lt is the total system time (of people in the queue plus in service), xt is the service time for the t-th customer, at is the inter-arrival time between the t-th and the (t+1)-th customer (V azquez-Abad, 2000; Rubinstein et al., 1996). The number of customers served in this system is denoted by τ. For convenience we can write the service time and the inter-arrival time using the variable yt = {xt, at}, and characterise it by a distribution p(yt; θ) = p(xt; θx)p(at; θa) with distributional parameters θ = {θx, θa}. The expected steadystate behaviour of this system gives a problem of the form of (2), where the cost is the ratio of the average total system time to the average number of customers served, and the measure is the joint distribution over service times p(y1:T ; θ) (Rubinstein, 1986) η = θEp(y1:T ;θ) "PT t=1 Lt(y1:t) In Kendall s notation, this is a general description for G/G/1 queues (Newell, 2013): queues with general distributions for the arrival rate, general distributions for the service time, and a single-server first-in-first-out queue. This problem also appears in many other areas and under many other names, particularly in regenerative and semi-Markov processes, and discrete event systems (Cassandras and Lafortune, 2009). In all these cases, the gradient of the expectation of a cost, described as a ratio, is needed to optimise a sequential process. Queues permeate all parts of our lives, hidden from us in global supply chains and in internet routing, and visible to us at traffic lights and in our online and offline shopping, all made efficient through the use of Monte Carlo gradient estimators. Experimental Design. In experimental design we are interested in finding the best designs the inputs or settings to a possibly unknown system that result in outputs that are optimal with respect to some utility or score. Designs are problem configurations, such as a hypothesis in an experiment, the locations of sensors on a device, or the hyperparameters of a statistical model (Chaloner and Verdinelli, 1995). We evaluate the system using a given design θ and measure its output y, usually in settings where the measurements are expensive to obtain and hence cannot be taken frequently. One such problem is the probability-of-improvement, which allows us to find designs θ that on average improve over a currently best-known outcome. This is an objective that can be written in the form of Equation (1), where the cost is the score of a new design being better that the current best design, and the measure is a predictive function p(y; θ), which allows us to simulate the output y for an input design θ. To find the best design, we will need to compute the gradient of the probability-of-improvement with respect to the design θ: η = θEp(y|θ) 1{y