# asymptotic_optimality_of_adaptive_importance_sampling__74f45e35.pdf Asymptotic optimality of adaptive importance sampling Bernard Delyon IRMAR University of Rennes 1 bernard.delyon@univ-rennes1.fr François Portier Télécom Paris Tech University of Paris-Saclay francois.portier@gmail.com Adaptive importance sampling (AIS) uses past samples to update the sampling policy qt. Each stage t is formed with two steps : (i) to explore the space with nt points according to qt and (ii) to exploit the current amount of information to update the sampling policy. The very fundamental question raised in this paper concerns the behavior of empirical sums based on AIS. Without making any assumption on the allocation policy nt, the theory developed involves no restriction on the split of computational resources between the explore (i) and the exploit (ii) step. It is shown that AIS is asymptotically optimal : the asymptotic behavior of AIS is the same as some oracle strategy that knows the targeted sampling policy from the beginning. From a practical perspective, weighted AIS is introduced, a new method that allows to forget poor samples from early stages. 1 Introduction The adaptive choice of a sampling policy lies at the heart of many fields of Machine Learning where former Monte Carlo experiments guide the forthcoming ones. This includes for instance reinforcment learning [19, 27, 30] where the optimal policy maximizes the reward; inference in Bayesian [6] or graphical models [21]; optimization based on stochastic gradient descent [34] or without using the gradient [18]; rejection sampling [12]. Adaptive importance sampling (AIS) [25, 2], which extends the basic Monte Carlo integration approach, offers a natural probabilistic framework to describe the evolution of sampling policies. The present paper establishes, under fairly reasonable conditions, that AIS is asymptotically optimal, i.e., learning the sampling policy has no cost asymptotically. Suppose we are interested in computing some integral value R ϕ, where ϕ : Rd R is called the integrand. The importance sampling estimate of R ϕ based on the sampling policy q, is given by q(xi) , (1) where (x1, . . . xn) i.i.d. q. The previous estimate is unbiased. It is well known, e.g., [16, 13], that the optimal sampling policy, regarding the variance, is when q is proportional to |ϕ|. A slightly different context where importance sampling still applies is Bayesian estimation. Here the targeted quantity is R ϕπ and we only have access to an unnormalized version πu of the density π = πu/ R πu. Estimators usually employed are ϕ(xi)πu(xi) q(xi) . (2) In this case, the optimal sampling policy q is proportional to |ϕ R ϕπ|π (see [9] or section B.3 in the supplementary material). 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Because appropriate policies naturally depend on ϕ or π, we generally cannot simulate from them. They are then approximated adaptively, by densities from which we can simulate, using the information gathered from the past stages. This is the very spirit of AIS. At each stage t, the value It, standing for the current estimate, is updated using i.i.d. new samples xt,1, . . . xt,nt from qt, where qt is a probability density function that might depend on the past stages 1, . . . t 1. The distribution qt, called the sampling policy, targets some optimal, at least suitable, sampling policy. The sequence (nt) N , called the allocation policy, contains the number of particles generated at each stage. The following algorithm describes the AIS schemes for the classical integration problem. For the Bayesian problem, it suffices to change the estimate according to (2). This is a generic representation of AIS as no explicit update rule is specified (this will be discussed just below). Algorithm 1 (AIS). Inputs: The number of stages T N , the allocation policy (nt)t=1,...T N , the sampler update procedure, the initial density q0. Set S0 = 0, N0 = 0. For t in 1, . . . T : (i) (Explore) Generate (xt,1, . . . xt,nt) from qt 1 (ii) (Exploit) (a) Update the estimate: St = St 1 + ϕ(xt,i) qt 1(xt,i) Nt = Nt 1 + nt It = N 1 t St (b) Update the sampler qt Pioneer works on adaptive schemes include [20] where, within a two-stages procedure, the sampling policy is chosen out of a parametric family; this is further formalized in [14]; [25] introduces the idea of a multi-stages approach where all the previous stages are used to update the sampling policy (see also [29] regarding the choice of the loss function); [26] investigates the use of control variates coupled with importance sampling; the population Monte Carlo approach [3, 2] offers a general framework for AIS and has been further studied using parametric mixtures [8, 9]; see also [5, 32] for a variant called multiple adaptive importance sampling; see [11] for a recent review. In [33, 23], using kernel smoothing, nonparametric importance sampling is introduced. The approach of choosing qt out of a parametric family should also be contrasted with the non parametric approach based on particles often refereed to as sequential Monte Carlo [6, 4, 10] whose context is different as traditionally the targeted distribution changes with t. The distribution qt 1 is then a weighted sum of Dirac masses P i wt 1,iδxt 1,i, and updating qt follows from adjustment of the weights. The theoretical properties of adaptive schemes are difficult to derive due to the recycling of the past samples at each stage and hence to the lack of independence between samples. Among the update based on a parametric family, the convergence properties of the Kullback-Leibler divergence between the estimated and the targeted distribution are studied in [8]. Properties related to the asymptotic variance are given in [9]. Among nonparametric update, [33] establishes fast convergence rates in a two-stages strategy where the number of samples used in each stage goes to infinity. For sequential Monte Carlo, limit theorems are given for instance in [6, 4, 10]. All these results are obtained when T is fixed and n T and therefore misses the true nature of the adaptive schemes for which the asymptotic should be made with respect to T. Recently, a more realistic asymptotic regime was considered in [22] in which the allocation policy (nt) is a fixed growing sequence of integers. The authors establish the consistency of the estimate when the update is conducted with respect to a parametric family but depends only on the last stage. They focus on multiple adaptive importance sampling [5, 32] which is different than AIS (see Remark 2 below for more details). In this paper, folllowing the same spirit as [8, 9, 2], we study parametric AIS as presented in the AIS algorithm when the policy is chosen out of a parametric family of probability density functions. Our analysis focuses on the following 3 key points which are new to the best of our knowledge. A central limit theorem is established for the AIS estimate It. It involves high-level conditions on the sampling policy estimate qt (which will be easily satisfied for parametric updates). Based on the martingale property associated to some sequences of interest, the asymptotic is not with T fixed and n T , but with the number of samples n1 + + n T . In particular, the allocation policy (nt) is not required to grow to infinity. This is presented in section 2. The high-level conditions are verified in the case of parametric sampling policies with updates taking place in a general framework inspired by the paradigm of empirical risk minimization (several concrete examples are provided). This establishes the asymptotic optimality of AIS in the sense that the rate and the asymptotic variance coincide with some oracle procedure where the targeted policy is known from the beginning. The details are given in section 3. A new method, called weighted AIS (w AIS) is designed in section 4 to eventually forget bad samples drawn during the early stages of AIS. Our numerical experiments shows that (i) w AIS accelerates significantly the convergence of AIS and (ii) small allocation policies (nt) (implying more frequent updates) give better results than large (nt) (at equal number of requests to ϕ). This last point supports empirically the theoretical framework adopted in the paper. All the proofs are given in the supplementary material. 2 Central limit theorems for AIS The aim of the section is to provide conditions on the sampling policy (qt) under which a central limit theorem holds for AIS and normalized AIS. For the sake of generality and because it will be useful in the treatment of normalized estimators, we consider the multivariate case where ϕ = (ϕ1, . . . ϕp) : Rd Rp. In the whole paper, R ϕ is with respect to the Lebesgue measure, is the Euclidean norm, Ip is the identity matrix of size (p, p). To study the AIS algorithm, it is appropriate to work at the sample time scale as described below rather than at the sampling policy scale as described in the introduction. The sample xt,i (resp. the policy qt) of the previous section (t is the block index and i the sample index within the block) is now simply denoted xj (resp. qj), where j = n1 + . . . nt + i is the sample index in the whole sequence 1, . . . n, with n = NT . The following algorithm is the same as Algorithm 1 (no explicit update rule is provided) but is expressed at the sample scale. Algorithm 2 (AIS at sample scale). Inputs: The number of stages T N , the allocation policy (nt)t=1,...T N , the sampler update procedure, the initial density q0. Set S0 = 0. For j in 1, . . . n : (i) (Explore) Generate xj from qj 1 (ii) (Exploit) (a) Update the estimate: Sj = Sj 1 + ϕ(xj) qj 1(xj) Ij = j 1Sj (b) Update the sampler qj whenever j {Nt = Pt s=1 ns : t 1} 2.1 The martingale property Define j as the j-th centered contribution to the sum Sj: j = ϕ(xj)/qj 1(xj) R ϕ. Define, for all n 1, The filtration we consider is given by Fn = σ(x1, . . . xn). The quadratic variation of M is given by M n = Pn j=1 E j T j | Fj 1 . Set V (q, ϕ) = Z ϕ(x) q(x) R ϕ ϕ(x) q(x) R ϕ T q(x) dx. (3) Lemma 1. Assume that for all 1 j n, the support of qj contains the support of ϕ, then the sequence (Mn, Fn) is a martingale. In particular, In is an unbiased estimate of R ϕ. In addition, the quadratic variation of M satisfies M n = Pn j=1 V (qj 1, ϕ). 2.2 A central limit theorem for AIS The following theorem describes the asymptotic behavior of AIS. The conditions will be verified for parametric updates in section 3 (see Theorem 3) in which case the asymptotic variance V will be explicitly given. Theorem 1 (central limit theorem for AIS). Assume that the sequence qn satisfies V (qn, ϕ) V , a.s. (4) for some V 0 and that there exists η > 0 such that q1+η j < , a.s. (5) Then we have n In Z ϕ d N(0, V ). Remark 1 (zero-variance estimate). Suppose that p = 1 (recalling that ϕ : Rd Rp). Theorem 1 includes the degenerate case V = 0. This happens when the integrand has constant sign and the sampling policy is well chosen, i.e. qn |ϕ|/ R |ϕ|. In this case, we have that n(In R ϕ) = op(1), meaning that the standard Monte Carlo convergence rate (1/ n) has been improved. This is inline with the results presented in [33] where fast rates of convergence (compared to standard Monte Carlo) are obtained under restrictive conditions on the allocation policy (nt). Note that other techniques such as control variates, kernel smoothing or Gaussian quadrature can achieve fast convergence rates [24, 28, 7, 1]. Remark 2 (adaptive multiple importance sampling). Another way to compute the importance weights, called multiple adaptive importance sampling, has been introduced in [32] and has been successfully used in [26, 5]. This consists in replacing qj 1 in the computation of Sj by qj 1 = Pj i=1 qi 1/j, xj still being drawn under qj 1. The intuition is that this averaging will reduce the effect of exceptional points xj for which |ϕ(xj)| qj 1(xj) (but |ϕ(xj)| qj 1(xj)). Our approach is not able to study this variant, simply because the martingale property described previously is not anymore satisfied. 2.3 Normalized AIS The normalization technique described in (2) is designed to compute R ϕπ, where π is a density. It is useful in the Bayesian context where π is only known up to a constant. As this technique seems to provide substantial improvements compared to unnormalized estimates (i.e., (1) with ϕ replaced by ϕπ), we recommend to use it even when the normalized constant of π is known. Normalized estimators are given by I(norm) n = In(ϕπ) In(π) , with In(ψ) = n 1 n X j=1 ψ(xj)/qj 1(xj). Interestingly, normalized estimators are weighted least-squares estimates as they minimize the function a 7 Pn j=1(π(xj)/qj 1(xj))(ϕ(xj) a)2. In contrast with In, I(norm) n has the following shift-invariance property : whenever ϕ is shifted by µ, I(norm) n simply becomes I(norm) n + µ. Because In(ϕπ) and In(π) are of the same kind as In defined in the second AIS algorithm, a straightforward application of Theorem 1 (with (ϕT π, π)T in place of ϕ). Corollary 1 (central limit theorem for normalized AIS). Suppose that (4) and (5) hold with (ϕT π, π)T (in place of ϕ). Then we have n I(norm) n Z ϕπ d N(0, UV U T ), with U = (Ip, R ϕπ). 3 Parametric sampling policy From this point forward, the sampling policies qt, t = 1, . . . T (we are back again to the sampling policy scale as in Algorithm 1), are chosen out of a parametric family of probability density functions {qθ : θ Θ}. All our examples fit the general framework of empirical risk minimization over the parameter space Θ Rq, where θt is given by θt argminθ Θ Rt(θ), (6) mθ(xs,i) qs 1(xs,i), where qs is a shortcut for qθs, mθ : Rd R might be understood as a loss function (see the next section for examples). Note that Rt/Nt is an unbiased estimate of the risk r(θ) = R mθ. 3.1 Examples of sampling policy We start by introducing a particular case, which is one of the simplest way to implement AIS. Then we will provide more general approaches. In what follows, the targeted policy, denoted by f, is chosen by the user and represents the distribution from which we wish to sample. It often reflects some prior knowledge on the problem of interest. If ϕ : Rd Rp, with p = 1, then (as discussed in the introduction) f |ϕ| is optimal for (1) and f |ϕ R ϕπ|π is optimal for (2). In the Bayesian context where many integrals R (ϕ1, . . . ϕp)dπ need to be computed, a usual choice is f = π. All the following methods only require calls to an unnormalized version of f. Method of moments with Student distributions. In this case (qθ)θ Θ is just the family of multivariate Student distributions with ν > 2 degrees of freedom (fixed parameter). The parameter θ contains a location and a scale parameter µ and Σ. This family has two advantages: the parameter ν allows tuning for heavy tails, and estimation is easy because moments of qθ are explicitly related to θ. A simple unbiased estimate for µ is (1/Nt) Pt s=1 Pns i=1 xs,if(xs,i)/qs 1(xs,i), but, as mentioned in section 2.3, we prefer to use the normalized estimate (using the shortcut qs for qθs): i=1 xs,i f(xs,i) qs 1(xs,i) f(xs,i) qs 1(xs,i) , (7) i=1 (xs,i µt)(xs,i µt)T f(xs,i) qs 1(xs,i) f(xs,i) qs 1(xs,i) . (8) Generalized method of moments (GMM). This approach includes the previous example. The policy is chosen according to a moment matching condition, i.e., R gqθ = R gf for some function g : Rd RD. For instance, g might be given by x 7 x or x 7 xx T (both are considered in the Student case). Following [17], choosing θ such that the empirical moments of g coincide with R gqθ might be impossible. We rather compute θt as the minimum of Eθ(g) i=1 g(xs,i) f(xs,i) f(xs,i) qs 1(xs,i) Equivalently, θt argminθ Θ i=1 Eθ(g) g(xs,i) 2 f(xs,i) qs 1(xs,i), which embraces the form given by (6), with mθ = Eθ(g) g 2f. Kullback-Leibler approach. Following [31, section 5.5], define the Kullback-Leibler risk as r(θ) = R log(qθ)f. Update of θt is done by minimizing the current estimator of Ntr(θ) given by Rt(θ) = Rt 1(θ) log(qθ(xt,i))f(xt,i) qt 1(xt,i) . (9) Variance approach. Another approach, when ϕ : Rd Rp with p = 1, consists in minimizing the variance over the class of sampling policies. In this case, define r(θ) = R ϕ2/qθ, and follow a similar approach as before by minimizing at each stage, Rt(θ) = Rt 1(θ) + qθ(xt,i)qt 1(xt,i). (10) This case represents a different situation than the Kullback-Leibler approach and the GMM. Here, the sampling policy is selected optimally with respect to a particular function ϕ whereas for KL and GMM the sampling policy is driven by a targeted distribution f. Remark 3 (computation cost). The update rule (6) might be computationally costly but alternatives exist. For instance, when qθ is a family of Gaussian distributions, closed formulas are available for (10). In fact we are in the case of weighted maximum likelihood estimation for which we find exactly (7) and (8), with ν = . This is computed online at no cost. Another strategy to reduce the computation time is to use online stochastic gradient descent in (6). Remark 4 (block estimator). In [22], the authors suggest to update θ based only on the particles from the last stage. For the Kullback-Leibler update, (9) would be replaced by Rt(θ) = Pnt i=1 log(qθ(xt,i))f(xt,i)/qt 1(xt,i). While this update makes easier the theoretical analysis (assuming that nt ), its main drawback is that most of the computing effort is forgotten at each stage as the previous computations are not used. 3.2 Consistency of the sampling policy and asymptotic optimality of AIS The updates described before using GMM, the Kullback-Leibler divergence or the variance, all fit within the framework of empirical risk minimization, given by (6), which rewritten at the sample scale gives Rj(θ) = Rj 1(θ) + mθ(xj) qj 1(xj) if j {Nt : t 1} then : θj argminθ Θ Rj(θ) qj = qθj else : qj = qj 1. The proof follows from a standard approach from M-estimation theory [31, Theorem 5.7] but a particular attention shall be payed to the uniform law of large numbers because of the missing i.i.d. property of the sequences of interest. Theorem 2 (concistency of the sampling policy). Set M(x) = supθ Θ mθ(x). Assume that Θ Rq is a compact set and that Z M(x)dx < , sup θ Θ qθ(x) dx < , and θ = θ , r(θ) = Z mθ > Z mθ . (11) If moreover, for any x Rd, the function θ 7 mθ(x) is continuous on Rq, then θn θ , a.s. The conclusion given in Theorem 2 permits to check the conditions of Theorem 1. This leads to the following result. Theorem 3 (asymptotic optimality of AIS). Under the assumptions of Theorem 2, if there exists η > 0 such that supθ Θ R ϕ 2+η/q1+η θ < , we have n In Z ϕ d N 0, V (qθ , ϕ) , where V ( , ) is defined in Equation (3). Remark 5 (the oracle property). From (11), we deduce that qθ is the unique minimizer of the risk function r. The risk function based on GMM or the Kullback-Leibler approach (described in section 3.1) is derived from a certain targeted density f in such a way that if qθ = f, then r(θ) is a minimum. Hence under the identifiability conditions of Theorem 2, if in addition f {qθ : θ Θ}, we have that qθ = f. This means that asymptotically, AIS achives the same variance as the oracle importance sampling method based on the (fixed) sampler f. Corollary 2 (asymptotic optimality for normalized AIS). Under the assumptions of Theorem 2, if there exists η > 0 such that supθ Θ R (ϕT π, π) 2+η/q1+η θ < , we have n I(norm) n Z ϕπ d N 0, UV (qθ , (ϕT π, π)T )U T , with U defined in Corollary 1 and V ( , ) defined in Equation (3). 4 Weighted AIS We follow ideas from [9, section 4] to develop a novel method to estimate R ϕπ. The method is called weighted adaptive importance sampling (w AIS), and will automatically re-weights each sample depending on its accuracy. It allows in practice to forget poor samples generated during the early stages. For clarity, suppose that ϕ : Rd Rp with p = 1. Define the weighted estimate, for any function ψ, I(α) T (ψ) = N 1 T ψ(xt,i) qt 1(xt,i). Note that for any sequence (αT,1, . . . αT,T ) such that PT t=1 ntαT,t = NT , I(α) T (ψ) is an unbiased estimate of R ψ. Let σ2 t = E[V (qt 1, ϕ)] where V ( , ) is defined in Equation (3). The variance of I(α) T (ϕ) is N 2 T PT t=1 α2 T,tntσ2 t which minimized w.r.t. (α) gives αT,t σ 2 t , for each t = 1, . . . T. In [9], a re-weighting is proposed using estimates of σt (based on sample of the t-th stage). We propose the following weights qt 1(xt,i) 1 2 , (12) satisfying the constraints PT t=1 ntαT,t = NT . The w AIS estimate is the (weighted and normalized) AIS estimate given by I(α) T (ϕπ)/I(α) T (π). (13) In contrast with the approach in [9], because our weights are based on the estimated variance of π/qt 1, our proposal is free from the integrand ϕ and thus reflects the overall quality of the t-th sample. This makes sense whenever many functions need to be integrated making inappropriate a re-weighting depending on a specific function. Another difference with [9] is that we use the true expectation, 1, in the estimate of the variance, rather than the estimate (1/nt) Pnt i=1 π(xt,i)/qt 1(xt,i). This permits to avoid the situation (common in high dimensional settings) where a poor sampler qt 1 is such that π(xt,i)/qt 1(xt,i) 0, for all i = 1, . . . nt, implying that the classical estimate of the variance is near 0, leading (unfortunately) to a large weight. 5 Numerical experiments In this section, we study a toy Gaussian example to illustrate the practical behavior of AIS. Special interest is dedicated to the effect of the dimension d, the practical choice of (nt) and the gain given by w AIS introduced in the previous section. We set NT = 1e5 and we consider d = 2, 4, 8, 16. The code is made available at https://github.com/portierf/AIS. The aim is to compute µ = R xφµ ,σ (x)dx where φµ,σ : Rd R is the probability density of N(µ, σ2Id), µ = (5, . . . 5)T Rd, σ = 1. The sampling policy is taken in the collection of multivariate Student distributions of degree ν = 3 denoted by {qµ,Σ0 : µ Rd} with Σ0 = 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 sample size AIS w AIS AMH oracle 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 10 8 6 4 2 0 2 sample size AIS w AIS AMH oracle 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 sample size AIS w AIS AMH oracle 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 sample size AIS w AIS AMH oracle Figure 1: From left-to-right and top-to-bottom d = 2, 4, 8, 16. AIS and w AIS are computed with T = 50 with a constant allocation policy nt = 2e3. Plotted is the logarithm of the MSE (computed for each method over 100 replicates) with respect to the number of requests to the integrand. σ0Id(ν 2)/ν and σ0 = 5. The initial sampling policy is set as µ0 = (0, . . . 0) Rd. The mean µt is updated at each stage t = 1, . . . T following the GMM approach as described in section 3, leading to the simple update formula i=1 xs,i f(xs,i) qs 1(xs,i) f(xs,i) qs 1(xs,i) , with f = φµ ,σ . In section C of the supplementary file, other results considering the update of the variance within the student family are provided. As the results for the unnormalized approaches were far from being competitive with the normalized ones, we consider only normalized estimators. We also tried the weights proposed in [9] but the results were not competitive. The (normalized) AIS estimate of µ is simply given by µt as displayed above. The w AIS estimate of µ is computed using (13) with weights (12). We also include the adaptive MH proposed in [15], where the proposal, assuming that Xi 1 = x, is given by N x, (2.4)2(Ci + ϵId)/d , if i > i0, and N(x, Id), if i i0, with Ci the empirical covariance matrix of (X0, X1, . . . Xi 1), i0 = 1000 and ϵ = 0.05 (other configurations as for instance using only half of the chain have been tested without improving the results). Finally we 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 sample size w AIS AMH oracle T=5 T=20 T=50 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 10 8 6 4 2 0 2 sample size w AIS AMH oracle T=5 T=20 T=50 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 sample size w AIS AMH oracle T=5 T=20 T=50 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 sample size w AIS AMH oracle T=5 T=20 T=50 Figure 2: From left-to-right and top-to-bottom d = 2, 4, 8, 16. AIS and w AIS are computed with T = 5, 20, 50, each with a constant allocation policy, resp. nt = 2e4, 5e3, 2e3. Plotted is the logarithm of the MSE (computed for each method over 100 replicates) with respect to the number of requests to the integrand. consider a so called oracle method : importance sampling with fix policy qµ ,Σ , with Σ = σ Id(ν 2)/ν. For each method that returns µ, the mean squared error (MSE) is computed as the average of µ µ 2 computed over 100 replicates of µ. In Figure 1, we compare the evolution of all the mentioned algorithms with respect to stages t = 1, . . . T = 50 with constant allocation policy nt = 2e3 (for AIS and w AIS). The clear winner is w AIS. Note that the oracle policy qµ ,Σ , which is not the optimal one (see section B.3 in the supplementary material), seems to give worse results than the the policy qµ ,Σ0, as w AIS with sig_0 performs better than the oracle after some time. In Figure 2, we examine 3 constant allocation policies given by T = 50 and nt = 2e3; T = 20 and nt = 5e3; T = 5 and nt = 2e4. We clearly notice that the rate of convergence is influenced by the number of update steps (at least at the beginning). The results call for updating as soon as possible the sampling policy. This empirical evidence supports the theoretical framework studied in the paper which imposes no condition on the growth of (nt). Acknowledgments The authors are grateful to Rémi Bardenet for useful comments and additional references. [1] Rémi Bardenet and Adrien Hardy. Monte carlo with determinantal point processes. ar Xiv preprint ar Xiv:1605.00361, 2016. [2] Olivier Cappé, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4):447 459, 2008. [3] Olivier Cappé, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Population monte carlo. Journal of Computational and Graphical Statistics, 13(4):907 929, 2004. [4] Nicolas Chopin. Central limit theorem for sequential monte carlo methods and its application to bayesian inference. The Annals of Statistics, 32(6):2385 2411, 2004. [5] Jean Cornuet, Jean-Michel Marin, Antonietta Mira, and Christian P Robert. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798 812, 2012. [6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006. [7] Bernard Delyon, François Portier, et al. Integral approximation by kernel smoothing. Bernoulli, 22(4):2177 2208, 2016. [8] Randal Douc, Arnaud Guillin, J-M Marin, and Christian P Robert. Convergence of adaptive mixtures of importance sampling schemes. The Annals of Statistics, pages 420 448, 2007. [9] Randal Douc, Arnaud Guillin, J-M Marin, and Christian P Robert. Minimum variance importance sampling via population monte carlo. ESAIM: Probability and Statistics, 11:427 447, 2007. [10] Randal Douc and Eric Moulines. Limit theorems for weighted samples with applications to sequential monte carlo methods. The Annals of Statistics, pages 2344 2376, 2008. [11] Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Generalized multiple importance sampling. ar Xiv preprint ar Xiv:1511.03095, 2015. [12] Akram Erraqabi, Michal Valko, Alexandra Carpentier, and Odalric Maillard. Pliable rejection sampling. In International Conference on Machine Learning, pages 2121 2129, 2016. [13] Michael Evans and Tim Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford Statistical Science Series. Oxford University Press, Oxford, 2000. [14] John Geweke. Bayesian inference in econometric models using monte carlo integration. Econometrica: Journal of the Econometric Society, pages 1317 1339, 1989. [15] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm. Bernoulli, 7(2):223 242, 2001. [16] John Michael Hammersley and David Christopher Handscomb. General principles of the monte carlo method. In Monte Carlo Methods, pages 50 75. Springer, 1964. [17] Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pages 1029 1054, 1982. [18] Tatsunori B Hashimoto, Steve Yadlowsky, and John C Duchi. Derivative free optimization via repeated classification. ar Xiv preprint ar Xiv:1804.03761, 2018. [19] Tang Jie and Pieter Abbeel. On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems, pages 1000 1008, 2010. [20] Tuen Kloek and Herman K Van Dijk. Bayesian estimates of equation system parameters: an application of integration by monte carlo. Econometrica: Journal of the Econometric Society, pages 1 19, 1978. [21] Qi Lou, Rina Dechter, and Alexander T Ihler. Dynamic importance sampling for anytime bounds of the partition function. In Advances in Neural Information Processing Systems, pages 3199 3207, 2017. [22] Jean-Michel Marin, Pierre Pudlo, and Mohammed Sedki. Consistency of the adaptive multiple importance sampling. ar Xiv preprint ar Xiv:1211.2548, 2012. [23] Jan C Neddermeyer. Computationally efficient nonparametric importance sampling. Journal of the American Statistical Association, 104(486):788 802, 2009. [24] Chris J. Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. J. R. Statist. Soc. B, 79(3):695 718, 2017. [25] Man-Suk Oh and James O. Berger. Adaptive importance sampling in Monte Carlo integration. J. Statist. Comput. Simulation, 41(3-4):143 168, 1992. [26] Art Owen and Yi Zhou. Safe and effective importance sampling. J. Amer. Statist. Assoc., 95(449):135 143, 2000. [27] Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI, pages 1607 1612. Atlanta, 2010. [28] François Portier and Johan Segers. Monte carlo integration with a growing number of control variates. ar Xiv preprint ar Xiv:1801.01797, 2018. [29] Jean-Francois Richard and Wei Zhang. Efficient high-dimensional importance sampling. Journal of Econometrics, 141(2):1385 1411, 2007. [30] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015. [31] A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998. [32] Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 419 428. ACM, 1995. [33] Ping Zhang. Nonparametric importance sampling. J. Amer. Statist. Assoc., 91(435):1245 1253, 1996. [34] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pages 1 9, 2015.