# efficient_gradientfree_variational_inference_using_policy_search__8c3d24f3.pdf

Efﬁcient Gradient-Free Variational Inference using Policy Search

Oleg Arenz 1 Mingjun Zhong 2 Gerhard Neumann 1 3

Inference from complex distributions is a common problem in machine learning needed for many Bayesian methods. We propose an efﬁcient, gradient-free method for learning general GMM approximations of multimodal distributions based on recent insights from stochastic search methods. Our method establishes information-geometric trust regions to ensure efﬁcient exploration of the sampling space and stability of the GMM updates, allowing for efﬁcient estimation of multi-variate Gaussian variational distributions. For GMMs, we apply a variational lower bound to decompose the learning objective into sub-problems given by learning the individual mixture components and the coefﬁcients. The number of mixture components is adapted online in order to allow for arbitrary exact approximations. We demonstrate on several domains that we can learn signiﬁcantly better approximations than competing variational inference methods and that the quality of samples drawn from our approximations is on par with samples created by state-of-the-art MCMC samplers that require signiﬁcantly more computational resources.

1. Introduction

We consider the problem of sampling or inference using a complex probability distribution p (x) = p (x)/Z for which we can evaluate p (x) but not the normalization constant Z = R

x p (x)dx. This problem is ubiquitous in machine leaning. For example, in Bayesian Inference, p (x) corresponds to the product of prior and likelihood.

As we can not sample directly from distribution p (x) or

1Computational Learning for Autonomous Systems, TU Darmstadt, Darmstadt, Germany 2Machine Learning Lab, University of Lincoln, Lincoln, UK 3Lincoln Center for Autonomous Systems, University of Lincoln, Lincoln, UK. Correspondence to: Oleg Arenz <oleg@robot-learning.de>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

use it for inference, a common approach is to use Variational Inference (VI) to approximate the target distribution p (x) with a tractable distribution p such as multi-variate Gaussians (Blei et al., 2017; Regier et al., 2017) or Gaussian Mixture Models (GMM) (Miller et al., 2017; Guo et al., 2016; Zobay, 2014). The optimization problem to obtain this approximation is commonly framed as minimizing the reverse Kullback-Leibler divergence (KL)

KL(p||p ) = Z

x p(x; θ) log p(x; θ)

with respect to the parameters θ of the approximation. This objective is typically evaluated on samples drawn from p(x; θ) and optimized by stochastic optimization algorithms (Fan et al., 2015; Gershman et al., 2012). However, in order to perform the KL minimization efﬁciently, p(x) is often restricted to belong to a simple family of models or is assumed to have non-correlating degrees of freedom (Blei et al., 2017; Peterson & Hartman, 1989), which is known as the mean ﬁeld approximation. Unfortunately, such restrictions can introduce signiﬁcant approximation errors.

Our approach focuses on learning multivariate Gaussian Mixture Models (GMMs) with full covariance matrices for approximating the target distribution. GMMs are desirable for VI, because they are capable of representing any continuous probability density function arbitrarily well, while inference with GMMs is relatively cheap. Naturally, variational inference with GMM approximations has been considered in the past. However, in order to make the minimization of objective (1) feasible, previous work either assumed factorized (Jaakkola & Jordan, 1998; Bishop et al., 1998) or isotropic (Gershman et al., 2012) mixture components or applied boosting by successively adding and optimizing new components while keeping previously added components ﬁxed (Miller et al., 2017; Guo et al., 2016).

As areas of high density p (x) are initially unknown, Variational Inference can essentially be seen as a search problem that is inﬂicted by the exploration-exploitation dilemma which is typical for reinforcement learning or policy search problems. The algorithms need to explore the sample space in order to ensure that all relevant areas are covered while they also need to exploit the current approximation p(x; θ) in order to ﬁne tune p(x; θ) in areas of high density. This exploration-exploitation based view is so far under-

Gradient-Free Variational Inference by Policy Search

developed in the Variational Inference community but essential to achieve good approximations with a small number of function evaluations.

Our method transfers information-geometric insights used in Policy Search (Deisenroth et al., 2013) and Reinforcement Learning (Sutton & Barto, 1998) on solving such exploration-exploitation dilemma efﬁciently. We therefore call our algorithm Variational Inference by Policy Search (VIPS). We extend the stochastic search method MORE (Abdolmaleki et al., 2016) to the variational inference setup and show that this version of MORE can efﬁciently learn single multivariate Gaussian variational distributions. We further extend the algorithm for training GMM distributions using a variational lower bound that enables us to train the mixture components and the coefﬁcients individually. Our optimization starts from an initial mixture model (e.g. a Gaussian prior) and the number of components is adapted online by adding components in promising regions and by deleting components with negligible weight.

We compare our method to other state-of-the-art VI methods (Miller et al., 2017; Gershman et al., 2012) and show that our algorithm can ﬁnd approximations of much higher quality with signiﬁcantly less evaluations of p (x). We further compare to existing sampling methods (Murray et al., 2010; Neal, 2003; Calderhead, 2014; Liu & Wang, 2016) and show that we can achieve similar sample quality with order of magnitudes less function evaluations.

2. Preliminaries

We will now formalize the problem statement and introduce relevant concepts from information-geometric policy search.

2.1. Problem formulation

As stated above, we want to minimize the KL divergence between the approximation p and the target distribution p . This direct minimization is infeasible as the normalization constant of p is unknown, however, it can be easily shown that the objective can be rewritten as

KL(p||p ) = Z

x p(x; θ) log p(x; θ)

p (x) dx | {z } ELBO

+ log Z, (2)

where the term log Z can be ignored as it does not depend on θ. This objective is known as the negative value of the evidence lower bound (ELBO) used in many variational inference methods (Blei et al., 2017). As we want to use insights from policy search, we rewrite the ELBO as reward maximization problem with an additional entropy objective,

x p(x; θ)R(x)dx + H(p), (3)

where the reward is given by R(x) = log p (x) and H(p) = R

x p(x; θ) log p(x; θ)dx is the entropy of p. Please note the swap in the sign due to the change from a minimization to a maximization problem. Hence, policy search algorithms can directly be applied to VI if they can incorporate an additional entropy objective. The ELBO objective can typically not be evaluated in closed form but is estimated by samples drawn from p(x; θ). Stochastic optimization can be used to optimize this sample-based objective (Blei et al., 2017).

2.2. Information Geometric Distribution Updates

Policy search algorithms must solve the explorationexploitation dilemma when updating the policy, i.e., they must control how much information from the current sample set is integrated in the policy versus how much we rely on keeping the current exploration strategy to generate more information. Information-geometric trust regions can be used to effectively control this exploration-exploitation trade-off (Peters et al., 2010; Schulman et al., 2015; Abdolmaleki et al., 2017; 2016). We will heavily rely on a recent stochastic search algorithm called MORE (Abdolmaleki et al., 2016), that for the ﬁrst time allows for a closed form solution of information geometric trust regions for Gaussian distributions. Stochastic search is a special case of policy search where the policy can be interpreted as search distribution p(x) in the parameter space x of a low-level policy.

The MORE algorithm solves a constraint optimization problem where the objective is given by maximizing the average reward. The information-geometric trust region is implemented by limiting the KL-divergence between the old and new search distribution which is equivalent to limiting the information gain of the policy update. Moreover, a second constraint limits the entropy loss of the distribution update to avoid a collapse of the search distribution. The resulting optimization program has the following form:

maximize p(x)

x p(x)R(x)dx,

s. t. KL p||q ϵ, H q H p γ. (4)

Here, q(x) is the old search distribution, R(x) is the reward and H the entropy of the distribution. The optimization problem can be solved using Lagrangian multipliers. The solution for p(x) is given by

η η+ω exp (R(x))

1 η+ω , (5)

where η and ω are Lagrangian multipliers, which can be found by solving the convex dual optimization problem (Abdolmaleki et al., 2016).

Gradient-Free Variational Inference by Policy Search

2.3. Fitting a Reward Surrogate

In order to use Equation 5, Gaussianity needs to be enforced. This can be either performed by ﬁtting a Gaussian (Daniel et al., 2012; Kupcsik et al., 2017) on samples that are weighted by Equation 5, which is prone to overﬁtting (as we need to ﬁt a full covariance matrix) or by ﬁtting a surrogate R(x) R(x) that is compatible to the Gaussian distribution (Abdolmaleki et al., 2016). As the Gaussian distribution is log linear in quadratic features of x, the surrogate needs to be a quadratic approximation of R(x), i.e.,

R(x) = 0.5x T Ax + a T x + a0.

The parameters A, a and a0 are learned from the current sample set using linear regression. The surrogate therefore only needs to be a local approximation of the reward function. Using R(x) for R(x) in Equation 5 yields a Gaussian distribution for p(x) where the mean µ and the full covariance matrix Σ can be evaluated in closed form. For the exact equations we refer to Abdolmaleki et al. (2016).

3. Variational Inference by Policy Search

As VI uses samples to evaluate the ELBO, VI is essentially a search problem where we need to ﬁnd samples in areas of high density of p but also make sure that these samples are distributed with the correct entropy. Interpreting VI as search problem, the current approximation p(x; θ) is used to search in the space of x. We can obtain samples from our current search distribution p(x; θ) and use them to update p(x; θ) such that it becomes a better approximation of p (x). Hence, VI is inﬂicted by the explorationexploitation dilemma in a similar way as reinforcement learning and policy search algorithms. In this paper, we want to use information-geometric trust regions for controlling the exploration-exploitation trade-off in the VI objective.

Information-geometric trust regions have been shown to yield efﬁcient closed form updates for Gaussian distributions (Abdolmaleki et al., 2016). However, in order to cope with more complex, multimodal distributions, we will extend the information-geometric updates to Gaussian Mixture Models. Hence, our variational distribution is given by a GMM

o p(o)p(x|o),

where o is the index of the mixture component, p(o) are the mixture weights and p(x|o) = N(µo, Σo) is a multivariate normal distribution with mean µo and full covariance matrix Σo. To improve readability, we will omit the parameter vector θ when writing the distribution p(x) in most cases. However, as we are dealing with mixture models, it should be noted that θ consists of the mean vectors, covariance matrices and mixture weights of all components.

We will ﬁrst introduce our objective including the trust regions and subsequently introduce a variational lower bound to decompose the objective into tractable optimization problems for the individual mixture components and the mixture coefﬁcients. The number of components is automatically adapted by deleting components that have low weight and by creating new components in promising regions.

3.1. Objective

We approximate the target distribution p (x) by minimizing L(θ) given in Equation 3 on the current set of samples. As we will use local approximations of the reward function around each component, we introduce individual trust regions for each component and for the coefﬁcients. The resulting optimization problem has the form

maximize p(x|o),p(o)

x p(x) R(x) log p(x) dx,

subject to o : Z

x p(x|o) log p(x|o)

q(x|o) ϵ(o),

o p(o) log p(o)

where q(o) and q(x|o) are the old mixture weights and components, respectively, and ϵw and ϵ(o) upper-bound the corresponding KL-divergences. However, the occurrence of the log-density of the GMM, log p(x), prevents us from updating each component independently.

3.2. Variational Lower Bound

By introducing an auxiliary distribution p(o|x), the objective of the optimization problem can be decomposed into a lower bound U(θ, p) and an expected KL-term, namely

J(θ) = U(θ, p) + Ep(x) h KL p(o|x)|| p(o|x) i , (6)

U(θ, p) = Z

o p(x, o) R(x) log p(x, o)

+ log p(o|x) dx

KL p(o|x)|| p(o|x) = X

o p(o|x) log p(o|x)

We also used the identity p(x, o) = p(x|o)p(o) to keep the notation uncluttered. Eq. 6 can be easily veriﬁed by using Bayes theorem, i.e., by setting log p(o|x) = log p(x, o) log p(x), all introduced terms will vanish and only the original objective remains, see supplement. The second term in Eq. 6 is always greater or equal zero. Hence,

Gradient-Free Variational Inference by Policy Search

U(θ, p) is a lower bound on the objective. This decomposition is closely related to the decomposition used in the expectation maximization (EM) algorithm (Bishop, 2006). However, as we want to minimize the reverse KL instead of the maximum likelihood (which corresponds to the forward KL) that is the objective in standard EM, the KL used for p(o|x) also needs to be reversed to obtain the lower bound.

Following the standard EM procedure, we can maximize the objective function by iteratively maximizing the lower bound (M-step) and tightening the lower bound (Estep). The E-step can be computed by setting p(o|x) = p(x|o)p(o)/p(x) using the current GMM. Hence, after the E-step, the lower bound is tight as the KL is set to 0. Consequently, improving the lower bound also improves the original objective J(θ) at each EM iteration. It can be easily seen that the lower bound does not contain the term log p(x) anymore and decomposes into individual terms for the weight distribution and the single components such that the updates can be performed independently. Moreover, an interesting observation is the term log p(o|x) acts as additional reward. After each sampling process from p(x), we perform 10 EM-iterations to ﬁne tune the mixture components with the newly obtained samples.

3.3. M-step for Component Updates

When updating a single component p(x|o), maximizing U is equivalent to maximizing

Uo(µo, Σo) = Z

x p(x|o) R(x) + log p(o|x) dx

+ H p(x|o) + η(o) ϵ(o) Z

x p(x|o) log p(x|o)

where we already added the Lagrangian multiplier η(o) of the KL constraint for component o. This optimization problem is very similar to the optimization problem that is solved in MORE (Eq.4), which becomes evident when deﬁning

ro(x) = R(x) + log p(o|x) (7)

as reward function. In fact, the only difference compared to MORE is that the entropy of the component enters the optimization via a constant factor 1 rather than a Lagrangian multiplier. Hence, we only need to optimize the Lagrangian multiplier of the KL constraint, η(o) as ω is set to 1 in the MORE equations, see Eq. 5.

We refer to the supplement and to the original MORE paper (Abdolmaleki et al., 2016) for the closed form updates of this optimization problem based on quadratic reward surrogates ro(x) ro(x). Note that in difference to the original MORE paper, we use a standard linear regression to obtain the quadratic models. The used dimensionality reduction

technique reported by Abdolmaleki et al. (2016) was not needed to achieve satisfactory results.

3.4. M-step for Weight Updates

Maximizing U with respect to p(o) is equivalent to maximizing

Uw (p(o)) = X

o p(o)rw(o) + H p(o)

o p(o) log p(o)

where we already added the Lagrangian multiplier for the KL constraint on the weights and

x p(x|o)ro(x)dx + H(p(x|o)).

This optimization problem corresponds to optimizing a discrete distribution with an additional entropy objective. The solution for p(o) is obtained by

ηw ηw+1 exp (rw(o))

1 ηw+1 . (8)

Please refer to the supplement for the dual function and gradient of this optimization problem.

3.5. Sample Reuse

Samples are used for approximating ro(x) for the component updates as well as rw(o) for the weight updates.

We ﬁt a quadratic surrogate to approximate ro(x) using linear regression, where the samples relate to the independent variables. The surrogate should be most accurate in the vicinity of p(x|o) which can be achieved by using independent variables that are distributed according to p(x|o). However, we also want to make use of samples from previous iterations or different components. We therefore perform weighted least squares, where the weight of sample i is given by the self-normalizing importance weight

z(xi) , Z = X

The sampling distribution z(x) for the current set of Ns samples is a Gaussian Mixture Model given by

1 Ns Ni(x),

where Ni(x) corresponds to the normal distribution that was used to obtain sample i. For better efﬁciency, we include several samples from each sampling component Ni(x) such that z(x) comprises less than Ns different component.

Gradient-Free Variational Inference by Policy Search

We approximate rw(o) for the weight updates by using an importance-weighted Monte Carlo estimate based on the same importance weights that are used for the component updates. Hence, we replace rw(o) in Eq. 8 by

i=1 wi(o)ro(xi) + H(p(x|o)).

The set of active samples is replaced directly after new samples have been drawn and is held constant during the EM iterations. For all our experiments we selected the 3 Nnew most recent samples, where Nnew corresponds to the number of samples that have been drawn during the last round of sampling. Each round of sampling consists of drawing 10 D samples from each mixture component, where D corresponds to the number of dimensions of x.

3.6. Adding and Deleting Mixture Components

In order to adapt the complexity of the GMM to the complexity of the target distribution, we initialize our algorithm with only one mixture component with high variance and gradually increase the number of mixture components. We consider the locations of all previous samples as candidates for new components. Every nadd iterations, we heuristically choose the most promising candidate and use its location as mean for a newly added component. The covariance matrix is initialized by interpolating the covariance matrices of neighboring components using the responsibilities.

As we want the new component to eventually achieve high weight, we want to add it at an area where its componentspeciﬁc reward ro(x) will become large. We therefore try to ﬁnd an area that has high likelihood under the target distribution but also yields high log-responsibilities for the newly added component, see Equation 7. However, the responsibilities of a newly initialized component hardly relate to the responsibilities it will eventually achieve. Instead, we choose the candidate that maximizes the score ei = log p (xi) max(log p(xi), maxj log p(xj) γ). The second term prefers locations that are little covered by the current approximation and behaves similar to the log-responsibilities which also saturate for far away areas. Without such saturation, the second term might dominate the score when the target distribution has heavy tails.

In order to add components without impairing the stability of the optimization, we initialize them with very low weight such that their effect on the approximation is negligible. As we draw the same number of samples from each component irrespective of their weight, such low weight components can still improve and eventually contribute to the approximation. However, keeping unnecessary components is costly in terms of computational cost and sample efﬁciency. We therefore delete components that did not have mentionable

effect on the approximation during the last ndel iterations.

The full algorithm is outlined in Algorithm 1. An opensource implementation is available online1.

Algorithm 1 Variational Inference by Policy Search

1: Input: initial parameters θ0 2: for j = 1 to max Iter do 3: Add and delete components according to heuristics. Store new parameters in θj 4: Draw samples si from each component and store them along with ri = log p (si) and the parameters of the responsible component. 5: (s , r , z (x)) select active samples() 6: compute zj = z (s ) 7: for k = 1 to max Iter EM do 8: recompute p = p(s ; θj), p = p(o|s ; θj) 9: θj update weights(p, zj, p, s , r , θj) 10: recompute p = p(s ; θj), p = p(o|s ; θj) 11: for each component do 12: θj update component(p, zj, p, s , r , θj) 13: end for 14: end for 15: end for

4. Related Work

Although MCMC samplers can not directly be used for approximating distributions, they are for many applications the main alternative to VI. Prominent examples of gradientfree MCMC methods include (random walk) Metropolis Hasting (Hastings, 1970), Gibbs sampling (Geman & Geman, 1984), slice sampling (Neal, 2003) and elliptical slice sampling (Murray et al., 2010; Nishihara et al., 2014). If the gradient of the target distribution is available, Hamiltonian MCMC (Duane et al., 1987) and the Metropolis-adjusted Langevin algorithm (Roberts & Stramer, 2002) are also popular choices. The No-U-Turn sampler (NUTS) (Hoffman & Gelman, 2014) is a notable variant of Hamiltonian MCMC that is appealing for not requiring hyper-parameter tuning. While many of these MCMC methods have problems with multimodal distributions in terms of mixing time, other methods use multiple chains and can therefore better explore multimodal sample spaces (Neal, 1996; Nishihara et al., 2014; Calderhead, 2014).

Many VI methods are only applicable to Gaussian variational distributions (Fan et al., 2015; Regier et al., 2017). The approach by Fan et al. (2015) can learn Gaussians with full covariance matrices using fast second order optimization. This idea has been extended by Regier et al. (2017) to trust region optimization. However, in difference to our approach, an euclidean trust region is used in parameter space

1https://github.com/Oleg Arenz/VIPS

Gradient-Free Variational Inference by Policy Search

Figure 1: We start with an initial isotropic Gaussian distribution (left) and iteratively improve the approximation and add additional components in order to approximate the desired distribution (right). The three plots in the middle visualize the approximation directly after adding the ﬁfth, tenth and twentieth component.

of the variational distribution. Such approach requires the computation of the Hessian of the objective which is only tractable for mean-ﬁeld approximations of single Gaussian distributions. In contrast, we use the trust regions directly on the change of the distributions instead of the change of the parameters of the distribution. The information geometric trust regions in this paper allow for efﬁcient estimation of GMMs with full covariance matrices without requiring gradient information from p .

Black-box Variational Inference methods do not make strong assumptions on the model family (Salimans & Knowles, 2013; Ranganath et al., 2014) and can therefore also be used for learning GMM approximations. Salimans & Knowles (2013) derive a ﬁxed point update of the natural parameters of a distribution from the exponential family that corresponds to a Monte-Carlo estimate of the gradient of Eq. (1) preconditioned by the inverse of their empirical covariance. By making structural assumptions on the target distribution, they extend their method to mixture models and show its applicability to bivariate GMMs. Ranganath et al. (2014) also use Monte-Carlo gradient estimates of Eq. (1) and apply Rao-Blackwellization and control variates to reduce its variance. The work of Weber et al. (2015) already explored the use of Reinforcement Learning for VI but formalizing VI as sequential decision problem. However, only simple policy gradient methods have been proposed in this context which are unsuitable for learning GMMs.

Closely related to our work are two recent approaches for Variational Inference that concurrently explored the idea of applying boosting to make the training of GMM approximations tractable (Miller et al., 2017; Guo et al., 2016). These methods start by minimizing the ELBO objective for a single component and then successively add and optimize new components and learn an optimal weighting between the previous mixture and the newly added component. However, they do not use information-geometric trust region to efﬁciently explore the sample space and therefore have problems ﬁnding all the modes as well as accurate estimates of the covariance matrices. Non-parametric variational inference (NPVI) (Gershman et al., 2012) learns GMMs with uniform weights using a second-order approximation of the

ELBO for efﬁcient gradient updates. However, this method only allows for mean ﬁeld approximations for the mixture components which is a severe limitation as shown in our comparisons. GMMs are also used by Zobay (2014) where an approximation of the GMM entropy is used to make the optimization tractable. The optimization is gradient-based and does not consider exploration of the sample space. It is therefore limited to rather low dimensional problems.

5. Experiments

We evaluate VIPS with respect to efﬁciency of the optimization as well as quality of the learned approximations. For assessing efﬁciency, we focus on the number of function evaluations, but also include a comparison with respect to the wall-clock time. As the ELBO objective is hard to use for comparisons as it depends on the current sample set, we assess the quality of the approximation by comparing samples drawn from the learned model with groundtruth samples based on their Maximum Mean Discrepancy (MMD) (Gretton et al., 2012). Please refer to the supplement how the MMD and the ground-truth samples are computed. Please note that the computation of the ground-truth samples is based on generalized elliptical slice sampling (GESS) (Nishihara et al., 2014) which is in most cases computationally very expensive. Due to the huge computational demands, we do not consider GESS as competitor.

We compare our method to a variety of state-of-the-art MCMC and VI approaches on unimodal and multimodal problems, namely we compare to Variational Boosting (Miller et al., 2017), Non-Parametric Variational Inference (Gershman et al., 2012), Stein Variational Gradient Descent (Liu & Wang, 2016), Hamiltonian Monte Carlo (Duane et al., 1987), Slice Sampling (Neal, 2003), Elliptical Slice Sampling (Murray et al., 2010), Parallel Tempering MCMC (Calderhead, 2014) and indirectly, since it is used for initializing variational boosting black box variational inference (Salimans & Knowles, 2013). However, due to the high computational demands, we do not compare to every method on each experiment but rather select promising candidates based on the sampling problem or on the preliminary experiments that we had to conduct for hyper-parameter tun-

Gradient-Free Variational Inference by Policy Search

104 105 106 107 108

function evaluations

vips1 vips40 svgd ess hmc npvi vboost0 vboost5 vboost10

(a) German Credit

104 105 106 107 108

function evaluations

vips1 vips40 ess svgd npvi

(b) Breast Cancer

Figure 2: Comparison of VIPS to other VI and MCMC methods for two logistic regression tasks. VIPS considerably outperformed all other methods already when using only one mixture component (VIPS1) and could only be slightly improved by the GMM on the breast cancer domain.

ing. For VIPS, we use the same hyper-parameters for all experiments. However, we do not add new components if it would increase the number of components above a certain threshold and evaluate different values for this threshold.

We perform three experiments on (essentially) unimodal sampling problems taken from related work. We perform Bayesian logistic regression on the German Credit and Breast Cancer datasets (Lichman, 2013) as described in (Nishihara et al., 2014) and approximate the posterior of the hierarchical Poisson GLM described in (Miller et al., 2017). We designed two challenging toy tasks for evaluating our approach on multimodal problems: sampling from an unknown twenty-dimensional GMM with full covariance matrices and distant modes, and sampling the joint conﬁgurations of a ten-dimensional planar robot.

We start the experimental evaluation of VIPS by visualizing the iterative improvements of the GMM approximation for the task of approximating an unknown two-dimensional GMM. Figure 1 shows the log-probability densities of the learned approximation during the optimization and the target distribution (right). We can see that VIPS gradually adds more components and improves the GMM. The ﬁnal ﬁt with 20 components closely approximates the target distribution.

5.1. Bayesian Logistic Regression

We perform two experiments for binary classiﬁcation on the German credit and breast cancer datasets (Lichman, 2013). For the German credit dataset twenty-ﬁve parameters are learned whereas the breast cancer dataset is thirty-one dimensional. We standardize both datasets and perform linear logistic regression where we put zero-mean Gaussian priors with variance 100 on all parameters.

On the German credit dataset we compare against NPVI, ESS, SVGD, HMC and variational boosting. For variational boosting we examine rank-0, rank-5 and rank-10 approximations of the covariance matrices. We also performed

104 105 106 107

function evaluations

vips1 vips5 svgd hmc npvi vboost1 vboost5 vboost10

(a) stop-and-frisk

104 105 106 107 108

function evaluations

svgd vips ess ptmcmc slice npvi vboost10

(b) 10-link planar robot

Figure 3: (a) On the stop-and-frisk dataset, VIPS1 closely approximates the posterior with roughly 33000 function evaluations. (b) On the 10-link robot experiment VIPS learns a signiﬁcantly better approximation than competing VI methods and achieves similar sample quality to MCMC.

experiments with full rank approximation but the optimization always failed after few iterations. For our approach we examine two variants, VIPS1 which learns only a single component and VIPS40 which stops adding new components after forty components. Figure 2a shows that the sample quality achieved by VIPS is unmatched by any variational inference method and ESS needs more than two orders of magnitude more function evaluation to achieve a similar MMD to VIPS1. However, we could not measure an advantage of using a GMM instead of single Gaussian distribution on this dataset. As VIPS1 is only learning a single Gaussian, it is also more data-efﬁcient than VIPS40. This result also illustrates the importance of using Gaussian distributions with an accurate estimation of the full covariance as VIPS1 could already outperform competing methods.

Figure 2b depicts the achieved MMDs on the breast cancer dataset. We only compared to the most promising methods from the German credit dataset. Although the posterior distribution is unimodal, the GMM variant of our method learns a slightly better approximation than VIPS1, presumably by matching higher order moments of the posterior.

5.2. Multi-level Poisson GLM

We evaluate our method on the problem of learning the posterior of a hierarchical Poisson GLM on the 37-dimensional stop-and-frisk dataset, where we refer to (Miller et al., 2017) for the description of the hierarchical model. We compared VIPS1 and VIPS5 to HMC, SVGD, variational boosting and non-parametric variational inference. The MMD with respect to the baseline samples is shown in Figure 3a.

5.3. Planar Robot

We evaluate our approach on a planar robot with ten degrees of freedom. We want to sample joint conﬁgurations such that the end-effector of the robot is close to a desired position at x = 7 and y = 0. The likelihood of a conﬁgu-

Gradient-Free Variational Inference by Policy Search

2 0 2 4 6 8 10

2 0 2 4 6 8 10

2 0 2 4 6 8 10

2 0 2 4 6 8 10

Figure 4: The ﬁrst three plots visualize typical component means and weights learned by NPVI, VBOOST and VIPS. Components with higher weight are drawn darker. The two rightmost plots show the samples drawn from the VIPS approximation and ground-truth samples, that have been collected during two days on 120 cores using GESS.

104 105 106 107 108

function evaluations

svgd vips ess ptmcmc

(a) MMD plotted over iterations

0 5000 10000 15000 20000 time [s]

svgd vips (1 core) vips (2 core) vips (4 core) ess ptmcmc

(b) MMD plotted over time

Figure 5: On the GMM experiment, VIPS was able to closely approximate the true mixture model and reliably discovers all ten components. When the number of computing cores is small compared to the components, VIPS can scale almost linearly with the number of cores.

ration is a Gaussian distribution in Cartesian end-effector space with a variance of 1e 4 in both dimensions. We assume a zero mean Gaussian prior in conﬁguration space, where the ﬁrst joint has variance 1 and the remaining joints have variance 0.04. We compare VIPS40 with ESS, parallel tempering MCMC, SVGD, slice sampling, NPVI and VBOOST. The MMD is shown in Figure 3b. Figure 4 visualizes the learned models of NPVI, VBOOST and VIPS and compares the ground-truth samples with the samples obtained by VIPS. All other VI methods can not represent the complex structure of the modes and get attracted by single modes resulting in bad approximations. We believe the reason for this behavior is a missing principled treatment of the exploration-exploitation trade-off. VIPS achieves a high sample quality that is comparable to MCMC methods in order of magnitude less function evaluations than MCMC.

5.4. Gaussian Mixture Model

We evaluate our method on the task of approximating unknown, randomly generated 20-dimensional GMMs comprising ten components. Each dimensions of the component means is drawn uniformly in the interval [ 50, 50]. The covariance matrices are given by Σ = A A + I20 where each entry of the 20 20-dimensional matrix A is sampled from a normal distribution with mean 0 and variance 20. Note that each component of the target distribution can

have a highly correlated covariance matrix, which is even a problem for the tested MCMC methods. Figure 5 shows the MMD of VIPS40, SVGD, ESS and parallel tempering MCMC plotted over the number of iterations as well as wallclock time. VIPS was the only method that could reliable be applied to this task. We also brieﬂy evaluated the scaling of the performance with the number of cores. Although parallel computing is not the focus of this paper, the possibility of performing independent updates of the mixture components suggests that the method can make good use of multi-threading. Figure 5b shows that VIPS scales almost linearly with the number of cores at least if their number is small, showing the potential of parallelizing our algorithm.

6. Conclusion

VIPS is motivated by the insight from stochastic search that information-geometric trust regions allow for controlled exploration and stable optimization. We transfered the stochastic search method MORE to the ﬁeld of variational inference and demonstrated that it is signiﬁcantly more efﬁcient than state-of-the-art approaches of VI for learning mean and full covariance of a multi-variate normal distribution. Based on our variant of MORE, we derived a novel method for learning GMM approximations and demonstrated that it is capable of learning high quality approximations of complex, multimodal distributions with a limited amount of function evaluations. Our method makes little assumptions on the unnormalized density function of the desired distribution and is thereby applicable to non-differentiable problems.

However, for higher-dimensional problems (e.g. more than 100 dimensions) ﬁtting quadratic surrogates may require too many samples which could be alleviated by using gradient information for constraining the surrogate. Furthermore, learning diagonal surrogates can be preferable for such problems due to better sampleand computational efﬁciency. For exploration, we start with an approximation with high entropy and decrease it slowly during optimization which can miss modes that are not discovered in the beginning. Actively sampling unexplored regions would result in an anytime algorithm, capable of further improving the approximation in the later stages of optimization.

Gradient-Free Variational Inference by Policy Search

Acknowledgements

This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 645582 (Ro Ma NS). Calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.

Abdolmaleki, A., Lioutikov, R., Lua, N., Reis, L. P., Peters, J., and Neumann, G. Model-Based Relative Entropy Stochastic Search. In Advances in Neural Information Processing Systems (NIPS), pp. 153 154, 2016.

Abdolmaleki, A., Price, B., Lau, N., Reis, L. P., and Neumann, G. Deriving and Improving CMA-ES with Information Geometric Trust Regions. In The Genetic and Evolutionary Computation Conference (GECCO 2017), July 2017.

Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, 2006.

Bishop, C. M., Lawrence, N. D., Jaakkola, T., and Jordan, M. I. Approximating Posterior Distributions in Belief Networks Using Mixtures. In Advances in Neural Information Processing Systems, pp. 416 422, 1998.

Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 2017.

Calderhead, B. A General Construction for Parallelizing Metropolis-Hastings Algorithms. Proceedings of the National Academy of Sciences of the United States of America (PNAS), Nov 2014.

Daniel, C., Neumann, G., and Peters, J. Hierarchical Relative Entropy Policy Search. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2012.

Deisenroth, M. P., Neumann, G., and Peters, J. A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, pp. 388 403, 2013.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 222, 1987.

Fan, K., Wang, Z., Beck, J., Kwok, J. T., and Heller, K. Fast Second-order Stochastic Backpropagation for Variational Inference. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 15, pp. 1387 1395, 2015.

Geman, S. and Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721 741, 1984.

Gershman, S. J., Hoffman, M. D., and Blei, D. M. Nonparametric Variational Inference. In Proceedings of the 29th International Conference on Machine Learning, 235 242, 2012.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A Kernel Two-sample Test. Journal of Machine Learning Research, 13:723 773, March 2012. ISSN 1532-4435.

Guo, F., Wang, X., Fan, K., Broderick, T., and Dunson, D. B. Boosting Variational Inference. ar Xiv:1611.05559v2 [stat.ML], 2016.

Hastings, W. K. Monte Carlo Sampling Methods using Markov Chains and their Applications. Biometrika, 57 (1):97 109, 1970.

Hoffman, M. D. and Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1): 1593 1623, 2014.

Jaakkola, T. S. and Jordan, M. I. Improving the Mean Field Approximation via the Use of Mixture Distributions. Learning in Graphical Models, 89:163 174, 1998.

Kupcsik, A., Deisenroth, M. P., Peters, J., Loh, A. P., Vadakkepat, P., and Neumann, G. Model-Based Contextual Policy Search for Data-Efﬁcient Generalization of Robot Skills. Artiﬁcial Intelligence, December 2017.

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Liu, Q. and Wang, D. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2378 2386. Curran Associates, Inc., 2016.

Miller, A. C., Foti, N. J., D Amour, A., and Adams, R. P. Variational Boosting: Iteratively Reﬁning Posterior Approximations. In Proceedings of the International Conference on Machine Learning, 2017.

Murray, I., Adams, R., and Mac Kay, D. Elliptical Slice Sampling. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 541 548, 2010.

Gradient-Free Variational Inference by Policy Search

Neal, R. M. Sampling from Multimodal Distributions using Tempered Transitions. Statistics and Computing, 6(4): 353 366, Dec 1996.

Neal, R. M. Slice Sampling. The Annals of Statistics, 31(3): 705 767, 06 2003. doi: 10.1214/aos/1056562461.

Nishihara, R., Murray, I., and Adams, R. P. Parallel MCMC with Generalized Elliptical Slice Sampling. Journal of Machine Learning Research, 15(1):2087 2112, January 2014.

Peters, J., Muelling, K., and Altun, Y. Relative Entropy Policy Search. In Proceedings of the 24th National Conference on Artiﬁcial Intelligence (AAAI), 2010.

Peterson, C. and Hartman, E. Explorations of the Mean Field Theory Learning Algorithm. Neural Networks, 2 (6):457 494, 1989.

Ranganath, R., Gerrish, S., and Blei, D. Black Box Variational Inference. Artiﬁcial Intelligence and Statistics, pp. 814 822, 2014.

Regier, J., Jordan, M. I., and Mc Auliffe, J. Fast Black-box Variational Inference through Stochastic Trust-Region Optimization. In Advances in Neural Information Processing Systems 30, pp. 2399 2408, 2017.

Roberts, G. O. and Stramer, O. Langevin Diffusions and Metropolis-Hastings Algorithms. Methodology and Computing in Applied Probability, 4(4):337 357, 2002.

Salimans, T. and Knowles, D. A. Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Bayesian Analysis, 8(4):837 882, 2013.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust Region Policy Optimization. In Proceedings of the International Conference on Machine Learning, 2015.

Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. MIT Press, Boston, MA, 1998.

Weber, T., Heess, N., Eslami, A., Schulman, J., Wingate, D., and Silver, D. Reinforced Variational Inference. In Advances in Neural Information Processing Systems (NIPS) Workshops, 2015.

Zobay, O. Variational Bayesian Inference with Gaussian Mixture Approximations. Electronic Journal of Statistics, 8(1):355 389, 2014. doi: 10.1214/14-EJS887.