# interacting_particle_markov_chain_monte_carlo__b96626cd.pdf Interacting Particle Markov Chain Monte Carlo Tom Rainforth1* TWGR@ROBOTS.OX.AC.UK Christian A. Naesseth2* CHRISTIAN.A.NAESSETH@LIU.SE Fredrik Lindsten3 FREDRIK.LINDSTEN@IT.UU.SE Brooks Paige1 BROOKS@ROBOTS.OX.AC.UK Jan-Willem van de Meent1 JWVDM@ROBOTS.OX.AC.UK Arnaud Doucet1 DOUCET@STATS.OX.AC.UK Frank Wood1 FWOOD@ROBOTS.OX.AC.UK equal contribution 1 The University of Oxford, Oxford, United Kingdom 2 Link oping University, Link oping, Sweden 3 Uppsala University, Uppsala, Sweden We introduce interacting particle Markov chain Monte Carlo (i PMCMC), a PMCMC method based on an interacting pool of standard and conditional sequential Monte Carlo samplers. Like related methods, i PMCMC is a Markov chain Monte Carlo sampler on an extended space. We present empirical results that show significant improvements in mixing rates relative to both noninteracting PMCMC samplers and a single PMCMC sampler with an equivalent memory and computational budget. An additional advantage of the i PMCMC method is that it is suitable for distributed and multi-core architectures. 1. Introduction MCMC methods are a fundamental tool for generating samples from a posterior density in Bayesian data analysis (see e.g., Robert & Casella (2013)). Particle Markov chain Monte Carlo (PMCMC) methods, introduced by Andrieu et al. (2010), make use of sequential Monte Carlo (SMC) algorithms (Gordon et al., 1993; Doucet et al., 2001) to construct efficient proposals for the MCMC sampler. One particularly widely used PMCMC algorithm is particle Gibbs (PG). The PG algorithm modifies the SMC step in the PMCMC algorithm to sample the latent variables conditioned on an existing particle trajectory, resulting in what is called a conditional sequential Monte Carlo (CSMC) step. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). The PG method was first introduced as an efficient Gibbs sampler for latent variable models with static parameters (Andrieu et al., 2010). Since then, the PG algorithm and the extension by Lindsten et al. (2014) have found numerous applications in e.g. Bayesian non-parametrics (Valera et al., 2015; Tripuraneni et al., 2015), probabilistic programming (Wood et al., 2014; van de Meent et al., 2015) and graphical models (Everitt, 2012; Naesseth et al., 2014; 2015). A drawback of PG is that it can be particularly adversely affected by path degeneracy in the CSMC step. Conditioning on an existing trajectory means that whenever resampling of the trajectories results in a common ancestor, this ancestor must correspond to this trajectory. Consequently, the mixing of the Markov chain for the early steps in the state sequence can become very slow when the particle set typically coalesces to a single ancestor during the CSMC sweep. In this paper we propose the interacting particle Markov chain Monte Carlo (i PMCMC) sampler. In i PMCMC we run a pool of CSMC and unconditional SMC algorithms as parallel processes that we refer to as nodes. After each run of this pool, we apply successive Gibbs updates to the indexes of the CSMC nodes, such that the indices of the CSMC nodes changes. Hence, the nodes from which retained particles are sampled can change from one MCMC iteration to the next. This lets us trade off exploration (SMC) and exploitation (CSMC) to achieve improved mixing of the Markov chain. Crucially, the pool provides numerous candidate indices at each Gibbs update. This gives a significantly higher probability that an entirely new retained particle will be switched in than in non-interacting alternatives. This interaction requires only minimal communication; each node must report an estimate of the marginal likelihood and receive a new role (SMC or CSMC) for the next sweep. This Interacting Particle Markov Chain Monte Carlo means that i PMCMC is embarrassingly parallel and can be run in a distributed manner on multiple computers. We prove that i PMCMC is a partially collapsed Gibbs sampler on the extended space containing the particle sets for all nodes. In the special case where i PMCMC uses only one CSMC node, it can in fact be seen as a non-trivial and unstudied instance of the α-SMC-based (Whiteley et al., 2016) PMCMC method introduced by Huggins & Roy (2015). However, with i PMCMC we extend this further to allow for an arbitrary number of CSMC and standard SMC algorithms with interaction. Our experimental evaluation shows that i PMCMC outperforms both independent PG samplers as well as a single PG sampler with the same number of particles run longer to give a matching computational budget. An implementation of i PMCMC is provided in the probabilistic programming system Anglican1 (Wood et al., 2014), whilst illustrative MATLAB code, similar to that used for the experiments, is also provided2. 2. Background We start by briefly reviewing sequential Monte Carlo (Gordon et al., 1993; Doucet et al., 2001) and the particle Gibbs algorithm (Andrieu et al., 2010). Let us consider a non Markovian latent variable model of the following form xt|x1:t 1 ft(xt|x1:t 1), (1a) yt|x1:t gt(yt|x1:t), (1b) where xt X is the latent variable and yt Y the observation at time step t, respectively, with transition densities ft and observation densities gt; x1 is drawn from some initial distribution µ( ). The method we propose is not restricted to the above model, it can in fact be applied to an arbitrary sequence of targets. We are interested in calculating expectation values with respect to the posterior distribution p(x1:T |y1:T ) on latent variables x1:T := (x1, . . . , x T ) conditioned on observations y1:T := (y1, . . . , y T ), which is proportional to the joint distribution p(x1:T , y1:T ), p(x1:T |y1:T ) µ(x1) t=2 ft(xt|x1:t 1) t=1 gt(yt|x1:t). In general, computing the posterior p(x1:T |y1:T ) is intractable and we have to resort to approximations. We will in this paper focus on, and extend, the family of particle Markov chain Monte Carlo algorithms originally proposed by Andrieu et al. (2010). The key idea in PMCMC is to use SMC to construct efficient proposals of the latent variables x1:T for an MCMC sampler. 1http://www.robots.ox.ac.uk/ fwood/anglican 2https://bitbucket.org/twgr/ipmcmc Algorithm 1 Sequential Monte Carlo (all for i = 1, . . . , N) 1: Input: data y1:T , number of particles N, proposals qt 2: xi 1 q1(x1) 3: wi 1 = g1(y1|xi 1)µ(xi 1) q1(xi 1) 4: for t = 2 to T do 5: ai t 1 Discrete wℓ t 1 N ℓ=1 6: xi t qt(xt|x ai t 1 1:t 1) 7: Set xi 1:t = (x ai t 1 1:t 1, xi t) 8: wi t = gt(yt|xi 1:t)ft(xi t|x ai t 1 1:t 1) qt(xi t|x ai t 1 1:t 1) 9: end for 2.1. Sequential Monte Carlo The SMC method is a widely used technique for approximating a sequence of target distributions: in our case p(x1:t|y1:t) = p(y1:t) 1p(x1:t, y1:t), t = 1, . . . , T. At each time step t we generate a particle system {(xi 1:t, wi t)}N i=1 which provides a weighted approximation to p(x1:t|y1:t). Given such a weighted particle system at time t 1, this is propagated forward in time to t by first drawing an ancestor variable ai t 1 for each particle from its corresponding distribution: P(ai t 1 = ℓ) = wℓ t 1. ℓ= 1, . . . , N, (2) where wℓ t 1 = wℓ t 1/ P i wi t 1. This is commonly known as the resampling step in the literature. We introduce the ancestor variables {ai t 1}N i=1 explicitly to simplify the exposition of the theoretical justification given in Section 3.1. We continue by simulating from some given proposal den- sity xi t qt(xt|x ai t 1 1:t 1) and re-weight the system of particles as follows: wi t = gt(yt|xi 1:t)ft(xi t|x ai t 1 1:t 1) qt(xi t|x ai t 1 1:t 1) , (3) where xi 1:t = (x ai t 1 1:t 1, xi t). This results in a new particle system {(xi 1:t, wi t)}N i=1 that approximates p(x1:t|y1:t). A summary is given in Algorithm 1. 2.2. Particle Gibbs The PG algorithm (Andrieu et al., 2010) is a Gibbs sampler on the extended space composed of all random variables generated at one iteration, which still retains the original target distribution as a marginal. Though PG allows for inference over both latent variables and static parameters, we will in this paper focus on sampling of the former. The core idea of PG is to iteratively run conditional sequential Monte Carlo (CSMC) sweeps as shown in Algorithm 2, Interacting Particle Markov Chain Monte Carlo Algorithm 2 Conditional sequential Monte Carlo 1: Input: data y1:T , number of particles N, proposals qt, conditional trajectory x 1:T 2: xi 1 q1(x1), i = 1, . . . , N 1 and set x N 1 = x 1 3: wi 1 = g1(y1|xi 1)µ(xi 1) q1(xi 1) , i = 1, . . . , N 4: for t = 2 to T do 5: ai t 1 Discrete wℓ t 1 N ℓ=1 , i = 1, . . . , N 1 6: xi t qt(xt|x ai t 1 1:t 1), i = 1, . . . , N 1 7: Set a N t 1 = N and x N t = x t 8: Set xi 1:t = (x ai t 1 1:t 1, xi t), i = 1, . . . , N 9: wi t = gt(yt|xi 1:t)ft(xi t|x ai t 1 1:t 1) qt(xi t|x ai t 1 1:t 1) , i = 1, . . . , N 10: end for whereby each conditional trajectory is sampled from the surviving trajectories of the previous sweep. This retained particle index, b, is sampled with probability proportional to the final particle weights wi T . 3. Interacting Particle Markov Chain Monte Carlo The main goal of i PMCMC is to increase the efficiency of PMCMC, in particular particle Gibbs. The basic PG algorithm is especially susceptible to the path degeneracy effect of SMC samplers, i.e. sample impoverishment due to frequent resampling. Whenever the ancestral lineage collapses at the early stages of the state sequence, the common ancestor is, by construction, guaranteed to be equal to the retained particle. This results in high correlation between the samples, and poor mixing of the Markov chain. To counteract this we might need a very high number of particles to get good mixing for all latent variables x1:T , which can be infeasible due to, for example, limited available memory. i PMCMC can alleviate this issue by, from time to time, switching out a CSMC particle system with a completely independent SMC one, resulting in improved mixing. i PMCMC, summarized in Algorithm 3, consists of M interacting separate CSMC and SMC algorithms, exchanging only very limited information at each iteration to draw new MCMC samples. We will refer to these internal CSMC and SMC algorithms as nodes, and assign an index m = 1, . . . , M. At every iteration, we have P nodes running local CSMC algorithms, with the remaining M P nodes running independent SMC. The CSMC nodes are given an identifier cj {1, . . . , M}, j = 1, . . . , P with cj = ck, k = j and we write c1:P = {c1, . . . , c P }. Let xi m = xi 1:T,m be the internal particle trajectories of node m. Suppose we have access to P trajectories x 1:P [0] = (x 1[0], . . . , x P [0]) corresponding to the Algorithm 3 i PMCMC sampler 1: Input: number of nodes M, conditional nodes P and MCMC steps R, initial x 1:P [0] 2: for r = 1 to R do 3: Workers 1 : M\c1:P run Algorithm 1 (SMC) 4: Workers c1:P run Algorithm 2 (CSMC), conditional on x 1:P [r 1] respectively. 5: for j = 1 to P do 6: Select a new conditional node by simulating cj according to (5). 7: Set new MCMC sample x j[r] = xbj cj by simulating bj according to (7) 8: end for 9: end for initial retained particles, where the index [ ] denotes MCMC iteration. At each iteration r, the nodes c1:P run CSMC (Algorithm 2) with the previous MCMC sample x j[r 1] as the retained particle. The remaining M P nodes run standard (unconditional) SMC, i.e. Algorithm 1. Each node m returns an estimate of the marginal likelihood for the internal particle system defined as i=1 wi t,m. (4) The new conditional nodes are then set using a single loop j = 1 : P of Gibbs updates, sampling new indices cj where P(cj = m|c1:P \j) = ˆζj m (5) and ˆζj m = ˆZm1m/ c1:P \j PM n=1 ˆZn1n/ c1:P \j , (6) defining c1:P \j = {c1, . . . , cj 1, cj+1, . . . , c P }. We thus loop once through the conditional node indices and resample them from the union of the current node index and the unconditional node indices3, in proportion to their marginal likelihood estimates. This is the key step that lets us switch completely the nodes from which the retained particles are drawn. One MCMC iteration r is concluded by setting the new samples x 1:P [r] by simulating from the corresponding conditional node s, cj, internal particle system P(bj = i|cj) = wi T,cj, x j[r] = xbj cj. (7) The potential to pick from updated nodes cj, having run independent SMC algorithms, decreases correlation and 3Unconditional node indices here refers to all m / c1:P at that point in the loop. It may thus include nodes who just ran a CSMC sweep, but have been switched out earlier in the loop. Interacting Particle Markov Chain Monte Carlo improves mixing of the MCMC sampler. Furthermore, as each Gibbs update corresponds to a one-to-many comparison for maintaining the same conditional index, the probability of switching is much higher than in an analogous non-interacting system. The theoretical justification for i PMCMC is independent of how the initial trajectories x 1:P [0] are generated. One simple and effective method (that we use in our experiments) is to run standard SMC sweeps for the conditional nodes at the first iteration. The i PMCMC samples x 1:P [r] can be used to estimate expectations for test functions f : XT 7 R in the standard Monte Carlo sense, with E[f(x)] 1 RP j=1 f(x j[r]). (8) However, we can improve upon this if we have access to all particles generated by the algorithm, see Section 3.2. We note that i PMCMC is suited to distributed and multi-core architectures. In practise, the particle to be retained, should the node be a conditional node at the next iteration, can be sampled upfront and discarded if unused. Therefore, at each iteration, only a single particle trajectory and normalisation constant estimate need be communicated between the nodes, whilst the time taken for calculation of the updates of c1:P is negligible. Further, i PMCMC should be amenable to an asynchronous adaptation under the assumption of a random execution time, independent of x j[r 1] in Algorithm 3. We leave this asynchronous variant to future work. 3.1. Theoretical Justification In this section we will give some crucial results to justify the proposed i PMCMC sampler. This section is due to space constraints fairly brief and it is helpful to be familiar with the proof of PG in Andrieu et al. (2010). We start by defining some additional notation. Let ξ := {xi t}i=1:N t=1:T S{ai t} i=1:N t=1:T 1 denote all generated particles and ancestor variables of a (C)SMC sampler. We write ξm when referring to the variables of the sampler local to node m. Let the conditional particle trajectory and corresponding ancestor variables for node cj be denoted by {xbj cj, bcj}, with bcj = (β1,cj, . . . , βT,cj), βT,cj = bj and βt,cj = a βt+1,cj t,cj . Let the posterior distribution of the latent variables be denoted by πT (x) := p(x1:T |y1:T ) with normalisation constant Z := p(y1:T ). Finally we note that the SMC and CSMC algorithms induce the respective distributions over the random variables generated by the procedures: i=1 q1(xi 1) w ai t 1 t 1 qt(xi t|x ai t 1 1:t 1) , q CSMC (ξ\{x , b} | x , b) = w ai t 1 t 1 qt(xi t|x ai t 1 1:t 1) . Note that running Algorithm 2 corresponds to simulating from q CSMC using a fixed choice for the index variables b = (N . . . , N). While these indices are used to facilitate the proof of validity of the proposed method, they have no practical relevance and can thus be set to arbitrary values, as is done in Algorithm 2, in a practical implementation. Now we are ready to state the main theoretical result. Theorem 1. The interacting particle Markov chain Monte Carlo sampler of Algorithm 3 is a partially collapsed Gibbs sampler (Van Dyk & Park, 2008) for the target distribution π(ξ1:M, c1:P , b1:P ) = 1 N P T M P m=1 m/ c1:P j=1 πT xbj cj 1cj / c1:j 1 j=1 q CSMC ξcj\{xbj cj, bcj} | xbj cj, bcj . (9) Proof. See Appendix A at the end of the paper. Remark 1. The marginal distribution of (xb1:P c1:P , c1:P , b1:P ), with xb1:P c1:P = (xb1 c1, . . . , xb P c P ), under (9) is given by π xb1:P c1:P , c1:P , b1:P = QP j=1 πT xbj cj 1cj / c1:j 1 N P T M P . (10) This means that each trajectory xbj cj is marginally distributed according to the posterior distribution of interest, πT . Indeed, the P retained trajectories of i PMCMC will in the limit R be independent draws from πT . Note that adding a backward or ancestor simulation step can drastically increase mixing when sampling the conditional trajectories x j[r] (Lindsten & Sch on, 2013). In the i PMCMC sampler we can replace simulating from the final weights on line 7 by a backward simulation step. Another option for the CSMC nodes is to replace this step by internal ancestor sampling (Lindsten et al., 2014) steps and simulate from the final weights as normal. 3.2. Using All Particles At each MCMC iteration r, we generate MN full particle trajectories. Using only P of these as in (8) might seem a bit wasteful. We can however make use of all particles to estimate expectations of interest by, for each Gibbs update j, averaging over the possible new values for the conditional Interacting Particle Markov Chain Monte Carlo node index cj and corresponding particle index bj. We can do this by replacing f(x j[r]) in (8) by Ecj|c1:P \j Ebj|cj f(x j[r]) = i=1 wi T,mf(xi m). This procedure is referred to as a Rao-Blackwellization of a statistical estimator and is (in terms of variance) never worse than the original one. We highlight that each ˆζj m, as defined in (6), depends on which indices are sampled earlier in the index reassignment loop. Further details, along with a derivation, are provided in the supplementary material. 3.3. Choosing P Before jumping into the full details of our experimentation, we quickly consider the choice of P. Intuitively we can think of the independent SMC s as particularly useful if they are selected as the next conditional node. The probability of the event that at least one conditional node switches with an unconditional, is given by P({switch}) = 1 E h P Y ˆZcj ˆZcj + PM m/ c1:P ˆZm There exist theoretical and experimental results (Pitt et al., 2012; B erard et al., 2014; Doucet et al., 2015) that show that the distributions of the normalisation constants are wellapproximated by their log-Normal limiting distributions. Now, with σ2 ( 1 N ) being the variance of the (C)SMC estimate, it means we have log Z 1 ˆZcj N( σ2 and log Z 1 ˆZm N( σ2 2 , σ2), m / c1:P at stationarity, where Z is the true normalization constant. Under this assumption, we can accurately estimate the probability (11) for different choices of P an example of which is shown in Figure 1a along with additional analysis in the supplementary material. These provide strong empirical evidence that the switching probability is maximised for P = M/2. In practice we also see that best results are achieved when P makes up roughly half of the nodes, see Figure 1b for performance on the state space model introduced in (12). Note also that the accuracy seems to be fairly robust with respect to the choice of P. Based on these results, we set the value of P = M 2 for the rest of our experiments. 4. Experiments To demonstrate the empirical performance of i PMCMC we report experiments on two state space models. Although both the models considered are Markovian, we emphasise that i PMCMC goes far beyond this and can be applied to arbitrary graphical models. We will focus our comparison 0 0.5 1 P/M Switching probability (a) Limiting log-Normal 0 0.5 1 P/M Normalized error M=4 M=8 M=16 M=32 M=64 (b) Gaussian state space model Figure 1. a) Estimation of switching probability for different choices of P and M assuming the log-Normal limiting distribution for ˆZm with σ = 3. b) Median error in mean estimate for different choices of P and M over 10 different synthetic datasets of the linear Gaussian state space model given in (12) after 1000 MCMC iterations. Here errors are normalized by the error of a multi-start PG sampler which is a special case of i PMCMC for which P = M (see Section 4). on the trivially distributed alternatives, whereby M independent PMCMC samplers are run in parallel these are PG, particle independent Metropolis-Hastings (PIMH) (Andrieu et al., 2010) and the alternate move PG sampler (APG) (Holenstein, 2009). Comparisons to other alternatives, including independent SMC, serialized implementations of PG and PIMH, and running a mixture of independent PG and PIMH samplers, are provided in the supplementary material. None outperformed the methods considered here, with the exception of running a serialized PG implementation with an increased number of particles, requiring significant additional memory (O(MN) as opposed to O(M + N)). In PIMH a new particle set is proposed at each MCMC step using an independent SMC sweep, which is then either accepted or rejected using the standard Metropolis Hastings acceptance ratio. APG interleaves PG steps with PIMH steps in an attempt to overcome the issues caused by path degeneracy in PG. We refer to the trivially distributed versions of these algorithms as multi-start PG, PIMH and APG respectively (m PG, m PIMH and m APG). We use Rao Blackwellization, as described in 3.2, to average over all the generated particles for all methods, weighting the independent Markov chains equally for m PG, m PIMH and m APG. We note that m PG is a special case of i PMCMC for which P = M. For simplicity, multinomial resampling was used in the experiments, with the prior transition distribution of the latent variables taken for the proposal. M = 32 nodes and N = 100 particles were used unless otherwise stated. Initialization of the retained particles for i PMCMC and m PG was done by using standard SMC sweeps. 4.1. Linear Gaussian State Space Model We first consider a linear Gaussian state space model (LGSSM) with 3 dimensional latent states x1:T , 20 dimen- Interacting Particle Markov Chain Monte Carlo 100 101 102 103 104 MCMC iteration Mean squared error i PMCMC with P=16 (a) Convergence in mean for full sequence 0 10 20 30 40 50 State space time step t Mean squared error i PMCMC with P=16 (b) Final error in mean for latent marginals Figure 2. Mean squared error averaged over all dimensions and steps in the state sequence as a function of MCMC iterations (left) and mean squared error after 104 iterations averaged over dimensions as function of position in the state sequence (right) for (12) with 50 time sequences. The solid line shows the median error across the 10 tested synthetic datasets, while the shading shows the upper and lower quartiles. Ground truth was calculated using the Rauch Tung Striebel smoother algorithm (Rauch et al., 1965). sional observations y1:T and dynamics given by x1 N (µ, V ) (12a) xt = αxt 1 + δt 1 δt 1 N (0, Ω) (12b) yt = βxt + εt εt N (0, Σ) . (12c) We set µ = [0, 1, 1]T , V = 0.1 I, Ω= I and Σ = 0.1 I where I represents the identity matrix. The constant transition matrix, α, corresponds to successively applying rotations of 7π 10 and π 20 about the first, second and third dimensions of xt 1 respectively followed by a scaling of 0.99 to ensure that the dynamics remain stable. A total of 10 different synthetic datasets of length T = 50 were generated by simulating from (12a) (12c), each with a different emission matrix β generated by sampling each column independently from a symmetric Dirichlet distribution with concentration parameter 0.2. Figure 2a shows convergence in the estimate of the latent variable means to the ground-truth solution for i PMCMC and the benchmark algorithms as a function of MCMC iterations. It shows that i PMCMC comfortably outperforms the alternatives from around 200 iterations onwards, with only i PMCMC and m APG demonstrating behaviour consistent with the Monte Carlo convergence rate, suggesting that m PG and m PIMH are still far from the ergodic regime. Figure 2b shows the same errors after 104 MCMC iterations as a function of position in state sequence. This demonstrates that i PMCMC outperformed all the other algorithms for the early stages of the state sequence, for which m PG performed particularly poorly. Toward the end of state sequence, i PMCMC, m PG and m APG all gave similar performance, whilst that of m PIMH was significantly worse. 4.2. Nonlinear State Space Model We next consider the one dimensional nonlinear state space model (NLSSM) considered by, among others, Gordon et al. (1993); Andrieu et al. (2010) x1 N µ, v2 (13a) 2 + 25 xt 1 1 + x2 t 1 + 8 cos (1.2t) δt 1 (13b) 20 + εt (13c) where δt 1 N 0, ω2 and εt N 0, σ2 . We set the parameters as µ = 0, v = 10. Unlike the LGSSM, this model does not have an analytic solution and therefore one must resort to approximate inference methods. Further, the multi-modal nature of the latent space makes full posterior inference over x1:T challenging for long state sequences. To examine the relative mixing of i PMCMC we calculate an effective sample size (ESS) for different steps in the state sequence. In order to calculate the ESS, we condensed identical samples as done in for example (van de Meent et al., 2015). Let uk t {xi t,m[r]}i=1:N,r=1:R m=1:M , k 1 . . . K, t 1 . . . T denote the unique samples of xt generated by all the nodes and sweeps of particular algorithm after R iterations, where K is the total number of unique samples generated. The weight assigned to these unique samples, vk t , is given by the combined weights of all particles for which xt takes the value uk t : i=1 wi,r t,mηr mδxi t,m[r](uk t ) (14) where δxi t,m[r](uk t ) is the Kronecker delta function and ηr m is a node weight. For i PMCMC the node weight is given by as per the Rao-Blackwellized estimator described in Interacting Particle Markov Chain Monte Carlo 0 10 20 30 40 50 State space time step t Normalized ESS i PMCMC with P=16 0 50 100 150 200 State space time step t Normalized ESS i PMCMC with P=16 Figure 3. Normalized effective sample size (NESS) for LGSSM (left) and NLSSM (right). Section 3.2. For m PG and m PIMH, ηr m is simply 1 RM , as samples from the different nodes are weighted equally in the absence of interaction. Finally we define the effective sample size as ESSt = PK k=1 vk t 2 1 . Figure 3 shows the ESS for the LGSSM and NLSSM as a function of position in the state sequence. For this, we omit the samples generated by the initialization step as this SMC sweep is common to all the tested algorithms. We further normalize by the number of MCMC iterations so as to give an idea of the rate at which unique samples are generated. These show that for both models the ESS of i PMCMC, m PG and m APG is similar towards the end of the space sequence, but that i PMCMC outperforms all the other methods at the early stages. The ESS of m PG was particularly poor at early iterations. PIMH performed poorly throughout, reflecting the very low observed acceptance ratio of around 7.3% on average. It should be noted that the ESS is not a direct measure of performance for these models. For example, the equal weighting of nodes is likely to make the ESS artificially high for m PG, m PIMH and m APG, when compared with methods such as i PMCMC that assign a weighting to the nodes at each iteration. To acknowledge this, we also plot histograms for the marginal distributions of a number of different position in the state sequence as shown in Figure 4. These confirm that i PMCMC and m PG have similar performance at the latter state sequence steps, whilst i PMCMC is superior at the earlier stages, with m PG producing almost no more new samples than those from the initialization sweep due to the degeneracy. The performance of PIMH was consistently worse than i PMCMC throughout the state sequence, with even the final step exhibiting noticeable noise. 5. Discussion and Future Work The i PMCMC sampler overcomes degeneracy issues in PG by allowing the newly sampled particles from SMC nodes to replace the retained particles in CSMC nodes. Our experimental results demonstrate that, for the models considered, this switching in rate is far higher than the rate at which PG generates fully independent samples. Moreover, the results in Figure 1b suggest that the degree of improvement over an m PG sampler with the same total number of nodes increases with the total number of nodes in the pool. The m APG sampler performs an accept reject step that compares the marginal likelihood estimate of a single CSMC sweep to that of a single SMC sweep. In the i PMCMC sampler the CSMC estimate of the marginal likelihood is compared to a population sample of SMC estimates, resulting in a higher probability that at least one of the SMC nodes will become a CSMC node. Since the original PMCMC paper in 2010 there have been several papers studying (Chopin & Singh, 2015; Lindsten et al., 2015) and improving upon the basic PG algorithm. Key contributions to combat the path degeneracy effect are backward simulation (Whiteley et al., 2010; Lindsten & Sch on, 2013) and ancestor sampling (Lindsten et al., 2014). These can also be used to improve the i PMCMC method ever further. A. Proof of Theorem 1 The proof follows similar ideas as Andrieu et al. (2010). We prove that the interacting particle Markov chain Monte Carlo sampler is in fact a standard partially collapsed Gibbs sampler (Van Dyk & Park, 2008) on an extended space Υ := X MT N [N] M(T 1)N [M] P [N] P . Proof. Assume the setup of Section 3. With π( ) with as per (9), we will show that the Gibbs sampler on the extended space, Υ, defined as follows ξ1:M\{xb1:P c1:P , bc1:P } π( |xb1:P c1:P , bc1:P , c1:P , b1:P ), (15a) Interacting Particle Markov Chain Monte Carlo -2 0 2 4 6 8 0 -2 0 2 4 6 8 p(x100jy1:T) p(x200jy1:T) -2 0 2 4 6 8 0 -2 0 2 4 6 8 0 i PMCMC with P = 16 m PIMH m PG m APG Figure 4. Histograms of generated samples at t = 1, 100, and 200 for a single dataset generated from (13) with T = 200. Dashed red line shows an approximate estimate of the ground truth, found by running a kernel density estimator on the combined samples from a small number of independent SMC sweeps, each with 107 particles. cj π( |ξ1:M, c1:P \j), j = 1, . . . , P, (15b) bj π( |ξ1:M, c1:P ), j = 1, . . . , P, (15c) is equivivalent to the i PMCMC method in Algorithm 3. First, the initial step (15a) corresponds to sampling from π(ξ1:M\{xb1:P c1:P , bc1:P }|xb1:P c1:P , bc1:P , c1:P , b1:P ) = m=1 m/ c1:P j=1 q CSMC ξcj\{xbj cj, bcj} | xbj cj, bcj, cj, bj . This, excluding the conditional trajectories, just corresponds to steps 3 4 in Algorithm 3, i.e. running P CSMC and M P SMC algorithms independently. We continue with a reformulation of (9) which will be usefuly to prove correctness for the other two steps π(ξ1:M, c1:P , b1:P ) = 1 M P m=1 q SMC (ξm) h 1cj / c1:j 1 wbj T,cjπT xbj cj q CSMC ξcj\{xbj cj, bcj} | xbj cj, bcj, cj, bj N T wbj T,cjq SMC ξcj m=1 q SMC (ξm) Z 1cj / c1:j 1 wbj T,cj. (16) Furthermore, we note that by marginalising (collapsing) the above reformulation, i.e. (16), over b1:P we get π(ξ1:M, c1:P ) = 1 M P m=1 q SMC (ξm) Z 1cj / c1:j 1. From this it is easy to see that π(cj|ξ1:M, c1:P \j) = ˆζj cj, which corresponds to sampling the conditional node indices, i.e. step 6 in Algorithm 3. Finally, from (16) we can see that simulating b1:P can be done independently as follows π(b1:P |ξ1:M, c1:P ) = π(b1:P , ξ1:M, c1:P ) π(ξ1:M, c1:P ) = j=1 wbj T,cj. This corresponds to step 7 in the i PMCMC sampler, Algorithm 3. So the procedure defined by (15) is a partially collapsed Gibbs sampler, derived from (9), and we have shown that it is exactly equal to the i PMCMC sampler described in Algorithm 3. Interacting Particle Markov Chain Monte Carlo Acknowledgments Tom Rainforth is supported by a BP industrial grant. Christian A. Naesseth is supported by CADICS, a Linnaeus Center, funded by the Swedish Research Council (VR). Fredrik Lindsten is supported by the project Learning of complex dynamical systems (Contract number: 637-2014-466) also funded by the Swedish Research Council. Frank Wood is supported under DARPA PPAML through the U.S. AFRL under Cooperative Agreement number FA8750-14-2-0006, Sub Award number 61160290-111668. Andrieu, Christophe, Doucet, Arnaud, and Holenstein, Roman. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269 342, 2010. ISSN 1467-9868. B erard, Jean, Del Moral, Pierre, and Doucet, Arnaud. A lognormal central limit theorem for particle approximations of normalizing constants. Electronic Journal of Probability, 19(94):1 28, 2014. Chopin, Nicolas and Singh, Sumeetpal S. On particle Gibbs sampling. Bernoulli, 21(3):1855 1883, 08 2015. doi: 10.3150/14-BEJ629. Doucet, Arnaud, de Freitas, Nando, and Gordon, Neil. Sequential Monte Carlo methods in practice. Springer Science & Business Media, 2001. Doucet, Arnaud, Pitt, Michael, Deligiannidis, George, and Kohn, Robert. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, pp. asu075, 2015. Everitt, Richard G. Bayesian parameter estimation for latent Markov random fields and social networks. Journal of Computational and Graphical Statistics, 21(4):940 960, 2012. Gordon, Neil J, Salmond, David J, and Smith, Adrian FM. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and Signal Processing), 140(2):107 113, 1993. Holenstein, Roman. Particle Markov chain Monte Carlo. Ph D thesis, The University Of British Columbia (Vancouver, 2009. Huggins, Jonathan H. and Roy, Daniel M. Convergence of sequential Monte Carlo-based sampling methods. Ar Xiv e-prints, ar Xiv:1503.00966v1, March 2015. Lindsten, Fredrik and Sch on, Thomas B. Backward simulation methods for Monte Carlo statistical inference. Foundations and Trends in Machine Learning, 6(1):1 143, 2013. Lindsten, Fredrik, Jordan, Michael I., and Sch on, Thomas B. Particle Gibbs with ancestor sampling. Journal of Machine Learning Research, 15:2145 2184, june 2014. Lindsten, Fredrik, Douc, Randal, and Moulines, Eric. Uniform ergodicity of the particle Gibbs sampler. Scandinavian Journal of Statistics, 42(3):775 797, 2015. Naesseth, Christian A, Lindsten, Fredrik, and Sch on, Thomas B. Sequential Monte Carlo for graphical models. In Advances in Neural Information Processing Systems 27, pp. 1862 1870. Curran Associates, Inc., 2014. Naesseth, Christian A., Lindsten, Fredrik, and Sch on, Thomas B. Nested sequential Monte Carlo methods. In The 32nd International Conference on Machine Learning, volume 37 of JMLR W&CP, pp. 1292 1301, Lille, France, jul 2015. Pitt, Michael K, dos Santos Silva, Ralph, Giordani, Paolo, and Kohn, Robert. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics, 171(2):134 151, 2012. Rauch, Herbert E, Striebel, CT, and Tung, F. Maximum likelihood estimates of linear dynamic systems. AIAA journal, 3(8):1445 1450, 1965. Robert, Christian and Casella, George. Monte Carlo statistical methods. Springer Science & Business Media, 2013. Tripuraneni, Nilesh, Gu, Shixiang, Ge, Hong, and Ghahramani, Zoubin. Particle Gibbs for infinite hidden Markov Models. In Advances in Neural Information Processing Systems 28, pp. 2386 2394. Curran Associates, Inc., 2015. Valera, Isabel, Francisco, Fran, Svensson, Lennart, and Perez-Cruz, Fernando. Infinite factorial dynamical model. In Advances in Neural Information Processing Systems 28, pp. 1657 1665. Curran Associates, Inc., 2015. van de Meent, Jan-Willem, Yang, Hongseok, Mansinghka, Vikash, and Wood, Frank. Particle Gibbs with ancestor sampling for probabilistic programs. In Proceedings of the 18th International conference on Artificial Intelligence and Statistics, pp. 986 994, 2015. Van Dyk, David A and Park, Taeyoung. Partially collapsed Gibbs samplers: Theory and methods. Journal of the American Statistical Association, 103(482):790 796, 2008. Whiteley, Nick, Andrieu, Christophe, and Doucet, Arnaud. Efficient Bayesian inference for switching state-space models using discrete particle Markov chain Monte Carlo methods. Ar Xiv e-prints, ar Xiv:1011.2437, 2010. Interacting Particle Markov Chain Monte Carlo Whiteley, Nick, Lee, Anthony, and Heine, Kari. On the role of interaction in sequential Monte Carlo algorithms. Bernoulli, 22(1):494 529, 02 2016. Wood, Frank, van de Meent, Jan Willem, and Mansinghka, Vikash. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, pp. 2 46, 2014.