# nested_sequential_monte_carlo_methods__8f4e8f37.pdf Nested Sequential Monte Carlo Methods Christian A. Naesseth CHRISTIAN.A.NAESSETH@LIU.SE Link oping University, Link oping, Sweden Fredrik Lindsten FREDRIK.LINDSTEN@ENG.CAM.AC.UK The University of Cambridge, Cambridge, United Kingdom Thomas B. Sch on THOMAS.SCHON@IT.UU.SE Uppsala University, Uppsala, Sweden We propose nested sequential Monte Carlo (NSMC), a methodology to sample from sequences of probability distributions, even where the random variables are high-dimensional. NSMC generalises the SMC framework by requiring only approximate, properly weighted, samples from the SMC proposal distribution, while still resulting in a correct SMC algorithm. Furthermore, NSMC can in itself be used to produce such properly weighted samples. Consequently, one NSMC sampler can be used to construct an efficient high-dimensional proposal distribution for another NSMC sampler, and this nesting of the algorithm can be done to an arbitrary degree. This allows us to consider complex and high-dimensional models using SMC. We show results that motivate the efficacy of our approach on several filtering problems with dimensions in the order of 100 to 1 000. 1. Introduction Inference in complex and high-dimensional statistical models is a very challenging problem that is ubiquitous in applications. Examples include, but are definitely not limited to, climate informatics (Monteleoni et al., 2013), bioinformatics (Cohen, 2004) and machine learning (Wainwright & Jordan, 2008). In particular, we are interested in sequential Bayesian inference, which involves computing integrals of the form πk(f) := E πk[f(X1:k)] = Z f(x1:k) πk(x1:k)dx1:k, (1) Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). k 1 k k + 1 Figure 1. Example of a spatio-temporal model where πk(x1:k) is given by a k 2 3 undirected graphical model and xk R2 3. for some sequence of probability densities πk(x1:k) = Z 1 πk πk(x1:k), k 1, (2) with normalisation constants Zπk = R πk(x1:k)dx1:k. Note that x1:k := (x1, . . . , xk) Xk. The typical scenario that we consider is the well-known problem of inference in time series or state space models (Shumway & Stoffer, 2011; Capp e et al., 2005). Here the index k corresponds to time and we want to process some observations y1:k in a sequential manner to compute expectations with respect to the filtering distribution πk(dxk) = P(Xk dxk | y1:k). To be specific, we are interested in settings where (i) Xk is high-dimensional, i.e. Xk Rd with d 1, and (ii) there are local dependencies among the latent variables X1:k, both w.r.t. time k and between the individual components of the (high-dimensional) vectors Xk. One example of the type of models we consider are the so-called spatio-temporal models (Wikle, 2015; Cressie & Wikle, 2011; Rue & Held, 2005). In Figure 1 we provide a probabilistic graphical model representation of a spatiotemporal model that we will explore further in Section 6. Sequential Monte Carlo (SMC) methods, reviewed in Section 2.1, comprise one of the most successful methodolo- Nested Sequential Monte Carlo Methods gies for sequential Bayesian inference. However, SMC struggles in high-dimensions and these methods are rarely used for dimensions, say, d 10 (Rebeschini & van Handel, 2015). The purpose of the NSMC methodology is to push this limit well beyond d = 10. The basic strategy, described in Section 2.2, is to mimic the behaviour of a so-called fully adapted SMC algorithm. Full adaptation can drastically improve the efficiency of SMC in high dimensions. Unfortunately, it can rarely be implemented in practice since the fully adapted proposal distributions are typically intractable. NSMC addresses this difficulty by requiring only approximate, properly weighted, samples from the proposal distribution. The proper weighting condition ensures the validity of NSMC, thus providing a generalisation of the family of SMC methods. Furthermore, NSMC will itself produce properly weighted samples. Consequently, it is possible to use one NSMC procedure within another to construct efficient high-dimensional proposal distributions. This nesting of the algorithm can be done to an arbitrary degree. For instance, for the model depicted in Figure 1 we could use three nested samplers, one for each dimension of the volume . The main methodological development is concentrated to Sections 3 4. We introduce the concept of proper weighting, approximations of the proposal distribution, and nesting of Monte Carlo algorithms. Throughout Section 3 we consider simple importance sampling and in Section 4 we extend the development to the sequential setting. We deliberately defer the discussion of the existing body of related work until Section 5, to open up for a better understanding of the relationships to the new developments presented in Sections 3 4. We also discuss various attractive features of NSMC that are of interest in high-dimensional settings, e.g. the fact that it is easy to distribute the computation, which results in improved memory efficiency and lower communication costs. Finally, Section 6 profiles our method extensively with a state-of-the-art competing algorithm on several high-dimensional data sets. We also show the performance of inference and the modularity of the method on a d = 1 056 dimensional climatological spatiotemporal model (Fu et al., 2012) structured according to Figure 1. 2. Background and Inference Strategy 2.1. Sequential Monte Carlo Evaluating πk(f) as well as the normalisation constant Zπk in (2) is typically intractable and we need to resort to approximations. SMC methods, or particle filters (PF), constitute a popular class of numerical approximations for sequential inference problems. Here we give a high-level introduction to the concepts underlying SMC methods, and postpone the details to Section 4. For a more extensive treatment we refer to Doucet & Johansen (2011); Capp e et al. (2005); Doucet et al. (2001). In particular, we will use the auxiliary SMC method as proposed by Pitt & Shephard (1999). At iteration k 1, the SMC sampler approximates the target distribution πk 1 by a collection of weighted particles {(Xi 1:k 1, W i k 1)}N i=1. These samples define an empirical point-mass approximation of the target distribution πN k 1(dx1:k 1) := ℓW ℓ k 1 δXi 1:k 1(dx1:k 1), (3) where δX(dx) denotes a Dirac measure at X. Each iteration of the SMC method can then conceptually be described by three steps, resampling, propagation, and weighting. The resampling step puts emphasis on the most promising particles by discarding the unlikely ones and duplicating the likely ones. The propagation and weighting steps essentially correspond to using importance sampling when changing the target distribution from πk 1 to πk, i.e. simulating new particles from a proposal distribution and then computing corresponding importance weights. 2.2. Adapting the Proposal Distribution The first working SMC algorithm was the bootstrap PF by Gordon et al. (1993), which propagates particles by sampling from the system dynamics and computes importance weights according to the observation likelihood (in the state space setting). However, it is well known that the bootstrap PF suffers from weight collapse in high-dimensional settings (Bickel et al., 2008), i.e. the estimate is dominated by a single particle with weight close to one. This is an effect of the mismatch between the importance sampling proposal and the target distribution, which typically gets more pronounced in high dimensions. More efficient proposals, partially alleviating the degeneracy issue for some models, can be designed by adapting the proposal distribution to the target distribution (see Section 4.2). In Naesseth et al. (2014a) we make use of the fully adapted SMC method (Pitt & Shephard, 1999) for doing inference in a (fairly) high-dimensional discrete model where xk is a 60-dimensional discrete vector. We can then make use of forward filtering and backward simulation, operating on the individual components of each xk, in order to sample from the fully adapted SMC proposals. However, this method is limited to models where the latent space is either discrete or Gaussian and the optimal proposal can be identified with a tree-structured graphical model. Our development here can be seen as a non-trivial extension of this technique. Instead of coupling one SMC sampler with an exact forward filter/backward simulator (which in fact Nested Sequential Monte Carlo Methods reduces to an instance of standard SMC), we derive a way of coupling multiple SMC samplers and SMC-based backward simulators. This allows us to construct procedures for mimicking the efficient fully adapted proposals for arbitrary latent spaces and structures in high-dimensional models. 3. Proper Weighting and Nested Importance Sampling In this section we will lay the groundwork for the derivation of the class of NSMC algorithms. We start by considering the simpler case of importance sampling (IS), which is a fundamental component of SMC, and introduce the key concepts that we make use of. In particular, we will use a (slightly nonstandard) presentation of an algorithm as an instance of a class, in the object-oriented sense, and show that these classes can be nested to an arbitrary degree. 3.1. Exact Approximation of the Proposal Distribution Let π(x) = Z 1 π π(x) be a target distribution of interest. IS can be used to estimate an expectation π(f) := E π[f(X)] by sampling from a proposal distribution q(x) = Z 1 q q(x) and computing the estimator (PN i=1 W i) 1 PN i=1 W if(Xi), with W i = Zqπ(Xi) q(Xi) , and where {(Xi, W i)}N i=1 are the weighted samples. It is possible to replace the IS weight by a nonnegative unbiased estimate, and still obtain a valid (consistent, etc.) algorithm (Liu, 2001, p. 37). One way to motivate this approach is by considering the random weight to be an auxiliary variable and to extend the target distribution accordingly. Our development is in the same flavour, but we will use a more explicit condition on the relationship between the random weights and the simulated particles. Specifically, we will make use of the following key property to formally justify the proposed algorithms. Definition 1 (Properly weighted sample). A (random) pair (X, W) is properly weighted for an unnormalised distribution p if W 0 and E[f(X)W] = p(f) := R f(x)p(x)dx for all measurable functions f. Note that proper weighting of {(Xi, W i)}N i=1 implies unbiasedness of the estimate of the normalising constant of p. Indeed, taking f(x) 1 gives E h 1 N PN i=1 W ii = R p(x)dx =: Zp. Interestingly, to construct a valid IS algorithm for our target π it is sufficient to generate samples that are properly weighted w.r.t. the proposal distribution q. To formalise this claim, assume that we are not able to simulate exactly from q, but that it is possible to evaluate the unnormalised density q point-wise. Furthermore, assume we have access to a class Q, which works as follows. The constructor of Q requires the specification of an unnormalised density function, say, q, which will be approximated by the procedures of Q. Furthermore, to highlight the fact that we will typically use IS (and SMC) to construct Q, the constructor also takes as an argument a precision parameter M, corresponding to the number of samples used by the internal Monte Carlo procedure. An object is then instantiated as q = Q(q, M). The class Q is assumed to have the following properties: (A1) Let q = Q(q, M). Assume that: 1. The construction of q results in the generation of a (possibly random) member variable, accessible as b Zq = q.Get Z(). The variable b Zq is a nonnegative, unbiased estimate of the normalising constant Zq = R q(x)dx. 2. Q has a member function Simulate which returns a (possibly random) variable X = q.Simulate(), such that (X, b Zq) is properly weighted for q. With the definition of Q in place, it is possible to generalise1 the basic importance sampler as in Algorithm 1, which generates weighted samples {(Xi, W i)}N i=1 targeting π. Note that Algorithm 1 is different from a random weight IS, since it approximates the proposal distribution (and not just the importance weights). Algorithm 1 Nested IS (steps 1 3 for i = 1, . . . , N) 1. Initialise qi = Q(q, M). 2. Set b Zi q = qi.Get Z() and Xi = qi.Simulate(). 3. Set W i = b Zi qπ(Xi) q(Xi) . 4. Compute b Zπ = 1 N PN i=1 W i. To see the validity of Algorithm 1 we can interpret the sampler as a standard IS algorithm for an extended target distribution, defined as Π(x, u) := u Q(x, u) π(x)q 1(x), where Q(x, u) is the joint PDF of the random pair (q.Simulate(), q.Get Z()). Note that Π is indeed a PDF that admits π as a marginal; for any measurable subset A X, Π(A R+) = Z 1A(x)u π(x) q(x) Q(x, u)dxdu = E b Zq 1A(X) π(X) where the penultimate equality follows from the fact that (X, b Zq) is properly weighted for q. Furthermore, the standard unnormalised IS weight for a sampler with target Π 1With q.Get Z() 7 Z and q.Simulate() returning a sample from q we obtain the standard IS method. Nested Sequential Monte Carlo Methods and proposal Q is given by u π/q, in agreement with Algorithm 1. Algorithm 1 is an example of what is referred to as an exact approximation; see e.g., Andrieu & Roberts (2009); Andrieu et al. (2010). Algorithmically, the method appears to be an approximation of an IS, but samples generated by the algorithm nevertheless target the correct distribution π. 3.2. Modularity of Nested IS To be able to implement Algorithm 1 we need to define a class Q with the required properties (A1). The modularity of the procedure (as well as its name) comes from the fact that we can use Algorithm 1 also in this respect. Indeed, let us now view π the target distribution of Algorithm 1 as the proposal distribution for another Nested IS procedure and consider the following definition of Q: 1. Algorithm 1 is executed at the construction of the object p = Q(π, N), and p.Get Z() returns the normalising constant estimate b Zπ. 2. p.Simulate() simulates a categorical random variable B with P(B = i) = W i/ PN ℓ=1 W ℓand returns XB. A simple computation now yields that for any measurable f we have E[f(XB) b Zπ] = π(f)Zπ. This implies that (XB, b Zπ) is properly weighted for π and that our definition of Q(π, N) indeed satisfies condition (A1). The Nested IS algorithm in itself is unlikely to be of direct practical interest. However, in the next section we will, essentially, repeat the preceding derivation in the context of SMC to develop the NSMC method. 4. Nested Sequential Monte Carlo 4.1. Fully Adapted SMC Samplers Let us return to the sequential inference problem. As before, let πk(x1:k) = Z 1 πk πk(x1:k) denote the target distribution at time k. The unnormalised density πk can be evaluated point-wise, but the normalising constant Zπk is typically unknown. We will use SMC to simulate sequentially from the distributions { πk}n k=1. In particular, we consider the fully adapted SMC sampler (Pitt & Shephard, 1999), which corresponds to a specific choice of resampling weights and proposal distribution, chosen in such a way that the importance weights are all equal to 1/N. Specifically, the proposal distribution (often referred to as the optimal proposal) is given by qk(xk | x1:k 1) = Zqk(x1:k 1) 1qk(xk | x1:k 1), where qk(xk | x1:k 1) := πk(x1:k)/πk 1(x1:k 1). In addition, the normalising constant Zqk(x1:k 1) = R qk(xk | x1:k 1)dxk is further used to define the resam- pling weights, i.e. the particles at time k 1 are resampled according to Zqk(x1:k 1) before they are propagated to time k. For notational simplicity, we use the convention x1:0 = , q1(x1 | x1:0) = π1(x1) and Zq1(x1:0) = Zπ1. The fully adapted auxiliary SMC sampler is given in Algorithm 2. Algorithm 2 SMC (fully adapted) 1. Set b Zπ0 = 1. 2. for k = 1 to n (a) Compute b Zπk = b Zπk 1 1 N PN j=1 Zqk(Xj 1:k 1). (b) Draw m1:N k from a multinomial distribution with probabilities Zqk (Xj 1:k 1) PN ℓ=1 Zqk (Xℓ 1:k 1), for j = 1, . . . , N. (c) Set L 0 (d) for j = 1 to N i. Draw Xi k qk( | Xj 1:k 1) and let Xi 1:k = (Xj 1:k 1, Xi k) for i = L + 1, . . . , L + mj k. ii. Set L L + mj k. As mentioned above, at each iteration k = 1, . . . , n, the method produces unweighted samples {Xi k}N i=1 approximating πk. It also produces an unbiased estimate b Zπk of Zπk (Del Moral, 2004, Proposition 7.4.1). The algorithm is expressed in a slightly non-standard form; at iteration k we loop over the ancestor particles, i.e. the particles after resampling at iteration k 1, and let each ancestor particle j generate mj k offsprings. (The variable L is just for bookkeeping.) This is done to clarify the connection with the NSMC procedure below. Furthermore, we have included a (completely superfluous) resampling step at iteration k = 1, where the dummy variables {Xi 1:0}N i=1 are resampled according to the (all equal) weights {Zq1(Xi 1:0)}N i=1 = {Zπ1}N i=1. The analogue of this step is, however, used in the NSMC algorithm, where the initial normalising constant Zπ1 is estimated. We thus have to resample the corresponding initial particle systems accordingly. 4.2. Fully Adapted Nested SMC Samplers In analogue with Section 3, assume now that we are not able to simulate exactly from qk, nor compute Zqk. Instead, we have access to a class Q which satisfies condition (A1). The proposed NSMC method is then given by Algorithm 3. Algorithm 3 can be seen as an exact approximation of the fully adapted SMC sampler in Algorithm 2. (In Naesseth et al. (2015) we provide a formulation of NSMC with arbitrary proposals and resampling weights.) We replace the exact computation of Zqk and exact simulation from qk, by the approximate procedures available through Q. Despite Nested Sequential Monte Carlo Methods Algorithm 3 Nested SMC (fully adapted) 1. Set b Zπ0 = 1. 2. for k = 1 to n (a) Initialise qj = Q(qk( | Xj 1:k 1), M) for j = 1, . . . , N. (b) Set b Zj qk = qj.Get Z() for j = 1, . . . , N. (c) Compute b Zπk = b Zπk 1 n 1 N PN j=1 b Zj qk o . (d) Draw m1:N k from a multinomial distribution with probabilities b Zj qk PN ℓ=1 b Zℓqk for j = 1, . . . , N. (e) Set L 0 (f) for j = 1 to N i. Compute Xi k = qj.Simulate() and let Xi 1:k = (Xj 1:k 1, Xi k) for i = L + 1, . . . , L + mj k. ii. delete qj. iii. Set L L + mj k. this approximation, however, Algorithm 3 is a valid SMC method. This is formalised by the following theorem. Theorem 1. Assume that Q satisfies condition (A1). Then, under certain regularity conditions on the function f : Xk 7 Rd and for an asymptotic variance ΣM k (f), both specified in Naesseth et al. (2015), we have i=1 f(Xi 1:k) πk(f) D N(0, ΣM k (f)), where {Xi 1:k}M i=1 are generated by Algorithm 3 and D denotes convergence in distribution. Proof. See Naesseth et al. (2015). Remark 1. The key point with Theorem 1 is that, under certain regularity conditions, the NSMC method converges at rate N even for a fixed (and finite) value of the precision parameter M. The asymptotic variance ΣM k (f), however, will depend on the accuracy and properties of the approximative procedures of Q. We leave it as future work to establish more informative results, relating the asymptotic variance of NSMC to that of the ideal, fully adapted SMC sampler. 4.3. Backward Simulation and Modularity of NSMC As previously mentioned, the NSMC procedure is modular in the sense that we can make use of Algorithm 3 also to define the class Q. Thus, we now view πn as the proposal distribution that we wish to approximately sample from using NSMC. Algorithm 3 directly generates an estimate b Zπn of the normalising constant of πn (which indeed is unbiased, see Theorem 2). However, we also need to generate a sample e X1:n such that ( e X1:n, b Zπn) is properly weighted for πn. The simplest approach, akin to the Nested IS procedure described in Section 3.2, is to draw Bn uniformly on {1, . . . , N} and return e X1:n = XBn 1:n. This will indeed result in a valid definition of the Simulate procedure. However, this approach will suffer from the well known path degeneracy of SMC samplers. In particular, since we call qj.Simulate() multiple times in Step 2(f)i of Algorithm 3, we risk to obtain (very) strongly correlated samples by this simple approach. It is possible to improve the performance of the above procedure by instead making use of a backward simulator (Godsill et al., 2004; Lindsten & Sch on, 2013) to simulate e X1:n. The backward simulator, given in Algorithm 4, is a type of smoothing algorithm; it makes use of the particles generated by a forward pass of Algorithm 3 to simulate backward in time a trajectory e X1:n approximately distributed according to πn. Algorithm 4 Backward simulator (fully adapted) 1. Draw Bn uniformly on {1, . . . , N}. 2. Set e Xn = XBn n . 3. for k = n 1 to 1 (a) Compute f W j k = πn((Xj 1:k, e Xk+1:n)) πk(Xj 1:k) for j = 1, . . . , N. (b) Draw Bk from a categorical distribution with prob- abilities f W j k PN ℓ=1 f W ℓ k for j = 1, . . . , N. (c) Set e Xk:n = (XBk k , e Xk+1:n). Remark 2. Algorithm 4 assumes unweighted particles and can thus be used in conjunction with the fully adapted NSMC procedure of Algorithm 2. If, however, the forward filter is not fully adapted the weights need to be accounted for in the backward simulation; see Naesseth et al. (2015). The modularity of NSMC is established by the following result. Definition 2. Let p = Q(πn, N) be defined as follows: 1. The constructor executes Algorithm 3 with target distribution πn and with N particles, and p.Get Z() returns the estimate of the normalising constant b Zπn. 2. p.Simulate() executes Algorithm 4 and returns e X1:n. Theorem 2. The class Q defined as in Definition 2 satisfies condition (A1). Proof. See Naesseth et al. (2015). Nested Sequential Monte Carlo Methods A direct, and important, consequence of Theorem 2 is that NSMC can be used as a component of powerful learning algorithms, such as the particle Markov chain Monte Carlo (PMCMC) method (Andrieu et al., 2010) and many of the other methods discussed in Section 5. Since standard SMC is a special case of NSMC, Theorem 2 implies proper weighting also of SMC. 5. Practicalities and Related Work There has been much recent interest in using SMC within SMC in various ways. The SMC2 by Chopin et al. (2013) and the recent method by Crisan & M ıguez (2013) are sequential learning algorithms for state space models, where one SMC sampler for the parameters is coupled with another SMC sampler for the latent states. Johansen et al. (2012) and Chen et al. (2011) address the state inference problem by splitting the state variable into different components and run coupled SMC samplers for these components. These methods differ substantially from NSMC; they solve different problems and the internal SMC sampler(s) is constructed in a different way (for approximate marginalisation instead of for approximate simulation). Another related method is the random weights PF of Fearnhead et al. (2010a), requiring exact samples from q and where the importance weights are estimated using a nested Monte Carlo algorithm. The method most closely related to NSMC is the spacetime particle filter (ST-PF) (Beskos et al., 2014a), which has been developed independently and in parallel with our work. The ST-PF is also designed for solving inference problems in high-dimensional models. It can be seen as a island PF (Verg e et al., 2013) implementation of the method presented by Naesseth et al. (2014b). Specifically, for a spatio-temporal models they run an island PF over both spatial and temporal dimensions. However, the ST-PF does not generate an approximation of the fully adapted SMC sampler. Another key distinction between NSMC and ST-PF is that in the latter each particle in the outer SMC sampler comprises a complete particle system from the inner SMC sampler. For NSMC, on the other hand, the particles will simply correspond to different hypotheses about the latent variables (as in standard SMC), regardless of how many samplers that are nested. This is a key feature of NSMC, since it implies that it is easily distributed over the particles. The main computational effort of Algorithm 3 is the construction of {qj}N j=1 and the calls to the Simulate procedure, which can be done independently for each particle. This leads to improved memory efficiency and lower communication costs. Furthermore, we have found (see Section 6) that NSMC can outperform ST-PF even when run on a single machine with matched computational costs. Another strength of NSMC methods are their relative ease of implementation, which we show in Section 6.3. We use the framework to sample from what is essentially a cubic grid Markov random field (MRF) model just by implementing three nested samplers, each with a target distribution defined on a simple chain. There are also other SMC-based methods designed for high-dimensional problems, e.g., the block PF studied by Rebeschini & van Handel (2015), the location particle smoother by Briggs et al. (2013) and the PF-based methods reviewed in Djuric & Bugallo (2013). However, these methods are all inconsistent, as they are based on various approximations that result in systematic errors. The previously mentioned PMCMC (Andrieu et al., 2010) is a related method, where SMC is used as a component of an MCMC algorithm. We make use of a very similar extended space approach to motivate the validity of our algorithm. Note that our proposed algorithm can be used as a component in PMCMC and most of the other algorithms mentioned above, which further increases the scope of models it can handle. 6. Experimental Results We illustrate NSMC on three high-dimensional examples, both with real and synthetic data. We compare NSMC with standard (bootstrap) PF and the ST-PF of Beskos et al. (2014a) with equal computational budgets on a single machine (i.e., neglecting the fact that NSMC is more easily distributed). These methods are, to the best of our knowledge, the only other available consistent online methods for full Bayesian inference in general sequential models. For more detailed explanations of the models and additional results, see Naesseth et al. (2015)2. 6.1. Gaussian State Space Model We start by considering a high-dimensional Gaussian state space model, where we have access to the true solution from the Kalman filter (Kalman, 1960). The latent variables and measurements {X1:k, Y1:k}, with {Xk, Yk} = {Xk,l, Yk,l}d l=1, are modeled by a d k lattice Gaussian MRF, which can be identified with a linear Gaussian state space model (see Naesseth et al. (2015)). We run a 2-level NSMC sampler. The outer level is fully adapted, i.e. the proposal distribution is qk = p(xk | xk 1, yk), which thus constitute the target distribution for the inner level. To generate properly weighted samples from qk, we use a bootstrap PF operating on the d components of the vector xk. Note that we only use bootstrap proposals where the actual sampling takes place, and that the conditional distribution 2Code available at https://github.com/can-cs/ nestedsmc Nested Sequential Monte Carlo Methods d = 50 d = 100 d = 200 1 10 20 30 40 50 60 70 80 90 100 k 1 10 20 30 40 50 60 70 80 90 100 k 1 10 20 30 40 50 60 70 80 90 100 k Figure 2. Median (over dimension) ESS (4) and 15 85% percentiles (shaded region). The results are based on 100 independent runs for the Gaussian MRF with dimension d. p(xk | xk 1, yk) is not explicitly used. We simulate data from this model for k = 1, . . . , 100 for different values of d = dim(xk) {50, 100, 200}. The exact filtering marginals are computed using the Kalman filter. We compare with both the ST-PF and standard (bootstrap) PF. The results are evaluated based on the effective sample size (ESS, see e.g. Fearnhead et al. (2010b)) defined as, ESS(xk,l) = E h (bxk,l µk,l)2 where bxk,l denote the mean estimates and µk,l and σ2 k,l denote the true mean and variance of xk,l | y1:k obtained from the Kalman filter. The expectation in (4) is approximated by averaging over 100 independent runs of the involved algorithms. The ESS reflects the estimator accuracy, obvious by the definition which is tightly related to the mean-squared-error. Intuitively the ESS corresponds to the equivalent number of i.i.d. samples needed for the same accuracy. We use N = 500 and M = 2 d for NSMC and match the computational time for ST-PF and bootstrap PF. We report the results in Figure 2. Note that the bootstrap PF is omitted from d = 100, 200 due to its poor performance already for d = 50 (which is to be expected). Each dimension l = 1, . . . , d provides us with a value of the ESS, so we present the median (lines) and 15 85% percentiles (shaded regions) in the first row of Figure 2. We have conducted additional experiments with different model parameters and different choices for N and M (some additional results are given in Naesseth et al. (2015)). Overall the results seem to be in agreement with the ones presented here, however ST-PF seems to be more robust to the trade-off between N and M. A rule-of-thumb for NSMC is to generally try to keep N as high as possible, while still maintaining a reasonable estimate of Zqk. 6.2. Non-Gaussian State Space Model Next, we consider an example with a non-Gaussian SSM, borrowed from Beskos et al. (2014a) where the full de- tails of the model are given. The transition probability p(xk | xk 1) is a localised Gaussian mixture and the measurement probability p(yk | xk) is t-distributed. The model dimension is d = 1 024. Beskos et al. (2014a) report improvements for ST-PF over both the bootstrap PF and the block PF by Rebeschini & van Handel (2015). We use N = M = 100 for both ST-PF and NSMC (the special structure of this model implies that there is no significant computational overhead from 1 10 20 30 40 50 60 70 80 90 100 k ST-PF Bootstrap Figure 3. Median ESS with 15 85% percentiles (shaded region) for the non Gaussian SSM. running backward sampling) and the bootstrap PF is given N = 10 000. In Figure 3 we report the ESS (4), estimated according to Carpenter et al. (1999). The ESS for the bootstrap PF is close to 0, for ST-PF around 1 2, and for NSMC slightly higher at 7 8. However, we note that all methods perform quite poorly on this model, and to obtain satisfactory results it would be necessary to use more particles. 6.3. Spatio-Temporal Model Drought Detection In this final example we study the problem of detecting droughts based on measured precipitation data (Jones & Harris, 2013) for different locations on earth. We look at the situation in North America during the years 1901 1950 and the Sahel region in Africa during the years 1950 2000, time frames including the so-called Dust Bowl in the US during the 1930s (Schubert et al., 2004) and the decades long drought in the Sahel region in Africa starting in the 1960s (Foley et al., 2003; Hoerling et al., 2006). We consider the spatio-temporal model defined by Fu et al. (2012) and compare with the results therein. Each location in a region is modelled to be in either a normal state Nested Sequential Monte Carlo Methods 1900 1910 1920 1930 1940 1950 Year Nr of drought locations p(x = 1) > 0.5 p(x = 1) > 0.9 p(x = 1) > 0.7 1950 1960 1970 1980 1990 2000 Year Nr of drought locations p(x = 1) > 0.5 p(x = 1) > 0.9 p(x = 1) > 0.7 North America region Sahel region North America 1939 North America 1940 North America 1941 Figure 4. Top: Number of locations with estimated p(x = 1) > {0.5, 0.7, 0.9} for the two regions. Bottom: Estimate of p(xt,i = 1) for all sites over a span of 3 years. All results for N = 100, N1 = {30, 40}, N2 = 20. 0 or in an abnormal state 1 (drought). Measurements are given by precipitation (in millimeters) for each location and year. At every time instance k our latent structure is described by a rectangular 2D grid Xk = {Xk,i,j}I,J i=1,j=1; in essence this is the model showcased in Figure 1. Fu et al. (2012) considers the problem of finding the maximum aposteriori configuration, using a linear programming relaxation. We will instead compute an approximation of the full posterior filtering distribution πk(xk) = p(xk | y1:k). Xk,1:2,1 Xk,1:2,2 Xk,1:2,3 Figure 5. Illustration of the three-level NSMC. The rectangular structure is used to instantiate an NSMC method that on the first level targets the full posterior filtering distribution, second level the columns, and third level the rows of Xk. Properly weighted samples are generated using a bootstrap PF for the third level. The structure of our NSMC method applied to this particular problem is illustrated in Figure 5. Figure 4 gives the results on the parts of North America that we consider. The first row shows the number of locations where the estimate of p(xk,i,j = 1) exceeds {0.5, 0.7, 0.9}, for both regions. These results seems to be in agreement with Fu et al. (2012, Figures 3, 6). However, we also receive an approximation of the full posterior and can visualise uncertainty in our estimates, as illustrated by the three different levels of posterior probability for drought. In general, we obtain a rich sample diversity from the posterior distribution. However, for some problematic years the sampler degenerates, with the result that the three credibility levels all coincide. This is also visible in the second row of Figure 4, where we show the posterior estimates p(xk,i,j | y1:k) for the years 1939 1941, overlayed on the regions of interest. Naturally, one way to improve the estimates is to run the sampler with a larger number of particles, which has been kept very low in this proof-of-concept. We have shown that a straightforward NSMC implementation with fairly few particles can attain reasonable approximations to the filtering problem for dimensions in the order of hundreds, or even thousands. This means that NSMC methods takes the SMC framework an important step closer to being viable for high-dimensional statistical inference problems. However, NSMC is not a silver bullet for solving high-dimensional inference problems, and the approximation accuracy will be highly model dependent. Hence, much work remains to be done, for instance on combining NSMC with other techniques for high-dimensional inference such as localisation (Rebeschini & van Handel, 2015) and annealing (Beskos et al., 2014b), in order to solve even more challenging problems. Acknowledgments This work was supported by the projects: Learning of complex dynamical systems (Contract number: 637-2014-466) and Probabilistic modeling of dynamical systems (Contract number: 621-2013-5524), both funded by the Swedish Research Council. Nested Sequential Monte Carlo Methods Andrieu, C. and Roberts, G. O. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697 725, 2009. Andrieu, Christophe, Doucet, Arnaud, and Holenstein, Roman. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269 342, 2010. Beskos, A., Crisan, D., Jasra, A., Kamatani, K., and Zhou, Y. A stable particle filter in high-dimensions. Ar Xiv:1412.3501, December 2014a. Beskos, Alexandros, Crisan, Dan, and Jasra, Ajay. On the stability of sequential Monte Carlo methods in high dimensions. Ann. Appl. Probab., 24(4):1396 1445, 08 2014b. Bickel, Peter, Li, Bo, and Bengtsson, Thomas. Sharp failure rates for the bootstrap particle filter in high dimensions, volume Volume 3 of Collections, pp. 318 329. Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2008. Briggs, Jonathan, Dowd, Michael, and Meyer, Renate. Data assimilation for large-scale spatio-temporal systems using a location particle smoother. Environmetrics, 24(2):81 97, 2013. Capp e, Olivier, Moulines, Eric, and Ryd en, Tobias. Inference in Hidden Markov Models. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. ISBN 0387402640. Carpenter, J., Clifford, P., and Fearnhead, P. Improved particle filter for nonlinear problems. IEE Proceedings Radar, Sonar and Navigation, 146(1):2 7, 1999. Chen, Tianshi, Sch on, Thomas B., Ohlsson, Henrik, and Ljung, Lennart. Decentralized particle filter with arbitrary state decomposition. IEEE Transactions on Signal Processing, 59(2):465 478, Feb 2011. Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. SMC2: an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397 426, 2013. Cohen, Jacques. Bioinformaticsan introduction for computer scientists. ACM Computing Surveys (CSUR), 36 (2):122 158, 2004. Cressie, N. and Wikle, C. K. Statistics for spatio-temporal data. Wiley, 2011. Crisan, D. and M ıguez, J. Nested particle filters for online parameter estimation in discrete-time state-space Markov models. Ar Xiv:1308.1883, August 2013. Del Moral, P. Feynman-Kac Formulae - Genealogical and Interacting Particle Systems with Applications. Probability and its Applications. Springer, 2004. Djuric, Petar M and Bugallo, M onica F. Particle filtering for high-dimensional systems. In Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2013 IEEE 5th International Workshop on, pp. 352 355. IEEE, 2013. Doucet, A. and Johansen, A. M. A tutorial on particle filtering and smoothing: Fifteen years later. In Crisan, D. and Rozovsky, B. (eds.), Nonlinear Filtering Handbook. Oxford University Press, 2011. Doucet, Arnaud, De Freitas, Nando, and Gordon, Neil. An introduction to sequential Monte Carlo methods. Springer, 2001. Fearnhead, Paul, Papaspiliopoulos, Omiros, Roberts, Gareth O., and Stuart, Andrew. Random-weight particle filtering of continuous time processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):497 512, 2010a. Fearnhead, Paul, Wyncoll, David, and Tawn, Jonathan. A sequential smoothing algorithm with linear computational cost. Biometrika, 97(2):447 464, 2010b. Foley, J. A., Coe, M. T., Scheffer, M., and Wang, G. Regime shifts in the sahara and sahel: Interactions between ecological and climatic systems in northern africa. Ecosystems, 6:524 539, 2003. Fu, Qiang, Banerjee, Arindam, Liess, Stefan, and Snyder, Peter K. Drought detection of the last century: An MRFbased approach. In Proceedings of the 2012 SIAM International Conference on Data Mining, pp. 24 34, Anaheim, CA, USA, April 2012. Godsill, S. J., Doucet, A., and West, M. Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association, 99(465):156 168, March 2004. Gordon, N. J., Salmond, D. J., and Smith, A. F. M. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107 113, April 1993. Hoerling, M., Hurrell, J., Eischeid, J., and Phillips, A. Detection and attribution of twentieth-century northern and southern african rainfall change. Journal of Climate, 19: 3989 4008, 2006. Nested Sequential Monte Carlo Methods Johansen, A. M., Whiteley, N., and Doucet, A. Exact approximation of Rao-Blackwellised particle filters. In Proceesings of the 16th IFAC Symposium on System Identification (SYSID), pp. 488 493, Brussels, Belgium, 2012. Jones, P.D. and Harris, I. CRU TS3.21: Climatic research unit (CRU) time-series (ts) version 3.21 of high resolution gridded data of month-bymonth variation in climate (jan. 1901dec. 2012). NCAS British Atmospheric Data Centre, sep 2013. URL http://dx.doi.org/10.5285/ D0E1585D-3417-485F-87AE-4FCECF10A992. Kalman, R. E. A new approach to linear filtering and prediction problems. Transactions of the ASME, Journal of Basic Engineering, 82:35 45, 1960. Lindsten, F. and Sch on, T. B. Backward simulation methods for Monte Carlo statistical inference. Foundations and Trends in Machine Learning, 6(1):1 143, 2013. Liu, Jun S. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2001. Monteleoni, Claire, Schmidt, Gavin A., Alexander, Francis, Niculescu-Mizil, Alexandru, Steinhaeuser, Karsten, Tippett, Michael, Banerjee, Arindam, Blumenthal, M. Benno, Auroop R. Ganguly, Jason E. Smerdon, and Tedesco, Marco. Climate informatics. In Yu, Ting, Chawla, Nitesh, and Simoff, Simeon (eds.), Computational Intelligent Data Analysis for Sustainable Development. Chapman and Hall/CRC, London, 2013. Naesseth, Christian A., Lindsten, Fredrik, and Sch on, Thomas B. Capacity estimation of two-dimensional channels using sequential Monte Carlo. In The 2014 IEEE Information Theory Workshop (ITW), pp. 431 435, Nov 2014a. Naesseth, Christian A, Lindsten, Fredrik, and Sch on, Thomas B. Sequential Monte Carlo for graphical models. In Advances in Neural Information Processing Systems 27, pp. 1862 1870. Curran Associates, Inc., 2014b. Naesseth, Christian A., Lindsten, Fredrik, and Sch on, Thomas B. Nested sequential Monte Carlo methods. ar Xiv:1502.02536, 2015. Pitt, Michael K and Shephard, Neil. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association, 94(446):590 599, 1999. Rebeschini, P. and van Handel, R. Can local particle filters beat the curse of dimensionality? Ann. Appl. Probab. (to appear), 2015. Rue, H. and Held, L. Gaussian Markov Random Fields, Theory and Applications. CDC Press, Boca Raton, FL, USA, 2005. Schubert, S. D., Suarez, M. J., Pegion, P. J., Koster, R. D., and Bacmeister, J. T. On the cause of the 1930s dust bowl. Science, 303:1855 1859, 2004. Shumway, R. H. and Stoffer, D. S. Time Series Analysis and Its Applications with R examples. Springer Texts in Statistics. Springer, New York, USA, third edition, 2011. Verg e, Christelle, Dubarry, Cyrille, Del Moral, Pierre, and Moulines, Eric. On parallel implementation of sequential Monte Carlo methods: the island particle model. Statistics and Computing, pp. 1 18, 2013. Wainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2): 1 305, 2008. Wikle, C. K. Modern perspectives on statistics for spatiotemporal data. WIREs Computational Statistics, 7(1): 86 98, 2015.