# replica_conditional_sequential_monte_carlo__f0113a71.pdf

Replica Conditional Sequential Monte Carlo

Alexander Y. Shestopaloff 1 2 Arnaud Doucet 3 2

We propose a Markov chain Monte Carlo (MCMC) scheme to perform state inference in non-linear non-Gaussian state-space models. Current state-of-the-art methods to address this problem rely on particle MCMC techniques and its variants, such as the iterated conditional Sequential Monte Carlo (c SMC) scheme, which uses a Sequential Monte Carlo (SMC) type proposal within MCMC. A deﬁciency of standard SMC proposals is that they only use observations up to time t to propose states at time t when an entire observation sequence is available. More sophisticated SMC based on lookahead techniques could be used but they can be difﬁcult to put in practice. We propose here replica c SMC where we build SMC proposals for one replica using information from the entire observation sequence by conditioning on the states of the other replicas. This approach is easily parallelizable and we demonstrate its excellent empirical performance when compared to the standard iterated c SMC scheme at ﬁxed computational complexity.

1. Introduction

We consider discrete-time state-space models. They can be described by a latent Markov process (Xt)t 1 and an observation process (Yt)t 1, (Xt, Yt) being X Y-valued, which satisfy X1 µ( ) and

Xt+1|{Xt = x} f( |x) Yt|{Xt = x} g( |x) (1) for t 1. Our goal is to sample from posterior distribution of the latent states X1:T := (X1, ..., XT ) given a realization of the observations Y1:T = y1:T . This distribution admits a

1School of Mathematics, University of Edinburgh, Edinburgh, UK 2The Alan Turing Institute, London, UK 3Department of Statistics, University of Oxford, Oxford, UK. Correspondence to: Alexander Y. Shestopaloff <ashestopaloff@turing.ac.uk>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

density given by

p(x1:T |y1:T ) µ(x1)g(y1|x1)

t=2 f(xt|xt 1)g(yt|xt).

(2) This sampling problem is now commonly addressed using an MCMC scheme known as the iterated c SMC sampler (Andrieu et al., 2010) and extensions of it; see, e.g., (Shestopaloff & Neal, 2018). This algorithm relies on a SMC-type proposal mechanism. A limitation of these algorithms is that they typically use data only up to time t to propose candidate states at time t, whereas the entire sequence y1:T is observed in the context we are interested in. To address these issues, various lookahead techniques have been proposed in the SMC literature; see (Chen et al., 2013) for a review. Alternative approaches relying on a parametric approximation of the backward information ﬁlter used for smoothing in state-space models (Briers et al., 2010) have also been recently proposed in (Scharth & Kohn, 2016; Guarniero et al., 2017; Ruiz & Kappen, 2017; Heng et al., 2017). When applicable, these iterative methods have demonstrated good performance. However, it is unclear how these ideas could be adapted to the MCMC framework investigated here. Additionally these methods are difﬁcult to put in practice for multimodal posterior distributions.

In this paper, we propose a novel approach which allows us to build proposals for c SMC that allows considering all observed data in a proposal, based on conditioning on replicas of the state variables. Our approach is based purely on Monte Carlo sampling, bypassing any need for approximating functions in the estimate of the backward information ﬁlter.

The rest of this paper is organized as follows. In Section 2, we review the iterated c SMC algorithm and outline its limitations. Section 3 introduces the replica iterated c SMC methodology. In Section 4, we demonstrate the methodology on a linear Gaussian model, two non-Gaussian state space models from (Shestopaloff & Neal, 2018) as well as the Lorenz-96 model from (Heng et al., 2017).

2. Iterated c SMC

The iterated c SMC sampler is an MCMC method for sampling from a target distribution of density π (x1:T ) :=

Replica Conditional Sequential Monte Carlo

Algorithm 1 Iterated c SMC kernel K (x1:T , x 1:T ) c SMC step.

1. At time t = 1

(a) Sample b1 uniformly on [N] and set xb1 1 = x1. (b) For i [N] \{b1}, sample xi 1 q1 ( ). (c) Compute w1(x0, xi 1) for i [N].

2. At times t = 2, . . . , T

(a) Sample bt uniformly on [N] and set xbt t = xt. (b) For i [N] \{bt}, sample

ai t 1 Cat{wt 1(x aj t 2 t 2 , xj t 1); j [N]}.

(c) For i [N] \{bt}, sample xi t qt( | x ai t 1 t 1 ).

(d) Compute wt(x ai t 1 t 1 , xi t) for i [N].

Backward sampling step.

1. At times t = T

(a) Sample b T Cat{w T (x aj T 1 T 1 , xj T ); j [N]}.

2. At times t = T 1, ..., 1

(a) Sample bt

Cat{βt+1(xj t, xbt+1 t+1 )wt(x aj t 1 t 1 , xj t); j [N]}.

Output x 1:T = xb1:T 1:T := xb1 1 , . . . , xb T T .

πT (x1:T ). It relies on a modiﬁed SMC scheme targeting a sequence of auxiliary target probability densities {πt (x1:t)}t=1,...,T 1 and a sequence of proposal densities q1 (x1) and qt(xt|xt 1) for t {2, ..., T}. These target densities are such that πt(x1:t)/πt 1(x1:t 1) βt(xt 1, xt).

2.1. Algorithm

We deﬁne the incremental importance weights for t 2 as

wt(xt 1, xt) := πt (x1:t) πt 1 (x1:t 1) qt(xt|xt 1) βt(xt 1, xt)

qt(xt|xt 1) (3)

and for t = 1 as

w1(x0, x1) := π1(x1)

q1(x1) . (4)

We introduce a dummy variable x0 to simplify notation. We let N 2 be the number of particles used by the algorithm and [N] := {1, ..., N}. We introduce the notation xt = x1 t, . . . , x N t X N, at = a1 t, . . . , a N t {1, . . . , N}N, x1:T = (x1, x2, ..., x T ), a1:T 1 = (a1, a2, ..., a T 1) and

x bt t = xt\xbt t , x b1:T 1:T = n x b1 1 , . . . , x b T T o , a bt t 1 =

at 1\abt t 1, a b2:T 1:T 1 = n a b2 1 , . . . , a b T T 1 o and set bt =

abt+1 t for t = 1, ..., T 1. It can be shown that the iterated c SMC kernel, described in Algorithm 1, is invariant w.r.t. π(x1:T ). Given the current state x1:T , the c SMC step introduced in (Andrieu et al., 2010) samples from the following distribution

Φ(x b1:T 1:T , a b2:T 1:T 1 xb1:T 1:T , b1:T ) = δx1:T xb1:T 1:T

i=1,i =b1 q1 xi 1 T Y

i=1,i =bt λ(ai t 1, xi t xt 1), (5)

λ ai t 1 = k, xi t xt 1 = wt 1(x ak t 1 t 2 , xk t 1) PN j=1 wt 1(x aj t 1 t 2 , xj t 1)

qt(xi t xk t 1). (6)

This can be combined to a backward sampling step introduced in (Whiteley, 2010); see (Finke et al., 2016; Shestopaloff & Neal, 2018) for a detailed derivation. It can be shown that the combination of these two steps deﬁned a Markov kernel that preserves the following extended target distribution

γ(x1:T , a b2:T 1:T 1, b1:T ) := π(xb1:T 1:T ) N T

Φ(x b1:T 1:T , a b2:T 1:T 1 xb1:T 1:T , b1:T ) (7)

as invariant distribution. In particular, it follows that if x1:T π then x 1:T π. The algorithm is described in Algorithm 1 where we use the notation Cat{ci; i [N]} to denote the categorical distribution of probabilities pi ci.

Iterated c SMC has been widely adopted for state space models, i.e. when the target is π(x1:T ) = p(x1:T |y1:T ). The default sequence of auxiliary targets one uses is πt(x1:t) = p(x1:t|y1:t) for t = 1, ..., T 1 resulting in the incremental importance weights

wt(xt 1, xt) f(xt|xt 1)g(yt|xt)

qt(xt|xt 1) (8)

for t 2 and

w1(x0, x1) µ(x1)g(y1|x1)

for t = 1. Typically we will attempt to select a proposal which minimizes the variance of the incremental weight, which at time t 2 is qopt t (xt|xt 1) = p(xt|xt 1, yt) g(yt|xt)f(xt|xt 1) or an approximation of it.

Replica Conditional Sequential Monte Carlo

2.2. Limitations of Iterated c SMC

When using the default sequence of auxiliary targets for state space models, iterated c SMC does not exploit a key feature of the problem at hand. The c SMC step typically uses a proposal at time t that only relies on the observation yt, i.e. qt(xt|xt 1) = p (xt|xt 1, yt), as it targets at time t the posterior density p (x1:t|y1:t). In high-dimensions and/or in the presence of highly informative observations, the discrepancy between successive posterior densities {p (x1:t|y1:t)}t 1 will be high. Consequently the resulting importance weights

{wt(x ai t 1 t 1 , xi t); i [N]} will have high variance and the resulting procedure will be inefﬁcient.

Ideally one would like to use the sequence of marginal smoothing densities as auxiliary densities, that is πt(x1:t) = p (x1:t|y1:T ) for t = 1, ..., T 1. Unfortunately, this is not possible as p (x1:t|y1:T ) p (x1:t|y1:t 1) p (yt:T |xt) cannot be evaluated pointwise up to a normalizing constant. To address this problem in a standard SMC framework, recent contributions (Scharth & Kohn, 2016; Guarniero et al., 2017; Ruiz & Kappen, 2017; Heng et al., 2017) perform an analytical approximation ˆp (yt:T |xt) of the backward information ﬁlter p (yt:T |xt) based on an iterative particle mechanism and target instead {ˆp (x1:t|y1:T )}t 1 where ˆp (x1:t|y1:T ) p (x1:t|y1:t 1) ˆp (yt:T |xt) using proposals of the form qt (xt|xt 1) f (xt|xt 1) ˆp (yt:T |xt). These methods can perform well but it requires a careful design of the analytical approximation and is difﬁcult to put in practice for multimodal posteriors. Additionally, it is unclear how these could be adapted in an iterated c SMC framework without introducing any bias.

Versions of iterated c SMC using an independent approximation to the backward information ﬁlter based on Particle Efﬁcient Importance Sampling (Scharth & Kohn, 2016) have been proposed (Grothe et al., 2016) though they still require a choice of analytical approximation and use an approximation to the backward information ﬁlter which is global. This can become inefﬁcient in high dimensional state scenarios.

3. Replica Iterated c SMC

We introduce a way to directly use the iterated c SMC algorithm to target a sequence of approximations {ˆp (x1:t|y1:T )}t 1 to the marginal smoothing densities of a state space model. Our proposed method is based on sampling from a target over multiple copies of the space as done in, for instance, the Parallel Tempering or Ensemble MCMC (Neal, 2010) approaches. However, unlike in these techniques, we use copies of the space to deﬁne a sequence of intermediate distributions in the c SMC step informed by the whole dataset. This enables us to draw samples of X1:T that incorporate information about all of the observed data. Related recent work includes (Leimkuhler et al., 2018),

where information sharing amongst an ensemble of replicas is used to improve MCMC proposals.

3.1. Algorithm

We start by deﬁning the replica target for some K 2 by

π(x(1:K) 1:T ) =

k=1 p(x(k) 1:T |y1:T ). (10)

Each of the replicas x(k) 1:T is updated in turn by running Algorithm 1 with a different sequence of intermediate targets which we describe here. Consider updating replica k and let ˆp(k)(yt+1:T |xt) be an estimator of the backward information ﬁlter, built using replicas other than the k-th one, x( k) t+1 = (x(1) t+1, . . . , x(k 1) t+1 , x(k+1) t+1 , . . . , x(K) t+1). For convenience of notation, we take ˆp(k)(y T +1:T |x T ) := 1. At time t, the c SMC does target approximation of the marginal smoothing distribution p (x1:t|y1:T ) as in (Scharth & Kohn, 2016; Guarniero et al., 2017; Ruiz & Kappen, 2017; Heng et al., 2017). This is of the form ˆp(k) (x1:t|y1:T ) p (x1:t|y1:t) ˆp(k) (yt+1:T |xt). This means that the c SMC for replica k uses the novel incremental weights at time t 2

w(k) t (xt 1, xt) := ˆp(k) (x1:t|y1:T ) ˆp(k) (x1:t 1|y1:T ) qt(xt|xt 1) (11)

g(yt|xt)f(xt|xt 1)ˆp(k) (yt+1:T |xt)

ˆp(k) (yt:T |xt 1) qt(xt|xt 1)

and w(k) 1 (x0, x1) g(y1|x1)µ(x1)ˆp(k)(yt+1:T |x1)/q1(x1). We would like to use the proposal minimizing the variance of the incremental weight, which at time t 2 is qopt t (xt|xt 1) g(yt|xt)f(xt|xt 1)ˆp(k) (yt+1:T |xt) or an approximation of it.

The full replica c SMC update for π is described in Algorithm 2 and is simply an application of Algorithm 1 to a sequence of target densities for each replica. A proof of the validity of the algorithm is provided in the Supplementary Material.

Algorithm 2 Replica c SMC update For k = 1, . . . , K

1. Build an approximation ˆp(k) (yt+1:T |xt) of p (yt+1:T |xt) using the replicas (x(1)

t+1, . . . , x(k 1)

t+1 , x(k+1) t+1 , . . . , x(K) t+1) for t = 1, ..., T 1.

2. Run Algorithm 1 with target π(x1:T ) = p(x1:T |y1:T ) and auxiliary targets πt(x1:t) = ˆp(k) (x1:t|y1:T ) for t = 1, . . . , T 1 with initial state x(k) 1:T to return x(k ) 1:T .

Output x(1:K)

Replica Conditional Sequential Monte Carlo

One sensible way to initialize the replicas is to set them to sequences sampled from standard independent SMC passes. This will start the Markov chain not too far from equilibrium. For multimodal distributions, initialization is particularly crucial, since we need to ensure that different replicas are well-distributed amongst the various modes at the start of the run.

3.2. Setup and Tuning

The replica c SMC sampler requires an estimator ˆp(k) (yt+1:T |xt) of the backward information ﬁlter based on x( k) t+1 . For our algorithm, we propose an estimator ˆp(k) (yt+1:T |xt) that is not based on any analytical approximation of p (yt+1:T |xt) but simply on a Monte Carlo approximation built using the other replicas,

ˆp(k) (yt+1:T |xt) X

f(x(j) t+1|xt)

p(x(j) t+1|y1:t) , (12)

where p (xt+1|y1:t) denotes the predictive density of xt+1. The rationale for this approach is that at equilibrium the components of x( k) t+1 are an iid sample from a product of K 1 copies of the smoothing density, p (xt+1|y1:T ). Therefore, as K increases, (12) converges to

Z f (xt+1|xt) p (xt+1|y1:t)p (xt+1|y1:T ) dxt+1

Z f (xt+1|xt) p (yt+1:T |xt+1) dxt+1

= p (yt+1:T |xt) . (13)

In practice, the predictive density is also unknown and we need to use an approximation of it. Whatever being the approximation ˆp (xt+1|y1:t) of p (xt+1|y1:t) we use, the algorithm is valid. We note that for K = 2, any approximation of the predictive density results in the same incremental importance weights.

We propose to approximate the predictive density in (13) by a constant over the entire latent space, i.e. ˆp(xt+1|y1:t) = 1. We justify this choice as follows. If we assume that we have informative observations, which is typical in many state space modelling scenarios, then p(xt+1|y1:T ) will tend to be much more concentrated than p(xt+1|y1:t). Thus, over the region where the posterior has high density, the predictive density will be approximately constant relative to the posterior density. This suggests approximating the predictive density in (13) by its mean with respect to the

posterior density, Z f (xt+1|xt) p (xt+1|y1:t)p (xt+1|y1:T ) dxt+1

R f (xt+1|xt) p (xt+1|y1:T ) dxt+1 R p (xt+1|y1:t) p (xt+1|y1:T ) dxt+1

1 K PK k=1 f(x(k) t+1|xt)

1 K PK k=1 p(x(k) t+1|y1:t) . (14)

Since the importance weights in c SMC at each time are deﬁned up to a constant, sampling is not affected by the speciﬁc value of 1 K PK k=1 p(x(k) t+1|y1:t). Therefore, when doing computation it can simply be set to any value, which is what we do.

We note that while the asymptotic argument doesn t hold for the estimator in (14), when the variance of the predictive density is greater than the variance of the posterior density, we expect the estimators in (12) and (14) to be close for any ﬁnite K.

An additional beneﬁt to approximating the predictive density by a constant is reduction in the variance of the mixture weights in (12). To see why this can be the case, consider the following example. Suppose the predictive density of xt+1 is N(µ, σ2 0) and the posterior density is N(0, σ2 1), where σ2 1 < σ2 0. Computing the variance of the mixture weight, we get

Var 1 p(xt+1|y1:t)

2σ2 1ν1 exp µ2 1

σ2 0 + 1 (σ2 0)2ν1

2πσ2 0 σ2 1ν2 exp µ2 1

σ2 0 + 1 (σ2 0)2ν2

From this we can see that variance increases exponentially with the squared difference of predictive and posterior means, µ2. As a result, we can get outliers in the mixture weight distribution. If this happens, many of the replicas will end up having low weights in the mixture. This will reduce the effective number of replicas used. Using a constant approximation will weight all of the replicas uniformly, and allow us to construct better proposals, as illustrated in Section 4.1.

A natural extension of the proposed method is to update some of the replicas with other than replica c SMC updates. Samples from these replicas can then be used in estimates of the backward information ﬁlter when doing a replica c SMC

Replica Conditional Sequential Monte Carlo

update. This makes it possible to parallelize the method, at least to some extent. For instance, one possibility is to do parallel independent c SMC updates on some of the replicas.

Performing other than replica c SMC updates on some of the replicas can be useful in multimodal scenarios. If all replicas are located in an isolated mode, and the replica c SMC updates use an estimate of the backward information ﬁlter based on replicas in that mode, then the overall Markov chain will tend not to transition well to other modes. Using samples from other types of updates in the estimate of the backward information ﬁlter can help counteract this effect by making transitions to other high-density regions possible.

4. Examples

We consider four models to illustrate the performance of our method. In all examples, we assume that the model parameters are known. The ﬁrst is a simple linear Gaussian model. We use this model to demonstrate that it is sensible to use a constant approximation to the predictive density in our estimator of the backward information ﬁlter. We also use the linear Gaussian model to better understand the accuracy and performance of replica c SMC. The second model, from (Shestopaloff & Neal, 2018), demonstrates that our proposed replica c SMC method is competitive with existing state-of-the-art methods at drawing latent state sequences in a unimodal context. The third model, also from (Shestopaloff & Neal, 2018), demonstrates that by updating some replica coordinates with a standard iterated c SMC kernel, our method is able to efﬁciently handle multimodal sampling without the use of specialized ﬂip updates. The fourth model is the Lorenz-96 model from (Heng et al., 2017), which has very low observation noise, making it a challenging case for standard iterated c SMC.

To do our computations, we used MATLAB on an OS X system, running on an Intel Core i5 1.3 GHz CPU. As a performance metric for the sampler, we used autocorrelation time, which is a measure of approximately how many steps of an MCMC chain are required to obtain the equivalent of one independent sample. The autocorrelation time is estimated based on a set of runs as follows. First, we estimate the overall mean using all of the runs. Then, we use this overall mean to estimate autocovariances for each of the runs. The autocovariance estimates are then averaged and used to estimate the autocorrelations ˆρk. The autocorrelation time is then estimated as 1 + 2 PM m=1 ˆρm where M is chosen such that for m > M the autocorrelations are approximately 0. The code to reproduce all the results is publicly available at https://github.com/ ayshestopaloff/replicacsmc.

4.1. A Linear Gaussian Model

Let Xt = (X1,t, . . . , Xd,t) for t = 1, . . . , T. The latent process for this model is deﬁned as X1 N(0, Σ1), Xt|{Xt 1 = xt 1} N(Φxt 1, Σ) for t = 2, . . . , T, where

0 φ2 ... ... ... ... φd 1 0 0 0 φd

ρ 1 ... ... ... ... 1 ρ ρ ρ 1

σ2 1,1 ρσ1,1σ1,2 ρσ1,1σ1,d

ρσ1,2σ1,1 σ2 1,2 ... ... ... ... σ2 1,d 1 ρσ1,d 1σ1,d ρσ1,dσ1,1 ρσ1,dσ1,d 1 σ2 1,d

with σ2 1,i = 1/(1 φ2 i ) for i = 1, . . . , d. The observations are Yi,t|{Xi,t = xi,t} N(xi,t, 1) for i = 1, . . . , d and t = 1, . . . , T. We set T = 250, d = 5 and the model s parameters to ρ = 0.7 and φi = 0.9 for i = 1, . . . , d. We generate a sequence from this model to use for our experiments.

Since this is a linear Gaussian model, we are able to compute the predictive density in (12) exactly using a Kalman ﬁlter. So for replica k, we can use the following importance densities,

q1(x1) µ(x1) X

f(x(j) 2 |x1)

p(x(j) 2 |y1) ,

qt(xt|xt 1) f(xt|xt 1) X

f(x(j) t+1|xt)

p(x(j) t+1|y1:t) ,

q T (x T |x T 1) f(x T |x T 1), (17)

where t = 2, . . . , T 1. Since these densities are Gaussian mixtures, they can be sampled from exactly. However, as pointed out in the previous section, this approach can be inefﬁcient. We will show experimentally that using a constant approximation to the predictive density in (12) actually improves performance. In all experiments, we intialize all replicas to a sample from an independent SMC pass with the same number of particles as used for c SMC updates. Also, the different runs in our experiments use different random number generator seeds.

We ﬁrst check that our replica method produces answers that agree with the posterior mean computed by a Kalman smoother. To do this, we do 10 replica c SMC runs with 100 particles and 2 replicas for 25, 000 iterations, updating each replica conditional on the other. We then look at whether the posterior mean of xi,t computed using a Kalman smoother lies within two standard errors of the overall mean of 10

Replica Conditional Sequential Monte Carlo

0 50 100 150 200 250 Time (t)

Autocorrelation time

(a) Replica c SMC, 2 replicas.

0 50 100 150 200 250 Time (t)

Autocorrelation time

(b) Replica c SMC, 75 replicas.

0 50 100 150 200 250 Time (t)

Autocorrelation time

(c) Replica c SMC, 75 replicas, constant approximation to predictive.

Figure 1. Estimated autocorrelation times for each latent variable. Different coloured lines correspond to different latent state components. The x-axis corresponds to different times.

replica c SMC runs. We ﬁnd this happens for about 91.4% of the xi,t. This indicates strong agreement between the answers obtained by replica c SMC and the Kalman smoother.

Next, we investigate the effect of using more replicas. To do this, we compare replica c SMC using 2 versus 75 replicas. We do 5 runs of each sampler. Both samplers use 100 particles and we do a total of 5, 000 iterations per run. For the sampler using 75 replicas, we update replica 1 at every iteration and replicas 2 to 75 in sequence at every 20-th iteration. For the sampler using 2 replicas, we update both replicas at every iteration. In both samplers, we update replica 1 with replica c SMC and the remaining replica(s) with iterated c SMC. After discarding 10% of each run as burn-in, we use all runs for a sampler to compute autocorrelation time.

We can clearly see in Figures 1a and 1b that using more replicas improves performance, before adjusting for computation time. We note that for this simple example, there is no beneﬁt from using replica c SMC with a large number of replicas if we take into account computation time.

To check the performance of using the constant approximation versus the exact predictive density, we run replica c SMC with 75 replicas and the same settings as earlier, except using a constant approximation to the predictive density. Figure 1c shows that using a constant approximation to the predictive density results in peformance better than when using the true predictive density. This is consistent with our discussion in Section 3.2.

0 500 1000 1500 Time (t)

Autocorrelation time

Figure 2. Estimated autocorrelation times for each latent variable. Different coloured lines correspond to different latent state components. The x-axis corresponds to different times.

The linear Gaussian model can also be used to demonstrate that due to looking ahead, a ﬁxed level of precision can be achieved with much fewer particles with replica c SMC than with standard iterated c SMC. In scenarios where the state is high dimensional and the observations are informative, it is difﬁcult to efﬁciently sample the variables xi,1 with standard iterated c SMC using the initial density as the proposal. We do 20 runs of 2, 500 iterations of both iterated c SMC with 700 particles and of replica c SMC with 35 particles and 2 replicas, with each replica updated given the other. We then use the runs to estimate the standard error of the overall mean over 20 runs. For the variable x1,1 sampled with iterated c SMC we estimate the standard error to be approximately 0.0111 whereas for replica c SMC the estimated standard error is a similar 0.0081, achieved using only 5% of the particles.

Finally, we verify that the proposed method works well on longer time series by running it on the linear Gaussian model but with the length of the observed sequence set to T = 1, 500. We use 2 replicas, each updated given the other, and do 5 runs of 5, 000 iterations of the sampler to estimate the autocorrelation time for sampling the latent variables. In Figure 2 we can see that the replica c SMC method does not suffer from a decrease in performance when used on longer time series.

4.2. Two Poisson-Gaussian Models

In this example, we consider the two models from (Shestopaloff & Neal, 2018). Model 1 uses the same latent process as Section 4.1 with T = 250, d = 10 and Yi,t|{Xi,t = xi,t} Poisson(exp(c + σxi,t)) for i = 1, . . . , d and t = 1, . . . , T where c = 0.4 and σ = 0.6. For Model 2, we again use the latent process in Section 4.1, with T = 500, d = 15 and Yi,t|{Xi,t = xi,t} Poisson(σ|xi,t|)) for i = 1, . . . , d and t = 1, . . . , T where σ = 0.8. We assume the observations are independent given the latent states.

We generate one sequence of observations from each model. A plot of the simulated data along dimension i = 1 is shown

Replica Conditional Sequential Monte Carlo

0 50 100 150 200 250 Time (t)

(a) Data for Model 1.

0 100 200 300 400 500 Time (t)

(b) Data for Model 2.

Figure 3. Simulated data from the Poisson-Gaussian models.

0 50 100 150 200 250 Time (t)

Adjusted autocorrelation time

(a) Iterated c SMC+Metropolis.

0 50 100 150 200 250 Time (t)

Adjusted autocorrelation time

(b) Replica c SMC.

Figure 4. Model 1. Estimated autocorrelation times for each latent variable, adjusted for computation time. Different coloured lines corresponds to different latent state components. The x-axis corresponds to different times.

in Figure 3. We set the importance densities qt for the replica c SMC sampler to the same ones as in Section 4.1, with a constant approximation to the predictive density.

We use replica c SMC with 5 replicas, updating one replica conditional on the other. We start with both sequences initialized to 0. We set the number of particles to 200. We do a total of 5 runs of the sampler with 5, 000 iterations, each run with a different random number generator seed. Each iteration of replica c SMC takes approximately 0.80 seconds. We discard 10% of each run as burn-in.

Plots of autocorrelation time comparing replica c SMC to the best method in (Shestopaloff & Neal, 2018) for sampling each of the latent variables are shown in Figure 4. The benchmark method takes approximately 0.21 seconds per iteration. We can see that the proposed replica c SMC method performs relatively well when compared to their best method after adjusting for computation time. The ﬁgure for iterated c SMC+Metropolis was reproduced using code available with (Shestopaloff & Neal, 2018).

For this model, the challenge is to move between the many different modes of the latent state due to conditioning on

0 200 400 600 800 1000 Time (t)

(a) Trace plot for x1,300.

0 200 400 600 800 1000 Time (t)

x3,208x4,208

(b) Trace plot for x3,208x4,208.

Figure 5. Trace plots for Model 2.

|xi,t| in the observation density. The marginal posterior of xi,t has two modes and is symmetric around 0. Additional modes appear due to uncertainty in the signs of state components.

We use a total of 50 replicas and update 49 of the 50 replicas with iterated c SMC and one replica with replica c SMC. This is done to prevent the Markov chain from being stuck in a single mode while at the same time enabling the replica c SMC update to use an estimate of the backward information ﬁlter based on replicas that are distributed across the state space. We initialize all replicas using sequences drawn from independent SMC passes with 1, 000 particles, and run the sampler for a total of 2, 000 iterations. Both replica c SMC and iterated c SMC updates use 100 particles.

In Figure 5 we plot every other sample of the same functions of state as in (Shestopaloff & Neal, 2018) of the replica updated with replica c SMC. This is the the coordinate x1,300 with true value 1.99 and x3,208x4,208 with true value 4.45. The ﬁrst has two well-separated modes and the second is ambiguous with respect to sign. We see that the sampler is able to explore different modes, without requiring any specialized ﬂip updates or having to use a much larger number of particles, as is the case in (Shestopaloff & Neal, 2018).

We note that the replicas doing iterated c SMC updates tend to get stuck in separate modes for long periods of time, as expected. However, as long as these replicas are welldistributed across the state space and eventually explore it, the bias in the estimate of the backward information ﬁlter will be low and vanish asymptotically. The samples from the replica c SMC update will consequently be a good approximation to samples from the target density. Further improvement of the estimate of the backward information ﬁlter based on replicas in multimodal scenarios remains an open problem.

4.3. Lorenz-96 Model

Finally, we look at the Lorenz-96 model in a low-noise regime from (Heng et al., 2017). The state function for this

Replica Conditional Sequential Monte Carlo

0 20 40 60 80 100 Time (t)

Figure 6. Simulated data from Lorenz-96 model along coordinate i = 1.

model is the Itˆo process ξ(s) = (ξ1(s), . . . , ξd(s)) deﬁned as the weak solution of the stochastic differential equation (SDE)

dξi = ( ξi 1ξi 2 + ξi 1ξi+1 ξi + α)dt + σfd Bi (18)

for i = 1, . . . , d, where indices are computed modulo d, α is a forcing parameter, σ2 f is a noise parameter and B(s) = (B1(s), . . . , Bd(s)) is d-dimensional standard Brownian motion. The initial condition for the SDE is ξ(0) = N(0, σ2 f Id). We observe the process on a regular grid of size h > 0 as Yt N(Hξ(th), R), where t = 0, . . . , T. We will assume that the process is only partially observed, with Hii = 1 for i = 1, . . . , p and 0 otherwise, for p = d 2.

We discretize the SDE (18) by numerically integrating the drift using a fourth-order Runge-Kutta scheme and adding Brownian increments. Let u be the mapping obtained by numerically integrating the drift of (18) on [0, h]. This discretization produces a state space model with X1 N(0, σ2 f I), Xt|{Xt 1 = xt 1} N(u(xt 1), σ2 fh I) for t = 2, . . . , T + 1 and Yt|{Xt = xt} N(Hxt, R) for t = 1, . . . , T +1. We set d = 16, σ2 f = 10 2, R = 10 3Ip and α = 4.8801. The process is observed for 10 time units, which corresponds to h = 0.1, T = 100, and a step size of 10 2 for the Runge-Kutta scheme. A plot of data generated from the Lorenz-96 model along one of the coordinates is shown in Figure 6.

We compare the performance of replica c SMC with two replicas, updating each replica conditional on the other, to an iterated c SMC scheme. For iterated c SMC, we use the model s initial density as q1 and the model s transition density as qt for t 2. For replica c SMC, we use the following importance densities for replica k,

q1(x1) f(x1) X

j =k φ(x1|x(j) 2 ),

qt(xt|xt 1) f(xt|xt 1) X

j =k φ(xt|x(j) t+1),

q T (x T |x T 1) f(x T |x T 1), (19)

where t = 2, . . . , T 1 and φ is the p-dimensional Gaussian

0 200 400 600 800 1000 MCMC sample

(a) Standard c SMC trace, x1,45.

0 200 400 600 800 1000 MCMC sample

(b) Replica c SMC trace, x1,45.

Figure 7. Lorenz-96 model. Comparison of standard c SMC and replica c SMC.

density with mean Hu 1(x(j) t+1) and variance σ2 fh Ip, that is, the mean is computed by running the Runge-Kutta scheme backward in time starting at the replica state x(j) t+1. We initialize the iterated c SMC sampler and each replica in the replica c SMC sampler with a sequence drawn from an independent SMC pass with 3, 000 particles. We run replica c SMC with 200 particles for 30, 000 iterations (0.7 seconds per iteration) and compare to standard iterated c SMC with 600 particles, which we also run for 30, 000 iterations (0.7 seconds per iteration), thus making the computational time equal.

Figure 7 shows the difference in performance of the two samplers by trace plots of x1,45 (true value 0.23), from one of the runs, plotting the samples every 30th iteration. We can see that replica c SMC performs noticeably better when compared to standard iterated c SMC.

5. Conclusion

We presented a novel sampler for latent sequences of a nonlinear state space model. Our proposed method leads to several questions. The ﬁrst is whether there are other ways to estimate the predictive density that does not result in mixture weights with high variance. Another question is to develop better guidelines on choosing the number of replicas to use in a given scenario. It would also be interesting to look at applications of replica c SMC in non time-series examples. Finally, while the proposed method offers an approach for sampling in models with multimodal state distributions, further improvement is needed.

Replica Conditional Sequential Monte Carlo

Andrieu, C., Doucet, A., and Holenstein, R. Particle Markov chain Monte Carlo (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72 (4):269 342, 2010.

Briers, M., Doucet, A., and Maskell, S. Smoothing algorithms for state-space models. Annals of the Institute of Statistical Mathematics, 62(1):61 89, 2010.

Chen, R., Ming, L., and Liu, J. S. Lookahead strategies for sequential Monte Carlo. Statistical Science, 28(1):69 94, 2013.

Finke, A., Doucet, A., and Johansen, A. On embedded hidden Markov models and particle Markov chain Monte Carlo methods. Technical report, ar Xiv:1610.08962, 2016.

Grothe, O., Kleppe, T., and Liesenfeld, R. Bayesian analysis in non-linear non-Gaussian state-space models using particle Gibbs. Technical report, ar Xiv:1601.01125, 2016.

Guarniero, P., Johansen, A. M., and Lee, A. The iterated auxiliary particle ﬁlter. Journal of the American Statistical Association, 112(520):1636 1647, 2017.

Heng, J., Bishop, A., Deligiannidis, G., and Doucet, A. Controlled sequential Monte Carlo. Technical report, ar Xiv:1708.08396, 2017.

Leimkuhler, B., Matthews, C., and Weare, J. Ensemble preconditioning for Markov chain Monte Carlo simulation. Statistics and Computing, 28:277 290, 2018.

Neal, R. MCMC using ensembles of states for problems with fast and slow variables such as Gaussian process regression. Technical report, ar Xiv:1101.0387, 2010.

Ruiz, H. C. and Kappen, H. J. Particle smoothing for hidden diffusion processes: adaptive path integral smoother. IEEE Transactions on Signal Processing, 65(12):3191 3203, 2017.

Scharth, M. and Kohn, R. Particle efﬁcient importance sampling. J. Econometrics, 190(1):133 147, 2016.

Shestopaloff, A. and Neal, R. Sampling latent states for high-dimensional non-linear state space models with the embedded HMM method. Bayesian Analysis, 13(3):797 822, 2018.

Whiteley, N. Discussion of particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):306 307, 2010.