# online_variational_sequential_monte_carlo__2aa88580.pdf

Online Variational Sequential Monte Carlo

Alessandro Mastrototaro 1 Jimmy Olsson 1

Being the most classical generative model for serial data, state-space models (SSM) are fundamental in AI and statistical machine learning. In SSM, any form of parameter learning or latent state inference typically involves the computation of complex latent-state posteriors. In this work, we build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efﬁcient and accurate model parameter estimation and Bayesian latent-state inference by combining particle methods and variational inference. While standard VSMC operates in the ofﬂine mode, by re-processing repeatedly a given batch of data, we distribute the approximation of the gradient of the VSMC surrogate ELBO in time using stochastic approximation, allowing for online learning in the presence of streams of data. This results in an algorithm, online VSMC, that is capable of performing efﬁciently, entirely on-the-ﬂy, both parameter estimation and particle proposal adaptation. In addition, we provide rigorous theoretical results describing the algorithm s convergence properties as the number of data tends to inﬁnity as well as numerical illustrations of its excellent convergence properties and usefulness also in batch-processing settings.

1. Introduction

Being the most classical structured probabilistic generative model for serial data, state-space models (SSM), also known as general state-space hidden Markov models, are fundamental and ubiquitous in AI and statistical machine learning (Capp e et al., 2005; Bishop, 2016). In SSM, any form of parameter learning or state inference typically involves the computation of complex joint posterior distributions of latent-state variables the so-called joint-smoothing

1Department of Mathematics, KTH Royal Institute of Technology , Stockholm, Sweden. Correspondence to: Alessandro Mastrototaro <alemas@kth.se>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

distributions given records of observations, which is a delicate problem whose analytical solution is intractable outside the limited cases of linear Gaussian models or models with ﬁnite state space. In a recent line of research (Le et al., 2018; Maddison et al., 2017; Naesseth et al., 2018), this problem is addressed by combining variational inference (Blei et al., 2017; Kingma & Welling, 2014) and sequential Monte Carlo (SMC) methods (Gordon et al., 1993; Doucet et al., 2001; Chopin & Papaspiliopoulos, 2020) in order to design ﬂexible families of variational joint-state distributions in the form of particle-trajectory laws. By optimizing the Kullback Leibler divergence from (KLD) the law of the particle trajectories to the joint-smoothing distribution, this variational SMC (VSMC) approach is killing two birds with one stone by learning not only the unknown model parameters but also, in parallel, an optimal particle proposal kernel, the latter being a problem that has received a lot of attention in the literature (Doucet et al., 2000; Cornebise et al., 2008; Gu et al., 2015). The procedure can be viewed as a non-trivial extension of the importance-weighted autoencoder (IWAE) proposed by Burda et al. (2016), where standard self-normalized importance sampling has been replaced by sequential importance sampling with systematic resampling in order to obtain a tighter evidence lower bound (ELBO). The objective of VSMC is a similar ELBO whose gradient is approximated by the expectation, under the law of the random numbers generated by the SMC algorithm, of the gradient of the logarithm of the standard particle-based likelihood estimator, the latter being unbiased (Del Moral, 2004; Chopin & Papaspiliopoulos, 2020), is obtained as a by-product of the particle approximation of the marginalstate posterior or ﬁlter distribution ﬂow; thus, whereas traditional particle-based inference in SSM typically relies on particle approximation of the joint-smoothing distributions (with aim of approximating, e.g., the intermediate quantity of the expectation maximization, EM, algorithm or the score function directly via the Fisher identity), which is cumbersome due to the so-called particle-path degeneracy phenomenon (Kitagawa, 1996; Capp e et al., 2005, Section 8.3), VSMC allows, using the reparameterization trick, the model parameters as well as an optimal particle proposal kernel to be simultaneously learned by processing repeatedly the given data batch using a standard particle ﬁlter.

Online Variational Sequential Monte Carlo

In its basic form, VSMC is an ofﬂine inference technique, in the sense that it requires the full data batch to be processed between every update of the model and variational parameters. Still, in a wide range of AI and machine-learning contexts, observed data become available sequentially through a data stream, requiring learning to be performed in an online fashion. The increasing interest in online machinelearning technology also stems from the need of processing data batches of ever-increasing sizes. Thus, in this work we propose an online, stochastic approximation-based version of VSMC, online variational SMC (OVSMC), which can be used for simultaneous online model learning and proposal adaptation. On the contrary to traditional approaches to particle-based online parameter learning in SSM such as particle-based recursive maximum-likelihood (RML, Le Gland & Mevel, 1997; Del Moral et al., 2015) or online EM (Mongillo & Den eve, 2008; Capp e, 2011), which rely on particle-based online approximation of the tangent ﬁlter (ﬁlter derivative) and the EM-intermediate quantity, respectively, our approach does not involve any particle smoothing, which, in order to avoid the problem of collapsing particle ancestral genealogies, typically calls for backward-sampling techniques which can be computationally very costly. In a recent work, Campbell et al. (2021) used variational approximations of the backward kernels for online state and parameter learning. Although this, interestingly, provides a non-particle-based methodology, it is very computationally intensive (see Section 5). In addition, our method allows for effective online adaptation of the particle proposal kernel in a way that differs from typical approaches in the literature; indeed, whereas traditional approaches aim to optimize the proposal time step by time step on the basis of local criteria (see, e.g., Cornebise et al., 2008; 2014; Zhao et al., 2022), our approach is based on a global KLD-based criterion (described in detail in Section 3) which allows the particle cloud to be effectively guided toward state-space regions of high local likelihood, without running the risk of over-adapting locally the proposal to current (or temporarily neighboring) observations.

Although our proposed method has similarities with the streaming variational Monte Carlo methodology proposed by Zhao et al. (2022), it is essentially different from the same. An important difference is that the aforementioned work focuses on local optimization of model and variational parameters, such that these are assumed to vary with time and are therefore optimized time step by time step, while OVSMC estimate global, amortized parameters using stochastic approximation. In addition, in order to show that our approach is statistically well founded, we provide rigorous theoretical results describing its convergence (Theorem 4.7). Under certain strong mixing assumptions, the time-normalized gradient guiding the learning of batch VSMC can, using Birkhoff s ergodic theorem, be shown to converge as the

observation batch size increases towards inﬁnity to a deterministic function depending on the parameter as well as the particle sample size; we show that as the number of observations tends to inﬁnity, OVSMC is solving the same problem as this ideal, asymptotic VSMC, in the sense that the mean ﬁeld targeted by OVSMC coincides with the time-normalized asymptotic gradient of VSMC.

Finally, we illustrate OVSMC numerically on a number of classical SSM and more complex generative models, for which the method exhibits fast parameter learning and efﬁcient adaptation of the particle proposal kernel. In the same numerical study, we also show that OVSMC is a strong challenger of VSMC on batch problems.

The paper is structured as follows. In Section 2, we review particle ﬁltering in the context of SSM, and their relation with variational inference in some recent works. In Section 3, we present our methodology for online learning of proposal distributions and parameters and Section 4 and Section 5 provide our theoretical results and numerical experiments, respectively.

2. Background

An SSM is a bivariate, time-homogeneous Markov chain (Xt, Yt)t N evolving on some general measurable product space (X Y, X Y). In most applications, (X, X) and (Y, Y) are Euclidean and furnished with the corresponding Borel σ-ﬁelds. More speciﬁcally, the marginal process (Xt)t N, referred to as the state process, is itself assumed to be a time-homogeneous Markov with transition density mθ(xt+1 | xt) and initial-state density m0(x0) (with respect to the same dominating measure dx, typically the Lebesgue measure) on X. The state process is latent but partially observed through the observation process (Yt)t N, whose values are assumed conditionally independent given the state process such that the conditional distribution of each Yt depends on the corresponding Xt only, and we denote by gθ(yt | xt) the density (with respect to some dominating measure) on Y of the latter. Using this notation, the joint density of a given vector X0:t = (X0, . . . , Xt) (this is our generic notation for vectors) of states and corresponding observations Y0:t is given by

pθ(x0:t, y0:t) = m0(x0)gθ(y0 | x0)

s=0 mθ(xs+1 | xs)gθ(ys+1 | xs+1). (1)

The model dynamics is governed by some parameter vector θ belonging to some parameter space Θ. In using these models, the focus is generally on inferring the hidden states given data, typically by determining the joint-smoothing distributions pθ(x0:t | y0:t) = pθ(x0:t, y0:t)/pθ(y0:t) or their

Online Variational Sequential Monte Carlo

marginals, such as the ﬁlter distributions pθ(xt | y0:t). Here the joint density pθ(x0:t, y0:t) is given by (1), whereas the likelihood pθ(y0:t) of the observations is the marginal of (1) with respect to y0:t. As the calculation of the likelihood requires the computation of complex integrals in high dimensions, the joint-smoothing and ﬁlter distributions are generally except in the cases where the model is linear Gaussian or the state space X is a ﬁnite set intractable. The calculation of the joint-smoothing distributions is of critical importance also when the parameter θ is unknown and must be estimated using maximum-likelihood or Bayesian methods (see Capp e et al., 2005; Kantas et al., 2015, and the references therein). In the framework of SSMs we may distinguish between batch methods, where parameter and state inference is carried through given a ﬁxed record of observations, and online methods, where inference is carried through in real time as new data becomes available through a data stream. In the online mode, SMC algorithms are particularly well suited for state inference; furthermore, these methods also provide a basis for online parameter estimation (Poyiadjis et al., 2011; Del Moral et al., 2015; Olsson & Westerborn Alenl ov, 2020). We next review brieﬂy SMC and the principles of variational inference.

2.1. Sequential Monte Carlo Methods

In the context of SSM, SMC methods approximate the smoothing distribution ﬂow (pθ(x0:t | y0:t))t N by forming iteratively a sequence (ξi 0:t, ωi t)N i=1, t N, of random samples of particles (the ξi 0:t s) with associated weights (the ωi t s). For t N, let Xt := X X (t times); then for any real-valued measurable function ht on Xt+1 that is integrable with respect to pθ(x0:t | y0:t), it holds, letting Ωt := PN i=1 ωi t,

ωi t Ωt ht(ξi 0:t) Z ht(x0:t)pθ(x0:t | y0:t) dx0:t.

See Del Moral (2004) for a comprehensive treatment of the theory of SMC. The SMC procedure consists of two core operations performed alternately: a selection step, which resamples the particles with replacement according to their weights, and a mutation step, which propagates randomly the selected particles to new locations. After importance sampling-based initialization of (ξi 0, ωi 0)N i=1, the random sample is updated according to Algorithm 1, in which mutation (Line 4) is executed using some generic proposal Markov transition kernel, possibly depending on yt+1, with density rλ(xt+1 | xt, yt+1) parameterized by some vector λ belonging to some parameter space Λ. Line 3 corresponds to the selection step, where indices (Ii t+1)N i=1 guiding the resampling are drawn from the categorical distribution on {1, . . . , N} induced by the particle weights.

Determining a good proposal distribution rλ is crucial for

Algorithm 1 Particle ﬁlter

1: Input: (ξi 0:t, ωi t)N i=1, yt+1. 2: for i = 1, . . . , N do 3: draw Ii t+1 cat((ωℓ t)N ℓ=1);

4: draw ξi t+1 rλ( | ξ Ii t+1 t , yt+1);

5: set ξi 0:t+1 (ξ Ii t+1 0:t , ξi t+1);

6: set ωi t+1 mθ(ξi t+1 | ξ Ii t+1 t )gθ(yt+1 | ξi t+1)

rλ(ξi t+1 | ξ Ii t+1 t , yt+1) ;

7: end for 8: return (ξi 0:t+1, ωi t+1)N i=1.

the performance of the particle ﬁlter (see, e.g., Cornebise et al., 2008; 2014; Gu et al., 2015; Zhao et al., 2022), and the particles should be guided towards state-space regions of non-vanishing likelihood in order to avoid computational waste. For instance, if the latent process is diffuse while the observations are highly informative, letting naively, as in the so-called bootstrap particle ﬁlter (Gordon et al., 1993), rλ mθ may result in many particles being assigned a negligible weight and thus significant sample depletion. A more appealing, data-driven option is to use the locally optimal proposal satisfying rλ(xt+1 | xt, yt+1) mθ(xt+1 | xt)gθ(yt+1 | xt+1) (and minimizing, e.g., the Kullback Leibler divergence (Cornebise et al., 2008)); however, this proposal is available in a closed form only in a few cases (Doucet et al., 2000; Capp e et al., 2005, Section 7.2.2.2). In Section 3 we will present a technique that allows to learn simultaneously, in an online fashion, both unknown model parameters and an efﬁcient proposal. The proposed method is based on variational inference, which is brieﬂy reviewed in the next section.

2.2. Variational Inference

Let T N be a ﬁxed time horizon. To approximate pθ(x0:T | y0:T ) by a variational-inference procedure, a family {qλ(x0:T | y0:T ) : λ Λ} of variational distributions is designed, whereby the ELBO

LVI(λ, θ) := Eqλ

log pθ(X0:T , y0:T )

qλ(X0:T | y0:T )

pθ(X0:T , y0:T )

qλ(X0:T | y0:T )

= log pθ(y0:T )

is maximized with respect to (λ, θ). Here the bound follows from Jensen s inequality. We note that the equality is satisﬁed in the ideal case qλ(x0:T | y0:T ) = pθ(x0:T | y0:T ). The optimization is performed by a combination of Monte Carlo sampling and stochastic gradient ascent.

The importance weighted autoencoder (IWAE, Burda et al., 2016) extends this idea further by optimizing a similar but

Online Variational Sequential Monte Carlo

improved ELBO, where the expectation is taken with respect to the law of N independent and qλ-distributed random variables, denoted by q N λ :

LIWAE(λ, θ) := Eq N λ

pθ(Xi 0:T , y0:T ) qλ(Xi 0:T | y0:T )

pθ(Xi 0:T , y0:T ) qλ(Xi 0:T | y0:T )

pθ(X0:T , y0:T )

qλ(X0:T | y0:T )

= log pθ(y0:T ). (2)

Again, the bound follows from Jensen s inequality and the fact that the average that is logarithmised is an unbiased estimator of the likelihood. The IWAE provides an improvement on standard VI as LIWAE provides a tighter lower bound on the likelihood log p(y0:T ) and gets, assuming bounded weights, arbitrarily close to the same as N tends to inﬁnity (Burda et al., 2016, Theorem 1). Still, once T is reasonably large and qλ(x0:T | y0:T ) is not sufﬁciently close to pθ(x0:T | y0:T ), approximating LIWAE by standard Monte Carlo implies generally high variance; indeed, in the extreme case, the highly skewed distribution of the terms of the likelihood estimator will effectively reduce IWAE to standard VI. This degeneracy problem can be counteracted by replacing standard Monte Carlo approximation with resampling-based SMC. For this reason, Le et al. (2018), Maddison et al. (2017) and Naesseth et al. (2018) all deﬁne an ELBO similar to (2), but where the expectation is taken with respect to the law QN λ,θ of the random variables (ξi 0)N i=1 and (ξi t, Ii t)N i=1, t {1, . . . , T}, generated by the particle ﬁlter:

LSMC(λ, θ) := EQN λ,θ

which is again a lower bound on log pθ(x0:T ) as the argument of the logarithm is an unbiased estimator of the likelihood (see, e.g., Chopin & Papaspiliopoulos, 2020, Proposition 16.3).

Like in standard VI, the maximization of LSMC is carried out by alternately (1) processing the given data batch y0:t with the particle ﬁlter and (2) taking a stochastic gradient ascent step. The latter involves the differentiation of LSMC with respect to (λ, θ), which can be carried through using the reparameterization trick (Kingma & Welling, 2014). More speciﬁcally, it is assumed that rλ( | x, y) is reparameterizable in the sense that there exists some auxiliary random variable ε, taking on values in some measurable space (E, E) and having distribution ν on (E, E) (the latter not depending on λ), and some function fλ on X Y E, parameterized by λ, such that for every (x, y) X Y, the pushforward distribution ν f 1 λ (x, y, ) coincides with that governed

by rλ( | x, y). In addition, importantly, for any given argument (x, y, ε), fλ(x, y, ε) is assumed to be differentiable with respect to λ. A similar reparameterization assumption is made for some initial proposal distribution denoted by r0,λ. The previous assumptions allow us to reparameterize Algorithm 1 by splitting the procedure on Line 4 into two suboperations: ﬁrst sampling εi t+1 ν and then letting

ξi t+1 fλ(ξ Ii t+1 t , yt+1, εi t+1). After this, the importance weights can be calculated as explicit functions of (λ, θ) according to

ωi t+1(λ, θ) := mθ(fλ(ξ Ii t+1 t , yt+1, εi t+1) | ξ Ii t+1 t )

gθ(yt+1 | fλ(ξ Ii t+1 t , yt+1, εi t+1))

rλ(fλ(ξ Ii t+1 t , yt+1, εi t+1) | ξ Ii t+1 t , yn+1) (3)

and, at initialization, ξi 0 fλ(y0, εi 0) and ωi 0(λ, θ) := m0(fλ(y0, εi 0))gθ(y0 | fλ(y0, εi 0))/r0,λ(fλ(y0, εi 0) | y0).

Le et al. (2018), Maddison et al. (2017) and Naesseth et al. (2018) have shown that the Monte Carlo approximation of λ,θLSMC(λ, θ) can, in order to avoid unmanageable variance, be advantageously carried through by targeting the surrogate gradient

GT (λ, θ) := EQN λ,θ

1 N Ωt(λ, θ)

where QN λ,θ now corresponds to the law of the random variables (εi 0)N i=1 and (εi t, Ii t)N i=1, t {1, . . . , T}, generated by the particle ﬁlter using the reparameterization trick. The approximate gradient GT (λ, θ) is indeed different from (λ,θ)LSMC(λ, θ) in that the latter contains one additional term corresponding to the expectation of the product of (λ,θ) log QN λ,θ and the logarithm of the unbiased likelihood estimator (see Le et al., 2018, Appendix A, for details). Nonetheless, since, as demonstrated by Le et al. (2018), Maddison et al. (2017) and Naesseth et al. (2018), this term generally does not make a signiﬁcant contribution to the gradient and is also very difﬁcult to estimate with reasonable accuracy, we proceed as in the mentioned works and simply discard the same. Roeder et al. (2017), Tucker et al. (2018) and Finke & Thiery (2019) discussed biased gradient approximation in the framework of the IWAE, but we are not aware of any in-depth analyses in the context of VSMC.

As mentioned earlier, the variational SMC (VSMC) approach described above is an ofﬂine method in the sense that each parameter update requires GT (λ, θ) to be estimated by processing the entire data batch y0:T with the particle ﬁlter. However, in many applications it is of utmost importance to be able to update both parameter estimates and proposal distributions in real time, and in the next section we will therefore provide an online extension of VSMC to the setting where the data become available sequentially.

Online Variational Sequential Monte Carlo

3. Online Implementation of VSMC

In this section our primary goal is to learn, for a given stream (yt)t 0 of observations, the amortized proposal and model parameters as the particles evolve and new observations become available. By rewriting (4) as

GT (λ, θ) = EQN λ,θ

t=0 (λ,θ) log Ωt(λ, θ)

we notice that the computation of the gradient is distributed over time, making it possible to adapt the method to the online setting. More precisely, in our scheme, the given parameters (θt, λt) are updated as

λt+1 λt + γλ t+1 λ log Ωt+1(λt, θt), (5)

θt+1 θt + γθ t+1 θ log Ωt+1(λt+1, θt), (6)

where (γλ t )t N>0 and (γθ t )t N>0 are given step sizes (learning rates).

A pseudocode for our algorithm, which we refer to as online variational SMC (OVSMC), is displayed in Algorithm 2, from which it can be seen that the updates of λ and θ on Line 7 and Line 14 are based on two distinct sampling steps with different sample sizes L and N, respectively. Indeed, as pointed out by Le et al. (2018) and Zhao et al. (2022), the quantity log Ωt+1(λ, θ) is a biased but consistent estimator of the log-predictive likelihood log pθ(yt+1 | y0:t) and will therefore, regardless of the proposal used, be arbitrarily close to this quantity as the number of particles increases. In contrast to the estimation of θ log Ωt(λ, θ), this is generally problematic in the estimation of λ log Ωt(λ, θ). In fact, a large number of particles reduces the signal-to-noise ratio of the estimator of the latter gradient, up to a point where it reduces to pure noise. For this reason, we take, in the spirit of the alternating ELBOs strategy of Le et al. (2018, Section 4.1, in which the authors consider IWAE and VSMC ELBOs with alternating sample sizes) an approach where λ and θ are updated through two distinct optimization steps, the one for λ with a small number L of particles, typically less than ten, and the one for θ with a possibly large sample size N.

Appealingly, as clear from Algorithm 2, the method is based only on particle approximation of the ﬁlter distribution ﬂow and therefore does not require saving the trajectories of the particles. This results in an online algorithm with memory requirements that remain uniformly limited in time and are, just like the computational complexity of the algorithm, linear in the number N of particles. In Section 4 we provide a rigorous theoretical justiﬁcation of (a slightly modiﬁed version of) Algorithm 2. In particular, we show that (5) (6) form a classical Robbins Monro scheme (Robbins & Monro, 1951) with state-dependent Markov noise targeting a mean ﬁeld corresponding to an asymptotic (as the number t of data tends to inﬁnity) VSMC.

Algorithm 2 Online Variational SMC (OVSMC)

1: Input: (ξi t, ωi t)N i=1, yt+1, θt, λt. 2: for i 1, . . . , L do 3: draw Ii t+1 cat((ωℓ t)N ℓ=1); 4: draw εi t+1 ν; 5: compute ωi t+1(λt, θt) according to (3); 6: end for 7: set λt+1 λt + γλ t+1 λ log Ωt+1(λt, θt); 8: for i 1, . . . , N do 9: draw Ii t+1 cat((ωℓ t)N ℓ=1); 10: draw εi t+1 ν;

11: set ξi t+1 fλt+1(ξ Ii t+1 t , yt+1, εi t+1); 12: compute ωi t+1(λt+1, θt) according to (3); 13: end for 14: set θt+1 θt + γθ t+1 θ log Ωt+1(λt+1, θt); 15: return (ξi t+1, ωi t+1)N i=1, θt+1, λt+1.

4. Theoretical Results

In this section we study the limiting behavior of OVSMC and discuss its connection with batch VSMC. As explained in Section 3, the purpose of the double optimization steps in Algorithm 2 is to improve the performance of the algorithm in practical use; however, here, for simplicity, we move the parameters λ into θ, resulting in the latter also containing algorithmic parameters. Therefore, the algorithm that we theoretically analyze corresponds to Lines 8 14, with reparameterization function fθt and ωi t+1 depending only on θt. Furthermore, for technical reasons related to the ergodicity of an extended Markov chain that we will deﬁne below and the Lipschitz continuity (in θ) of its transition kernel, the analyzed algorithm repeats twice the generation of the auxiliary variable on Line 10, where one variable is used to estimate the gradient and the other to particle propagation. All details are provided in Appendix A, where the modiﬁed procedure is displayed in Algorithm 3 and we also provide all proofs.

Let, for t N, Zt := (Xt, Yt, (ξIi t t 1)N i=1, (εi t)N i=1), where

(ξIi t t 1)N i=1 and (εi t)N i=1 are generated according to the modiﬁed version of Algorithm 2. Then (Zt)t N is a state dependent Markov chain with transition kernel Tθ (described in the supplement), in the sense that given Z0:t, Zt+1 is distributed according to Tθt(Zt, ) (note that θt is deterministic function of Z0:t). Let P denote the law of (Zt)t N when initialized as described previously.

Assumption 4.1. The data (yt)t N is the output of an SSM (Xt, Yt)t N on (X Y, X Y) with state and observation transition densities m(xt+1 | xt) and g(yt | xt), respectively.

Assumption 4.2. The transition densities m, mθ, gθ and rθ are uniformly bounded from above and below (in all their

Online Variational Sequential Monte Carlo

arguments as well as in θ).

The strong mixing assumptions of Assumption 4.2 are standard in the literature and point to applications where the state space X is a compact set.

Proposition 4.3. Let Assumptions 4.1 4.2 hold. Then for every θ Θ, the canonical Markov chain (Zθ t )t N induced by Tθ is uniformly ergodic and admits a stationary distribution τθ.

Now, letting Hθ(Zt) := θ log Ωt(θ), we may deﬁne the mean ﬁeld h(θ) := R Hθ(z) τθ(dz) (depending implicitly on the sample size N). Note that by the law of large numbers for ergodic Markov chains it holds, a.s., that limt Pt s=0 Hθ(Zθ s)/t = h(θ). Thus, since VSMC estimates the gradient Gt(θ) by Pt s=0 Hθ(Zθ s), ﬁnding a zero of the mean ﬁeld h(θ) can be considered equivalent to applying VSMC to an inﬁnitely large batch of observations from the model. From this perspective, it is interesting to note that our proposed OVSMC algorithm is a Robbins Monro scheme targeting h(θ), with updates given by θt+1 θt + γθ t+1Hθt(Zt+1) and Zt+1 Tθt(Zt, ), t N. Moreover, as established by our main result, Theorem 4.7, the scheme produces a parameter sequence (θt)t N making the gradient arbitrarily close to zero (in the L2 sense). We preface this result by some further assumptions.

Assumption 4.4. The gradients θmθ, θgθ and θrθ, and their compositions with the reparameterization function fθ, are all uniformly bounded and Lipschitz.

Assumption 4.5. There exists a bounded function V on Θ (the Lyapunov function) such that h = θV .

Assumption 4.6. For every t N>0, γθ t+1 γθ t . In addition, there exist constants a > 0, a > 0 and c (0, 1) such that for every t, γθ t aγθ t+1, γθ t γθ t+1 a (γθ t+1)2 and γ1 c.

Theorem 4.7. Let Assumptions 4.1, 4.2, 4.4, 4.5 and 4.6 hold. Then there exist constants b > 0 and b > 0, possibly depending on V , such that for every t N,

E[ h(θτ) 2] b + b Pt s=0(γθ s+1)2 Pt s=0 γθ s+1 ,

where τ cat((γθ s+1)t s=0) is a random variable on {0, . . . , t}.

Corollary 4.8. Let the assumptions of Theorem 4.7 hold and let, for every t N>0, γθ t = c/

t, where c > 0 is given in Assumption 4.6. Then E[ h(θτ) 2] = O(log t/

5. Experiments

In this section we illustrate numerically the performance of OVSMC. We illustrate the method s capability of learning online good amortized proposals and model parameters

and more complex generative models, but also show that it can serve as a strong competitor to VSMC in batch scenarios. Stochastic gradients are passed to the ADAM optimizer (Kingma & Ba, 2015) in Tensorﬂow 21. All the experiments are run on an Apple Mac Book Pro M1 2020, memory 8GB.

5.1. Linear Gaussian SSM

Our ﬁrst example is a standard linear Gaussian SSM with X = Rdx and Y = Rdy for some (dx, dy) N2 >0. More precisely, we let mθ( | x) and gθ( | x) be Ndx(Ax, S u Su) and Ndy(Bx, S v Sv), respectively, where A and Su are dx dx matrices, B Rdy dx and Sv Rdy dy. We start with the simplest case dx = dy = 1 and generate data under A = 0.8, B = 1 and Su = 0.5. We consider two cases for Sv {0.2, 1.2} corresponding to informative and more non-informative observations, respectively. The state process is initialized with its stationary distribution N(0, S2 u/(1 A2)). Our aim is to estimate A and Su and, in parallel, optimizing the particle proposal. For this purpose, we let rλ( | xt, yt+1) be N(µλ(xt, yt+1), σ2 λ(xt, yt+1)), where µλ and σ2 λ are two distinct neural networks with one dense hidden layer having three and two nodes, respectively, and relu activation functions. In Figure 1 we observe the convergence of the unknown parameters for a set of distinct starting values, noticing that more informative observations yields, as expected, faster convergence. Even though values close to the true parameters are obtained already after about 10000 to 20000 iterations, the algorithm is kept running until t = 50000 for the purpose of learning the proposal. We observe an improvement of the effective sample size (ESS) of the particle cloud as rλ is converging towards the optimal proposal, as shown in Figure 8 in Appendix B. Figure 2 displays conditional proposal densities for some given (xt, yt+1) in a single run; we see that for both choices of Sv, the learned proposal is generally close to the locally optimal one, except in the presence of unlikely outliers in the data, in which case the learned proposal tend to move, as is desirable, its mode towards that of the prior (bootstrap) kernel.

Next, we show that our method can also be usefully applied in the batch-data mode. Following Naesseth et al. (2018) and Zhao et al. (2022), let dx = dy = 10 and let the matrix A have elements Aij = 0.42|i j|+1. Furthermore, let Su = I and Sv = 0.5I. We consider two parameterizations of B, one sparse, B = I, and one dense where its elements are random, independent and standard normally distributed. With these parameterizations, a data record y0:T , T = 100, was generated starting from an N(0, I)-distributed initial state. In this part, we view the model parameters θ as being known and focus on learning of the proposal. Instead

1The Python code may be found at https://bitbucket. org/amastrot/ovsmc.

Online Variational Sequential Monte Carlo

0 10000 20000 30000 40000 50000 Time iterations t

Parameter values

Estimated A (mean and 5%-95% quantiles) Estimated Su (mean and 5%-95% quantiles) True value A True value Su

0 10000 20000 30000 40000 50000 Time iterations t

Parameter values

Estimated A (mean and 5%-95% quantiles) Estimated Su (mean and 5%-95% quantiles) True value A True value Su

Figure 1: Parameter learning curves for the one-dimensional linear Gaussian SSM in Section 5.1, obtained using algorithm 2 with L = 5 and N = 10000 for Sv = 0.2 (left) and Sv = 1.2 (right). The means and the quantiles are calculated on the basis of 100 learning curves, each starting with a different initial value and based on independently generated observation data.

of learning, as done by Naesseth et al. (2018) and Zhao et al. (2022), different local proposals for each time step, we learn, in both batch and online mode, an amortized Gaussian proposal where the mean vector and diagonal covariance matrix are functions of xt and yt+1, modeled by neural networks. These neural networks, which are time invariant, have each a single hidden layer with 16 nodes each and the relu activation function. In this setting, we (1) processed the given observation batch M times using standard VSMC, with gradient-based parameter updates between every sweep of the data, and (2) compared the result with the output of OVSMC when executed on a data sequence of length (T + 1)M formed by repeating the given data record M times. The number of particles was equal to L = 5 for both methods. Figure 3 displays the resulting ELBO evolutions for the two methods and some different ADAM learning rates, and it is clear from this plot that OVSMC, with its appealing convergence properties and reduced noise, is indeed a challenger of VSMC in this batch context.

COMPARISON TO CAMPBELL ET AL. (2021)

As anticipated in Section 1, the online variational ﬁltering (OVF) approach of Campbell et al. (2021) performs online parameter learning, although considering a variational family that is not particle-based and thus not directly comparable to the VSMC one. Even if it is algorithmically and computationally complex and includes a large number of hyperparameters to be tuned, OVF is a relevant recent work in the context of online variational learning. Thus, we used it as a benchmark for OVSMC, and let the latter solve the same problem as in Figure 1b in Campbell et al.

5 0 5 xt + 1

(xt, yt + 1) = (0, 0)

5 0 5 xt + 1

(xt, yt + 1) = (-1.5, 1.5)

5 0 5 xt + 1

(xt, yt + 1) = (3, -3)

5 0 5 xt + 1

(xt, yt + 1) = (-4, 4)

5 0 5 xt + 1

(xt, yt + 1) = (0, 0)

Locally optimal proposal Learned proposal Bootstrap proposal

5 0 5 xt + 1

(xt, yt + 1) = (-1.5, 1.5)

5 0 5 xt + 1

(xt, yt + 1) = (3, -3)

5 0 5 xt + 1

(xt, yt + 1) = (-3, 12)

Figure 2: Comparisons of the bootstrap, locally optimal and learned proposals for different (xt, yt+1). Here OVSMC was run for 50000 iterations with L = 5 and N = 10000 for Sv = 0.2 (top) and Sv = 1.2 (bottom).

(2021, Section 5.1). The result is displayed in Figure 4, where we demonstrate the comparability between the two methods for learning A and B; however, we note a signiﬁcant difference in the computational time, since using the code provided by Campbell et al. (2021) OVF takes about 17 hours while OVSMC uses less than one hour for a single run. Even though we are aware that part of this difference may be due to the particular implementation used, the complexity gap between the two methods is deﬁnitely non-negligible.

5.2. Stochastic Volatility and RML

In this example we focus on parameter estimation and compare OVSMC with Pa RIS-based RML (Olsson & Westerborn Alenl ov, 2020). We consider a univariate stochastic volatility model (Hull & White, 1987), where for x X = R, mθ( | x) and gθ( | x) are N(αx, σ2) and N(0, β2 exp(x)), respectively, with α (0, 1), σ > 0 and β > 0. The state process is initialized according to its N(0, σ2/(1 α2)) stationary distribution. Figure 5 reports the learning curves of the estimated parameters θ = (α, σ, β) obtained with OVSMC and particle-based RML, starting from different parameter vectors, for 27 independent data sequences generated under θ = (0.975, 0.165, 0.641). In both algorithms, the learning rate was 10 3. The learning curves show similar behavior, with OVSMC converging faster than RML in several cases when estimating β, but ﬂuctuating somewhat more when learning α and σ. This is to be expected, since the RML procedure is based on particle smoothing. However, we should remember that OVSMC does not use any backward sampling and in addition to learning θ also has the ability to optimize online the proposal kernel over the same family of deep Gaussian proposals as in Section 5.1. As the number of particles grows, the computational speed is also in favor of OVSMC, since the Pa RIS-based RML has a computational bottle-

Online Variational Sequential Monte Carlo

0 500 1000 1500 Iterations M

of all data

l.r. = 0.001

log p (y0 : T)

0 500 1000 1500 Iterations M

of all data

l.r. = 0.005

0 500 1000 1500 Iterations M

of all data

l.r. = 0.01

0 500 1000 1500 Iterations M

of all data

l.r. = 0.05

0 500 1000 1500 Iterations M

of all data

OVSMC l.r. = 0.0005

0 500 1000 1500 Iterations M

of all data

OVSMC l.r. = 0.001

0 500 1000 1500 Iterations M

of all data

l.r. = 0.001

log p (y0 : T)

0 500 1000 1500 Iterations M

of all data

l.r. = 0.005

0 500 1000 1500 Iterations M

of all data

l.r. = 0.01

0 500 1000 1500 Iterations M

of all data

l.r. = 0.05

0 500 1000 1500 Iterations M

of all data

OVSMC l.r. = 0.0005

0 500 1000 1500 Iterations M

of all data

OVSMC l.r. = 0.001

Figure 3: ELBO evolutions of VSMC and OVSMC (each running with L = 5) for the multivariate linear Gaussian SSM in Section 5.1 (top: B sparse; bottom: B dense). In each plot, which corrsponds to a particular learning rate, ﬁve independent runs och each algorithm are displayed on top of each other and compared with the target log-likelihood.

neck stemming from its inherent backward-sampling mechanism; with our implementation, OVSMC was three times faster than Pa RIS-based RML in this speciﬁc case, and this ratio grows even more in favor of OVSMC as N increases.

5.3. Deep Generative Model of Moving Agent

In this section we study the applicability of OVSMC to more complex and high-dimensional models. Inspired by Le et al. (2018, Section 5.3), we produced a long and partially observable video sequence, with frames represented by 32 32 arrays, showing a moving agent. The motion is similar as the one described in the referenced paper, with the difference that the agent exhibits a stationary behavior by bouncing against the edges of the image. The agent is occluded from the image whenever it moves into the central region of the frame, covered by a 16 30 horizontal rectangle. The data generation is described in more detail in Appendix C.1. Following Le et al. (2018, Section C.1), we model the generative process and the proposal through the framework of variational recurrent neural networks (Chung et al., 2015), with a similar architecture described in more detail in Appendix C.2.

0 25000 50000 Time iterations t (A)

Mean Absolute Error

OVSMC (mean) OVSMC (5%-95% quantiles) OVF (5 realizations)

0 25000 50000 Time iterations t (B)

Figure 4: Mean absolute errors estimating A (left) and B (right) of ﬁve independent runs of OVF, along with the distribution of 40 independent runs of OVSMC, with L = 5 and N = 104, proposal kernel as described in Section 5.1, with 64 nodes in the hidden layer of each neural network, and ADAM learning rates 10 3. Here dx = dy = 10, Su = 0.1I, and Sv = 0.25I and matrices A and B are diagonal with i.i.d. Unif(0.5, 1)-distributed elements.

Parameter values OVSMC

Learning curves OVSMC

Learning curves OVSMC

Learning curves OVSMC

True values

0 10000 20000 30000 40000 50000 Time iterations t

Parameter values RML

Learning curves RML

Learning curves RML

Learning curves RML

True values

Figure 5: Parameter learning curves obtained with OVSMC (with L = 5 and N = 1000) and Pa RIS-based RML (Olsson & Westerborn Alenl ov, 2020) (with N = 1000), for the stochastic volatility SSM in Section 5.2, with learning rate 10 3.

In this setting, we generated a single video sequence with T = 105 frames, and processed the same by iterating Algorithm 2, with N = 20 and L = 5, T times. We remark that since we have to run the algorithm for a large number of iterations, it is clearly infeasible and possibly not even useful to input the whole history of frames and latent states to the recurrent neural network of the model; instead, we include only a ﬁxed window (of length 40) of recent frames and states and discard all the previous information. This allows the average time and memory requirement per iteration to be kept at a constant value.

Figures 6 7 show a visual illustration of the method s ability to generate probable videos. The goal is to analyze the quality of the learned generative model by visualizing some

Online Variational Sequential Monte Carlo

independently sampled videos and comparing them to a reference video generated from a different seed than the one used for training. We consider a 150 frames long reference video. For each sample we use the proposal distribution to generate the latent states for the ﬁrst 60 iterations, i.e., we access the true frame to sample and then decode the states, and after t = 60 we continue with the generative model only. The images show every third frame for the true video and ﬁve independent sample videos. We note that the ﬁrst 60 frames of the sample videos look similar to the true video, as expected since the model has access to the true observations. After that, the sample videos continue similarly in an initial phase, but start to move at different speeds when the agent disappears behind the rectangle after t = 75. As the generated frame sequences show, the model predicts a smooth movement pattern of the agent, with some random perturbations, as in the second sample, where the agent leaves the rectangle after t = 96 from the top instead of below.

Figure 6: See caption in Figure 7.

6. Conclusions

We have presented the OVSMC algorithm, a procedure that extends the batch VSMC to the context of streaming data. The proposed methodology allows us to learn, on-the-ﬂy, unknown model parameters and optimal particle proposals in both standard SSM as well as and more complex generative models. Under strong mixing assumptions, which are standard in the literature, we provide theoretical support showing that the stochastic approximation scheme of

Figure 7: Comparison between original video and ﬁve samples under the learned model. Until t = 60, frames are generated using the learned proposal and the given data to encode the latent states; after this, the frames are produced using the generative model.

OVSMC solves the same problem as VSMC for an inﬁnite batch size. The interesting question of whether the estimates produced by OVSMC are still consistent for the true parameter in the case of correctly speciﬁed SSM despite the biased approximation of the ELBO gradient remains open for future work. Moreover, from an application point of view, it would be appealing to explore, empirically and theoretically, the case where the parameters vary through time (i.e. when dealing with non-stationary data), with the aim of identifying the conditions under which OVSMC would still work.

Acknowledgments

This work is supported by the Swedish Research Council, Grant 2018-05230.

Impact Statement

This paper presents work whose goal is to advance the ﬁeld of Machine Learning. There are many potential societal consequences of our work, none which we feel must be speciﬁcally highlighted here.

Online Variational Sequential Monte Carlo

Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2016.

Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.

Campbell, A., Shi, Y., Rainforth, T., and Doucet, A. Online variational ﬁltering and parameter learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 18633 18645. Curran Associates, Inc., 2021.

Capp e, O. Online EM algorithm for hidden Markov models. J. Comput. Graph. Statist., 20(3):728 749, 2011.

Capp e, O., Moulines, E., and Ryd en, T. Inference in Hidden Markov Models. Springer, New York, 2005.

Chopin, N. and Papaspiliopoulos, O. An introduction to sequential Monte Carlo methods. Springer, New York, 2020.

Chung, J., Kastner, K., Dinh, L., Goel, K.and Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, volume 28, pp. 2980 2988. Curran Associates, Inc., 2015.

Cornebise, J., Moulines, E., and Olsson, J. Adaptive methods for sequential importance sampling with application to state space models. Stat. Comput., 18(4):461 480, 2008.

Cornebise, J., Moulines, E., and Olsson, J. Adaptive sequential Monte Carlo by means of mixture of experts. Stat. Comput., 24(3):317 337, 2014.

Del Moral, P. Feynman Kac Formulae. Genealogical and Interacting Particle Systems with Applications. Springer, New York, 2004.

Del Moral, P., Doucet, A., and Singh, S. S. Uniform stability of a particle approximation of the optimal ﬁlter derivative. SIAM Journal on Control and Optimization, 53(3):1278 1304, 2015.

Douc, R., Moulines, E., Priouret, P., and Soulier, P. Markov Chains. Spinger, 2018.

Doucet, A., Godsill, S., and Andrieu, C. On sequential Monte-Carlo sampling methods for Bayesian ﬁltering. Stat. Comput., 10:197 208, 2000.

Doucet, A., De Freitas, N., and Gordon, N. (eds.). Sequential Monte Carlo Methods in Practice. Springer, New York, 2001.

Finke, A. and Thiery, A. H. On importance-weighted autoencoders. ar Xiv preprint ar Xiv:1907.10477, 2019.

Gordon, N., Salmond, D., and Smith, A. F. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F, Radar Signal Process., 140:107 113, 1993.

Gu, S. S., Ghahramani, Z., and Turner, R. E. Neural adaptive sequential Monte Carlo. Advances in neural information processing systems, 28, 2015.

Hull, J. and White, A. The pricing of options on assets with stochastic volatilities. J. Finance, 42:281 300, 1987.

Kantas, N., Doucet, A., Singh, S. S., Maciejowski, J., Chopin, N., et al. On particle methods for parameter estimation in state-space models. Statistical science, 30 (3):328 351, 2015.

Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pp. 1944 1974. PMLR, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

Kitagawa, G. Monte-Carlo ﬁlter and smoother for non Gaussian nonlinear state space models. J. Comput. Graph. Statist., 1:1 25, 1996.

Le, T. A., Igl, M., Rainforth, T., Jin, T., and Wood, F. Autoencoding sequential Monte Carlo. In International Conference on Learning Representations, 2018.

Le Gland, F. and Mevel, L. Recursive estimation in HMMs. In Proc. IEEE Conf. Decis. Control, pp. 3468 3473, 1997.

Liu, J. S. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput., 6:113 119, 1996.

Maddison, C. J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh, Y. Filtering variational objectives. Advances in Neural Information Processing Systems, 30, 2017.

Online Variational Sequential Monte Carlo

Meyn, S. P. and Tweedie, R. L. Markov Chains and Stochastic Stability. Cambridge University Press, London, 2009.

Mongillo, G. and Den eve, S. Online learning with hidden Markov models. Neural Computation, 20(7):1706 1716, 2008. doi: 10.1162/neco.2008.10-06-351.

Naesseth, C., Linderman, S., Ranganath, R., and Blei, D. Variational sequential Monte Carlo. In International conference on artiﬁcial intelligence and statistics, pp. 968 977. PMLR, 2018.

Olsson, J. and Westerborn Alenl ov, J. Particle-based online estimation of tangent ﬁlters with application to parameter estimation in nonlinear state-space models. Annals of the Institute of Statistical Mathematics, 72:545 576, 2020.

Poyiadjis, G., Doucet, A., and Singh, S. S. Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65 80, 2011.

Robbins, H. and Monro, S. A stochastic approximation method. Ann. Math. Statist., 22:400 407, 1951.

Roeder, G., Wu, Y., and Duvenaud, D. K. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Tadic, V. Z. B. and Doucet, A. Asymptotic properties of recursive maximum likelihood estimation in non-linear state-space models. ar Xiv preprint ar Xiv:1806.09571, 2018.

Tucker, G., Lawson, D., Gu, S., and Maddison, C. J. Doubly reparameterized gradient estimators for monte carlo objectives. In International Conference on Learning Representations, 2018.

Zhao, Y., Nassar, J., Jordan, I., Bugallo, M., and Park, I. M. Streaming variational monte carlo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1150 1161, 2022.

Online Variational Sequential Monte Carlo

A. Proof of Theorem 4.7

In the ﬁrst part of this supplement we provide a detailed proof of Theorem 4.7, proceeding as follows.

Preliminaries (Section A.1). For technical reasons, the proof of Theorem 4.7 to be presented calls for some more sophisticated notation, introduced in Section A.1.1. In addition, in Section A.1.2 we present the slight modiﬁcation of Algorithm 2 that is the subject of our theoretical analysis and in which we optimize a unique parameter vector θ containing both model and proposal parameters.

Intermediate results (Section A.2). In this part, we redeﬁne the extended state-dependent Markov chain (Zt)t N (discussed in Section 4), including the original SSM as well as the random variables generated by Algorithm 3. Under strong mixing assumptions, we establish, in Proposition A.4 (corresponding to Proposition 4.3) uniform ergodicity of the Markov transition kernel Tθ governing (Zt)t N and the existence of a stationary distribution.

After this, we consider the Robbins Monroe scheme with state-dependent Markov noise targeting the mean ﬁeld h(θ) deﬁned (as in Section 4) as the expectation of the noisy measurement function Hθ under the stationary distribution of Tθ, and prove that it is bounded. Under the assumption that the state-process and emission transition densities as well as their gradients (with respect to θ) are bounded and Lipschitz continuous in θ, Proposition A.6 then establishes the existence of a solution to the Poisson equation associated with the Markov kernel Tθ.

Main proofs (Section A.3). Next, we introduce further assumptions regarding the Lyapunov function associated with the mean ﬁeld and the step-size sequence of the stochastic update, under which we restate and prove Theorem 4.7 and Corollary 4.8.

Auxiliary results (Section A.4). Finally, we prove Proposition A.6 using Lemmas A.11 and A.12. The approach to these proofs is partially inspired by the work of Tadic & Doucet (2018).

We remark that our analysis is carried through for a ﬁxed particle sample size N N>0 in the SMC algorithm, and it is beyond the scope of our paper to optimize the derived theoretical bounds with respect to N.

A.1. Preliminaries

A.1.1. NOTATION

We let R+ and R + be the sets of nonnegative and positive real numbers, respectively, and denote vectors by xm:n := (xm, xm+1, . . . , xn 1, xn) or, alternatively, by xm:n := (xm, xm+1, . . . , xn 1, xn), depending on the situation. For some general state space (S, S), we let M(S) be the set of measures on S, and M1(S) M(S) the probability measures. For a signed measure σ(ds) on (S, S), we denote by |σ| (ds) its total variation.

We now reintroduce the SSM (Xt, Yt)t N, speciﬁed in Section 2, in a somewhat more rigorous way. More speciﬁcally, (Xt, Yt)t N is deﬁned as a bivariate Markov chain evolving on (X Y, X Y) according to the Markov transition kernel

Kθ : (X Y) (X Y) ((x, y), A) 7 ZZ 1A(x , y ) Mθ(x, dx ) Gθ(x , dy ),

Mθ : X X (x, A) 7 Z 1A(x ) mθ(x, x ) µ(dx ),

Gθ : X Y (x, B) 7 Z 1B(y) gθ(x, y) η(dy),

with mθ : X X R+ and gθ : X Y R+ being the state and emission transition densities and µ M(X) and η M(Y) reference measures. (Here we have slightly modiﬁed the notation in Section 2, by using the short-hand notation mθ(x, x ) = mθ(x | x) and gθ(x, y) = gθ(y | x).) The chain is initialized according to χ Gθ : X Y A 7 R

A χ(dx) Gθ(x, dy), where χ is some probability measure on (X, X) having density m0(x) with respect to µ. As speciﬁed in Section 2, only the process (Yt)t N is observed, whereas the state process (Xt)t N is unobserved and hence referred to as latent or hidden. Moreover, θ is a parameter belonging to some vector space Θ and governing the dynamics of the model.

Online Variational Sequential Monte Carlo

A.1.2. ALGORITHM

We let Rθ : X Y X [0, 1] be some Markov kernel, the so-called proposal kernel, parameterized by θ Θ as well and having transition density rθ : X X Y R+ with respect to µ, such that for every (x, y, A) X Y X,

Rθ((x, y), A) = 0 Z 1A(x )gθ(x , y) Mθ(x, dx ) = 0.

On the basis of the proposal kernel, deﬁne the weight function wθ(x, x , y) := mθ(x, x )gθ(x , y)/rθ(x, x , y) for (x, x , y) X X Y such that rθ(x, x , y) > 0. In order to express the particles as explicit differentiable functions of θ, the proposal is assumed to be reparameterizable. More precisely, we assume that there exist some state-space (E, E), an easily sampleable probability measure ν M1(E) not depending on θ, and a function fθ : X Y E X such that for all (x, y) X Y and θ Θ, it holds that R h(fθ(x, y, v)) ν(dv) = R h(x ) Rθ((x, y), dx ) for all bounded real-valued measurable functions h on X; in other words, the pushforward distribution ν f 1 θ (x, y, ) coincides with Rθ((x, y), ).

Algorithm 3 displays the procedure studied in our analysis. In this slightly modiﬁed version of Algorithm 2, the particle system will be represented by the resampled particles, denoted here as ( ξi t)N i=1, t 0. In order to avoid introducing further notation, the resampled particles are conventionally all initialized at time 1 by some arbitrary value u X such that {fθ(u, y0, εi 0)}N i=1, corresponding to (ξi 0)N i=1, are i.i.d. according to some initial proposal distribution Rθ((u, y0), ) depending on u, where (εi 0)N i=1 ν N. As we will see, the cloud of resampled particles will be included in the statedependent Markov chain (Zt)t N governing the perturbations of the stochastic-approximation scheme under consideration; the initialization according to the constant u is described in (7).

In Algorithm 3, we operate with two samples of mutated particles, one used for the propagation of the particle cloud and one used for approximation of the gradient, in the sense that the noise variables (εi t+1)N i=1 generated on Line 9 are not used to propagate the particles. Consequently, fθt( ξi t, yt+1, εi t+1) is generally different from ξi t+1. Importantly, this decoupling makes the Markov transition kernel Tθ non-collapsed (free from Dirac components), which, as we will see, facilitates signiﬁcantly the theoretical analysis of the algorithm.

Algorithm 3 OVSMC (modiﬁed version)

1: Input: ( ξi t 1)N i=1, yt:t+1, θt 2: for i = 1, . . . , N do 3: draw ξi t Rθt(( ξi t 1, yt), );

4: set ωi t wθt( ξi t 1, ξi t, yt); 5: end for 6: for i = 1, . . . , N do 7: draw Ii t+1 cat((ωℓ t)N ℓ=1);

8: set ξi t ξ Ii t+1 t ; 9: draw εi t+1 ν; 10: end for 11: set θt+1 θt + γt+1 log PN i=1 wθt( ξi t, fθt( ξi t, yt+1, εi t+1), yt+1) ;

12: return ( ξi t)N i=1, θt+1.

A.2. Intermediate results

A.2.1. CONSTRUCTION OF (Zt)t N

We ﬁrst provide a more detailed statement of Assumption 4.1, which assumes that the law of the data is governed by an unspeciﬁed SSM.

Assumption A.1. The observed data stream (Yt)t N is the output of an SSM (Xt, Yt)t N on (X Y, X Y) with some state and observation transition kernels M(x, dx ) and G(x, dy), respectively. These kernels have transition densities m(x, x ) and g(x, y) with respect to µ and η, respectively.

Online Variational Sequential Monte Carlo

As explained in Section 4, the stochastic process (Zt)t N, evolving on the product space (Z, Z) := (X Y XN EN, X Y X N E N), includes the data-generating SSM well as the random variables generated by Algorithm 3, i.e., for t N, Zt := (Xt, Yt, ξ1:N t 1, ε1:N t ). Let P and (Ft)t N be the law and natural ﬁltration of (Zt)t N; then, as described in Section 4, (Zt)t N is a state-dependent Markov chain with transition kernel Tθ, in the sense that for any bounded measurable function h on Z, P-a.s., E[h(Zt+1) | Ft] = Tθth(Zt). The kernel Tθ is given by, with zt = (xt, yt, x1:N t 1, v1:N t ),

Tθ(zt, dzt+1) := M(xt, dxt+1) G(xt+1, dyt+1) Z

k=1 rθ( xk t 1, xk t , yt)

wθ( xi t 1, xi t, yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt) δxi t(d xj t)

ν N(dv1:N t+1) µ N(dx1:N t ),

where we have written R

x1:N t to indicate that the integral is with respect to (xi t)N i=1. The initial state Zθ 0 is initialized according to the probability measure

τ0(dz0) := χ(dx0) G(x0, dy0) δ N u (d x1:N 1 ) ν N(dv1:N 0 ), (7)

where the dummy particles x1:N 1 we are initialized at an arbitrary point u X. Under the following assumption, we will establish that the kernel Tθ is uniformly ergodic and has an invariant distribution.

Assumption A.2. There exists ϵ (0, 1) such that for every θ Θ, (x, x ) X2 and y Y,

ϵ m(x, x ) 1

ϵ , ϵ mθ(x, x ) 1

ϵ , ϵ gθ(x, y) 1

ϵ , ϵ rθ(x, x , y) 1

The strong mixing assumptions of Assumption A.2 (as well as Assumption A.5 introduced later) are standard in the literature and point to applications where the state and parameter space are compact sets. Note that from Assumption A.2 it follows directly that ϵ3 wθ(x, x , y) ϵ 3 for all θ Θ, (x, x ) X2 and y Y. Let (Zθ t )t N and Pθ denote the canonical Markov chain and its law induced by the Markov transition kernel Tθ. Remark A.3. Note that by Assumption A.2 it follows that the reference measure µ is a ﬁnite measure on (X, X). Thus, without loss of generality we assume in the following that µ is a probability measure.

We now rewrite and prove Proposition 4.3.

Proposition A.4 (Proposition 4.3). Let Assumptions A.1 and A.2 hold. Then for every θ Θ, (Zθ t )t N is geometrically uniformly ergodic and admits a unique stationary distribution τθ M1(Z).

Proof. In order to establish uniform ergodicity we show that Tθ allows the state space Z as a ν1-small set for some ν1 M(Z). Indeed for any zt = (xt, yt, x1:N t 1, v1:N t ) Z and A Z,

Tθ(zt, A) = Z Z 1A(xt+1, yt+1, x1:N t , v1:N t+1) m(xt, xt+1) g(xt+1, yt+1) µ(dxt+1) η(dyt+1)

k=1 rθ( xk t 1, xk t , yt)

wθ( xi t 1, xi t, yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt) δxi t(d xj t)

µ N(dx1:N t ) ν N(dv1:N t+1)

ϵ7N+1 Z Z 1A(xt+1, yt+1, x1:N t , v1:N t+1) g(xt+1, yt+1) µ(dxt+1) η(dyt+1)

i=1 δxi t(d xj t)

µ N(dx1:N t ) ν N(dv1:N t+1).

Thus, we may conclude that Tθ allows Z as a ν1-small set with

ν1(dz) = ϵ7N+1 g(x, y) µ(dx) η(dy) Z

i=1 δxi(d xj)

µ N(dx1:N) ν N(dv1:N). (8)

Online Variational Sequential Monte Carlo

Then, by Meyn & Tweedie (2009, Theorem 16.0.2(v)) or Douc et al. (2018, Theorem 15.3.1(iii)) it follows that (Zθ t )t 0 is geometrically uniformly ergodic. Then the Dobrushin coefﬁcient of Tθ is strictly less than one for all θ Θ, which implies that Tθ admits a unique stationary distribution τθ (see, e.g., Douc et al., 2018, Theorems 18.2.4 5).

A.2.2. STOCHASTIC APPROXIMATION UPDATE AND MEAN FIELD

Now, deﬁne, for z = (x, y, x1:N, v1:N) Z,

Hθ(z) := ln

i=1 wθ( xi, fθ( xi, y, vi), y)

= PN i=1 wθ( xi, fθ( xi, y, vi), y) PN i=1 wθ( xi, fθ( xi, y, vi), y) ,

where, here and everywhere in the following, gradients are with respect to θ, i.e., = θ. Then Algorithm 3 is equivalent with the Robbins Monro (Robbins & Monro, 1951) stochastic-approximation scheme

θt+1 θt + γt+1Hθt(Zt+1), t N,

where Zt+1 Tθt(Zt, ), initialized by some starting guess θ0. This procedure aims to ﬁnd a zero of the mean ﬁeld

h(θ) := Z Hθ(z) τθ(dz).

The next result, Proposition A.6, establishes the boundedness of the mean ﬁeld h(θ), the convergence of the expectation of Hθ(Zθ t ) to the same and the existence of a solution to the Poisson equation associated with Tθ. This intermediate result will be instrumental in the proofs of Section A.3. Proposition A.6 will be proven under the assumption that the gradients of the model and proposal densities are bounded and Lipschitz in θ.

Assumption A.5. There exists κ [1, ) such that for all (θ, θ ) Θ2, (x, x ) X2, y Y and v E,

max{ mθ(x, fθ(x, y, v)) , gθ(fθ(x, y, v), y) , rθ(x, fθ(x, y, v), y) } κ,

max{|mθ(x, x ) mθ (x, x )| , |mθ(x, fθ(x, y, v)) mθ (x, fθ (x, y, v))| ,

mθ(x, fθ(x, y, v)) mθ (x, fθ (x, y, v)) } κ θ θ ,

max{|gθ(x, y) gθ (x, y)| , |gθ(fθ(x, y, v), y) gθ (fθ (x, y, v), y)| ,

gθ(fθ(x, y, v), y) gθ (fθ (x, y, v), y) } κ θ θ ,

max{|rθ(x, x , y) rθ (x, x , y)| , |rθ(x, fθ(x, y, v), y) rθ (x, fθ (x, y, v), y)| ,

rθ(x, fθ(x, y, v), y) rθ (x, fθ (x, y, v), y) } κ θ θ .

Proposition A.6. Under Assumptions A.2 and A.5 the following holds true.

(i) h(θ) is well-deﬁned and bounded on Θ.

(ii) For every θ Θ, h(θ) = limt Eθ[Hθ(Zθ t )].

(iii) There exists a measurable function Hθ on Z that satisﬁes the Poisson equation

Hθ(z) h(θ) = Hθ(z) Tθ Hθ(z),

for every θ Θ and z Z.

(iv) There exists a real number α [1, ) such that for every (θ, θ ) Θ2 and z Z,

max{ Hθ(z) , Hθ(z) , Tθ Hθ(z) } α,

max{ Hθ(z) Hθ (z) , Tθ Hθ(z) Tθ Hθ (z) , h(θ) h(θ ) } α θ θ .

The proof of Proposition A.6 is found in Section A.4.

Online Variational Sequential Monte Carlo

A.3. Proof of the main results

We are now ready to prove Theorem 4.7 and Corollary 4.8, which are restated in some more detail below. Our proofs will be based on theory developed by Karimi et al. (2019), where the minimization of a non-convex smooth objective function is considered. Thus, we assume, in Assumption A.7, that the mean ﬁeld h(θ) is indeed the gradient of some smooth function of θ, the so-called Lyapunov function (depending on N), maximized by the algorithm. However, since the surrogate gradient considered in VSMC, whose time-normalized asymptotic limit is the target of OVSMC, is a biased approximation of the gradient of the ELBO LSMC (see Section 2.2), the Lyapunov function does not have a straightforward interpretation in this case.

Assumption A.7. There exists a bounded function V on Θ (the Lyapunov function) such that h = V .

Assumption A.8. For every t N>0, γt+1 γt. In addition, there exist constants a > 0 and a > 0 such that for every t,

γt aγt+1, γt γt+1 a γ2 t+1, γ1 1/(2 α + 2ch),

where the constant α [1, ) is given in Proposition A.6 and ch := α(a + 1)/2 + α2(2a + 1) + αa .

Theorem A.9 (Theorem 4.7). Let Assumptions A.1, A.2, A.5, A.7 and A.8 hold. Then for every t N,

E[ h(θτ) 2] 2(d0,t + c0,t + (4 α3 + cγ) Pt s=0 γ2 s+1) Pt s=0 γs+1 ,

where τ cat((γs+1)t s=0), α is deﬁned in Proposition A.6 and

d0,t := E[V (θt+1) V (θ0)], cγ := α2(2 + α), c0,t := α(γ1 γt+1 + 2).

Proof. The proof follows directly from Karimi et al. (2019, Theorem 2), with V , Hθ and h being multiplied by 1 as we deal with a maximization problem. We notice that (Karimi et al., 2019, Assumptions A1 A3) are satisﬁed by our Assumption A.7, with c0 = d0 = 0 and c1 = d1 = 1, and by Proposition A.6, letting L = α. Moreover, (Karimi et al., 2019, Assumptions A5 A7) are satisﬁed by Proposition A.6, letting L(0) P H = L(1) P H = σ = α.

Corollary A.10 (Corollary 4.8). Let the assumptions of Theorem A.9 hold and let the step-size sequence (γt)t N be given by γt = 1/(

t(2 α + 2ch)), where α and ch are provided in Assumption A.8. Then for every t N>0,

E[ h(θτ) 2] = O log t

where τ cat((γs+1)t s=0).

Proof. Noticing that Assumption A.8 is satisﬁed with a = 2 and a = 2( α + ch), the result is a direct implication of Theorem A.9.

A.4. Auxiliary results

In this section we establish Proposition A.6. The proof is prefaced by two lemmas.

Lemma A.11. Let Assumptions A.2 and A.5 hold. Then there exists κ [1, ) such that for all (θ, θ ) Θ2, (x, x ) X2, y Y and v E,

wθ(x, fθ(x, y, v), y) κ

max{|wθ(x, x , y) wθ (x, x , y)| , |wθ(x, fθ(x, y, v), y) wθ (x, fθ (x, y, v), y)| ,

wθ(x, fθ(x, y, v), y) wθ (x, fθ (x, y, v), y) } κ θ θ .

Online Variational Sequential Monte Carlo

Proof. First, write

wθ(x, fθ(x, y, v), y) = 1 rθ(x, fθ(x, y, v), y)

gθ(fθ(x, y, v), y) mθ(x, fθ(x, y, v))

+ mθ(x, fθ(x, y, v)) gθ(fθ(x, y, v), y)

wθ(x, fθ(x, y, v), y) rθ(x, fθ(x, y, v), y) ,

implying, from Assumptions A.2 and A.5,

wθ(x, fθ(x, y, v), y) 1

ϵ2 ( mθ(x, fθ(x, y, v)) + gθ(fθ(x, y, v), y) ) + 1

ϵ4 rθ(x, fθ(x, y, v), y)

= κ(2ϵ2 + 1)

In order to show that wθ is Lipschitz, we apply Assumptions A.2 and A.5 according to

|wθ(x, x , y) wθ (x, x , y)| mθ(x, x ) |gθ(x , y) gθ (x , y)| + gθ (x , y) |mθ(x, x ) mθ (x, x )|

rθ(x, x , y)

+ mθ (x, x )gθ (x , y)|rθ (x, x , y) rθ(x, x , y)|

rθ(x, x , y)rθ (x, x , y) κ(2ϵ2 + 1)

Proceeding similarly, we obtain

|wθ(x, fθ(x, y, v), y) wθ (x, fθ (x, y, v), y)| (2ϵ2 + 1)κ

ϵ4 θ θ . (9)

Moreover, in order to show that also the gradient wθ is Lipschitz, consider the decomposition

wθ(x, fθ(x, y, v), y) wθ (x, fθ (x, y, v), y)

gθ(fθ(x, y, v), y) mθ(x, fθ(x, y, v))

rθ(x, fθ(x, y, v), y) gθ (fθ (x, y, v), y) mθ (x, fθ (x, y, v))

rθ (x, fθ (x, y, v), y)

+ mθ(x, fθ(x, y, v)) gθ(fθ(x, y, v), y)

rθ(x, fθ(x, y, v), y) mθ (x, fθ (x, y, v)) gθ (fθ (x, y, v), y)

rθ (x, fθ (x, y, v), y)

+ wθ (x, fθ (x, y, v), y) rθ (x, fθ (x, y, v), y)

rθ (x, fθ (x, y, v), y) wθ(x, fθ(x, y, v), y) rθ(x, fθ(x, y, v), y)

rθ(x, fθ(x, y, v), y)

where, by Assumptions A.2 and A.5, gθ(fθ(x, y, v), y) mθ(x, fθ(x, y, v))

rθ(x, fθ(x, y, v), y) gθ (fθ (x, y, v), y) mθ (x, fθ (x, y, v))

rθ (x, fθ (x, y, v), y)

gθ(fθ(x, y, v), y) mθ(x, fθ(x, y, v)) mθ (x, fθ (x, y, v))

rθ(x, fθ(x, y, v), y)

+ mθ (x, fθ (x, y, v)) |gθ(fθ(x, y, v), y) gθ (fθ (x, y, v), y)|

rθ(x, fθ(x, y, v), y)

+ mθ (x, fθ (x, y, v)) gθ (fθ (x, y, v), y)|rθ (x, fθ (x, y, v), y) rθ(x, fθ(x, y, v), y)|

rθ(x, fθ(x, y, v), y)rθ (x, fθ (x, y, v), y)

ϵ3 (ϵ + ϵ2κ + κ) θ θ .

Using (9), the other terms of (10) may be treated similarly, yielding

wθ(x, fθ(x, y, v), y) wθ (x, fθ (x, y, v), y)

ϵ3 (ϵ + ϵ2κ + κ) + κ

ϵ5 (ϵ + 2ϵ2κ + 2κ) θ θ

= (ϵ + 2ϵ3 + 2κ + 2ϵ4κ + 4ϵ2κ) κ

Online Variational Sequential Monte Carlo

The proof is the concluded by letting

κ := max (2ϵ2 + 1)κ

ϵ4 , (ϵ + 2ϵ3 + 2κ + 2ϵ4κ + 4ϵ2κ) κ

= (ϵ + 2ϵ3 + 2κ + 2ϵ4κ + 4ϵ2κ) κ

Our second prefatory lemma establishes Lipschitz continuity and exponential contraction of the Markov dynamics underlying the state-dependent process (Zt)t N. Recall that under Assumptions A.1 and A.2, Proposition A.4 provides the existence of a unique stationary distribution τθ of the canonical Markov chain (Zθ t )t N induced by Tθ. We may then deﬁne, for t N>0,

T t θ : Z Z (z, A) 7 T t θ(z, A) τθ(A),

where T t θ is the t-skeleton deﬁned recursively as T 1 θ = Tθ and T s+1 θ (z, A) = R T s θ (z, dz ) Tθ(z , A) for (z, A) Z Z. By convention, we let T 0 θ (z, A) := δz(A).

Lemma A.12. Let Assumptions A.2 and A.5 hold. Then there exists a constant ς [1, ) (possibly depending on N) such that for every (θ, θ ) Θ2, z Z, bounded measurable function h on Z and t N,

(i) | T t θh(z)| ϱt h ,

(ii) | T t θh(z) T t θ h(z)| ςϱt/2 h θ θ ,

(iii) max{|τθh τθ h| , |Tθh(z) Tθ h(z)|} ς h θ θ ,

where ϱ = 1 ϵ7N+1.

Proof. The ﬁrst bound (i) follows by Proposition A.4, establishing that Tθ allows Z as a ν1-small set, with ν1 being deﬁned in (8). Then by Meyn & Tweedie (2009, Theorem 16.2.4) it holds that for all z Z and bounded measurable functions h on Z, | T t θh(z)| = |T t θh(z) τθh| ϱt h ,

where ϱ := 1 ν1(Z) = 1 ϵ7N+1.

We turn to (ii) and (iii) and introduce the short-hand notations

ak θ := rθ( xk t 1, xk t , yt),

wθ( xi t 1, xi t, yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt) δxi t

which depend implicitly on x1:N t 1 and x1:N t . We may then write, for given zt Z and bounded measurable function h on Z,

|Tθh(zt) Tθ h(zt)| Z Z h(xt+1, yt+1, x1:N t , v1:N t+1) m(xt, xt+1) g(xt+1, yt+1) µ(dxt+1) η(dyt+1)

k=1 ak θ βN θ

(d x1:N t ) µ N(dx1:N t ) ν N(dv1:N t+1). (11)

Here the total variation measure inside the integral can be bounded according to βN θ

k=1 ak θ βN θ

βN θ βN θ N Y

ϵN βN θ βN θ +

ai θ ai θ i 1 Y

k=i +1 ak θ

ϵN βN θ βN θ + 1 ϵN 1

ak θ ak θ βN θ , (12)

Online Variational Sequential Monte Carlo

where we have applied Assumption A.2. To bound the second term in (12), we ﬁrst note that by Assumption A.5,

ak θ ak θ Nκ θ θ . (13)

Moreover, by rewriting the measure βN θ as

wθ( xi t 1, xi t, yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt) δxi t

i1:N {1,...,N}N wi1:N θ δx i1:N t ,

where we have deﬁned wi1:N θ := QN j=1(wθ( xij t 1, xij t , yt)/ PN ℓ=1 wθ( xℓ t 1, xℓ t, yt)), we may bound the same as βN θ µ({x1 t, . . . , x N t }N)/ϵ6N, where we have deﬁned the occupation measure

µ({x1 t, . . . , x N t }N) := 1 N N X

i1:N {1,...,N}N δx i1:N t .

Consequently, by combining this with (13) we obtain the bound

ak θ ak θ βN θ κN ϵ7N 1 θ θ µ({x1 t, . . . , x N t }N). (14)

on the second term in (12). We now bound the ﬁrst term in (12). For this purpose, write

βN θ βN θ = X

i1:N {1,...,N}N

wi1:N θ wi1:N θ δx i1:N t , (15)

where, by Lemma A.11,

wi1:N θ wi1:N θ =

wθ( xij t 1, xij t , yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt)

wθ ( xij t 1, xij t , yt) PN ℓ=1 wθ ( xℓ t 1, xℓ t, yt)

wθ( x ij t 1, x ij t , yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt)

wθ ( x ij t 1, x ij t , yt) PN ℓ=1 wθ ( xℓ t 1, xℓ t, yt)

wθ( xij t 1, xij t , yt) PN ℓ=1 wθ( xℓ t 1, xℓ t, yt) wθ ( xij t 1, xij t , yt) PN ℓ=1 wθ ( xℓ t 1, xℓ t, yt)

|wθ( xij t 1, xij t , yt) wθ ( xij t 1, xij t , yt)| PN ℓ=1 wθ( xℓ t 1, xℓ t, yt)

+wθ ( xij t 1, xij t , yt) PN ℓ=1 wθ ( xℓ t 1, xℓ t, yt) wθ( xℓ t 1, xℓ t, yt) PN ℓ =1 wθ( xℓ t 1, xℓ t , yt) PN ℓ =1 wθ ( xℓ t 1, xℓ t , yt)

Nϵ3 θ θ + κ Nϵ9 θ θ

= κN N Nϵ6N 3

implying, via, (15), that the ﬁrst term in (12) can be bounded as

1 ϵN βN θ βN θ κN ϵ7N 3

θ θ µ({x1 t, . . . , x N t }N). (16)

Online Variational Sequential Monte Carlo

Thus, combining (12), (14) and (16) yields βN θ

k=1 ak θ βN θ

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 θ θ µ({x1 t, . . . , x N t }N). (17)

Now, by plugging the bound (17) into (11) we obtain

|Tθh(zt) Tθ h(zt)| h

Z Z m(xt, xt+1) g(xt+1, yt+1) µ(dxt+1) η(dyt+1)

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 θ θ Z

x1:N t µ({x1 t, . . . , x N t }N)(d x1:N t ) µ N(dx1:N t ) ν N(dv1:N t+1)

= (ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 h θ θ . (18)

Now, using the decomposition, for t N>0,

T t+1 θ T t+1 θ =

T t s θ T s+1 θ T t s+1 θ T s θ =

T t s θ Tθ T s θ T t s θ Tθ T s θ =

s=0 T t s θ (Tθ Tθ ) T s θ

we obtain, using (18) and (i), the bound

T t+1 θ h(z) T t+1 θ h(z)

Z T t s θ (z, dz )|Tθ T s θ h(z ) Tθ T s θ h(z )|

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 h θ θ

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3(1 ϱ) h θ θ .

| T t+1 θ h(z) T t+1 θ h(z)| =

ZZ T s θ h(z )(Tθ Tθ )(z , dz ) T t s θ (z, dz )

Z T t s θ (z, dz )(Tθ T s θ h(z ) Tθ T s θ h(z ))

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 θ θ

s=0 ϱt s T s θ h( )

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 (t + 1)ϱt h θ θ .

Finally, we can write, for arbitrary t N>0 and z Z,

|τθh τθ h| |T t θh(z) T t θ h(z)| + | T t θh(z)| + | T t θ h(z)|

(ϵ6 κ + κ + ϵ4κ) N ϵ7N+3(1 ϱ) h θ θ + 2ϱt h ,

implying that

|τθh τθ h| (ϵ6 κ + κ + ϵ4κ) N ϵ7N+3(1 ϱ) h θ θ .

The proof of (ii) and (iii) is now concluded by simply noting that

| T t θh(z) T t θ h(z)| (ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 tϱt/2 1ϱt/2 h θ θ ,

Online Variational Sequential Monte Carlo

and letting

ς := (ϵ6 κ + κ + ϵ4κ) N ϵ7N+3 max sup t N tϱt/2 1, 1 1 ϱ

We are now ready to prove Proposition A.6.

Proof of Proposition A.6. Using Assumptions A.2 and A.5 and Lemma A.11 we conclude that for all (θ, θ ) Θ2 and z Z,

Hθ(z) PN i=1 wθ( xi, fθ( xi, y, vi), y) PN ℓ=1 wθ( xℓ, fθ( xℓ, y, vℓ), y) κ

from which (i) immediately follows.

We turn to (ii). Using Lemma A.12(i) and (19), for every t N and z Z,

Eθ[Hθ(Zθ t ) | Zθ 0 = z] h(θ) = T t θHθ(z) h(θ) = T t θHθ(z) κ

Thus, for every θ Θ,

0 lim inf t Eθ[Hθ(Zθ t )] h(θ) lim sup t Eθ[Hθ(Zθ t )] h(θ)

= lim sup t Eθ[Eθ[Hθ(Zθ t ) | Zθ 0] h(θ)] lim sup t κ ϵ3 ϱt = 0,

which proves (ii).

In order to establish (iii), let

s=0 (T s θ Hθ(z) h(θ)), z Z.

Indeed, again by Lemma A.12(i), for every z Z,

s=0 T s θ Hθ(z) h(θ) κ

s=0 ϱs = κ ϵ3(1 ϱ).

which implies that Hθ and Tθ Hθ are well deﬁned and bounded. Moreover, by the dominated convergence theorem, for every z Z,

s=1 (T s θ Hθ(z) h(θ)),

implying (iii).

To prove (iv), write, using Lemma A.11,

Hθ(z) Hθ (z) PN i=1 wθ( xi, fθ( xi, y, vi), y) wθ ( xi, fθ ( xi, y, vi), y)

PN ℓ=1 wθ( xℓ, fθ( xℓ, y, vℓ), y)

+ Hθ (z) PN i=1 wθ( xi, fθ( xi, y, vi), y) wθ ( xi, fθ ( xi, y, vi), y) PN ℓ=1 wθ( xℓ, fθ( xℓ, y, vℓ), y)

Online Variational Sequential Monte Carlo

and by (20) and Lemma A.12 it holds that for every z Z,

T t θHθ(z) T t θ Hθ (z) T t θHθ(z ) T t θHθ (z ) T t θHθ (z) T t θ Hθ (z)

ϵ3 ςϱt/2 θ θ

ςϱt/2 θ θ .

Thus, we may write

Tθ Hθ(z) Tθ Hθ (z) =

s=1 (T s θ Hθ(z) h(θ))

s=1 (T s θ Hθ (z) h(θ ))

s=0 T s θ Hθ(z) T s θ Hθ (z)

2ς 1 ϱ θ θ .

Finally, by Lemma A.12 and (20) again, we have

h(θ) h(θ ) Z Hθ(z) Hθ (z) τθ(dz) + τθHθ τθ Hθ κ

(1 + ς) θ θ ,

which allows us, by letting

to conclude the proof of (iv).

B. ESS improvement for the model in Section 5.1

Figure 8 shows how the effective sample size (ESS, Liu, 1996) of the particle cloud improves in a single run of OVSMC while rλ is being learned in univariate linear Gaussian model of Section 5.1, to ﬁnally reach the performance of the optimal proposal. This is evident for Sv = 0.2; on the other hand, when Sv = 1.2, the particles are well propagated into regions of non-negligible likelihood even with the bootstrap proposal, resulting in the normalized ESS being close to one regardless.

0 20000 40000 Time iterations t

0 20000 40000 Time iterations t

Locally optimal proposal Learned proposals Bootstrap proposal

Figure 8: Evolution of (every 100th) normalized ESS for the one-dimensional linear Gaussian SSM in Section 5.1, for Sv = 0.2 (left) and Sv = 1.2 (right).

Online Variational Sequential Monte Carlo

C. Details on the deep generative model of Section 5.3

In this section we provide more details on the model presented in Section 5.3.

C.1. Data generation

First, we describe how the frames of the video are generated. The movement can shortly be described as a partially observed Gaussian random walk conﬁned to a rectangle. More speciﬁcally, the agent moves on the rectangle [0, 1] [0, 5] R2, starting from a uniformly sampled point on [0, 1] {5}. It then moves according to a bivariate Gaussian random walk with covariance matrix 0.004I and vertical drift being initially 0.15 and changing sign every time that either bottom or top edges are hit. In practice, the agent bounces every time it hits any edge. Then each frame is created by projecting the rectangle into a 32 32 array and giving the agent an approximately round shape by overlapping, as a plus sign, two 3 5 rectangles, one vertical and one horizontal. In the created arrays, the background has value zero, while the pixels representing the agent are equal to one. All the frames have an horizontal 16 30 rectangle in the center, which (partially) occludes the view of the agent every time it (partially) falls in that area. This area is represented by 0.5-valued pixels.

C.2. Model architecture

Inspired by Le et al. (2018, Section C.1), the model is constructed on the basis of the variational recurrent neural networks (VRNN, Chung et al., 2015) framework. More precisely, the generative model is represented by the process (Xt+1, Ht, Yt+1)t N>0, with joint density

p(x1:T , h0:T , y1:T ) = p0(h0)

t=0 mθ(xt+1 | ht)gθ(yt+1 | ht, xt+1)pλ,θ(ht+1 | ht, xt+1, yt+1),

where T is a ﬁxed time horizon, y1:T represent the frames of the video, x1:T some lower-dimensional latent states and h0:T are the so-called hidden states of the gated recurrent unit (GRU), which is the speciﬁc recurrent neural network (RNN) used within the whole architecture. The initial and transition distributions of the generative model are

H0 N(0, I),

Xt+1 | Ht = ht N(µx θ(ht), σx θ (ht)2),

Yt+1 | Ht = ht, Xt+1 = xt+1 Bernoulli(µy θ(ϕx θ(xt+1), ht)),

Ht+1 | Ht = ht, Xt+1 = xt+1, Yt+1 = yt+1 δGRUλ(ht,ϕx θ (xt+1),ϕy λ(yt+1)),

for t N, while the proposal rλ(xt+1 | yt+1, ht) is given by

Xt+1 | Yt+1 = yt+1, Ht = ht N(µp λ(ϕy λ(yt+1), ht), σp λ(ϕy λ(yt+1), ht)2).

More in detail, the model comprises the following neural networks.

µx θ and σx θ have two dense layers, the second one being the output, with each 128 nodes, corresponding then to the size of the latent states, whose ﬁrst layer is shared. The activations of the ﬁrst layers are Re LU functions, while the activations of the output layer are linear and softplus, respectively.

ϕx θ is a single dense output layer with 128 nodes and the Re LU activation function.

ϕy λ is the encoder of the frames and is represented by a sequential architecture with four convolutional layers, all with 4 4 ﬁlters, stride two and padding one, except the last one which has stride one and zero padding; the numbers of ﬁlters are, in order, 32, 128, 64 and 32. The activation functions of the convolutions are leaky Re LUs with slope 0.2 and we add batch normalization layers after each of them (except the last one which simply has linear activation). In the end the output is ﬂattened to obtain a tensor in R32.

µy θ is the decoder, which is modeled by a sequential architecture with transposed convolutions. In particular, the ﬁrst layer has 128 4 4 ﬁlters with stride one and zero padding. Then we have two layers with 64 and 32 4 4 ﬁlters, respectively, while the last one has a single 4 4 ﬁlter; all these have stride two and padding one. We use again leaky Re LUs as activations with slope 0.2 and batch normalization layers. The activation function of the output layer is instead a sigmoid, in order to obtain an output in (0, 1)32 32.

Online Variational Sequential Monte Carlo

GRUλ is a gated recurrent unit RNN, which takes as input the concatenated outputs of ϕx θ and ϕy λ to produce deterministically the next hidden state ht+1 of the RNN. Here GRUλ takes ht as an additional input to model the recurrence, but in practice, during training we need to input a time series of (functions of) latent states and frames. Since we deal with streaming data, it would be infeasible to input the whole history at every iteration, and we hence include only the 40 most recent latent states and frames when learning the GRUλ.

µp λ and σp λ have three dense layers of size 128 including the output and share the ﬁrst two layers, with Re LUs activations except the last layers, which has linear and softplus functions, respectively.

We note that the subscripts to the neural networks deﬁned above indicate which optimizer the one for the generative model or the one for the proposal that is used. Even if the GRU is supposed to be part of the generative model, we noticed a learning improvement by considering it part of the proposal, motivated by the fact that it is the ﬁrst (deterministic) sampling operation in the propagation of the particles. In order to describe more clearly our procedure, we have displayed its pseudocode in Algorithm 4. In our notation, (ξx,i t 39:t)N i=1 are the particles representing the latent states of the process, while (ξh,i t 40:t 1)N i=1 refer to the hidden states of the GRU. Note that for most of the variables involved, direct dependencies on the parameters are omitted for a less cumbersome notation. We remark that for iterations t < 40, the starting time of the vectors of particles must be considered one for x and zero for h.

Algorithm 4 OVSMC for deep generative model of moving agent of Section 5.3.

1: Input: (ξx,i t 39:t, ξh,i t 40:t 1, ωi t)N i=1, yt 39:t+1, θt, λt 2: for i 1, . . . , L do 3: draw Ii t+1 cat((ωℓ t)N ℓ=1);

4: set ξh,i t 39:t GRUλt(initial state = ξ h,Ii t+1 t 40 , ϕx θt(ξ x,Ii t+1 t 39:t), ϕy λt(yt 39:t)); 5: draw εi t+1 N(0, I128);

6: set ξx,i t+1 µx λt(ϕy λt(yt+1), ξh,i t ) + σx λt(ϕy λt(yt+1), ξh,i t )εi t+1;

7: set ωi t+1(λt, θt) mθt(ξx,i t+1 | ξh,i t )gθt(yt+1 | ξh,i t , ξx,i t+1)

rλt(ξx,i t+1 | yt+1, ξh,i t ) ;

8: end for 9: set λt+1 λt + γλ t+1 λ log PN i=1 ωi t+1(λt, θt) ;

10: for i 1, . . . , N do 11: draw Ii t+1 cat((ωℓ t)N ℓ=1);

12: set ξh,i t 39:t GRUλt+1(initial state = ξ h,Ii t+1 t 40 , ϕx θt(ξ x,Ii t+1 t 39:t), ϕy λt+1(yt 39:t));

13: draw εi t+1 N(0, I128);

14: set ξx,i t+1 µx λt+1(ϕy λt+1(yt+1), ξh,i t ) + σx λt+1(ϕy λt+1(yt+1), ξh,i t )εi t+1;

15: set ωi t+1(λt+1, θt) mθt(ξx,i t+1 | ξh,i t )gθt(yt+1 | ξh,i t , ξx,i t+1)

rλt+1(ξx,i t+1 | yt+1, ξh,i t ) ;

16: set ξx,i t 38:t+1 (ξ x,Ii t+1 t 38:t, ξx,i t+1); 17: end for 18: set θt+1 θt + γθ t+1 θ log PN i=1 ωi t+1(λt+1, θt) ;

19: return (ξx,i t 38:t+1, ξh,i t 39:t, ωi t+1)N i=1, θt+1, λt+1.