# selfsupervised_inference_in_statespace_models__4a66f13a.pdf

Published as a conference paper at ICLR 2022

SELF-SUPERVISED INFERENCE IN STATE-SPACE MODELS

David Ruhe AI4Science, AMLab, Anton Pannekoek Institute University of Amsterdam, The Netherlands d.ruhe@uva.nl

Patrick Forr e AI4Science, AMLab University of Amsterdam, The Netherlands p.d.forre@uva.nl

We perform approximate inference in state-space models with nonlinear state transitions. Without parameterizing a generative model, we apply Bayesian update formulas using a local linearity approximation parameterized by neural networks. This comes accompanied by a maximum likelihood objective that requires no supervision via uncorrupt observations or ground truth latent states. The optimization backpropagates through a recursion similar to the classical Kalman filter and smoother. Additionally, using an approximate conditional independence, we can perform smoothing without having to parameterize a separate model. In scientific applications, domain knowledge can give a linear approximation of the latent transition maps, which we can easily incorporate into our model. Usage of such domain knowledge is reflected in excellent results (despite our model s simplicity) on the chaotic Lorenz system compared to fully supervised and variational inference methods. Finally, we show competitive results on an audio denoising experiment.

1 INTRODUCTION

Many sequential processes in industry and research involve noisy measurements that describe latent dynamics. A state-space model is a type of graphical model that effectively represents such noise-afflicted data (Bishop, 2006). The joint distribution is assumed to factorize according to a directed graph that encodes the dependency between variables using conditional probabilities. One is usually interested in performing inference, meaning to obtain reasonable estimates of the posterior distribution of the latent states or uncorrupt measurements. Approaches involving sampling (Neal et al., 2011), variational inference (Kingma & Welling, 2013), or belief propagation (Koller & Friedman, 2009) have been proposed before. Assuming a hidden Markov process (Koller & Friedman, 2009), the celebrated Kalman filter and smoother (Kalman, 1960; Rauch et al., 1965) are classical approaches to solving the posterior inference problem. However, the Markov assumption, together with linear Gaussian transition and emission probabilities, limit their flexibility. We present filtering and smoothing methods that are related to the classical Kalman filter updates but are augmented with flexible function estimators without using a constrained graphical model. By noting that the filtering and smoothing recursions can be back-propagated through, these estimators can be trained with a principled maximum-likelihood objective reminiscent of the noise2noise objective (Lehtinen et al., 2018; Laine et al., 2019). By using a locally linear transition distribution, the posterior distribution remains tractable despite the use of non-linear function estimators. Further, we show how a linearized smoothing procedure can be applied directly to the filtering distributions, discarding the need to train a separate model for smoothing.

To verify what is claimed, we perform three experiments. (1) A linear dynamics filtering experiment, where we show how our models approximate the optimal solution with sufficient data. We also report that including expert knowledge can yield better estimates of latent states. (2) A more challenging chaotic Lorenz smoothing experiment that shows how our models perform on par with recently proposed supervised models. (3) An audio denoising experiment that uses real-world noise showing practical applicability of the methods.

Our contributions can be summarized as follows.

Published as a conference paper at ICLR 2022

Noisy measurements Extended Kalman Smoother Recurrent Smoother (Ours) Recursive Smoother (Ours) Ground Truth

Figure 1: Best viewed on screen. Qualitative results of our work. To the noisy measurements (1st from left), we apply an extended Kalman smoother (2nd). From the noisy measurements, we learn a recurrent model that does slightly better (3rd). Our recursive model combines expert knowledge with inference (4th), yielding the best result. Ground truth provided for comparison (5th).

1. We show that the posterior inference distribution of a state-space model is tractable while parameter estimation is performed by neural networks. This means that we can apply the classical recursive Bayesian updates, akin to the Kalman filter and smoother, with mild assumptions on the generative process.

2. Our proposed method is optimized using maximum likelihood in a self-supervised manner. That is, ground truth values of states and measurements are not assumed to be available for training. Still, despite our model s simplicity, our experiments show that it performs better or on par with several baselines.

3. We show that the model can be combined with prior knowledge about the transition and emission probabilities, allowing for better applicability in low data regimes and incentivizing the model to provide more interpretable estimates of the latent states.

4. A linearized smoothing approach is presented that does not require explicit additional parameterization and learning of the smoothing distribution.

2 RELATED WORK

Becker et al. (2019) provide a detailed discussion of recent related work, which we build on here and in table 1. An early method that extends the earlier introduced Kalman filter by allowing nonlinear transitions and emissions is the Extended Kalman filter (Ljung, 1979). It is limited due to the naive approach to locally linearize the transition and emission distributions. Furthermore, the transition and emission mechanisms are usually assumed to be known, or estimated with Expectation Maximization (Moon, 1996). More flexible methods that combine deep learning with variational inference include Black Box Variational Inference (Archer et al., 2015), Structured Inference Networks (Krishnan et al., 2017), Kalman Variational Autoencoder (Fraccaro et al., 2017), Deep Variational Bayes Filters (Karl et al., 2017), Variational Sequential Monte Carlo (Naesseth et al., 2018) and Disentangled Sequential Autoencoder (Yingzhen & Mandt, 2018). However, the lower-bound objective makes the approach less scalable and accurate (see also Becker et al. (2019). Furthermore, all of the above methods explicitly assume a graphical model, imposing a strong but potentially harmful inductive bias. The Backprop KF (Haarnoja et al., 2016) and Recurrent Kalman Network (Becker et al., 2019) move away from variational inference and borrow Bayesian filtering techniques from the Kalman filter. We follow this direction but do not require supervision through ground truth latent states or uncorrupt emissions. Satorras et al. (2019) combine Kalman filters through message passing with graph neural networks to perform hybrid inference. We perform some of their experiments by also incorporating expert knowledge. However, contrary to their approach, we do not need supervision. Finally, concurrently to this work, Revach et al. (2021) develop Kalman Net. It proposes similar techniques but evaluates them in a supervised manner. The authors, however, do suggest that an unsupervised approach can also be feasible. Additionally, we more explicitly state what generative assumptions are required, then target the posterior distribution of interest, and develop the model and objective function from there. Moreover, the current paper includes linearized smoothing (section 6), parameterized smoothing (appendix A), and the recurrent model (appendix C). We also denote theoretical guarantees under the noise2noise objective.

Published as a conference paper at ICLR 2022

scalable state est. uncertainty noise dir. opt. self-sup.

Ljung (1979) /

Hochreiter et. al. (1997) / Cho et al. (2014) /

Wahlstr om et al. (2015) / Watter et al. (2015)

Archer et al. (2015) / Krishnan et al. (2017) Fraccaro et al. (2017) / Karl et al. (2017) Naesseth et al. (2018) Yingzhen et al. (2018)

Rangapuram et al. (2018) / (1D) Doerr et al. (2018)

Satorras et al. (2019) Haarnoja et al. (2016) Becker et al. (2019)

Table 1: We compare whether algorithms are scalable, state estimation can be performed, models provide uncertainty estimates, noisy or missing data can be handled, optimization is performed directly and if supervision is required. / means that it depends on the parameterization.

ek 1 ek ek+1

xk 1 xk xk+1

. . . . . .

yk 1 yk yk+1

. . . . . .

Figure 2: State-space model with deeper latent structure.

3 GENERATIVE MODEL ASSUMPTIONS

In this section, we explicitly state the model s generative process assumptions. First, we assume that we can measure (at least) one run of (noise-afflicted) sequential data y0:K := (y0, . . . , y K), where each yk RM, k = 0, . . . , K. We abbreviate: yl:k := (yl, . . . , yk) and y<k := y0:k 1 and y k := y0:k and y k := (y0:k 1, yk+1:K). We then assume that y0:K is the result of some possibly non-linear probabilistic latent dynamics, i.e., of a distribution p(x0:K), whose variables are given by x0:K := (x0, . . . , x K) with xk RN. Each yk is assumed to be drawn from some shared noisy emission probability p(yk | xk). The joint probability is then assumed to factorize as:

p(y0:K, x0:K) = p(x0:K)

k=0 p(yk | xk). (1)

Further implicit assumptions about the generative model are imposed via inference model choices (see section 7). Note that this factorization encodes several conditional independences like

yk (y k, x k) | xk. (2)

Typical models that follow these assumptions are linear dynamical systems, hidden Markov models, but also nonlinear state-space models with higher-order Markov chains in latent space like presented in fig. 2.

In contrast to other approaches (e.g., Krishnan et al. (2015); Johnson et al. (2016); Krishnan et al. (2017)) where one tries to model the latent dynamics with transition probabilities p(xk | xk 1) and possibly non-linear emission probabilities p(yk | xk), we go the other way around. We assume that all the non-linear dynamics are captured inside the latent distribution p(x0:K), where at this point we make no further assumption about its factorization, and the emission probabilities are (wellapproximated with) a linear Gaussian noise model:

p(yk | xk) = N (yk | Hxk, R) , (3)

where the matrix H represents the measurement device and R is the covariance matrix of the independent additive noise. We make a brief argument why this assumption is not too restrictive. First, if one is interested in denoising corrupted measurements, any nonlinear emission can be captured directly inside the latent states xk. To see this, let zk RN M denote non-emitted state variables. We then put xk := [yk zk] , where yk is computed by applying the nonlinear emission to zk. We thus include the measurements in the modeled latent state xk. Then we can put H := IM 0M (N M) . Second, techniques proposed by Laine et al. (2019) allow for non Gaussian noise models, relaxing the need for assumption eq. (3). Third, we can locally linearize the

Published as a conference paper at ICLR 2022

emission (Ljung, 1979). Finally, industrial or academic applications include cases where emissions are (sparse) Gaussian measurements and the challenging nonlinear dynamics occur in latent space. Examples can be found in MRI imaging (Lustig et al., 2007) and radio astronomy (Thompson et al., 2017).

4 PARAMETERIZATION

In this section, we show how we parameterize the inference model. A lot of the paper s work relies on established Bayesian filtering machinery. However, for completeness, we like to prove how all the update steps remain valid while using neural networks for function estimation.

Given our noisy measurements y0:K = (y0, . . . , y K) we want to find good estimates for the latent states x0:K = (x0, . . . , x K), which generated y0:K. For this, we want to infer the marginal conditional distributions p(xk | y k) or p(xk | y<k) (for forecasting), for an online inference approach (filtering); and p(xk | y0:K) or p(xk | y k), for a full inference approach (smoothing). In the main body of the paper, we only consider filtering. Smoothing can be performed similarly, which is detailed in the supplementary material (appendix A).

We start with the following advantageous parameterization:

p(xk | xk 1, y<k) = N xk | ˆFk|<k xk 1 + ˆek|<k, ˆQk|<k , (4)

where ˆFk|<k := ˆFk|<k(y<k), ˆek|<k := ˆek|<k(y<k) and ˆQk|<k := ˆQk|<k(y<k) are parameterized with neural networks. Next, we have available

p(xk 1 | y (k 1)) = N xk 1 | ˆx(k 1)| (k 1), ˆP(k 1)| (k 1) , (5)

i.e., the previous time-step s conditional marginal distribution of interest. For k = 1, this is some initialization. Otherwise, it is the result of the procedure we are currently describing. We use this distribution to evaluate the marginalization

p(xk | y<k) = Z p(xk | xk 1, y<k) p(xk 1 | y<k) dxk 1

= N xk | ˆxk|<k, ˆPk|<k , (6)

ˆxk|<k(y<k) = ˆFk|<k ˆxk 1| k 1 + ˆek|<k, ˆPk|<k(y<k) = ˆFk|<k ˆPk 1| k 1 ˆF k|<k + ˆQk|<k. (7)

Note that the distributions under the integral eq. (6) are jointly Gaussian only because of the parameterization eq. (4). Hence, we can evaluate the integral analytically.

Finally, to obtain the conditional p(xk | y k) = p(xk | yk, y<k) we use Bayes rule:

p(xk | yk, y<k) = p(yk | xk, y<k) p(xk | y<k)

p(yk | y<k)

eq. (2) = p(yk | xk) p(yk | y<k) p(xk | y<k). (8)

Equation (3) and the result eq. (6) allow us to also get an analytic expression for

p(xk | y k) = N xk | ˆxk| k, ˆPk| k (9)

with the following abbreviations:

ˆxk| k(y k) := ˆxk|<k + ˆ Kk (yk H ˆxk|<k), ˆPk| k(y k) := ˆPk|<k ˆ Kk H ˆPk|<k. (10)

We introduce the Kalman gain matrix similar to the classical formulas:

ˆ Kk := ˆPk|<k H H ˆPk|<k H + R 1 . (11)

Note that taking the matrix inverse at this place in eq. (11) is more efficient than in the standard Gaussian formulas (for reference presented in appendix B) if M N, which holds for our experiments.

Published as a conference paper at ICLR 2022

xk| k xk+1| k

xk+1| k+1 xk+2| (k+1)

xk 1| k 1 xk| (k 1)

Figure 3: Our recursive model visualized. Data-point yk is fed, together with hidden state h k , into a GRU unit. The new hidden state h k+1 is decoded into multiplicative component Fk+1 and additive component ek+1. Using these, the previous posterior mean ˆxk| k is transformed into the prior estimate for time-step k + 1 ˆxk+1 k. yk+1 is used to obtain posterior mean ˆxk+1| (k+1).

This completes the recursion: we can use eq. (9) for a new time-step k + 1 by plugging it back into eq. (6). We have shown how estimating a local linear transition using neural networks in eq. (4) ensures that all the recursive update steps from the Kalman filter analytically hold without specifying and estimating a generative model.

We note that we could also have parameterized

p(xk | y<k) = N xk | ˆxk|<k(y<k), ˆPk|<k(y<k) (12)

with ˆxk|<k(y<k) and ˆPk|<k(y<k) directly estimated by a neural network. This has the advantage that we do not rely on a local linear transition model. However, it also means that we are estimating xk without any form of temporal regularization. Additionally, it is harder to incorporate prior knowledge about the transitions maps into such a model. Nonetheless, we detail this parameterization further in the supplementary material (appendix C) and include its performance in our experiments in section 8.

To conclude the section, we like to discuss some of the limitations of the approach. 1. The Gaussianity assumption of eq. (4) ensures but also restricts eq. (8) and eq. (9) to these forms. That is, we make a direct assumption about the form of the posterior p(xk | y k). Defending our case, we like to point out that methods such as variational inference (Krishnan et al., 2017) or posterior regularization (Ganchev et al., 2010) also make assumptions (e.g., mean-field Gaussian) about the posterior. 2. Since we did not explicitly specify a factorization of p(x0:K), we cannot ensure that the distributions we obtain from the above procedure form a posterior to the ground truth generative process. This does not mean, however, that we cannot perform accurate inference. Arguably, not making explicit assumptions about the generative process is preferred to making wrong assumptions and using those for modeling, which can be the case for variational auto-encoders. 3. The local linearity assumption is justifiable if the length between time-steps is sufficiently small. However, note that the model is more flexible than directly parameterizing eq. (12) (see appendix C) since it reduces to that case by putting Fk := 0 for all k.

We have shown in the previous section how parameterization of a local linear transition model leads to recursive estimation of p(xk | y<k) and p(xk | y k) for all k using classical Bayesian filtering formulas. The inference is only effective if the estimates ˆFk|<k, ˆek|<k and ˆQk|<k from eq. (4) are accurate. We can use the parameterization p(xk | y<k) of eq. (6), the emission model p(yk | xk) = N(yk | H xk, R) from eq. (3), and the factorization from eq. (1) to see that an analytic form of the log-likelihood of the data emerges:

p(yk | y<k) = Z p(yk | xk) p(xk | y<k) dxk

= N yk | H ˆxk|<k(y<k), H ˆPk|<k(y<k) H + R , (13)

Published as a conference paper at ICLR 2022

log p(y0:K) =

k=0 log N yk | H ˆxk|<k(y<k), H ˆPk|<k(y<k) H + R . (14)

If we put ˆyk|<k := H ˆxk|<k(y<k) and ˆ Mk|<k := H ˆPk|<k(y<k) H + R, then the maximumlikelihood objective leads to the following loss function, which we can minimize using gradient descent methods w.r.t. all model parameters:

h (ˆyk|<k yk) ˆ M 1 k|<k (ˆyk|<k yk) + log det ˆ Mk|<k i . (15)

Note that each term in the sum above represents a one-step-ahead self-supervised error term. We thus minimize the prediction residuals ˆyk|<k(y<k) yk in a norm that is inversely scaled with the above covariance matrix, plus a regularizing determinant term, which prevents the covariance matrix from diverging. The arisen loss function is similar to the noise2noise (Lehtinen et al., 2018; Krull et al., 2019; Batson & Royer, 2019; Laine et al., 2019) objective from computer vision literature, combined with a locally linear transition model like Becker et al. (2019). We show in appendix D that this objective will yield correct results (meaning estimating the ground-truth xk) if the noise is independent with E[yk | Hxk] = Hxk. A similar procedure in the causality literature is given by Sch olkopf et al. (2016). An algorithmic presentation of performing inference and fitting is presented in appendix E.

Note that after fitting the parameters to the data, eq. (6) can directly be used to do one-step ahead forecasting. Forecasting an arbitrary number of time steps is possible by plugging the new value ˆx K+1|K via y K+1 := H ˆx K+1|K back into the recurrent model, and so on. This is not a generative model but merely a convenience that we deemed worth mentioning.

6 LINEARIZED SMOOTHING

So far, we have only discussed how to perform filtering. Recall that for smoothing, we are instead interested in the quantity p(xk | y0:K). A smoothing strategy highly similar to the methods described earlier can be obtained by explicitly parameterizing such a model, which we detail in the supplementary material. Here, we introduce a linearized smoothing procedure. The essential advantage is that no additional model has to be trained, which can be costly. Several algorithms stemming from the Kalman filter literature can be applied, such as the RTS smoother (Rauch et al., 1965) and the two-filter smoother (Kitagawa, 1994). To enable this, we need to assume that the conditional mutual information I (xk 1; yk:K | xk, y0:k 1) is small for all k = 1, . . . , K. In other words, we assume that we approximately have the following conditional independences:

xk 1 yk:K | (xk, y0:k 1). (16)

To explain the motivation for this requirement, consider the model in fig. 2. If the states of y0:k 1 and xk are known, then the additional information that yk:K has about the latent variable xk 1 would need to be passed along the unblocked deeper paths like yk+1 xk+1 ek+1 ek xk 1. Then the assumption of small I(xk 1; yk:K | xk, y0:k 1) can be interpreted as that the deeper paths transport less information than the lower direct paths. If we consider all edges to the xk s as linear and the edges to the ek s as non-linear maps, the above could be interpreted as an information-theoretic version of expressing that the non-linear correction terms are small compared to the linear parts in the functional relations between the variables.

We will now show that under the earlier assumptions and eq. (16) we get a Gaussian approximation: p(xk | y0:K) N xk | ˆzk, ˆGk . We will do backward induction with ˆz K := ˆx K| K and ˆGK := ˆPK| K. To propagate this to previous time steps k 1 we use the chain rule:

p(xk 1 | y0:K) = Z p(xk 1 | xk, y0:K) p(xk | y0:K) dxk, (17)

where the second term is known by backward induction and for the first term we make use of the approximate conditional independence from eq. (16) to get

p(xk 1 | xk, y0:K) = p(xk 1 | yk:K, xk, y0:k 1) eq. (16) p(xk 1 | xk, y0:k 1). (18)

Published as a conference paper at ICLR 2022

The latter was shown to be Gaussian in section 4:

p(xk 1, xk | y0:k 1) = N

| ˆxk 1| k 1 ˆxk|<k

" ˆPk 1| k 1 ˆPk 1| k 1 ˆF k|<k ˆFk|<k ˆPk 1| k 1 ˆPk|<k

By use of the usual formulas for Gaussians and the reverse Kalman gain matrix ˆ Jk 1|k we arrive at the following update formulas, k = K, . . . , 1, with ˆz K := ˆx K| K and ˆGK := ˆPK| K:

ˆ Jk 1|k := ˆPk 1| k 1 ˆF k|<k ˆP 1 k|<k, (20)

ˆGk 1 := ˆPk 1| k 1 + ˆ Jk 1|k ˆPk| k ˆPk|<k ˆ J k 1|k, (21)

ˆzk 1 := ˆxk 1| k 1 + ˆ Jk 1|k (ˆzk ˆxk| k). (22)

As such, we can perform inference for all k = 0, . . . , K with p(xk | y0:K) N xk | ˆzk, ˆGk . Algorithmically, the above is presented in appendix E.

7 RECURRENT NEURAL NETWORK

Before going into the experiments section, we briefly explain how we specifically estimate the functions ˆFk|<k(y<k), ˆek|<k(y<k) and ˆQk|<k(y<k) that parameterize the transition probability p(xk | y<k, xk 1) (eq. (4)). The choice of the model here implicitly makes further assumptions about the generative model. If we consider neural networks, the temporal nature of the data suggests recurrent neural networks (Graves et al., 2013), convolutional neural networks (Kalchbrenner et al., 2014), or transformer architectures (Vaswani et al., 2017). Additionally, if the data is image-based, one might further make use of convolutions. For our experiments, we use a Gated Recurrent Unit (GRU) network (Cho et al., 2014), that recursively encodes hidden representations. Therefore, we put

ˆQk|<k := Lk L k ,

ˆFk ˆek ˆLk

:= f(h k ), h k := GRU(y k 1, h k 1), (23)

where ˆLk is a Cholesky factor and f is a multi-layer perceptron decoder.

8 EXPERIMENTS

We perform three experiments, as motivated in section 1. Technical details on the setup of the experiments can be found in the supplementary material (appendix F). We refer to the model detailed in section 4 as the recursive filter, as it uses the Bayesian update recursion. For smoothing experiments, we use recursive smoother. The model obtained by parameterizing p(xk | y<k) directly (eq. (12)) is referred to as the recurrent filter or recurrent smoother, as it only employs recurrent neural networks (and no Bayesian recursion) to estimate said density directly.

8.1 LINEAR DYNAMICS

In the linear Gaussian case, it is known that the Kalman filter will give the optimal solution. Thus, we can get a lower bound on the test loss. In this toy experiment, we simulate particle tracking under linear dynamics and noisy measurements of the location. We use Newtonian physics equations as prior knowledge. We generate trajectories T = {x0:K, y0:K} with xk R6 and yk R2 according to the differential equations:

"0 1 0 0 c 1 0 τc 0

We obtain sparse, noisy measurements yk = Hxk + r with r N(0, R). H is a selection matrix that returns a two-dimensional position vector. We run this experiment in a filtering setting, i.e.,

Published as a conference paper at ICLR 2022

102 103 104 105

# Training Samples

Measurements Kalman Filter (Supervised) Optimal Recurrent Filter (Ours) Recursive Filter (Ours)

Figure 4: Considers the linear dynamics experiment (section 8.1). The mean squared error on the test set (lower is better) as a function of the number of examples.

102 103 104 105

# Training Samples

Measurements Extended Kalman Smoother (Supervised) Satorras et al., (2019) (Supervised) Krishnan et al., (2017) Recurrent Smoother (Ours) Recursive Smoother (Ours) Recursive Filter + Linearized Smoother (Ours)

Figure 5: Considers the Lorenz experiment (section 8.2). The mean squared error on the test set (lower is better) as a function of the number of examples used for training.

we only use past observations. We compare against (1) the raw, noisy measurements which inherently deviate from the clean measurements, (2) the Kalman filter solution where we optimized the transition covariance matrix using clean data (hence supervised), (3) the optimal solution, which is a Kalman filter with ground truth parameters performing exact inference. To estimate the transition maps for the Kalman filter, we use the standard Taylor series of e A t up to the first order. Additionally, we use this expert knowledge as an inductive bias for the recursive filter s transition maps.

In fig. 4 we depict the test mean squared error (MSE, lower is better) as a function of the number of training samples. Given enough data, the self-supervised models approximate the optimal solution arbitrarily well. Our recursive model significantly outperforms both the Kalman filter and the inference model in the low-data regime by using incorporated expert knowledge. Additionally, we report that the recursive model s distance to the ground truth latent states is much closer to the optimal solution than both the inference model and Kalman filter. Specifically, we report average mean squared errors of 0.685 for the inference model, 0.241 for the Kalman filter, 0.161 for the recursive model compared to 0.135 for the optimal Kalman filter. Finally, it is worth noting that the recursive model has much less variance as a function of its initialization.

8.2 LORENZ EQUATIONS

We simulate a Lorenz system according to

" σ σ 0 ρ x1 1 0 x2 0 β

# "x1 x2 x3

We have H = I, x R3 and y R3. The Lorenz equations model atmospheric convection and form a classic example of chaos. Therefore, performing inference is much more complex than in the linear case. This time, we perform smoothing (see appendix A) and compare against (1) the raw measurements, (2) a supervised Extended Kalman smoother (Ljung, 1979), (3) the variational inference approach of Krishnan et al. (2017), (4) the supervised recursive model of Satorras et al. (2019). Our models include a recurrent smoother, a recursive smoother, and the recursive filter with linearized smoothing (section 6). Transition maps for the Extended Kalman smoother and the recursive models are obtained by taking a second-order Taylor series of e A t. For the supervised extended Kalman filter, we again optimize its covariance estimate using ground truth data.

In fig. 5 we plot the test mean squared error (MSE, lower is better) as a function of the number of examples available for training. It is clear that our methods approach the ground truth states with

Published as a conference paper at ICLR 2022

Whitenoise Doing the dishes Dude miaowing Exercise Bike Pink noise Running tap Combined

Kalman Filter (Supervised) 0.225 0.230 0.232 0.237 0.235 0.230 0.227 Noise2Noise (Lehtinen et al., 2018) 0.327 0.399 0.448 0.430 0.440 0.383 0.526 SIN (Krishnan et al., 2017) 0.297 0.373 0.352 0.348 0.377 0.342 0.343 Recurrent Filter (Ours) 0.102 0.207 0.213 0.200 0.234 0.175 0.181 Recursive Filter (Ours) 0.107 0.206 0.213 0.198 0.232 0.166 0.175 Recursive Filter + RTS Smoother (Ours) 0.100 0.204 0.215 0.197 0.231 0.166 0.176 RKN (Becker et al., 2019, Supervised) 0.121 0.127 0.109 0.105 0.085 0.121 0.125

Table 2: Considers the audio denoising experiment (section 8.3). Test mean squared error (MSE, lower is better) per model and noise subset. Blue numbers indicate second-best performing models.

more data. This is in contrast to the Extended Kalman smoother, which barely outperforms the noisy measurements. We also see that the recursive models significantly outperform the recurrent model in the low-data regime. The recursive filter with linearized smoothing performs comparably to the other models and even better in low-data regimes. We hypothesize that this is because the required assumption for the linearized smoother holds (section 6) and regularizes the model. The variational method of Krishnan et al. (2017) performs poorly in low-data regimes. Most notably, the supervised method of Satorras et al. (2019) outperforms our models only slightly.

8.3 AUDIO DENOISING

Next, we test the model on non-fabricated data with less ideal noise characteristics. Specifically, we use the Speech Commands spoken audio dataset (Warden, 2018). Performing inference on spoken audio is challenging, as it arguably requires understanding natural language. To this end, recent progress on synthesizing raw audio has been made (Lakhotia et al., 2021). However, this requires scaling to much larger and more sophisticated neural networks than presented here, which we deem out the current work s scope. Therefore, we take a subset of the entire dataset, using audio from the classes tree , six , eight , yes , and cat . We overlay these clean audio sequences {x(1) 0:K, . . . , x(N) 0:K} with various noise classes that the dataset provides. That is, for every noise class C we obtain a set of noisy sequences {y(1) 0:K, . . . , y(N) 0:K}C. We also consider a combined class in which we sample from the union of the noise sets. The task is to denoise the audio without having access to clean data. We evaluate the models on non-silent parts of the audio, as performance on those sections is the most interesting. Notably, none of these noise classes is Gaussian distributed.

We show the mean squared error on the test set of all models per noise class in table 2. Our models outperform the Kalman filter, Noise2Noise (Lehtinen et al., 2018), and SIN (Krishnan et al., 2017) unsupervised baselines. We suspect that the relatively poor performance of SIN is due to its generative Markov assumption, regularizing the model too strongly. The poor performance of Noise2Noise is due to the fact that it does not use the current measurement yk to infer xk. Like before, note that the Kalman filter is supervised as we optimized its covariance matrix using clean data. The supervised RKN (Becker et al., 2019) outperforms our models on most noise classes, but notably not on white noise. Most of these noise classes have temporal structure, making them predictable from past data. This is confirmed by observing these mean squared error values over the course of training. Initially, the values were better than reported in table 2, but the model increasingly fits the noise over time. Thus, although two of the main assumptions about the model (independent Gaussian noise) are violated, we still can denoise effectively. Since the RKN s targets are denoised (hence supervised ), it does not have this problem. In practice, obtaining clean data can be challenging.

9 CONCLUSION

We presented an advantageous parameterization of an inference procedure for nonlinear state-space models with potentially higher-order latent Markov chains. The inference distribution is split into linear and nonlinear parts, allowing for a recursion akin to the Kalman filter and smoother algorithms. Optimization is performed directly using a maximum-likelihood objective that backpropagates through these recursions. Smoothing can be performed similarly, but we additionally proposed linearized smoothing that can directly be applied to the filtering distributions. Our model is simple and builds on established methods from signal processing. Despite this, results showed that it can perform better or on par with fully supervised or variational inference methods.

Published as a conference paper at ICLR 2022

10 ETHICS STATEMENT

The paper presents a simple method to perform inference using noisy sequential data. Applications can be found throughout society, e.g., tracking particles, denoising images or audio, or estimating system states. While many such examples are for good, there are applications with ethically debatable motivations. Among these could be tracking humans or denoising purposefully corrupted data (e.g., to hide one s identity).

11 REPRODUCIBILITY STATEMENT

We are in the process of releasing code for the current work. For clarity and reproducibility, the presented methods are available as algorithms in the supplementary material. Furthermore, we made explicit wherever we needed to make an approximation or an assumption.

Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. ar Xiv preprint ar Xiv:1511.07367, 2015.

Philipp Bader, Sergio Blanes, and Fernando Casas. Computing the matrix exponential with an optimized taylor polynomial approximation. Mathematics, 7(12):1174, 2019.

Joshua Batson and Lo ıc Royer. Noise2self: Blind denoising by self-supervision. In ICML, 2019.

Philipp Becker, Harit Pandya, Gregor H. W. Gebhardt, Cheng Zhao, C. James Taylor, and Gerhard Neumann. Recurrent kalman networks: Factorized inference in high-dimensional deep feature spaces. In ICML, 2019.

Christopher M Bishop. Pattern recognition. Machine learning, 128(9), 2006.

Kyunghyun Cho, Bart Van Merri enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014.

Andreas Doerr, Christian Daniel, Martin Schiegg, Nguyen-Tuong Duy, Stefan Schaal, Marc Toussaint, and Trimpe Sebastian. Probabilistic recurrent state-space models. In International Conference on Machine Learning, pp. 1280 1289. PMLR, 2018.

Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition and nonlinear dynamics model for unsupervised learning. ar Xiv preprint ar Xiv:1710.05741, 2017.

Kuzman Ganchev, Joao Grac a, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001 2049, 2010.

Christian Gourieroux, Alain Monfort, and Alain Trognon. Pseudo maximum likelihood methods: Theory. Econometrica: journal of the Econometric Society, pp. 681 700, 1984.

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645 6649. Ieee, 2013.

Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop KF: learning discriminative deterministic state estimators. In Neur IPS, 2016.

Matthew J. Johnson, David Duvenaud, Alexander B. Wiltschko, Ryan P. Adams, and Sandeep R. Datta. Composing graphical models with neural networks for structured representations and fast inference. In Neur IPS, 2016.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. ar Xiv preprint ar Xiv:1404.2188, 2014.

Published as a conference paper at ICLR 2022

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.

Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. In ICLR, 2017.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Genshiro Kitagawa. The two-filter formula for smoothing and an implementation of the gaussiansum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605 623, 1994.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. ar Xiv preprint ar Xiv:1511.05121, 2015.

Rahul G. Krishnan, Uri Shalit, and David A. Sontag. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101 2109. AAAI Press, 2017.

Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void - learning denoising from single noisy images. In CVPR, 2019.

Samuli Laine, Tero Karras, Jaakko Lehtinen, and Timo Aila. High-quality self-supervised deep image denoising. In Neur IPS, 2019.

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, et al. Generative spoken language modeling from raw audio. ar Xiv preprint ar Xiv:2102.01192, 2021.

Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. In ICML, 2018.

Lennart Ljung. Asymptotic behavior of the extended kalman filter as a parameter estimator for linear systems. IEEE Transactions on Automatic Control, 24(1):36 50, 1979.

Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 58(6):1182 1195, 2007.

Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6): 47 60, 1996.

Christian Naesseth, Scott Linderman, Rajesh Ranganath, and David Blei. Variational sequential monte carlo. In International conference on artificial intelligence and statistics, pp. 968 977. PMLR, 2018.

Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31:7785 7794, 2018.

Herbert E Rauch, F Tung, and Charlotte T Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA journal, 3(8):1445 1450, 1965.

Guy Revach, Nir Shlezinger, Xiaoyong Ni, Adria Lopez Escoriza, Ruud JG van Sloun, and Yonina C Eldar. Kalmannet: Neural network aided kalman filtering for partially known dynamics. ar Xiv preprint ar Xiv:2107.10043, 2021.

Victor Garcia Satorras, Max Welling, and Zeynep Akata. Combining generative and discriminative models for hybrid inference. In Neur IPS, 2019.

Published as a conference paper at ICLR 2022

Bernhard Sch olkopf, David W Hogg, Dun Wang, Daniel Foreman-Mackey, Dominik Janzing, Carl Johann Simon-Gabriel, and Jonas Peters. Modeling confounding by half-sibling regression. Proceedings of the National Academy of Sciences, 113(27):7391 7398, 2016.

Richard A Thompson, James M Moran, and George W Swenson Jr. Interferometry and synthesis in radio astronomy. Springer Nature, 2017.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

Niklas Wahlstr om, Thomas B Sch on, and Marc Peter Deisenroth. From pixels to torques: Policy learning with deep dynamical models. ar Xiv preprint ar Xiv:1502.02251, 2015.

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. ar Xiv preprint ar Xiv:1804.03209, 2018.

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. ar Xiv preprint ar Xiv:1506.07365, 2015.

Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pp. 5670 5679. PMLR, 2018.

A PARAMETERIZED SMOOTHING

In the main body of the paper, we showed how to parameterize the model for recursive estimation of p(xk | y k), k = 0, . . . , K. Additionally, we provided a linearized smoothing procedure that yields p(xk | y1:K). The disadvantage clearly is the linearization. Here, we show how we can recursively estimate p(xk | y1:K) in a similar sense to the filtering setting.

First, we put

p(xk | xk 1, y k) = N xk | ˆFk| k xk 1 + ˆek| k, ˆQk| k , (26)

where ˆFk| k(y k) and ˆek| k(y k) and ˆQk| k(y k) are two-sided recurrent neural network outputs similar to section 7. We compute the distribution of interest as follows:

p(xk | y0:K) = p(yk | xk)p(xk | y k)

p(yk | y k) (27)

p(yk | xk) Z p(xk, xk 1 | y k)dxk 1 (28)

= p(yk | xk) Z p(xk | xk 1, y k)p(xk 1 | y k)dxk 1 (29)

xk 1 yk|y k p(yk | xk) Z p(xk | xk 1, y k)p(xk 1 | y0:K)dxk 1 (30)

where we make the approximation to compute eq. (27) efficiently and recursively. It is justified if I(xk 1; yk | y k) < ϵ for small ϵ. That is, the additional information that yk conveys about xk 1 is marginal if we have all other data. Let

p(xk 1 | y0:K) := N xk 1 | ˆxk 1|0:K, ˆPk 1|0:K (31)

be previous time-step s posterior. Then put

p(xk | xk 1, y k) := N(x | ˆFk| kxk 1 + ˆek| k, ˆQk| k). (32)

where we left out the arguments for the following quantities estimated by an RNN.

ˆFk| k(y k) ˆek| k(y k) ˆQk| k(y k)

:= f(h k , h k ) h k h k

:= GRU(h k 1, yk 1) GRU(h k+1, yk+1)

Published as a conference paper at ICLR 2022

Applying the integral in eq. (30) we get Z p(xk | xk 1, y k)p(xk 1 | y0:K)dxk 1 = N xk | ˆxk| k, ˆPk| k (34)

where we put

ˆxk| k := ˆFk| k ˆxk 1|0:K + ˆek| k ˆPk| k := ˆFk| k ˆPk 1|0:K ˆF k| k + ˆQk| k (35)

A data likelihood can be computed as follows

p(yk | y k) = Z p(yk | xk)p(xk | y k)dxk (36)

= Z p(yk | xk) Z p(xk 1, xk | y k)dxkxk 1 (37)

= Z p(yk | xk) Z p(xk | xk 1, y k)p(xk 1 | y k)dxkxk 1 (38)

xk 1 yk|y k Z p(yk | xk) Z p(xk | xk 1, y k)p(xk 1 | y0:K)dxkxk 1 (39)

where we made the same approximation eq. (30) as before. It evaluates to Z p(yk | xk)p(xk | y k)dxk = N yk | ˆyk| k, ˆ Mk| k (40)

ˆyk| k := H ˆxk| k ˆ Mk| k := H ˆPk| k H + R (41)

For fitting to the data we now would use a maximum-pseudo-likelihood (Gourieroux et al., 1984) approach by maximizing:

k=0 log p(yk | y k) =

k=0 log N yk | H ˆxk| k(y k), H ˆPk| k(y k) H + R , (42)

leading to minimizing the following self-supervised loss function:

h (ˆyk| k yk) ˆ M 1 k| k (ˆyk| k yk) + log det ˆ Mk| k i , (43)

where ˆyk| k := H ˆxk| k(y k) and ˆ Mk| k := H ˆPk| k(y k) H + R.

B GAUSSIAN CONDITIONING FORMULAS

Since many of the calculations used in this work are based on the Gaussian conditioning formulas, we provide them here. If

p(x) = N(x | µ, P ), (44) p(y | x) = N(y | Hx + b, R), (45)

p(y) = N(y | Hµ + b, R + HP H ), (46)

p(x | y) = N x | Σ H R 1 (y b) + P 1µ , Σ , (47)

Σ = (P 1 + H R 1H) 1. (48)

Published as a conference paper at ICLR 2022

C DIRECT PARAMETERIZATION OF p(x | y<k)

We show here how to directly parameterize p(x | y<k) and p(x | y k). This parameterization is referred to as the recurrent model in our experiments. The procedure is rather straightforward. For filtering, we put p(xk | y<k) = N xk | ˆxk| k(y<k), ˆPk|<k(y<k) . We model:

ˆxk|<k := ek, ˆPk|<k := Lk L k , ek Lk

:= f(h k ), (49)

where Lk is a cholesky factor and f is a multi-layer perceptron. The argument h k is recursively given by

h k := GRU(yk 1, h k 1), (50)

where we employ a Gated Recurrent Unit (GRU) network (Cho et al., 2014). Note that this parameterization is equivalent to the model described in the main paper with Fk := 0.

For smoothing, p(xk | y k) = N xk | ˆxk| k(y k), ˆPk| k(y k) .

ˆxk| k := ek, ˆPk| k := Lk L k , ek Lk

:= f(h k , h k ), (51)

where Lk is a cholesky factor and f is a multi-layer perceptron.

h k := GRU(yk 1, h k 1), h k := GRU(yk+1, h k+1). (52)

Once p(x | y<k) (filtering) or p(x | y k) (smoothing) is obtained, all the procedures for inference and optimization described in the main paper and appendix A remain the same.

D BIAS-VARIANCE-NOISE DECOMPOSITION OF THE SELF-SUPERVISED GENERALIZATION ERROR

Any estimate ˆxk = ˆxk(y k) for xk that is not dependent on yk will give us a bias-variancenoise decomposition of the generalization error. Note that this setting covers both the filtering and smoothing case. Define optimal model ˆy k := E [yk | y k], then under the specified generative model (section 3) we have

E h ˆyk yk 2 2 | y k i = E h ˆyk ˆy k + ˆy k yk 2 2 | y k i (53)

= E h ˆyk ˆy k 2 2 | y k i + Var [yk | y k] + 2E h (ˆyk ˆy k) (ˆy k yk) | y k i (54)

= ˆyk ˆy k 2 2 + Var [yk | y k] + 2 (ˆyk ˆy k) E [(ˆy k yk) | y k] | {z } 0

Var [yk | y k] = E h ˆy k Hxk nk 2 2 | y k i (56)

= E h ˆy k Hxk 2 2 | y k i + E h nk 2 2 | y k i + 2 E h (Hxk ˆy k) nk | y k i

xk,y k nk = E h ˆy k Hxk 2 2 | y k i + tr(R) (58)

E h ˆyk yk 2 2 | y k i = ˆyk ˆy k 2 2 + E h ˆy k Hxk 2 2 | y k i + tr(R) (59)

Note that ˆy k = E [Hxk | y k] = HE [xk | y k]. Then, define our model ˆyk := H ˆxk. The reducible part of the error becomes

ˆy k ˆyk 2 2 = H (E[xk | y k] ˆxk) 2 2 (60)

Published as a conference paper at ICLR 2022

For this reason, the model output ˆxk approaches the optimal model under minimization of the selfsupervised error (perturbed by H).

Additionally, the model ˆyk approaches Hxk (the uncorrupted measurement) under this criterion.

E h ˆyk yk 2i = E h ˆyk Hxk + Hxk yk 2i (61)

= E h ˆyk Hxk 2i + E h Hxk yk 2i + 2 E h (ˆyk Hxk) (Hxk yk) i (62)

= E h ˆyk Hxk 2i + E h Hxk yk 2i + 2 E [(ˆyk Hxk)] E [Hxk yk] | {z } 0

= E h ˆyk Hxk 2i + tr(R) + E [ Hxk yk ]2 | {z } 0

If we then take ˆyk := H ˆxk then the reducible part of the error approximates the true xk (perturbed by H).

E h ˆyk Hxk 2i = E h H (ˆxk xk) 2i (65)

Published as a conference paper at ICLR 2022

E ALGORITHMS

Algorithm 1: Recursive Filter (Inference) input : Data (time-series) y0:K = (y0, . . . , y K), emission matrices H and R, parameters ϕ. output:

1. For training: Loss value LK and its gradient ϕLK.

2. For inference: ˆxk| k and ˆPk| k for k in 0, . . . , K. Inference is done via:

p(xk | y k) = N xk | ˆxk| k, ˆPk| k .

3. For linearized smoothing (section 6): ˆFk|<k, ˆPk|<k, ˆxk| k and ˆPk| k, for k in 0, . . . , K.

4. For forecasting: ˆx K+1|<(K+1) and ˆPK+1|<(K+1). Forecasting is done via:

p(x K+1 | y0:K) = N x K+1 | ˆx K+1|<(K+1), ˆPK+1|<(K+1) .

h0 := 0 L 1 := 0 ˆP0|<0 := ˆQ0 ˆx0|<0 := ˆe0 for k in 0, . . . , K do

ˆ Bk|<k := H ˆPk|<k H + R 1

ˆyk|<k := H ˆxk|<k

Lk := Lk 1 + (yk ˆyk|<k) ˆ Bk|<k (yk ˆyk|<k) log det ˆ Bk|<k ˆ Kk := ˆPk|<k H ˆ Bk|<k ˆPk| k := ˆPk|<k ˆ Kk H ˆPk|<k

ˆxk| k := ˆxk|<k + ˆ Kk (yk ˆyk|<k)

hk+1 := GRUϕ(hk, yk)

ˆek+1|<(k+1) ˆFk+1|<(k+1) ˆLk+1|<(k+1)

:= fϕ(hk+1)

ˆQk+1|<(k+1) := ˆLk+1|<(k+1) ˆL k+1|<(k+1) ˆPk+1|<(k+1) := ˆFk+1|<(k+1) ˆPk| k ˆF k+1|<(k+1) + ˆQk+1|<(k+1)

ˆxk+1|<(k+1) := ˆFk+1|<(k+1) ˆxk| k + ˆek+1|<(k+1)

end For the training case we use backpropagation through the above loop to compute ϕLK.

Algorithm 2: Recursive Filter (Training) input : Data (time-series) y0:K = (y0, . . . , y K), , emission matrices H and R, initialized parameters ϕ0, number of training rounds I. output: Model parameters ϕ for inference at test-time. for i in 0, . . . , I do

Obtain L(i) K and ϕL(i) K from algorithm 1. Run preferred optimizer step to update parameters ϕ with ϕL(i) K (and L(i) K ). end

Published as a conference paper at ICLR 2022

Algorithm 3: Parameterized Recursive Smoother (Inference) input : Data (time-series) y0:K = (y0, . . . , y K), emission matrices H and R, initialized parameters ϕ0 output:

1. Loss value LK and its gradient w.r.t. all model parameters ϕ(LK) for training.

2. For all k in 0, . . . , K: ˆxk|0:K, ˆPk|0:K. These can be used for inference through

p(xk | y0:K) = N ˆxk|0:K, ˆPk|0:K .

h 0 := 0 h K+1 := 0

L(i) 1 := 0 ˆP0| 0 := ˆQ0 ˆx0| 0 := ˆe0 for k in 0, . . . , K do

ˆ Bk| k := H ˆPk| k H + R 1

ˆyk| k := H ˆxk| k

L(i) k := L(i) k 1 + (yk ˆyk| k) ˆ Bk| k (yk ˆyk| k) log det ˆ Bk| k ˆ Kk := ˆPk| k H ˆ Bk| k ˆPk| k := ˆPk| k ˆ Kk H ˆPk| k

ˆxk| k := ˆxk| k + ˆ Kk (yk ˆyk| k)

h k+1 := GRUϕ(h k , yk h k+1 := GRUϕ(h k , yk+1

ˆek+1| (k+1) ˆFk+1| (k+1) ˆLk+1| (k+1)

:= fϕ(h k+1, h k+1)

ˆQk+1| (k+1) := ˆLk+1| (k+1) ˆL k+1| (k+1) ˆPk+1| (k+1) := ˆFk+1| (k+1) ˆPk|0:K ˆF k+1| (k+1) + ˆQk+1| (k+1)

ˆxk+1| (k+1) := ˆFk+1| (k+1) ˆxk|0:K + ˆek+1| (k+1)

end For training case we also use backpropagation through the above loop to compute ϕLK

Algorithm 4: Parameterized Recursive Smoother (Training) input : Data (time-series) y0:K = (y0, . . . , y K), emission matrices H and R, initialized parameters ϕ0, number of training rounds I output: Model parameters ϕ for inference at test-time. for i in 1, . . . , I do

Obtain ϕL(i) K from algorithm 3. Run preferred optimizer step. end

Published as a conference paper at ICLR 2022

Algorithm 5: Linearized Smoother

input : Values of: ˆFk|<k, ˆPk|<k, ˆPk| k, ˆxk| k for all k = 0, . . . , K, obtained from the recursive filter algorithm. output: Linearly smoothed distributions p(xk | y0:K) = N xk | ˆzk, ˆGk for all k = 0, . . . , K. ˆz K := ˆx K| K ˆGK := ˆPK| K for k = K, . . . , 1 do

ˆ Jk 1|k := ˆPk 1| k 1 ˆF k|<k ˆP 1 k|<k ˆGk 1 := ˆPk 1| k 1 + ˆ Jk 1|k ˆPk| k ˆPk|<k ˆ J k 1|k

ˆzk 1 := ˆxk 1| k 1 + ˆ Jk 1|k (ˆzk ˆxk| k)

Algorithm 6: Recurrent Smoother (Training) input : Training data (time-series) y0:K = (y0, . . . , y K), emission function N(Hxk, R), initialized parameters ϕ0, number of training iterations n. output: Optimized parameters ϕ for i in 1 to n do

h0 := 0 h K+1 := 0 L(i) := 0 for k in 0 to K do

h k := GRUϕ(h k 1, yk 1)

h k := GRUϕ(h k 1, yk+1) ˆxk| k ˆLk| k

:= fϕ(h k , h k )

ˆPk| k := ˆLk| k ˆL k| k ˆyk| k := H ˆxk| k

ˆ Bk| k := H ˆPk| k H + R 1

L(i) k := L(i) k 1 + (yk ˆyk| k) ˆ Bk| k (yk ˆyk| k) log det ˆ Bk| k

end Compute ϕL(i) K and apply SGD step with respect to all model parameters, which amounts to backpropagation through the above calculations. end For the filter variant, the steps that involve the backward direction ( ) are left out.

Published as a conference paper at ICLR 2022

Algorithm 7: Recurrent Smoother (Inference) input : Test data y0:K = (y0, . . . , y K), trained parameters ϕ , emission function N(Hx, R). output: Inferred posteriors p(xk | y0:K) = N(xk | ˆxk|0:K, ˆPk|0:K) for all k h0 := 0 h K+1 := 0 for k in 0 to K do

h k := GRUϕ(h k 1, yk 1)

h k := GRUϕ(h k 1, yk+1) ˆxk| k ˆLk| k

:= fϕ(h k , h k )

ˆPk| k := ˆLk| k ˆL k| k

ˆ Kk := ˆPk| k H H ˆPk| k H + R 1

ˆxk|0:K := ˆxk| k + ˆ Kk(yk H ˆxk| k) ˆPk|0:K := ˆPk| k ˆ Kk H ˆPk| k

end For the filter variant, the steps that involve the backward direction ( ) are left out.

Published as a conference paper at ICLR 2022

F EXPERIMENTS: DETAILS

F.1 LINEAR DYNAMICS

As specified in the main paper, the dynamics are according to

"0 1 0 0 c 1 0 τc 0

Since these are linear transitions, we can calculate any transition directly using x(t + t) = e A tx(t).

F := e A 0 0 e A

Q := Q 0 0 Q

We used c = 0.06, τ = 0.17, t := 1 and covariance

1 3 0 0 0 1 0 0 0 3

The matrix exponential is computed using Bader et al. (2019). The parameters for the emission distribution:

H := 1 0 0 0 0 0 0 0 0 1 0 0

R := 0.52 1 0 0 1

We simulate a K := 131, 072 trajectory for training, K := 16, 384 trajectory for validation and K := 32, 768 for testing. The F that is used in the (non-optimal) Kalman filter and recursive model is computed as follows:

n=0 ( t An)/n! (70)

F.2 LORENZ EQUATIONS

We simulate a Lorenz system according to

" σ σ 0 ρ x1 1 0 x2 0 β

# "x1 x2 x3

We integrate the system using dt = 0.00001 and sample it uniformly at t = 0.05. We use ρ = 28, σ = 10, β = 8/3. The transition in t arbitrary time-steps is linearly approximated by a Taylor expansion and used in the Kalman smoother and recursive models.

e A|xk t Fk :=

n=0 ( t A|xk)n/n! (72)

We simulate K := 131, 072 steps for training, K := 32, 768 for testing and K := 16, 384 for validation. We have H := I and thus x R3 and y R3. We use R := 0.52I.

G PARAMETERIZED SMOOTHING: ALTERNATIVE POSTERIOR EVALUATION

We would like to point out that the distribution p(xk | y0:K) can be obtained without making assumption eq. (30), which we stretch out here. Initial experiments showed that using these calculations the model did not converge as smoothly as when using the ones stated before. However, it could be of interest to further investigate. Returning to the posterior of interest

p(xk | y0:K) = Z p(xk | xk 1, y0:K) p(xk 1 | y0:K) dxk 1 (73)

= Z p(yk | xk) p(yk | xk 1, y k) p(xk | xk 1, y k)p(xk 1 | y0:K) dxk 1. (74)

Published as a conference paper at ICLR 2022

p(xk 1 | y0:K) = N xk 1 | ˆxk 1|0:K(y0:K), ˆPk 1|0:K(y0:K) (75)

as the previous time-step s posterior.

p (xk | xk 1, y k) = N xk | ˆFk| kxk 1 + ˆek| k, ˆQk| k (76)

where ˆFk| k(y k), ˆek| k(y k) and ˆQk| k(y k) are estimated by a neural network. Combining this with noise model eq. (3) we get:

p(xk | xk 1, y0:K)

= N xk | ˆFk| k xk 1 + ˆek| k + Kk (yk H ( ˆFk| k xk 1 + ˆek| k)), ˆQk| k Kk H ˆQk| k

= N xk | (I Kn H) ˆFk| k xk 1 + ˆek| k + Kk yk, (I Kn H) ˆQk| k , (78)

where we directly applied the Woodbury matrix identity to obtain Kalman gain matrix

Kk := ˆQk| k H H ˆQk| k H + R 1 (79)

Then, applying the integral we get:

p (xk | y0:K) = N xk | ˆxk|0:K, ˆPk|0:K , (80)

ˆPk|0:K = (I Kn H) ˆFk| k ˆPk 1|0:K ˆF k| k (I Kn H) + (I Kn H) ˆQk| k, (81)

ˆxk|0:K = ˆFk| k ˆxk 1|0:K + ˆek| k + Kk (yk H ( ˆFk| k ˆxk 1|0:K + ˆek| k)) (82)

= (I Kn H) ˆFk| k ˆxk 1|0:K + ˆek| k + Kk yk. (83)