# selfsupervised_inference_in_statespace_models__4a66f13a.pdf Published as a conference paper at ICLR 2022 SELF-SUPERVISED INFERENCE IN STATE-SPACE MODELS David Ruhe AI4Science, AMLab, Anton Pannekoek Institute University of Amsterdam, The Netherlands d.ruhe@uva.nl Patrick Forr e AI4Science, AMLab University of Amsterdam, The Netherlands p.d.forre@uva.nl We perform approximate inference in state-space models with nonlinear state transitions. Without parameterizing a generative model, we apply Bayesian update formulas using a local linearity approximation parameterized by neural networks. This comes accompanied by a maximum likelihood objective that requires no supervision via uncorrupt observations or ground truth latent states. The optimization backpropagates through a recursion similar to the classical Kalman filter and smoother. Additionally, using an approximate conditional independence, we can perform smoothing without having to parameterize a separate model. In scientific applications, domain knowledge can give a linear approximation of the latent transition maps, which we can easily incorporate into our model. Usage of such domain knowledge is reflected in excellent results (despite our model s simplicity) on the chaotic Lorenz system compared to fully supervised and variational inference methods. Finally, we show competitive results on an audio denoising experiment. 1 INTRODUCTION Many sequential processes in industry and research involve noisy measurements that describe latent dynamics. A state-space model is a type of graphical model that effectively represents such noise-afflicted data (Bishop, 2006). The joint distribution is assumed to factorize according to a directed graph that encodes the dependency between variables using conditional probabilities. One is usually interested in performing inference, meaning to obtain reasonable estimates of the posterior distribution of the latent states or uncorrupt measurements. Approaches involving sampling (Neal et al., 2011), variational inference (Kingma & Welling, 2013), or belief propagation (Koller & Friedman, 2009) have been proposed before. Assuming a hidden Markov process (Koller & Friedman, 2009), the celebrated Kalman filter and smoother (Kalman, 1960; Rauch et al., 1965) are classical approaches to solving the posterior inference problem. However, the Markov assumption, together with linear Gaussian transition and emission probabilities, limit their flexibility. We present filtering and smoothing methods that are related to the classical Kalman filter updates but are augmented with flexible function estimators without using a constrained graphical model. By noting that the filtering and smoothing recursions can be back-propagated through, these estimators can be trained with a principled maximum-likelihood objective reminiscent of the noise2noise objective (Lehtinen et al., 2018; Laine et al., 2019). By using a locally linear transition distribution, the posterior distribution remains tractable despite the use of non-linear function estimators. Further, we show how a linearized smoothing procedure can be applied directly to the filtering distributions, discarding the need to train a separate model for smoothing. To verify what is claimed, we perform three experiments. (1) A linear dynamics filtering experiment, where we show how our models approximate the optimal solution with sufficient data. We also report that including expert knowledge can yield better estimates of latent states. (2) A more challenging chaotic Lorenz smoothing experiment that shows how our models perform on par with recently proposed supervised models. (3) An audio denoising experiment that uses real-world noise showing practical applicability of the methods. Our contributions can be summarized as follows. Published as a conference paper at ICLR 2022 Noisy measurements Extended Kalman Smoother Recurrent Smoother (Ours) Recursive Smoother (Ours) Ground Truth Figure 1: Best viewed on screen. Qualitative results of our work. To the noisy measurements (1st from left), we apply an extended Kalman smoother (2nd). From the noisy measurements, we learn a recurrent model that does slightly better (3rd). Our recursive model combines expert knowledge with inference (4th), yielding the best result. Ground truth provided for comparison (5th). 1. We show that the posterior inference distribution of a state-space model is tractable while parameter estimation is performed by neural networks. This means that we can apply the classical recursive Bayesian updates, akin to the Kalman filter and smoother, with mild assumptions on the generative process. 2. Our proposed method is optimized using maximum likelihood in a self-supervised manner. That is, ground truth values of states and measurements are not assumed to be available for training. Still, despite our model s simplicity, our experiments show that it performs better or on par with several baselines. 3. We show that the model can be combined with prior knowledge about the transition and emission probabilities, allowing for better applicability in low data regimes and incentivizing the model to provide more interpretable estimates of the latent states. 4. A linearized smoothing approach is presented that does not require explicit additional parameterization and learning of the smoothing distribution. 2 RELATED WORK Becker et al. (2019) provide a detailed discussion of recent related work, which we build on here and in table 1. An early method that extends the earlier introduced Kalman filter by allowing nonlinear transitions and emissions is the Extended Kalman filter (Ljung, 1979). It is limited due to the naive approach to locally linearize the transition and emission distributions. Furthermore, the transition and emission mechanisms are usually assumed to be known, or estimated with Expectation Maximization (Moon, 1996). More flexible methods that combine deep learning with variational inference include Black Box Variational Inference (Archer et al., 2015), Structured Inference Networks (Krishnan et al., 2017), Kalman Variational Autoencoder (Fraccaro et al., 2017), Deep Variational Bayes Filters (Karl et al., 2017), Variational Sequential Monte Carlo (Naesseth et al., 2018) and Disentangled Sequential Autoencoder (Yingzhen & Mandt, 2018). However, the lower-bound objective makes the approach less scalable and accurate (see also Becker et al. (2019). Furthermore, all of the above methods explicitly assume a graphical model, imposing a strong but potentially harmful inductive bias. The Backprop KF (Haarnoja et al., 2016) and Recurrent Kalman Network (Becker et al., 2019) move away from variational inference and borrow Bayesian filtering techniques from the Kalman filter. We follow this direction but do not require supervision through ground truth latent states or uncorrupt emissions. Satorras et al. (2019) combine Kalman filters through message passing with graph neural networks to perform hybrid inference. We perform some of their experiments by also incorporating expert knowledge. However, contrary to their approach, we do not need supervision. Finally, concurrently to this work, Revach et al. (2021) develop Kalman Net. It proposes similar techniques but evaluates them in a supervised manner. The authors, however, do suggest that an unsupervised approach can also be feasible. Additionally, we more explicitly state what generative assumptions are required, then target the posterior distribution of interest, and develop the model and objective function from there. Moreover, the current paper includes linearized smoothing (section 6), parameterized smoothing (appendix A), and the recurrent model (appendix C). We also denote theoretical guarantees under the noise2noise objective. Published as a conference paper at ICLR 2022 scalable state est. uncertainty noise dir. opt. self-sup. Ljung (1979) / Hochreiter et. al. (1997) / Cho et al. (2014) / Wahlstr om et al. (2015) / Watter et al. (2015) Archer et al. (2015) / Krishnan et al. (2017) Fraccaro et al. (2017) / Karl et al. (2017) Naesseth et al. (2018) Yingzhen et al. (2018) Rangapuram et al. (2018) / (1D) Doerr et al. (2018) Satorras et al. (2019) Haarnoja et al. (2016) Becker et al. (2019) Table 1: We compare whether algorithms are scalable, state estimation can be performed, models provide uncertainty estimates, noisy or missing data can be handled, optimization is performed directly and if supervision is required. / means that it depends on the parameterization. ek 1 ek ek+1 xk 1 xk xk+1 . . . . . . yk 1 yk yk+1 . . . . . . Figure 2: State-space model with deeper latent structure. 3 GENERATIVE MODEL ASSUMPTIONS In this section, we explicitly state the model s generative process assumptions. First, we assume that we can measure (at least) one run of (noise-afflicted) sequential data y0:K := (y0, . . . , y K), where each yk RM, k = 0, . . . , K. We abbreviate: yl:k := (yl, . . . , yk) and y