# pseudolikelihood_inference__3b54956c.pdf Pseudo-Likelihood Inference Theo Gruner 1,2 Boris Belousov 3 Fabio Muratore 4 Daniel Palenicek 1,2 Jan Peters 1,2,3,5 1 Intelligent Autonomous Systems Group, Technical University of Darmstadt 2 hessian.AI 3 German Research Center for AI (DFKI) 4 Bosch Center for Artificial Intelligence 5 Centre for Cognitive Science theo_sunao.gruner@tu-darmstadt.de Simulation-Based Inference (SBI) is a common name for an emerging family of approaches that infer the model parameters when the likelihood is intractable. Existing SBI methods either approximate the likelihood, such as Approximate Bayesian Computation (ABC), or directly model the posterior, such as Sequential Neural Posterior Estimation (SNPE). While ABC is efficient on low-dimensional problems, on higher-dimensional tasks, it is generally outperformed by SNPE which leverages function approximation. In this paper, we propose Pseudo-Likelihood Inference (PLI), a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks. By utilizing integral probability metrics, we introduce a smooth likelihood kernel with an adaptive bandwidth that is updated based on information-theoretic trust regions. Thanks to this formulation, our method (i) allows for optimizing neural posteriors via gradient descent, (ii) does not rely on summary statistics, and (iii) enables multiple observations as input. In comparison to SNPE, it leads to improved performance when more data is available. The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system, showing particular advantages on stochastic simulations and multi-modal posterior landscapes. 1 Introduction Parametric stochastic simulators are a well-established tool for predicting the behavior of real-world phenomena. These statistical models find widespread use in various scientific fields such as physics, economics, biology, ecology, computer science, and robotics, where they help to gain knowledge about the underlying stochastic processes [27, 34] or generate additional data for subsequent downstream tasks [38]. In both cases, the practitioner seeks to explain the observations as accurately as possible while incorporating all available information. The output of such a simulator is largely determined by its parameters and their values. When estimating these parameters using Bayesian inference, given observations from a physical system, which are inevitably subject to measurement noise, we obtain a distribution over values instead of a point estimate. Additionally, there might be several parameter configurations yielding the same observation, hence rendering the resulting distribution to be multimodal. Moreover, the likelihood function might be unknown or too expensive to evaluate for many practical use cases. The combination of these difficulties makes obtaining a posterior distribution over simulator parameters challenging for state-of-the-art inference methods, both regarding effectiveness and efficiency. SBI approaches address the issue of intractable likelihoods by using (stochastic) simulators as forward models to generate observations from proposal distributions over parameters. The approaches are also often called likelihood-free, which can be easily misunderstood since some of them directly approximate the likelihood [7]. ABC is a family of SBI methods that approximate the posterior with 37th Conference on Neural Information Processing Systems (Neur IPS 2023). a set of weighted particles which are obtained from Monte Carlo simulations and updated based on an empirical estimation of the intractable likelihood [48]. For an ABC approach to work well, three criteria have to be fulfilled: (i) the likelihood kernel is capable of measuring the similarity of observations meaningfully, (ii) the proposal distribution samples close to the posterior, (iii) the decision-making rule balances between accepting a sufficient amount of samples from the proposal and steering inference towards the posterior distribution. Constructing a suitable likelihood kernel often means tailoring summary statistics to the problem at hand. However, recent advances promise to replace this heuristic-based process by employing Integral Probability Metrics (IPMs) to measure statistical distances in observation space [3, 16, 17]. While these methods significantly increase the required number of simulations, approximations of the statistical distances [25, 8] can be computed in parallel, hence facilitating the parallelization of the whole inference pipeline. Following up on the shortcomings of ABC, the family of SNPE approaches provide Bayesian inference methods that leverage conditional neural density estimators to approximate the posterior [40, 24, 41, 19]. The benchmarking study of Lueckmann et al. [32] concludes that, generally, SNPE approaches are to be preferred over ABC as they are superior in terms of expressibility and accuracy across a wide range of benchmarking tasks. However, it is important to point out that the analysis of the posterior inference has solely been reported for single observations. These single-sample scenarios favor SNPE in high dimensions since ABC relies on summary statistics to evaluate the likelihood. Therefore, it remains an open question whether SNPE methods can transfer their benefits to settings where the (approximated) posterior is conditioned on multiple observations at once. Contributions. By deriving the ABC posterior from a constrained variational inference objective, we introduce a novel SBI method called Pseudo-Likelihood Inference (PLI). Our formulation allows for approximating the posterior with neural density estimators. PLI updates this posterior from pseudo-likelihoods which are exponentially transformed statistical distances computed using IPMs. To further remove heuristics from the inference process, we derive an adaptive bandwidth update of PLI s likelihood kernel that bounds the loss of information based on information-geometric trust-region principles. This way, PLI can update its neural posterior solely given observations from a (stochastic) black-box simulator. Moreover, the usage of IPMs enables PLI to simultaneously condition on a variable number of observations, while SNPE methods need to concatenate them and, therefore, degrade when the number of observations increases. We compare PLI against ABC and one SNPE method on two SBI benchmarking tasks as well as a highly dynamical double pendulum task. For both, ABC and PLI, we investigate two IPMs: the Maximum Mean Discrepancy (MMD) and the Wasserstein distance. Motivated by recent benchmarking results, we chose Automatic Posterior Transformation (APT) [24] to represent SNPE approaches. Our experiments investigate the dependency of the trained density estimator s performance on the number of observations, where performance is measured in observation as well as in parameter space. We show the merits and disadvantages of all methods and conclude with concrete recommendations. 2 Bayesian inference with intractable likelihoods The objective of Bayesian inference is to find the posterior parameter distribution p(ξ|x 1:N) given a set of reference data points x 1:N which are assumed to be drawn from the likelihood model p(x|ξ). Given a prior belief over the parameters p(ξ), the posterior is expressed via Bayes rule p(ξ|x 1:N) p(x 1:N|ξ) p(ξ). (1) In the following, we describe SBI methodologies that aim to approximate the posterior (1) when the likelihood is given by a simulator model, from which only sampling is possible, x1:M p(x|ξ) but evaluating the likelihood is infeasible. 2.1 Approximate Bayesian computation ABC methods perform Bayesian inference without explicitly computing the likelihood function p(x 1:N|ξ) = R p(x 1:N|x1:M, ξ)p(x1:M|ξ) dx1:M [48]. Instead, they approximate it pβ(x 1:N|ξ) Z Kβ(D(x 1:N, x1:M)) p(x1:M|ξ)dx1:M (2) using Monte Carlo samples x1:M from the simulator as the reference points and smoothening them with a kernel Kβ(D(x 1:N, x1:M)) [28]. The kernel assesses the similarity of the reference data x 1:N and the simulated data x1:M based on a distance measure D, the kernel type K, and the kernel bandwidth β. The uniform kernel 1{D(x 1:N,x1:M) β} (see Table A.1) has emerged as the default kernel choice of many ABC methods [49, 15, 30, 31]. In this case, the bandwidth represents a rejection threshold that assigns zero probability to all parameters whose simulations lie outside of the β-ball in terms of the distance D. The uniform kernel exhibits the favorable characteristic of converging to the likelihood in the limit [28] limβ 0 pβ(x 1:N|ξ) = R 1{x 1:N}(x1:M)p(x1:M|ξ)dx1:M = p(x 1:N|ξ). (3) Once the approximate likelihood (2) is obtained, ABC draws samples from the approximate posterior pβ(ξ|x 1:N) pβ(x 1:N|ξ) p(ξ). (4) There exist multiple ways of implementing this sampling procedure. In rejection ABC [49], the simplest form, proposal parameters are drawn from the prior distribution p(ξ) and are accepted if the simulated data falls close to the true data, as measured by the kernel function. While rejection ABC yields a simple algorithm with desirable convergence properties, finding posterior samples for small bandwidths β in high dimensions often becomes computationally infeasible [33]. Therefore, the research on ABC focuses on three directions of improvement: (i) replacing the prior with a sequentially updated proposal distribution πt(ξ) to reduce the search space during sampling, (ii) adapting the bandwidth β to draw samples with an appropriate acceptance rate, and (iii) finding sufficient statistics to represent the simulated output in low dimensions [48]. MCMC-ABC [33] and SMC-ABC [47, 51, 15, 31] build upon sampling strategies based on Markov Chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) to sequentially update the proposal distribution. MCMC-ABC does not allow for an adaptive bandwidth, and thus, SMC sampling strategies have evolved as the leading ABC methods for these cases [15]. 2.2 Sequential Monte Carlo ABC SMC ABC builds on SMC samplers introduced by Del Moral et al. [14]. Fundamentally, SMCABC approximates the posterior distribution through a sequence of intermediate target posterior distributions pβt(ξ|x 1:N) (4) that are characterized by an adaptable bandwidth parameter βt, where t denotes the inference time. Furthermore, SMC-ABC uses importance sampling from a sequentially updated proposal distribution πt(ξ) to improve the sample efficiency. The proposal distribution is represented by an empircal distribution πt(ξ) = 1/M PM i=0 δξ(i) t (ξ) that is defined through a set of particles {ξ(i) t }. Importance sampling then enables the approximation of the target posterior pβt(ξ|x 1:N) from the proposal distribution pβt(ξ|x 1:N) qt(ξ) = i=1 W (i) t δξ(i) t (ξ); W (i) t = pβt(ξ(i) t |x 1:N) πt(ξ(i)) . (5) Here, Wt are the weights between the target posterior and the proposal distribution. SMC-ABC methods follow three steps to carry out inference for the next target posterior pβt+1(ξ|x 1:N): (i) A new bandwidth βt+1 of the target posterior pβt+1(x 1:N|ξ) is estimated. Typically, the update is based on heuristics, such as the Effective Sample Size (ESS) [15] to ensure that the particle variance does not degrade. (ii) New proposal particles ξ(i) t are sampled from a forward Markov kernel ξ(i) t+1 Kt(ξ(i) t 1, ξ(i) t ) to stay close to the target posterior of the next iteration pβt+1(ξ|x 1:N). (iii) The weights of the particles are adjusted based on approximations of (5). As the weight update is typically numerically intractable [14], different SMC-ABC methods [15, 30, 31] have been introduced which propose approximations to the optimal weight update. We refer to Appendix B for a more detailed explanation of SMC-ABC and its different approaches. 3 Pseudo-likelihood inference The proposed PLI methodology, summarized in Figure 1, generalizes the ABC approaches by introducing exponential likelihood kernels with adaptive bandwidth updates, which are motivated from a Variational Inference (VI) perspective. proposal πt 1 simulate p(x|ξ) pseudolikelihood posterior πt πt(ξ|x 1:N) p(ξ) p(ξ|x 1:N) πt 1(ξ|x 1:N) KL (πt||πt 1) ε Prior Posterior β D(p (x), p(x|ξ)) 0.0 0.5 1.0 p(x 1:N|ξ(i)) Figure 1: Schematic overview of the introduced iterative Pseudo-Likelihood Inference (PLI) approach. (top) Based on samples drawn from the simulator, the pseudo-likelihood (8) is evaluated based on the discrepancy between the empirical data-generating and likelihood distributions. The bar chart shows how the pseudo-likelihood evaluation changes for different bandwidths. (bottom) The evaluation of the pseudo-likelihood is used to estimate a target posterior πt under trust-region constraints that moves from the prior distribution to the final posterior. 3.1 Exponential likelihood kernels PLI adopts the view of SMC-ABC on approximating a smoothed target posterior pt(ξ|x 1:N), in the following denoted by πt(ξ), by formulating the following constrained VI problem for each inference step t, πt(ξ) = argmin π(ξ) P(ξ) KL (π(ξ) || p(ξ|x 1:N)), s.t. KL (π(ξ) || πt 1(ξ)) ε. (6) The optimization is balanced between fitting the posterior distribution p(ξ|x 1:N) and constraining the information loss between two inference steps πt 1(ξ) and πt(ξ). The loss of information is incorporated as a trust-region constraint with the bound ε > 0 in the space of probability distributions P(ξ) through the Kullback-Leibler (KL) divergence KL (π(ξ) || πt 1(ξ)). Theorem 1. The optimal target distribution πt(ξ) in the optimization problem (6) is given by πt(ξ) p(ξ) πt 1(ξ) 1 1+ηt p(x 1:N|ξ) 1 1+ηt πt 1(ξ) (7) where ηt > 0 is a dual Lagrangian variable corresponding to the trust-region constraint. Proof. See Appendix A.1. The temperature parameter ηt plays the role of an adaptive step size that controls the update step from πt 1(ξ) to πt(ξ). In the limit of small step sizes ηt 0 at convergence, the target posterior (7) turns into the true posterior πt(ξ) p(ξ)p(x 1:N|ξ). In the spirit of ABC-based methods, we approximate the intractable likelihood p(x 1:N|ξ) with a Gibbs distribution p(x 1:N|ξ), which we call pseudo-likelihood, and which is based on a discrepancy measure between the empirical data distribution p (x) = 1/N P i δx i (x) and the simulator likelihood p(x|ξ) p(x 1:N|ξ) := 1 Z(ξ) exp D(p (x), p(x|ξ)) Here, Z(ξ) is the normalization constant, and β > 0 is a bandwidth parameter that controls the sharpness of the approximation. When the KL divergence is used as the discrepancy measure D, we recover the true likelihood, as the following lemma states. Lemma 1. When D = KL and 2β = 1/N in (8), the pseudo-likelihood p(x 1:N|ξ) equals the true likelihood p(x 1:N|ξ). Proof. Since KL(p (x) p(x|ξ)) = H[p (x)] N 1 log p(x 1:N|ξ), then provided 2β = 1/N, the exponential in (8) is given by exp( NKL(p (x) p(x|ξ))) p(x 1:N|ξ) and Z(ξ) is a constant. However, the KL divergence is intractable in SBI because we cannot evaluate the likelihood. Therefore, we propose replacing the KL divergence with Integral Probability Metrics (IPMs), such as the MMD and the Wasserstein distance, which can be evaluated on distribution samples. Although theoretical analysis is less straightforward in these cases, some results have been obtained in prior works. Consistency and robustness of an MMD-based posterior estimator were shown by Chérief Abdellatif and Alquier [5] and Wasserstein-based exponential kernels were studied by De Plaen et al. [12]. In this paper, we focus on a practical instantiation of pseudo-likelihood inference, which can accommodate a variety of divergence functions and obtain superior empirical results by leveraging neural posterior approximators and adaptive step-size updates. The following subsections introduce the key components that constitute our method. 3.2 Bandwidth adaptation from trust-region principles The Lagrangian parameter ηt has a particularly interesting property. From (7), we see that pulling ηt into the pseudo-likelihood (2), yields a time-dependent tempered pseudo-likelihood pt(x 1:N|ξ) := Z(ξ)1+ηt exp( (2βt) 1D(p (x), p(x|ξ)) = p(x 1:N|ξ) 1 1+ηt . (9) Here, we introduce the adaptive bandwidth βt = (1 + ηt)β that approaches β in the limit ηt 0. The dual formulation of the stochastic search problem (6) leads to a tractable solution for the optimal Lagrangian parameter ηt, and hence an optimal bandwidth βt (see Appendix A.1 for more details). g(ηt) = ηtϵ (1 + ηt) log Eπt 1(ξ) p(ξ) πt 1(ξ) 1 1+ηt pt(x 1:N|ξ) . (10) 5 10 15 20 step ϵ = 0.3 ϵ = 0.7 ϵ = 1.0 Figure 2: Bandwidth ηt recorded for different ε on the Gaussian location task. The bandwidth is monotonically decreasing over iterations. While we obtain the primal optimal point in closed form (7) to obtain the optimal dual variable ηt, we need to resort to numerical optimization of the Lagrangian dual objective. The optimal bandwidth parameter βt, that is obtained by maximizing (10), can be seen as an information-bounded trust region update to move the pseudo-likelihood towards the likelihood. In the early inference stages, the proposal prior pt 1(ξ) is typically uninformative, and thus the information loss is moderate even if pt(ξ) moves far away from pt 1(ξ). In the later inference steps, the proposal distribution is typically pronounced, and small deviations may lead to significant information loss. This intuition suggests that the bandwidth βt should decay over iterations, and indeed Figure 2 shows that βt quickly decays towards zero over a range of values of ε on the Gaussian location task (Sec. C.2). The exact decay schedule of βt is problem-dependent. Therefore, it is convenient to set an information-loss bound ε and obtain an adaptive bandwidth schedule by optimizing the dual (10) rather than pre-specifying a decay schedule by hand for each problem. 3.3 Approximate Bayesian inference with pseudo-likelihoods Pseudo-Likelihood Inference (PLI) is a sequential SBI methodology based on approximating the target posterior πt(ξ) (7). It is closely tied to SMC-ABC by (i) sequentially approximating the target posteriors, (ii) sequentially adapting the bandwidth parameter, and (iii) sequentially updating the proposal distribution for higher sample efficiency. Instead of representing the posterior through a set of weighted particles, the PLI formulation allows for various powerful neural density estimators. A parameterized density model qϕ(ξ) is trained to approximate the PLI posterior (7) using the mprojection minϕ Φ KL(πt(ξ) qϕ(ξ)), which results in the Weighted Maximum Likelihood (WML) objective with parameter samples drawn from the proposal prior ξ(k) πt 1(ξ(k)) maxϕ Φ PK k=1 w(k) log qϕ(ξ(k)); w(k) = p(ξ(k)) πt 1(ξ(k)) 1 1+ηt pt(x 1:N|ξ(k)). (11) Algorithm 1 Pseudo-Likelihood Inference (PLI) 1: input: reference data x 1:N, prior p(ξ), stochastic simulator p(x|ξ), IPM D( , ), posterior approximator qϕ(ξ), max. iter. T 2: initialize proposal prior π0(ξ) = p(ξ) 3: for t in 1:T do 4: sample parameters ξ(1:K) πt 1(ξ) 5: for each ξ(k) do 6: simulate data x(k) 1:M p(x|ξ(k)) 7: compute IPM s(k) = D(x(k) 1:M, x 1:N) 8: end for 9: update ηt by maximizing the dual (10) 10: evaluate pt(x 1:N|ξ(k)) (9) 11: fit qϕt(ξ) by WML (11) 12: set new proposal prior πt(ξ) = qϕt(ξ) 13: end for 14: output: approximate posterior qϕT (ξ) Thus, we derive a practical PLI algorithm by leveraging this empirical WML objective. Further details on the objective derivation are described in Appendix A.2. Our proposed PLI Algorithm 1 consists of four main steps. First, in lines 4 8, training pairs from the proposal and simulator are drawn, and the discrepancy measure D(x 1:N, x1:M) between the observations and the simulations is evaluated for each ξ(k). We follow Gretton et al. [25] to approximate the MMD between two discrete probability measures, whereas we make use of the entropy regularized optimal transport formulation to approximate the Wasserstein distance [8] (see Appendix C.1). Both versions facilitate parallelization on the GPU. Second, in line 9, the optimal bandwidth under trust region constraint is estimated, and the tempered pseudo-likelihood is evaluated. by maximizing the dual (10). Third, in line 11, the parameterized density estimator qϕ(ξ) is trained to approximate the target posterior πt(ξ) via the m-projection (11). Note that the expectation w.r.t. the proposal distribution πt 1(ξ) enables gradient descent on the qϕt estimator without requiring a differentiable simulator. Fourth, in line 12, we set the current posterior approximation qϕt(ξ) as the proposal πt(ξ) for the next inference step, thus leveraging bootstrapping of the density estimator. While we restrict the analysis in this paper to the m-projection, we note that the i-projection can also be employed, as shown in Appendix A.3. Normalization. The normalization term Z(ξ) in the definition of the pseudo-likelihood (8) requires taking an integral over the reference data x 1:N, which is infeasible in practice. When the KL divergence is used in the kernel, Z(ξ) does not depend on ξ, as shown in Lemma 1. While in general, the dependence on ξ cannot be neglected, its influence on the weights in (10) and (11) may be negligible, provided the relative ranking of the samples is not affected significantly. In Appendix A.4, we provide an ablation study on low-dimensional problems where the integral over x 1:N can be approximated by sampling. We observed that, even though the ranking correlation of the weights w(k) in (11) is different with and without estimating Z(ξ), the final posterior is not affected. Therefore, in the subsequent experiments, we treat it as a constant, as in the ideal case of the KL divergence. Nevertheless, this is a point where our practical implementation does not follow the theoretical derivation strictly, and this issue should be addressed in future work. 4 Experiments We compare the PLI framework against SMC-ABC [15] and APT [24] on five diverse tasks. Our implementation is based on Wasserstein-ABC [3], but instead of the employed r-hit kernel [30], our implementation is based on population Monte Carlo (Alogrithm 3 [31]) because we observed improved performance in preliminary studies. A summary of the different ABC methods is given in Appendix B and Table A.1. APT was chosen as the representative for the class of SNPE algorithms. We leverage Neural Spline Flows (NSFs) [19] as density estimators for both PLI and APT. Both neural flow configurations share the same base network architecture, but for APT, the conditional flow is augmented with an embedding network (Appendix C). All experiments are implemented in JAX [4], and each ran on a single Nvidia RTX 3090.1 To make the experiments comparable, the simulation budgets of PLI and APT were fixed to 5000 samples per inference step over 20 episodes, while ABC ran for 200 episodes on 1000 particles. 4.1 Evaluation metrics The model is compared against the reference posterior samples ξ when available. We also quantify the methods performances based on their realizations by computing the Wasserstein distance and 1https://github.com/theogruner/pseudo_likelihood_inference 10 1 10 3 MMD2(ξ , ξ) Gaussian Location 10 1 10 3 10 5 W1(ξ , ξ) MMD-PLI W-PLI MMD-ABC W-ABC APT Gaussian Mixture 0.1 0.2 0.3 0.4 MMD2(x 1:N, x1:N) 10 2 10 1 100 W2 2(x 1:N, x1:N) Figure 3: Evaluation of the posterior performance on five different tasks. We report the mean and 95% ci over 10 random seeds, each carried out using N data points for conditioning. We compare samples from the approximate posterior ξ q(ξ|x 1:N) against reference posterior samples using MMD and the Wasserstein distance. No posterior samples were available for the Furuta pendulum. Therefore, the performance is evaluated in the observation space. Lower values are better for all metrics. PLI is the preferred method for conditioning on multiple observations due to its steady improvement with increasing N. ABC performs better than PLI on Gaussian Location and Gaussian Mixture tasks but lags in more complex tasks. APT excels with few observations but degrades as N increases. the MMD. The comparison is carried out on 10 000 samples each. Furthermore, we use Posterior Predictive Checks (PPCs) to evaluate the predictive capabilities of the posterior models in the observation space Eq(ξ|x 1:N) [D(x 1:N, x1:M)]. Due to the computational limits, the PPCs are carried out on 1000 simulations against the reference data. Lueckmann et al. [32] also report results with classifier-based tests and the kernelized Stein-discrepancy. However, since benchmarking is not our focus, we restrict the analysis to comparing with the Wasserstein and MMD. We evaluate PLI on four common benchmarking tasks within the SBI community [32]: Gaussian Location, a Gaussian Mixture Model, Simple-Likelihood Complex-Posterior, and SIR. Further, we add a system identification task on a Furuta pendulum representing a highly dynamic continuous control system. The tasks specifications are listed in Appendix C. For each task, we conduct experiments for different numbers of available reference observations N = {1, 2, 5, 10, 20, 50, 100, 1000}. The reference observations are simulated based on a pre-defined ground-truth parameter ξgt. Although PLI and ABC can cope with varying numbers of observations N and numbers of simulations per parameter M, we choose N = M for all experiments since it is required by APT. In the following paragraphs, we first discuss the results of the benchmarking tasks and present a separate discussion for the Furuta pendulum. Figure 3 gives a quantitative overview of the benchmarking tasks compared to the reference posteriors, while Figure C.1 in the Appendix complements the study by showing the posterior predictive performances. Benchmarking tasks. Each benchmarking task presents different challenges that must be addressed by the SBI methods. In Gaussian Location, the task is to infer a uni-modal 10-dimensional Gaussian distribution. Gaussian Mixture Model and SLCP feature multi-modal posteriors that require flexible density estimators. SIR is a well-known epidemiological model that features 10-dimensional data of the dynamical system. All methods generally depict a reoccurring behavior on the different benchmarking tasks, as shown in Figure 3. For fewer observations (N 20), APT matches the reference posterior better than the other approaches, whereas ABC and PLI match the posterior data better with increasing N. In particular, PLI consistently improves with an increasing number of reference samples. The influence of N on the shape of the posterior is further visualized in Figure C.2, which compares the posterior approximations of all methods for N = 2 and N = 100 reference observations. For the SLCP task, ABC struggles to capture the multi-modality of the SLCP task. This effect is further illustrated when comparing Figures C.3 and C.4, which allow for a qualitative 9 10 11 0.12 0.0 0.5 1.0 0.0 0.5 1.0 t 0.0 0.5 1.0 MMD-PLI W-PLI MMD-ABC W-ABC APT Figure 4: Empirical and quantitative evaluation on the Furuta pendulum. (top) Snapshots of the posterior evolution on the g mp plane on the Furuta pendulum for N = 1000. (bottom) Predictive performance of the learned MMD-PLI posterior for the angular rotation sin θr. The stochasticity of the simulator is removed by synchronizing the initial state between the reference and predicted simulations. Thus, the only discrepancies between trajectories are due to the model not capturing the dynamics parameters of the system. After the inference has been completed (step 20), the predictive simulator ( MMD-PLI) can completely recover the ground truth dynamics ( Reference). (right) Evaluation of the mean accumulated error over 1000 trajectories with synchronized initial states between the simulation and the reference trajectory. All approaches improve with rising N while PLI with MMD matches the reference data best. assessment of the posterior approximations. On SIR, PLI variants show significantly improved performance compared to the other baselines. Generally, we find that ABC and PLI perform better with MMD than with Wasserstein distance. This observation can be attributed to the Wasserstein distance not scaling well to high dimensions, which has been reported by recent studies [17, 16]. Furuta pendulum. The Furuta pendulum is an inverted double pendulum setup [21]. While the system s dynamics are inherently deterministic, small perturbations of the initial state around its unstable equilibrium point lead to highly diverse trajectories. The observation space is T 6 dimensional, where T is the number of time steps per trajectory. We set the sampling frequency of the simulation to 100 Hz and the duration to 1 sec, resulting in 600-dimensional observations. No reference posterior is available for this task; thus, the analysis is restricted to quantifying the observed data. Given the similarity of the Wasserstein distance and the MMD in parameter and observation space on the previous tasks, we argue that a comparison based on PPCs, i.e., W2 2(x 1:N, x1:M) and MMD2(x 1:N, x1:M), is sufficient. In Figure 3, the Wasserstein PPC and MMD PPC exhibit divergent behaviors, with the former indicating enhanced performance as reference observations increase, in contrast to the moderate improvement suggested by MMD. Therefore, we evaluate the models predictive performances on the deterministic system by synchronizing the initial states of the reference data and the simulations in Figure 4. This modification ensures that the similarity between two rollouts can be evaluated by the accumulated error err = P i |x i xi|. While all approaches perform better with more reference observations, MMD-PLI matches the reference dynamics best. Additionally, MMD is favored here, as MMD-based PLI and ABC outperform their Wasserstein counterparts, with both the MMD PPC plot and error plot showcasing congruent trends for N 20. The appended posterior plots in Figures C.5 and C.6 reveal that for N = 2, all methods are widely spread over the prior region, yet converge to the ground truth. However, APT cannot recover the ground truth for N 100, whereas PLI and ABC center around the ground truth. 5 Related work In the previous sections, we have seen that PLI is algorithmically similar to ABC methods with SMC samplers. Therefore, approximating the likelihood by the empirical pseudo-likelihood (2) enables drawing from the rich toolbox of existing approximate inference algorithms. This section introduces related research fields and shows how PLI fits among them. Sequential neural density estimation. With the enriched class of neural density estimators, amortized SBI methods have received increasing interest in recent years. Similar to ABC, synthetic samples from the simulator are used to approximate the posterior. Sequential neural density estimation methods can be further classified into methods that directly train a posterior estimator [40, 24], a neural likelihood [41, 22], or a neural ratio estimator [19, 36]. All methods have in common that they do not rely on an approximation of the posterior model but are optimized solely on pairs of parameter samples from a proposal distribution ξ(k) pt(ξ) and its corresponding simulation x(k) p(x|ξ(k)). We note that the original papers have only reported posteriors conditioned on a single observation x. While technically, these methods can incorporate multiple data points, this requires either stacking multiple observations or falling back to summary statistics. As noted in [50], neural likelihood estimators can sidestep these requirements by evaluating the log-likelihood of single observations and carrying out MCMC sampling on the joint log-likelihood. Yet, leveraging the neural likelihood restricts the evaluation of the posterior. Summary statistics. ABC has commonly relied on reducing the dimensionality of the raw observations with summary statistics [48]. These summaries must be carefully chosen and are often task-specific, restricting the general applicability of ABC. Recent additions to ABC methods report on replacing summary statistics with statistical distances [17]. While direct comparison of the raw data suffers from the curse of dimensionality, comparing the observations through empirical measures sidesteps this issue [17]. Bernton et al. [3] report on augmenting the likelihood kernel with the Wasserstein distance, while Park et al. [42] leverage the kernelized approximation of the MMD [25]. Other contributions include the Cramér-von-Mises distance [20] and the energy distance [39]. While statistical distances are appealing due to their general applicability, Drovandi and Frazier [17] conclude that they are limited by their high computational requirements. Approaches proposed for automated summary design include ABC with indirect inference, which utilizes an auxiliary model to evaluate data summaries [23, 18]. Particle mirror descent. A posterior updating similar to ours (7) has been derived in Particle Mirror Descent (PMD) [10]. PMD tackles particle depletion by incorporating the proposal distribution of the previous round into the optimization process. Furthermore, the authors show that the proposed method converges to the posterior given m posterior samples by O(1/ m). Our version can be seen as extending their approach to the case of intractable likelihoods. We extend PMD to neural density estimators using samples from the proposal posterior (7) as a training set. Geometric path and likelihood tempering. Rewriting the optimal posterior (7) reveals a close relation of the optimal PLI posterior (4) and the geometric path formulation [6, p. 335], πt(ξ) π1 λ t 1 (ξ) p(x 1:N, ξ)λ. The optimal posterior moves from the proposal distribution πt(ξ) at inference time t to the target posterior πt(ξ|x 1:N) p(x 1:N, ξ) along the geometric path that is parameterized by λ. The formulation differentiates from likelihood tempering in SMC samplers [6] by leveraging the proposal instead of the prior distribution. Note, however, that for t = 0, the proposal mimics the prior, and thus the PLI geometric path has the same boundary values as in classical likelihood tempering. While the tempered posterior cannot be applied to SMC samplers due to its dependence on the proposal distribution, the geometric path formulation based on the prior distribution gives rise to sequential annealing ABC [1]. Generalized variational inference. Introduced by Knoblauch et al. [29], Generalized Variational Inference (GVI) is an extension of the standard variational inference framework that starts from the optimization view of Bayes rule and generalizes it by considering different losses, divergences, and variational families. Therefore, various Bayesian inference methods can be seen as instantiations of GVI with different choices of these three parameters. In particular, ABC and PLI can be seen as GVI with the choices P(Kβ(x 1:N, x1:M), KL, P(Θ)) since they employ a likelihood kernel as a loss. Crucially, specifying a different loss function instead of the typical log-likelihood can be shown to add robustness against model misspecification [29]. PAC-Bayes [26] can be seen as a generalization of ABC as it covers the whole space of possible loss functions l(x 1:N, ξ). The PAC-Bayesian theory provides a broad array of risk bounds for generalized Bayesian learning methods. 6 Conclusion We propose Pseudo-Likelihood Inference (PLI), a new addition to the toolbox of SBI methods. PLI is targeted for Bayesian inference tasks in which the posterior is conditioned on multiple observations simultaneously. For that, we derive a softened ABC posterior from a constrained variational inference problem and leverage IPMs between the empirical observations to assess the intractable likelihood. The derived posterior formulation enables the learning of flexible neural density estimators from black-box simulators, extending the range of applicability for ABC methods. Our experiments assess how well PLI, ABC, and SNPE perform based on their generative power when given varying amounts of reference observations as a condition. Given few observations, SNPE-based methods perform better than ABC and PLI, which rely on statistical distances. However, when more data is available, ABC and PLI methods perform better. When the posterior distribution is simple, ABC is efficient at reproducing it using fast particle updates. However, PLI is a better option for complex posterior distributions because of its more adaptable neural density estimator. Additionally, PLI evaluates the posterior probability, which is useful in downstream tasks that require uncertainty quantification. Limitations. In PLI, the computational cost is distributed among three main computations: the simulation, the summary statistics estimation, and the normalizing flow training. While the computational effort is large for every computation, PLI leverages parallelization on GPUs for all three computations. In the provided experiments, the training process of the neural model takes the main computational budget, while simulation and summary statistics are negligible. On a Nvidia RTX 3090, ABC typically runs 2-10 min, while PLI and SNPE take 60-90 min, depending on the task. Simpler models however, such as multivariate Gaussian or GMM, reduce the computation time to the simulation. Furthermore, we would like to express that high-fidelity simulators might increase the simulation time significantly, making the simulation the most costly operation within the inference pipeline. Instead of utilizing IPMs, one could exploit adversarial strategies to approximate the KL divergence, as explored by Mescheder et al. [35] and Santana and Hernández-Lobato [45]. Acknowledgements This work was funded by the Hessian Ministry of Science and the Arts (HMWK) through the projects The Third Wave of Artificial Intelligence - 3AI , hessian.AI, and the grant Einrichtung eines Labors des Deutschen Forschungszentrum für Künstliche Intelligenz (DFKI) an der Technischen Universität Darmstadt . Fabio Muratore was employed by the Robert Bosch Gmb H during the time of this collaboration. [1] Carlo Albert, Hans R Künsch, and Andreas Scheidegger. A simulated annealing approach to approximate bayes computations. Statistics and Computing, 25(6):1217 1232, 2015. [2] Mark A. Beaumont. Approximate bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics, 41(1):379 406, 2010. [3] Espen Bernton, Pierre E. Jacob, Mathieu Gerber, and Christian P. Robert. Approximate bayesian computation with the wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(2):235 269, 2019. [4] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. [5] Badr-Eddine Chérief-Abdellatif and Pierre Alquier. Mmd-bayes: Robust bayesian estimation via maximum mean discrepancy. In Symposium on Advances in Approximate Bayesian Inference, 2019. [6] Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020. [7] Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055 30062, 2020. [8] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013. [9] Marco Cuturi, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier Teboul. Optimal transport tools (ott): A jax toolbox for all things wasserstein. ar Xiv preprint ar Xiv:2201.12324, 2022. [10] Bo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics, 2016. [11] Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17:93:1 93:50, 2016. [12] Henri De Plaen, Michaël Fanuel, and Johan AK Suykens. Wasserstein exponential kernels. In 2020 International Joint Conference on Neural Networks. IEEE, 2020. [13] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1 2):1 142, 2013. [14] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 68(3):411 436, 2006. [15] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. An adaptive sequential monte carlo method for approximate bayesian computation. Statistics and Computing, 22(5):1009 1020, 2012. [16] Charita Dellaporta, Jeremias Knoblauch, Theodoros Damoulas, and François-Xavier Briol. Robust bayesian inference for simulator-based models via the MMD posterior bootstrap. In Artificial Intelligence and Statistics, 2022. [17] Christopher C. Drovandi and David T. Frazier. A comparison of likelihood-free methods with and without summary statistics. Statistics and Computing, 32(3):42, 2022. [18] Christopher C Drovandi, Anthony N Pettitt, and Anthony Lee. Bayesian indirect inference using a parametric auxiliary model. Statistical Science, 30(1):72 95, 2015. [19] Conor Durkan, Iain Murray, and George Papamakarios. On contrastive learning for likelihoodfree inference. In International Conference on Machine Learning, 2020. [20] David T Frazier. Robust and efficient approximate bayesian computation: A minimum distance approach. ar Xiv preprint ar Xiv:2006.14126, 2020. [21] Katsuhisa Furuta, M Yamakita, and S Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering, 206(4):263 269, 1992. [22] Pierre Glaser, Michael Arbel, Arnaud Doucet, and Arthur Gretton. Maximum likelihood learning of energy-based models for simulation-based inference. ar Xiv preprint ar Xiv:2210.14756, 2022. [23] Alexander Gleim and Christian Pigorsch. Approximate bayesian computation with indirect summary statistics. Draft paper: http://ect-pigorsch. mee. uni-bonn. de/data/research/papers, 2013. [24] David S. Greenberg, Marcel Nonnenmacher, and Jakob H. Macke. Automatic posterior transformation for likelihood-free inference. In International Conference on Machine Learning, 2019. [25] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012. [26] Benjamin Guedj. A primer on pac-bayesian learning. ar Xiv preprint ar Xiv:1901.05353, 2019. [27] Florian Hartig, Justin M Calabrese, Björn Reineking, Thorsten Wiegand, and Andreas Huth. Statistical inference for stochastic simulation models theory and application. Ecology letters, 14(8):816 827, 2011. [28] George Karabatsos and Fabrizio Leisen. An approximate likelihood perspective on ABC methods. Statistics Surveys, 12:66 104, 2018. [29] Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on bayes rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132):1 109, 2022. [30] Anthony Lee. On the choice of MCMC kernels for approximate bayesian computation with SMC samplers. In Winter Simulation Conference, 2012. [31] Maxime Lenormand, Franck Jabot, and Guillaume Deffuant. Adaptive approximate bayesian computation for complex models. Computational Statistics, 28(6):2777 2796, 2013. [32] Jan-Matthis Lueckmann, Jan Boelts, David S. Greenberg, Pedro J. Gonçalves, and Jakob H. Macke. Benchmarking simulation-based inference. In Artificial Intelligence and Statistics, 2021. [33] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavaré. Markov chain monte carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26):15324 15328, 2003. [34] Kevin Mc Goff, Sayan Mukherjee, and Natesh Pillai. Statistical inference for dynamical systems: A review. Statistics Surveys, 9:209 252, 2015. [35] Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017. [36] Benjamin Kurt Miller, Christoph Weniger, and Patrick Forré. Contrastive neural ratio estimation. ar Xiv preprint ar Xiv:2210.06170, 2022. [37] Fabio Muratore, Theo Gruner, Florian Wiese, Boris Belousov, Michael Gienger, and Jan Peters. Neural posterior domain randomization. In Conference on Robot Learning, 2021. [38] Fabio Muratore, Fabio Ramos, Greg Turk, Wenhao Yu, Michael Gienger, and Jan Peters. Robot learning from randomized simulations: A review. Frontiers in Robotics and AI, 9, 2022. [39] Hien Duy Nguyen, Julyan Arbel, Hongliang Lü, and Florence Forbes. Approximate bayesian computation via the energy statistic. IEEE Access, 8:131683 131698, 2020. [40] George Papamakarios and Iain Murray. Fast ϵ-free inference of simulation models with bayesian conditional density estimation. In Advances in Neural Information Processing Systems, 2016. [41] George Papamakarios, David C. Sterratt, and Iain Murray. Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. In Artificial Intelligence and Statistics, 2019. [42] Mijung Park, Wittawat Jitkrittum, and Dino Sejdinovic. K2-ABC: approximate bayesian computation with kernel embeddings. In Artificial Intelligence and Statistics, 2016. [43] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Conference on Artificial Intelligence, 2010. [44] Gabriel Peyré and Marco Cuturi. Computational optimal transport. Found. Trends Mach. Learn., 11(5-6):355 607, 2019. [45] Simón Rodríguez Santana and Daniel Hernández-Lobato. Adversarial α-divergence minimization for bayesian approximate inference. Neurocomputing, 471:260 274, 2022. [46] Richard Sinkhorn. A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices. The Annals of Mathematical Statistics, 35(2):876 879, 1964. [47] S. A. Sisson, Y. Fan, and Mark M. Tanaka. Sequential monte carlo without likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760 1765, 2007. [48] Scott A Sisson, Yanan Fan, and Mark Beaumont. Handbook of approximate Bayesian computation (1st ed.). CRC Press, 2018. [49] Simon Tavaré, David J Balding, R C Griffiths, and Peter Donnelly. Inferring Coalescence Times From DNA Sequence Data. Genetics, 145(2):505 518, 1997. [50] Alvaro Tejero-Cantero, Jan Boelts, Michael Deistler, Jan-Matthis Lueckmann, Conor Durkan, Pedro J. Gonçalves, David S. Greenberg, and Jakob H. Macke. sbi: A toolkit for simulationbased inference. Journal of Open Source Software, 5(52):2505, 2020. [51] Tina Toni, David Welch, Natalja Strelkowa, Andreas Ipsen, and Michael P.H Stumpf. Approximate bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of The Royal Society Interface, 6(31):187 202, 2009. A Algorithmic details: pseudo-likelihood inference A.1 Deriving the optimal PLI parameter distribution Theorem 1. The optimal target distribution πt(ξ) in the optimization problem (6) is given by πt(ξ) p(ξ) πt 1(ξ) 1 1+ηt p(x 1:N|ξ) 1 1+ηt πt 1(ξ) (12) where ηt > 0 is a dual Lagrangian variable corresponding to the trust-region constraint. Proof. The solution to the stochastic search problem (6) can be obtained from Lagrangian optimization. The optimization problem is restated here for readability πt(ξ) = arg min π(ξ) KL (π(ξ) || p(ξ|x 1:N)), s.t. KL (π(ξ) || πt 1(ξ)) ε, Z π(ξ) dξ = 1. We decompose the KL objective into two terms by applying Bayes rule KL (π(ξ) || p(ξ|x 1:N)) = E π(ξ) [log p(x 1:N|ξ)] + KL (π(ξ) || p(ξ)). (13) The constrained optimization problem (6) can be reformulated with Lagrange multipliers as L(π) = Z π(ξ) log p(x 1:N|ξ)dξ + Z π(ξ) log π(ξ) + η Z π(ξ) log π(ξ) πt 1(ξ)dξ ε + λ Z π(ξ)dξ 1 = Z π(ξ) log p(x 1:N|ξ) + log π(ξ) p(ξ) + η log π(ξ) πt 1(ξ) + λ dξ ηε λ. (14) Here, we leveraged the assumption that the likelihood p(x1:M|ξ) is fixed for all joint distributions, and thus, the joint distributions can be split into the likelihood p(x 1:N|ξ) and their associated prior/proposal distributions. The gradient of the Lagrangian vanishes for the optimal parameter distribution L π π=πt = log p(x 1:N|ξ) + log πt(ξ) p(ξ) + 1 + η log πt(ξ) πt 1(ξ) + 1 + λ = 0. (15) Reformulation yields πt(ξ) = p 1 1+η (ξ) π η 1+η t 1 (ξ) exp log p(x 1:N|ξ) 1 + η 1 + η + λ = Q 1(η) p(ξ) πt 1(ξ) 1 1+η exp log p(x 1:N|ξ) 1 + η The normalization constant Q(η) = exp((1 + η + λ)/(1 + η)) follows by marginalization of (16) Q(η) = E πt 1(ξ) " p(ξ) πt 1(ξ) 1 1+η exp log p(x 1:N|ξ) 1 + η We further obtain the dual of the Lagrangian by reinserting (16) into the Lagrangian (14) g(η) = ηε (1 + η) log(Q(η)). (17) A.2 Weighted maximum likelihood optimization (m-projection) The m-projection of the optimal posterior onto the approximation family results in a weighted maximum likelihood formulation min ϕ KL (πt(ξ) || qϕ(ξ)) (18) Z log qϕ(ξ) πt(ξ) dξ p(ξ) πt 1(ξ) 1 1+ηt p(x 1:N|ξ) 1 1+ηt πt 1(ξ) log qϕ(ξ) dξ = max ϕ E πt 1(ξ) p(ξ) pt 1(ξ) 1 1+ηt p(x 1:N|ξ) 1 1+ηt | {z } w The weighting term w is independent of ϕ, and as such, the formulation facilitates optimizing neural density estimators with gradient descent. This optimization resolves to weighted maximum likelihood where the weights are obtained from the pseudo-likelihood (8) and the importance weights. The weighted maximum likelihood formulation can be optimized in closed form for linear Gaussian models [43, 13], with Expectation-Maximization (EM) using Gaussian Mixture Models (GMMs) [11] or with gradient descent as done in this paper. A.3 Optimizing with the i-projection We can reformulate the minimization problem for the i-projection in the following way min ϕ KL (qϕ(ξ) || πt(ξ)) (20) = min ϕ E qϕ(ξ) Q 1 ϕ p(ξ) pt 1(ξ) 1/(1+ηt) p(x 1:N|ξ) 1 1+ηt πt 1(ξ) = min ϕ KL (qϕ(ξ) || πt 1(ξ)) E πt 1(ξ) πt 1(ξ) log p(ξ) pt 1(ξ) 1/(1+η) p(x 1:N|ξ) The equation above alleviates the issue of back-propagating through the simulator by using importance sampling. The optimization problem is fitting the posterior estimator q to the proposal p(ξ) while having a regularization term that forces the distribution to fit the reference data. The temperature parameter η can thus be interpreted as weighting the regularization term. Small values of η put more emphasis on the regularization term, while large values concentrate on containing the information of the proposal. For linear Gaussian models, a closed-form expression for the KL term in (21) exists. Therefore, it can be directly optimized using any non-linear optimization method. Table A.1: Kernels used for the ABC and PLI variants during the experiments in Section 4. An IPM denoted by D plays the role of a distance measure between the reference data x 1:N and simulated samples x1:M. Parameter βt controls the kernel bandwidth (see Section 3.1). Effective Sample Size (ESS) is defined as the inverse of the normalized weight s variance. Kβt(D(x1:M, x 1:N)) Algorithm βt estimation Update 1{D(x1:M,x 1:N) βt} SMC ABC ESS β t = arg minβt ESS(wt, βt) αESS(wt 1, βt 1) PMC ABC α-Quantile β t = QD(xt 1:M,x 1:N)(α) exp D(x1:M,x 1:N) 2βt PLI Trust-region β t = β(1 + arg maxηt g(ηt)), see (10) A.4 Analysis of the partition function We shed some light on the intractable log-partition function Z(ξ) introduced in the pseudolikelihood (8). The partition function Z(ξ) of (8) is an integral over sample space of x X X exp D(x 1:N, x1:M) dx 1:N. (22) To approximate the intractable quantity, we approximate the integral through Monte Carlo simulations with a uniform distribution U i=1 exp D((x 1:N)i, x1:M) ; (x 1:N)i U( ; x 5 p 0 5 10 15 20 step Gaussian Location nobs = 1 nobs = 10 nobs = 100 0 5 10 15 20 step Gaussian Mixture 0 5 10 15 20 step 0 5 10 15 20 step 0 5 10 15 20 step Figure A.1: Spearman correlation coefficient r(w, ˆw) [ 1, 1] between the partition corrected weights ˆwi and the uncorrected weights wi. High values of r correspond to a high correlation of the weight rankings, thus meaning that the weights preserve relative ordering. Gaussian Linear Gaussian Mixture 5 10 15 20 step MMD-PLI MMD-PLI with Z(ξ) W-PLI W-PLI with Z(ξ) 5 10 15 20 step 5 10 15 20 step 5 10 15 20 step Figure A.2: Training plots comparing PLI trained with and without the partition function Z(ξ) on N = 100 samples. Top row: MMD between posterior samples and model samples. Bottom row: Wasserstein distance between posterior samples and model samples. On all tasks but SLCP, the inclusion of the partition function does not change the posterior inference. As we cannot sample over the whole space X, we choose to sample over the 5σ interval of the exponential kernel, where x represents the mean of 100 prior simulations. We use 10000 samples from U to approximate the partition function and evaluate the Z(ξ) based on 100 samples from the target posterior πt(ξ). We evaluate the influence of the partition function Z(ξ) on the performance of our PLI algorithm by comparing the weights ˆwi and wi as defined in (11) with and without Z(ξ), respectively. Despite the weights being numerically different, we hypothesize that they do not change the relative ordering of the samples and, therefore, do not affect the Weighted Maximum Likelihood (WML) update (11) significantly. Therefore, we employ the Spearman correlation coefficient, a nonparametric measure of rank correlation, to capture such dependencies, which we report in Figure A.1. Contrary to our hypothesis, the weights with and without normalization are not perfectly rank-correlated in four out of five experiments when there is more than one observed data point. Potentially, this could lead to a different WML update, but when we evaluate the whole PLI algorithm with and without the partition function Z(ξ), we find that its influence is marginal for both MMD-PLI and W-PLI in Figure A.2. Only on the SLCP task, an improvement of W-PLI can be seen when using the partition function, whereas MMD-PLI performs even better without it. We have motivated PLI from a VI perspective. As mentioned in the related work section (see Sec. 5), this is similar to the General Variational Inference (GVI) approach [29], which also defines the posterior as a solution to an optimization problem, namely π(ξ) = argmin π βN E π(ξ) [LN(ξ)] + KL (π(ξ) || p(ξ)). (24) When the loss is defined as the log-likelihood, LN(ξ) = log p(x 1:N|ξ) with β = 1/N, the PLI objective (6) is recovered. However, when we substitute the pseudo-likelihood (8) into the GVI objective, we obtain an additional term in the loss which is given as an expectation over the logpartition function Eπ[ log Z(ξ)]. Therefore, PLI can be seen as GVI with an additional loss term, or GVI can be equated with PLI using unnormalized pseudo-likelihood. B Algorithmic details: sequential Monte Carlo ABC The foundations of SMC-ABC have been laid by Del Moral et al. [14], who introduced SMC samplers. These samplers describe an approximate inference routine in which the posterior is approximated through a sequence of intermediate target posteriors. In the context of ABC, the sequence of intermediate posteriors is defined by an adaptive bandwidth βt of the approximate posterior pβt(ξ|x 1:N) (4). Additionally, the sample efficiency of ABC is improved by replacing the prior as the sampling distribution with a proposal distribution πt(ξ). The proposal distribution is represented by a set of particles πt(ξ) = 1/M PM i=1 δξ(i) t (ξ) and through importance sampling an approximation of the target posterior pβt(ξ|x 1:N) can be obtained, (see (5)), pβt(ξ|x 1:N) qt(ξ) = i=1 W (i) t δξ(i) t (ξ); W (i) t = pβt(ξ(i) t |x 1:N) πt(ξ(i)) , (25) where W (i) t denote the importance weights. The proposal distribution πt(ξ) should ideally stay close to the target posterior pβt(ξ|x 1:N) to improve the sample efficiency. Therefore, the proposal distribution is updated based on a Markov kernel Kt+1(ξt, ξt+1) which is the transition probability from ξt to ξt+1. The update of the proposal distribution is typically numerically intractable as it requires marginalization, i.e., integration over ξt for each inference step 0 : t πt+1(ξt+1) = Z Kt+1(ξt, ξt+1)πt(ξt)dξt. (26) To alleviate the computational burden, Del Moral et al. [14] show that the joint representation of the proposal distribution πt(ξ0:t) can be efficiently calculated as it only requires solving the product over t transitions πt(ξ0:t) = π0(ξ0) τ=0 Kt+1(ξt, ξt+1). (27) We define the joint proposal distribution as the empirical distribution πt(ξ0:t) = M 1 PM i=1 δξ(i) 0:t(ξ0:t) defined by a set of joint particles ξ(i) 0:t. Thus, the joint posterior approxi- mation of pβt(ξ0:t|x 1:N) based on the importance weights reads as i=1 w(i) t δξ(i) 0:t(ξ0:t); w(i) t = pβt(ξ(i) 0:t|x 1:N) πt(ξ(i) 0:t) (28) i=1 w(i) t δξ(i) t (ξt); w(i) t = pβt(ξ(i) 0:t|x 1:N) πt(ξ(i) 0:t) . (29) The marginal target posterior approximation qt(ξt) can be directly recovered from the joint approximation qt(ξ0:t). Furthermore, both distributions share their weights which means that it is only required to estimate the weights wt in order to approximate the target posteriors pβt(ξt|x 1:N). In general, the probability of the target joint posterior pβt(ξ(i) 0:t|ξ) is intractable. Therefore, the authors introduce an auxiliary backward Markov kernel Lt(ξt+1, ξt) to simplify the computation pβt(ξ0:t|x 1:N) = pβt(ξt|x 1:N) τ=0 Lτ(ξτ+1, ξτ). (30) Assuming that a posterior approximation of the target posterior pβt(ξt|x 1:N) is available through the set of weighted particles {(w(i) t , ξ(i) t )} and the particles of the proposal distribution πt(ξ) = M 1 PM i=1 δξ(i) t (ξ) are updated based on a kernel transition ξ(i) t+1 Kt+1(ξ(i) t , ξ(i) t+1), then the importance weights wt+1 are updated based on the following recursion wt+1 = pβt+1(ξ0:t+1) πt+1(ξ0:t+1) = pβt+1(ξt+1) Qt τ=0 Lt(ξτ+1, ξτ) π0(ξ0) Qt τ=0 Kt(ξτ, ξτ+1) (31) = pβt+1(ξt+1)Lt(ξt+1, ξt) pβt(ξt)Kt+1(ξt, ξt+1) | {z } ˆ wt+1 pβt(ξt) Qt 1 τ=0 Lt(ξτ+1, ξτ) π0(ξ0) Qt 1 τ=0 Kt(ξτ, ξτ+1) | {z } wt Thus, the sequential update is performed by updating the current weights wt with the marginal weights ˆwt+1. Up to now, the choice of the backward kernel has been neglected. As it is an auxiliary quantity, several approximations can be made to model the backward kernel. Del Moral et al. [14] refer to the optimal backward kernel as the Markov kernel that minimizes the variance of the particles Lopt t (ξt+1, ξt) = π(ξt)Kt+1(ξt, ξt+1) πt+1(ξt+1) . They further show that the optimal backward kernel recovers the marginal weights from (25) wopt t+1 := pβt+1(ξ|x 1:N) πt+1(ξ) = Wt+1. (33) In general, the optimal backward kernel is numerically intractable and has led to several other approximations summarized in Table B.2. Depending on choice of approximation, a number of different SMC-ABC methods have evolved, namely the classical SMC-ABC approach by Del Moral et al. [15], Population Monte Carlo (PMC)-ABC [51, 2, 31], and Metropolis-Hastings ABC [30]. Please refer to those references as well as Algorithms 2 and 3 for implementation details of these approaches. Algorithm 2 Sequential Monte Carlo ABC [15] 1: input: reference data x 1:N, prior p(ξ), stochastic simulator p(x|ξ), IPM D( , ), max. iteration count T, forward kernel K(ξt, ξt+1), resampling threshold V , α 2: initialize particles ξ(k) 0 p( ) 3: initialize particle weights w(k) 0 = 1/K 4: for t in 1:T do 5: for each ξ(k) t 1 do 6: simulate x(k) 1:M = {x(k) m p(x|ξ(k) t 1)} 7: compute IPM s(k) t 1 = D(x 1:N, x(k) 1:M) 8: end for 9: update the bandwidth βt by solving ESS({w(k) t }, βt) = α ESS({w(k) t 1}, βt 1) w(k) t w(k) t 1 1{s(k) t 1 βt} 1{s(k) t 1 βt 1} 10: if ESS({w(k) t }, βt) < V then 11: resample K particles ξ(k) t from {ξ(k) t 1} 12: set weights w(k) t = 1/K 13: end if 14: sample K particles ξ(k) t K(ξ(k) t 1, ξ(k) t ) 15: end for 16: output: posterior particles ξ(k) T Algorithm 3 PMC-ABC [31] 1: input: reference data x 1:N, prior p(ξ), stochastic simulator p(x|ξ), IPM D( , ), max. iteration count T, forward kernel Kt+1(ξt, ξt+1), α-Quantile α 2: initialize particles ξ(k) 0 p( ) 3: initialize particle weights w(k) 0 = 1/K 4: store number of best particles Kα = αK, 5: for t in 1:T do 6: elect Kα best particles ˆξ(k) t 1 7: sample K Kα proposal particles ξ(l) t Kt(ˆξ(k) t 1, ξ(l) t ) 8: for each ξ(l) t do 9: simulate x(l) 1:M = {x(l) m p(x| ξ(l) t )} 10: compute IPM s(l) t = D(x 1:N, x(l) 1:M) 11: end for 12: update bandwidth based on the empirical α-Quantile βt = Q{s(k) t 1,s(l) t }(α) 13: update weights w(k) t = p( ξ(k) t ) PKα i=1 w(i) t 1 PKα j=1 w(j) t 1 Kt(ξ(i) t 1, ξ(k) t ) 14: set K new particles ξk t = {ˆξ(k) t 1, ξ(k) t } 15: update forward kernel Kt+1(ξt, ξt+1) 16: end for 17: output: posterior particles ξ(k) T C Experimental details Here we detail the experimental configurations to reproduce the results covered in Figure 3 and C.1. A small grid search has been carried out over the learning rate, the trust-region parameter ε, the batch size, and the number of training samples on the SLCP and Furuta task for N = 50. The best-fitting hyperparameters over the two tasks are reported in Table C.3 and used throughout the experiments. The remaining parameters are taken from [32] to make the results comparable. We complement Figure 3 with Figure C.1 to compare the results in the observation space and include the Furuta pendulum. Table B.2: Approximations of the optimal backward kernel Lopt t (ξt+1, ξt) lead to different SMCABC approaches. Algorithm Assumption Lt ˆwt+1 Optimal - πt(ξt)Kt+1(ξt,ξt+1) πt+1(ξt+1) - PMC-ABC π pβt pβt(ξt|x 1:N)Kt+1(ξt,ξt+1) R pβt(ξt|x 1:N)Kt+1(ξt,ξt+1)dξt pβt+1(ξt+1|x 1:N) R pβt(ξt|x 1:N)Kt+1(ξt,ξt+1)dξt SMC-ABC pβt+1 pβt pβt(ξt|x 1:N)Kt+1(ξt,ξt+1) pβt(ξt+1|x 1:N) pβt+1(ξt|x 1:N) pβt(ξt|x 1:N) MH-ABC Lt(ξt+1, ξt) = Kt+1(ξt+1, ξt) pβt+1(ξt+1|x 1:N)Kt+1(ξt+1,ξt) pβt(ξt|x 1:N)Kt+1(ξt,ξt+1) MMD2(x 1:N, x1:N) Gaussian Location Gaussian Mixture W2 2(x 1:N, x1:N) log q(ξgt|X ) 1 2 5 10 20 50 1001000 MMD2(ξ , ξ) 1 2 5 10 20 50 1001000 N 1 2 5 10 20 50 1001000 N 1 2 5 10 20 50 1001000 N 1 2 5 10 20 50 1001000 N MMD-PLI W-PLI MMD-ABC W-ABC APT Figure C.1: Evaluation of the posterior performance on three different tasks (displayed along the columns). A node represents the mean and standard deviation of 10 experiments with different random seeds, each carried out using N data points for conditioning. The samples from the approximate posterior ξ q(ξ|x 1:N) are compared against the reference posterior samples with the Wasserstein distance and the MMD when available. Additionally, the log probability of the ground-truth parameter ξgt is evaluated and posterior predictive checks are carried out on all tasks. The ground-truth parameters are described in Appendix C. Lower values are better for all metrics. C.1 Approximation of the integral probability metrics In the context of this paper, we consider two instances of IPMs D(p (x), p(x|ξ)) between the data generating distribution p (x) and the likelihood p(x|ξ) the maximum mean discrepancy MMD and the squared 2-Wasserstein distance W2. To simplify the notation, we formulate the discrepancy between the pdfs p(x) and q(x) whose empirical probability distributions are denoted by p(x) = 1/N PN i=1 δxi(x) and q(y) = 1/M PM j=1 δyj(y). Furthermore, we denote the cost between individual samples by c(x, y). Maximum mean discrepancy The MMD [25] can be formulated with respect to an evaluation kernel k(x, y) as the sum of three terms MMD2(p, q) = E x p(x) y p(y) [k(x, y)] 2 E x p(x) y q(y) [k(x, y)] + E x q(x) y q(y) [k(x, y)] . (34) As the expectations are generally intractable, an unbiased estimate of MMD based on samples drawn from p and q is used [25] MMD2( p, q) 1 N(N 1) i =i k(xi, xi ) 2 NM i,j=1 k(xi, yj) + 1 M(M 1) j =j k(xj, xj ). (35) Here, xi and yj represent samples drawn from the sampling distributions x1:N p(x) and y1:M q(y). In this paper, a Gaussian kernel exp( 1/(2ℓ)c(x, y)) with bandwidth ℓis employed. The bandwidth is known to be very sensitive, which is why the kernel is evaluated over a variety of bandwidths ℓ= {1, 10, 20, 40, 80, 100, 130, 200, 400, 800, 1000} by summing over the bandwidths k(xi, yj) = P ℓkℓ(xi, yj) [16]. Wasserstein distance. In the experiments, we consider the squared 2-Wasserstein, which can be formulated as the solution to the optimal transport problem W2 2(p, q) = inf γ Z c(x, y)γ(x, y) dxdy. (36) The problem is also known as the Kantorovich problem, searching for the optimal coupling γ Γ(p, q) in the set of joint distributions that admit p and q through marginalization. For the empirical measures p and q, the Kantorovich problem can be formulated as a linear program W1 = min P U i,j Pij Cij. (37) Here, Cij = c(xi, yj) is the cost matrix containing the pairwise comparisons between the samples drawn from p(x) and q(y). The linear program searches for the optimal coupling matrix P among the set of doubly stochastic matrices U = {P RN M + : P1M = 1/N1N, P 1N = 1/M1M}. Peyré and Cuturi [44] show that introducing an entropy regularization term εH(P) to the objective (37) leads to an iterative scheme that can be solved with the Sinkhorn algorithm [46]. The iterative procedure enables parallelization on hardware accelerators to efficiently solve the optimal transport problem. Furthermore, it can be seen that the 2-Wasserstein distance can be recovered in the limit ε 0. We leverage the JAX library, OTT [9], to approximate W2 2 with Sinkhorn iterations for the computations. C.2 Gaussian location The Gaussian location model is a 10-dimensional Gaussian model. The ten dimensional parameters ξ [ 1, 1]10 define the means of the model N(x|µ = ξ, Σ = 0.1I). We choose a Gaussian prior p(ξ) = N(ξ|0, 0.1I) for which the posterior can be recovered in closed form. The ground-truth parameter is sampled uniformly within the posterior support ξgt U( 1, 1). 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 1.0 0.5 0.0 0.5 1.0 ξ1 Figure C.2: Slice of the posterior through the ξ1 ξ2 plane. The upper row shows experiments conducted on N = 2 reference observations. The lower row shows the approximate posteriors for N = 100 reference observations. The dotted line represents the ground truth parameter ξ(gt) that was used to generate x 1:N. All approaches show that the posterior becomes denser when conditioned on more data. C.3 Gaussian mixture model The task is to infer the mean parameters of a two-dimensional multivariate Gaussian mixture model [47] p(x|ξ) = N(x|µ = ξ, Σ = I) + N(x|µ = ξ, Σ = 0.01I) (38) The two-dimensional observation space represents samples from the Gaussian mixture model. We assume a uniform prior p(ξ) = U( 10, 10). C.4 Simple-likelihood complex-posterior The Simple-Likelihood Complex-Posterior (SLCP) task consists of a 5-dimensional parameter space Ξ [ 3, 3]5 with a uniform prior U( 3, 3). The ground-truth parameter, from which the reference observations are generated, is set to ξ(gt) = (0.7, 1.5, 1.0, 0.9, 0.6) . The observations represent four samples from a 2-dimensional Gaussian distribution x = [x1, x2, x3, x4, ] , xi = N(xi, µ(ξ), Σ(ξ)). (39) For further information, we refer to the SBI benchmarking paper from Lueckmann et al. [32]. The SIR model is an epidemiological time-series model that models the spreading of a disease. The name derives from the three states, (i) susceptible, (ii) infectious, (iii) recovered, that an individual can be in. The parameters of the dynamics model are the contact rate β and the mean recovery rate γ (0, 2] (0, 0.5] The prior is a log-normal distribution over β and γ β Log Normal(log(0.4), 0.5) (41) γ Log Normal(log(0.125), 0.2) (42) We rollout the dynamics over 160 timesteps and evaluate the simulation at 20 equidistant time-steps by taking a sample from the binomial, xi Binom(1000, Ii/N). Here N denotes the total population at the start of the simulation. See Lueckmann et al. [32] for details on the dynamics of the SIR model. C.6 The Furuta pendulum This inverted double pendulum can be described by the angular deflection [θr, θp] of the rods w.r.t. their equilibrium position [0, 0]. The equations of motion can be derived by formulating the Euler-Lagrange equation [37]: dr θr + τ dp θp 12mrl2 r + mpl2 r + 1 4mpl2 p sin2 θp 1 2mplplr cos θp 1 2mplplr cos θp 1 3mpl2 p 4mpl2 p sin 2θp θr θp 1 2mplplr sin θp θ2 p 1 8mpl2 p sin 2θp θ2 r + 1 2mplpg sin θp Here, the mild assumption is made that the pole length is significantly greater than its diameter for which the moments of inertia of the poles around their pivot are Ji = 1/3 mil2 i , i {r, p}. The mass matrix contains entries from the translatory and rotational movement of the two poles. As the reference coordinate systems are constantly rotating w.r.t. the basis coordinate system, Coriolis forces occur. They are complemented by gravitation which works on the rotational pole. The left-hand side considers damping in the joints, represented by the damping coefficients dr and dp, and the torque τ which is applied from a servo motor. For this paper, we omit external forces, i.e., τ = 0 Nm. The Furuta pendulum is set into motion by perturbing the initial state around its unstable equilibrium. For the system identification tasks we select the five system parameters g lr mr lp mp [9, 11] [0.08, 0.09] [0.08, 0.1] [0.12, 0.135] [0.02, 0.03] 9.81 0.085 0.095 0.129 0.024 with a uniform prior on the predefined ranges. Table C.3: Hyper-parameter settings of the SBI methods as used for the experiments in Section 4. Forward slashes symbolize layers of a neural network. Parameter Value Likelihood kernel Exponential Kernel Trust-region threshold ε 0.5 Model Neural Spline Flow (NSF) Bijector Rational Quadratic Spline with param size D # Bins 10 Conditioning MLP input dim / 50 / 50 / 50 / D # Bijectors / Transforms 5 Base distribution N(0, 1) Learning rate 1 10 5 Epochs 20 Train samples per iteration 5000 Batch size 125 PMC-ABC [31] Likelihood kernel Uniform Kernel Likelihood update α-Quantile, α = 0.1 (see Table A.1) α 0.1 Reverse transition kernel Lt reverse PMC kernel (see Table B.2) Particles 1000 Epochs 200 Perturbation kernel GMM with 5 components Model Conditional Neural Spline Flow (NSF) Bijector Rational Quadratic Spline with param size D # Bins 10 # Bijectors / Transforms 5 Conditioning MLP input dim / 32 / 32 / 32 / D Base distribution N(0, 1) # Atoms 10 Learning rate 1 10 5 Epochs 20 Train samples per epoch 5000 Batch size 500 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 Figure C.3: Results of posterior inference on the SLCP task with N = 2 reference observations. The unimodal distribution of the parameters ξ1 and ξ2) are depicted well by all approaches. On the contrary, the multi-modality is only represented properly by the APT posterior ( Reference, MMD-PLI, APT, MMD-ABC). 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 ξ1 2.5 0.0 2.5 ξ2 2.5 0.0 2.5 ξ3 2.5 0.0 2.5 ξ4 2.5 0.0 2.5 ξ5 Figure C.4: Results of posterior inference on the SLCP task with N = 100 reference observations. Compared to the posterior given N = 2 observations (Figure C.3), the posterior ( Reference) is distributed tightly around distinct points. Here, MMD-PLI captures all modes of the posterior, MMD-ABC centers around a uni-mode, while APT cannot represent the multi-modality. 0.08 0.10 lr 0.08 0.09 mr 0.020 0.025 0.12 0.13 mp 0.08 0.10 lr 0.08 0.09 mr 0.020 0.025 0.12 0.13 mp 0.08 0.10 lr 0.08 0.09 mr 0.020 0.025 0.12 0.13 mp Figure C.5: Results of posterior inference on the Furuta pendulum with N = 2 reference observations. All methods center around the ground truth parameter. APT finds the expected correlations among the parameters while MMD-PLI and MMD-ABC remain more widespread. 0.08 0.10 lr 0.08 0.09 mr 0.020 0.025 0.12 0.13 mp 0.08 0.10 lr 0.08 0.09 mr 0.020 0.025 0.12 0.13 mp 0.08 0.10 lr 0.08 0.09 mr 0.020 0.025 0.12 0.13 mp Figure C.6: Results of posterior inference on the Furuta pendulum with N = 100 reference observations. All models capture the ground-truth parameter well. In contrast to the N = 2 setting (Figure C.5) MMD-PLI reveals pairwise correlations between the domain parameters, and MMDABC is less densely distributed. Note, that APT clusters outside of the ground-truth parameter.