# learning_likelihoodfree_reference_priors__97739344.pdf

Learning Likelihood-Free Reference Priors

Nicholas Bishop * 1 Daniel Jarne Ornia * 1 Joel Dyer * 1 Anisoara Calinescu 1 Michael Wooldridge 1

Simulation modeling offers a flexible approach to constructing high-fidelity synthetic representations of complex real-world systems. However, the increased complexity of such models introduces additional complications, for example when carrying out statistical inference procedures. This has motivated a large and growing literature on likelihood-free or simulation-based inference methods, which approximate (e.g., Bayesian) inference without assuming access to the simulator s intractable likelihood function. A hitherto neglected problem in the simulation-based Bayesian inference literature is the challenge of constructing minimally informative reference priors for complex simulation models. Such priors maximise an expected Kullback-Leibler distance from the prior to the posterior, thereby influencing posterior inferences minimally and enabling an objective approach to Bayesian inference that does not necessitate the incorporation of strong subjective prior beliefs. In this paper, we propose and test a selection of likelihood-free methods for learning reference priors for simulation models, using variational approximations to these priors and a variety of mutual information estimators. Our experiments demonstrate that good approximations to reference priors for simulation models are in this way attainable, providing a first step towards the development of likelihood-free objective Bayesian inference procedures.

1. Introduction

Simulation models have played a crucial role across diverse scientific disciplines, offering a powerful framework for understanding complex systems. From epidemiology (Kerr

*Equal contribution 1University of Oxford. Correspondence to: Nicholas Bishop <nicholas.bishop@cs.ox.ac.uk>, Daniel Jarne Ornia <daniel.jarneornia@cs.ox.ac.uk>, Joel Dyer <joel.dyer@cs.ox.ac.uk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

et al., 2021) to economics (Wiese et al., 2024; Dyer et al., 2024a) and robotics (Todorov et al., 2012), these models can be used to explore intricate dynamics, test hypotheses, and make predictions about real-world phenomena. However, these models often lack an analytically tractable likelihood function for the outputs they produce. This intractability prevents the direct application of classical statistical inference methods. For this reason, a set of simulation-based inference (SBI) techniques has emerged, including Approximate Bayesian Computation (Beaumont, 2019) and modern methods based on density and density ratio estimation (see, e.g., Thomas et al., 2016; Papamakarios & Murray, 2016; Hermans et al., 2020; Greenberg et al., 2019). These methods approximate common statistical inference procedures using only the ability to simulate from the simulator, rather than the ability to evaluate the model s likelihood function.

Given a simulator with implicit density pθ, a prior distribution π over model parameters θ Θ Rd, and observed iid data x1:n := (x1, . . . , xn), xi X, the goal of simulationbased Bayesian inference methods, is to approximate the posterior distribution given by Bayes s theorem:

πx1:n(θ) pθ(x1:n)π(θ). (1)

The prior, π, is intended to encapsulate pre-existing knowledge or beliefs about the plausible parameter values, while πx1:n encapsulates the updated beliefs about plausible values for θ. However, in many practical scenarios, strong prior information regarding likely values for the model parameters may be unavailable or unreliable. Even when prior beliefs exist and can be readily expressed in the form of a prior distribution, they may not be confidently held, or the modeller might fear that they are unduly influencing the results of their Bayesian analyses by incorporating strong prior beliefs into π. The modeller may therefore wish to carry out Bayesian inference in a way that minimises the influence of their own prior beliefs on the resulting inferences, and determine what posterior beliefs would arise when minimal prior information is encoded into the inference procedure.

Such considerations have motivated the development of objective Bayesian methods, which offer a principled approach to constructing priors that are minimally informative and, correspondingly, posteriors that are maximally datadriven. This has often been solved through reference priors (Bernardo, 1979; Berger & Yang, 1994; Bernardo, 1997;

Learning Likelihood-Free Reference Priors

Berger et al., 2009), defined to be a prior distribution π , within some class Π of distributions on Θ, that maximizes the expected information gain from the data and minimises the influence of the prior on the posterior:

π n = arg max π Π E θ π x1:n pθ

log πx1:n(θ)

The right-hand side of Equation (2) can be recognised as the mutual information Iπ(x1:n, θ) between x1:n and θ π. Such reference priors (RPs) have additional desirable properties, such as invariance under reparameterization, good frequentist coverage (Consonni et al., 2018), and they asymptotically achieve the minimax entropy risk when Π is the class of continuous priors on Θ (Clarke & Barron, 1994).

Solving Equation (2) is, however, difficult except for in the simplest of cases. As described above, it is generally not the case that the model s likelihood function is tractable, let alone the mutual information (MI) between the model s parameter and output. Identifying a prior π n that solves Equation (2) for a given simulator is therefore challenging.

In this paper, we address this problem by proposing and testing a selection of likelihood-free approaches to learning RPs for implicit simulation models. A subset of these approaches allow for the simulator to be differentiable, while others address the case of non-differentiable simulators. To learn these RPs, we leverage recent advances in information theory and machine learning to estimate and maximise Iπ(x1:n, θ). Specifically, we consider both non-parametric entropy estimators, such as the Kozachenko-Leonenko estimator (Kozachenko & Leonenko, 1987; Kraskov et al., 2004), and more recent powerful neural MI estimators (Oord et al., 2018; Song & Ermon, 2020; Letizia et al., 2023). By maximizing the estimated MI, we approximate the RP without requiring explicit access to the likelihood function or the MI between θ and x1:n. Through experiments, we demonstrate the effectiveness of our approach and compare methods across a series of benchmarks, showcasing their potential for enabling objective Bayesian inference in simulation modelling.

2. Background

2.1. Reference Priors

Reference priors (RPs) were introduced in Bernardo (1979) as an approach to designing prior distributions representing a vague initial state of knowledge. Such priors aim to be minimally informative about the parameter θ, in the sense of maximising the missing information to be gained from the data. This can be seen more explicitly by rewriting Equation (2) as a difference between the prior and posterior entropies:

π n = arg max π Hπ [θ] Ex1:n mπHπx1:n [θ] , (3)

where Hπ [θ] = Eθ π [ log π(θ)] is the entropy of θ when generated from π and

mπ(x1:n) = Eθ π [pθ(x1:n)] (4)

is the marginal likelihood associated with prior π. Equation (3) makes it apparent that RPs aim to maximise the initial uncertainty in the prior while minimising, in expectation, the posterior uncertainty.

By specifying a maximally vague initial state of knowledge, RPs provide a useful tool for conducting prior sensitivity analysis: they provide the modeller with a way to assess what posterior inferences would ensue when those inferences are dominated by the observed data, rather than by the modeller s own subjective prior (Bernardo, 1997). In this way, the modeller can assess the relative influence of their own subjective prior on the final inference.

It is typical (Bernardo, 1979; Berger & Bernardo, 1992; Lafferty & Wasserman, 2001) to distinguish between the n-reference prior, defined by Equation (2) for finite n, and the reference prior, defined to be

π (θ) = lim n π n(θ) π n(θ0), (5)

where θ0 Θ is an arbitrary reference value for θ for which π n(θ0) > 0 for all n. This latter definition provides instead a measure of the total information that is missing about the value of θ, while circumventing issues resulting from the fact that Iπ(x1:n, θ) often diverges with infinite data.

A known result in the objective1 prior literature is that nreference priors are often only supported on a finite, discrete set of atoms in Θ (Berger et al., 1988; Zhang, 1994). This restriction can be undesirable, and in conflict with the aim for a minimally informative prior. To counteract this, it is sometimes preferred to consider only continuous, positive priors that maximise the objective in Equation (2) (Nalisnick & Smyth, 2017); we will assume this approach when describing the methods under consideration in Section 4.

A further motivation for restricting attention to continuous priors that solve Equation (2) is provided by Bernardo (1979); Clarke & Barron (1994). Here, it is established that asymptotically (i.e., as n ) and under regularity conditions (such as asymptotic Normality of the posterior distribution; see Clarke & Barron (1994) for details) Jeffreys prior, which has density

πJ(θ) (det(F(θ)))1/2 (6)

1The term objective has been historically adopted for methods giving rise to minimally informative priors, according to different definitions of minimally informative . We use this term to align with existing literature, but note that objective is in some senses a problematic term; see, e.g., Bernardo (1997).

Learning Likelihood-Free Reference Priors

is a solution, and is the unique solution among continuous positive priors. Here, F(θ) is the Fisher information matrix

F(θ) = Ex1:n pθ [ θ log pθ(x1:n) θ log pθ(x1:n)]

with determinant det(F(θ)), and denotes an outer product. Clarke & Barron (1994) prove that the continuous positive prior asymptotically solving Equation (2) i.e., Jeffreys prior, Equation (6) induces a marginal likelihood function mπJ that asymptotically minimises the largest distance between the space Mn of all densities on X n and members of the model family {pθ : θ Θ}; that is, as n ,

mπJ = arg inf m Mn

sup θ Θ Ex1:n pθ

log pθ(x1:n)

That is, asymptotically, RPs identified in the space of continuous positive distributions also satisfy the alternative notion of uninformativeness captured by (7).

2.2. Simulation-based Inference

Approximate, simulation-based inference (SBI) procedures for performing parameter inference for complex models have been in development for multiple decades. Beginning with approaches such as Approximate Bayesian Computation (Diggle & Gratton, 1984; Pritchard et al., 1999; Beaumont, 2019; Dyer et al., 2024b) and synthetic likelihood (Wood, 2010) methods, the past decade has seen multiple innovations in SBI methods from the machine learning community, such as through the use of neural density (e.g., Papamakarios & Murray, 2016; Greenberg et al., 2019) or density ratio (e.g., Thomas et al., 2016; Hermans et al., 2020) estimation techniques. This literature has, however, primarily focused on the problem of estimating the posterior density for a given prior and simulator, rather than on the problem of finding priors with different properties of interest. Our work addresses this gap.

3. Related Work

Early work on RPs have proposed procedures for pointwise estimation or generating samples via, e.g., Markov chain Monte Carlo (MCMC). An example of the former is Algorithm 1 of Berger et al. (2009), which exploits the following form of π as obtained through a calculus of variations argument (Bernardo, 1979; Berger & Bernardo, 1992):

π (θ) exp (Ex1:n pθ [log πx1:n(θ)]) . (8)

The posterior appearing in the exponent on the right-hand side of this expression is typically taken to be the limiting posterior as n , which is (under certain regularity conditions, see Van der Vaart (2000)) independent of the (continuous, positive) prior that generates it. Berger et al.

(2009) then propose to evaluate the right-hand side of Equation (8), which we denote below with π , pointwise using Mote Carlo estimates of the expectations in Equation (8) and an arbitrary positive prior c : Θ R+:

r=1 log pθ(x(r) 1:n)c(θ)

1 L PL l=1 pθ(l)(x(r) 1:n)

with x(r) 1:n pθ, θ(l) c, and n large. In contrast, Figure 1 of Lafferty & Wasserman (2001), describes an iterative

MCMC algorithm inspired by the Blahut-Arimoto algorithm (Blahut, 1972; Arimoto, 1972) for sampling from a model s RP. In each of these procedures, knowledge of and the ability to easily calculate (exactly or numerically) the likelihood function corresponding to the model under consideration is assumed. More recent and more closely related work by Nalisnick & Smyth (2017) consider the problem of learning variational approximations to RPs, once again by assuming knowledge and tractability of the likelihood function associated with the model. Gao et al. (2022) consider how n-reference priors can be used to pre-train deep neural networks in classification settings. In contrast to these prior works, we do not assume that the modeller enjoys access to the likelihood function associated with their simulator, expanding the scope of the literature on objective priors in Bayesian inference to the case of likelihood-free simulation models that are either differentiable or non-differentiable.

In this section, we provide details on the methods we propose and test for learning (n-)reference priors in likelihoodfree settings for simulation models. Each of the methods we consider entails optimising a (lower bound on a )n estimate ˆI for the MI between x1:n and θ. The main structure of the optimisation loop is shown diagrammatically in Figure 1.

Throughout, we assume the use of a parameterised variational family Π := {πϕ | ϕ Φ} of proper priors (see Appendix A.1) with tractable density function, such as normalising flows (Rezende & Mohamed, 2015). We will at times omit the dependence on ϕ for ease of notation. Such continuous approximations to (n-)reference priors can be advantageous, as discussed in Section 2.1. Further, by changing the number n of independent replications of the simulator output, the modeller can decide whether to approximate an n-reference prior for smaller values of n, or approximate the RP (5) by taking larger n.

We also assume that data samples x1:n are generated from a simulator with implicit likelihood function pθ that is dependent on parameters θ. By sequentially sampling parameters θ π and data x1:n pθ we can construct a dataset Dϕ = (x(r) 1:n, θ(r))r=1,...,R generated from the joint distribu-

Learning Likelihood-Free Reference Priors

Flow πϕ ϕ θ Simulator pθ x1, . . . , xn

Encoder sφ sφ(x1, . . . , xn) MI Estimator ˆI

Figure 1. A schematic of the overall pipeline for our methods.

tion (x1:n, θ) hπ, where

hπ(x1:n, θ) = pθ(x1:n)π(θ) (10)

is the joint distribution of (x1:n, θ) under prior π, and discard the θ(r) to produce samples from mπ.

4.1. Difference of Entropy Estimators

From the definition of MI between model parameters and outputs, we obtain the following equivalent expressions (see Appendix A.2):

Iπ(x1:n, θ) = Hπ[θ] Ex1:n mπHπx1:n[θ] (11)

= Hπ[θ] + Hmπ[x1:n] Hhπ[x1:n, θ]. (12)

As one of the main approaches, we consider a set of methods that rely on estimating MI through a difference of entropy relation, either through estimating the entropy of the prior and the posterior (11), or through estimating the prior, marginal and joint entropies (12). The problem of estimating (and maximising) MI now becomes a problem of effectively estimating the entropy of these different distributions.

Given that we have assumed a variational prior family with tractable density π(θ), the first term in equations (11) and (12) can be directly estimated without bias through a Monte Carlo entropy estimate for B prior samples:

b=1 log π(θ(b)). (13)

Then, to estimate the terms Ex1:n mπHπx1:n[θ], Hmπ[x1:n] and Hhπ[x1:n, θ] in likelihood free SBI settings there exist different parametric (Pichler et al., 2022) and nonparametric (Kraskov et al., 2004) approaches. In a nonparametric approach, we can follow Jarne Ornia et al. (2024) and exploit the differentiability of entropy estimators (e.g., the Kozachenko-Leonenko estimator, Kozachenko & Leonenko (1987)) and of differentiable simulators to propagate gradients backwards from outputs to prior; a parametric approach would instead construct density estimates of the corresponding distributions and compute the entropy directly. We propose the use of a parametric estimator in the interest of stability and flexibility2.

2Non-parametric, sample based entropy estimators are usually high variance and numerically unstable (Bonachela et al., 2008).

4.1.1. GENERATIVE ENTROPY DIFFERENCE ESTIMATOR

Making use of (11), we devise an approach for estimating the mutual information. Recall that we can obtain estimates ˆHπ[θ] directly from (13). Therefore, it only remains to find an estimate for Ex1:n mπHπx1:n[θ]. Taking inspiration from Pichler et al. (2022), a practical approach is to construct parameterised estimators for the conditional distribution πx1:n, denoted as ˆπψ x1:n (for parameters ψ Ψ), and use these to directly estimate this term using Monte Carlo sampling as in (13). Pichler et al. (2022) propose doing so by assuming the parameters ψ to be a function of the conditioning variable (the observed data). This yields the estimator:

ˆIψ,ϕ(Dϕ) = ˆHπ[θ] + 1

b=1 log ˆπψ x(b) 1:n(θ(b)). (14)

An important caveat is that in our case, the data marginal mπ depends on a prior π that changes iteratively (as the parameters ϕ evolve). Then, such an approach would entail repeating the following iteratively: (1) For the current prior π, sample Dϕ from hπ. (2) Build the estimator ˆπψ x1:n. (3) Compute ˆIψ,ϕ(Dϕ). (4) Update ϕ by SGD on ϕ ˆIψ,ϕ(Dϕ).

Step 2 above requires possibly learning a new estimator each time we update ϕ, which can be sampling and computationally inefficient. Instead, we adopt a two time-scale approach (Borkar, 1997). We consider the prior π and the estimator ˆπx1:n to pertain to the same parameterised model class (particularly, a normalizing flow). We then update the conditional density estimator ˆπx1:n with a faster learning rate3 than the prior π.

MI Estimator: Generative Entropy Difference (GED) Therefore, to learn a RP using a parametric MI estimator, we propose defining π and ˆπx1:n to belong to some parameterised class of conditional density estimators, and use ˆIψ,ϕ(Dϕ) (14) as an estimate of the MI. We then update both ˆπx1:n at each iteration through maximum likelihood on the sampled data, and π through SGD on the MI estimator, with a slower learning rate. For further implementation details, see Appendix B.6.

4.2. Variational Lower Bound Estimators

When the simulator is differentiable, we may also exploit variational lower bounds on the mutual information to learn an approximate RP. A wide range of variational lower

3The formal two time-scale argument by Borkar (1997) requires learning rates α(t) and β(t) decay to zero for t , to have infinite sum and finite sum of squares. Then, for the resulting stochastic approximation to guarantee convergence, limt α(t) β(t) = 0 must hold. We instead adopt an approximated argument, and choose fixed β α. This proved sufficient to learn the RPs correctly. See appendix for further details.

Learning Likelihood-Free Reference Priors

bounds have been proposed in the literature (see Poole et al. (2019) for a thorough overview). Many such bounds rely on the following variational characterization of the KL-divergence (Donsker & Varadhan, 1975) between two distributions P and Q:

DKL(P Q) = sup T L EP [T] log EQ[e T ], (15)

where L is the space of essentially bounded measurable functions. In particular, Belghazi et al. (2018) exploit Equation (15) by proposing MINE, which parameterises T with a neural network (commonly referred to as a critic) so that a reasonably tight lower bound on the MI can be learned via stochastic gradient ascent. Due to the second term of Equation (15), which corresponds to the log-partition function of Q, MINE suffers from high variance, especially when the MI is large (Song & Ermon, 2020; Mc Allester & Stratos, 2020). To mitigate this issue, Song & Ermon (2020) propose SMILE, which clips empirical approximations of the log-parition function to lie in the range [ τ, τ], where τ > 0 is a hyperparameter controlling the bias-variance trade-off associated with truncating the log-partition function.

In the context of contrastive predictive coding, the Info NCE objective was proposed by Oord et al. (2018) for the purpose of density ratio estimation:

sup T EP n(X,Y )

i=1 log T(xi, yi) 1 n P

j =i T(xi, yj)

Oord et al. (2018) observes that Info NCE implicitly maximises a variational lower bound on the MI. Song & Ermon (2020) further show that MI estimation methods derived from the Barber-Akov inequality (Barber & Agakov, 2003) may be reformulated as an optimization over density ratios. That is, from an estimate of the density ratio one may estimate MI and, by optimising a variational lower bound on the MI, one obtains an estimate of the density ratio.

Given a variational lower bound on the MI, we may jointly train a critic T and a variational prior to learn a RP. The variational family is responsible for producing a prior that induces high MI between x1:n and θ by maximising the variational lower bound. Meanwhile, the critic is tasked with keeping the variational lower bound tight so that the variational family has a good approximation of the MI to benchmark against. Both networks may be simultaneously updated via stochastic gradient ascent. For instance, using Info NCE, a critic Tµ with parameters µ, and a flow πϕ we may proceed as follows:

1. Sample Dϕ hϕ.

2. Compute the variational lower bound using critic Tµ:

ˆIϕ,µ(Dϕ) = 1

b=1 log Tµ(x(b) 1:n, θ(b))

a =b Tµ(x(b) 1:n, θ(a)) .

3. Update the parameters ϕ and µ via stochastic gradient ascent using ϕ,µ ˆIϕ,µ(Dϕ).

The procedure outlined above relies on differentiability of the simulator, as gradients must backpropagate through the critic and then the simulator before reaching the variational prior. We show in experiments that this requirement, while apparently stringent, can be reliably circumvented using surrogate gradients (Blondel & Roulet, 2024).

Additionally, note that the critic and the variational prior are updated simultaneously. As a result, the critic is optimizing a non-stationary objective. If the variational prior changes too rapidly the MI estimate provided by the critic may degrade, in turn causing the performance of the variational prior to degrade. In practice, we stabilize training through a two time-scale scheme, as described in Section 4.1.

In experiments, we adopt both SMILE and Info NCE as variational objectives, for several reasons. Many real-world simulators produce high dimensional outputs such as time series, leading to the possibility of high variance. SMILE is naturally suited to this setting due to the hyperparameter τ that enables explicit management of the bias-variance trade-off. Meanwhile, Info NCE typically exhibits low variance, since the optimal critic does not depend on batch size (Poole et al., 2019). This property is especially important for expensive simulators, since only smaller batch sizes can be used under limited simulation budgets.

4.3. Performing Simulation-based Inference

It is worthwhile highlighting that, with each of the methods described in the previous subsections, the ability to perform SBI for the implicit model and learned RP ˆπ comes at no further training cost. In the case of GED, the method entails the construction of an amortised (in x1:n) estimator ˆπψ x1:n(θ) for the posterior density θ that results from the use of the learned prior ˆπ in Bayes s theorem, which can be immediately reused to generate samples from ˆπψ x1:n through, for example, Markov chain Monte Carlo (MCMC) or forward passage through the network defining the normalising flow model. Similarly, in the case of the approaches outlined in Section 4.2, which are based on optimising a variational lower bound to the mutual information between x1:n and θ, the learned discriminators ˆT estimate a function w of the density ratio hˆπ (x1:n, θ)/mˆπ (x1:n)ˆπ (θ). (For example, in the case of SMILE, w is the identity.) Since we have assumed that the variational RP density ˆπ is tractable through the use of, for example, a normalising flow model for the variational family, estimates log ˆπ x1:n(θ) of the logposterior density log π x1:n(θ) resulting from the use of the true RP π may then be obtained as

log ˆπ x1:n(θ) = log w 1( ˆT(x1:n, θ)) + log ˆπ (θ). (17)

Learning Likelihood-Free Reference Priors

(a) Gaussian Scale Model (b) Exponential Model (c) Triangular Model (d) AR(1) Model

Figure 2. Comparison of proposed methods for learning reference priors on tractable models.

Equation (17) can then be used in, e.g., an MCMC procedure to generate samples from the posterior ˆπ x1:n. In both cases, Bayesian SBI can immediately be performed with no further training of density ratio or posterior estimators.

5. Experiments

Here, we present a series of experiments4 to assess the RPlearning methods described in Section 4. In the first instance, we consider their ability to recover the RPs for a collection of simulators with known RPs, before considering more complex simulation models whose RPs are unknown.

5.1. Tractable Examples

5.1.1. MODEL OVERVIEWS

Gaussian Scale Model. The first tractable example we consider simulates Gaussian random variables. For this model, n samples are generated iid from N(µ, σ2) as xt = µ + θut, with ut N(0, 1), and where µ R is known, θ > 0 is a free parameter, and N(a, b2) is a Normal distribution with mean a and variance b2. Here, the RP (Yang & Berger, 1996) for θ is π (θ) 1/θ.

Exponential Rate Model. We next consider an Exponential model, whose generative process is as follows: for t = 1, . . . , n, we generate random variables xt from an Exp(θ) density as xt = log (1 ut) /θ, where ut U(0, 1), θ > 0 is a parameter, and U(a, b) is a uniform distribution on [a, b]. As with the Gaussian scale model, the RP for θ is known (Yang & Berger, 1996) to be π (θ) 1/θ.

Triangular Model. In this example, we generate random variables from a Triangular distribution on [0, 1]. Here, iid data is generated for t = 1, . . . , n as

xt = I [ut θ] p

+ (1 I [ut θ]) 1 p

(1 ut) (1 θ) , (18)

4Code available at https://github.com/ joelnmdyer/lf_reference_priors.

where θ (0, 1) is a free parameter and ut U(0, 1) is a random variable distributed uniformly on [0, 1]. While the RP for θ in this model is not available analytically, it is known (Berger et al., 2009) to be approximated well by a Beta(1/2, 1/2) distribution. Additionally, although the derivative of xt with respect to θ for fixed ut is 0 almost everywhere when defined in the usual sense, we may nonetheless define an approximate surrogate gradient through, e.g., the straight-through gradient trick (Bengio et al., 2013) in order to backpropagate through xt. In this way, we may continue to apply the methods described in Section 4.2, which require a differentiable simulator.

Autoregressive Time-series Model. Finally, we consider the standard autoregressive time-series model of order 1 (AR(1)). Using ut N(0, 1), t = 1, . . . , n, this model generates a time-series x1, . . . , xn as

x1 = σu1, and xt = θxt 1+σut for t = 2, . . . , n, (19)

where σ > 0 is fixed and θ [ 1, 1] is a free parameter. It can be shown (Berger & Yang, 1994) that the corresponding RP for θ [ 1, 1] is π (θ) 1 θ2 1/2.

5.1.2. RESULTS

Figure 2 shows the obtained RPs for each method in Section 4 alongside their ground-truth counterparts. Each prior is generated from N = 104 samples and building a histogram (with fixed number of 30 bins). In general, the GED and variational lower bound (VLB) methods obtain accurate approximations of the ground truth RPs. However, bigger discrepancies can be observed especially near asymptotic parameter limits: both the Gaussian and Exponential models have asymptotic densities at θ = 0, the AR(1) model has asymptotes at θ { 1, 1} and the Triangular model at θ {0, 1}. These asymptotes are in general difficult to reproduce for generative models; for example, we see that Info NCE and SMILE slightly overestimate densities close to the boundaries in Figure 2c. However, the proposed architecture of bounded generative flows are overall able to reconstruct these densities and approximate the true priors

Learning Likelihood-Free Reference Priors

Table 1. Performance metrics (mean [standard deviation] from 5 repeats) for different prior methods. Bold indicates best performance.

WASSERSTEIN C2ST TASK NUMERICAL INFONCE SMILE GED NUMERICAL INFONCE SMILE GED

GAUSS. 6.95 [0.64] 7.08 [0.56] 6.77 [0.53] 1.66 [0.20] 0.90 [0.01] 0.62 [<0.005] 0.63 [0.01] 0.49 [0.01] EXP. 4.25 [0.39] 2.43 [0.61] 3.83 [0.75] 2.04 [0.32] 0.91 [0.01] 0.58 [0.04] 0.61 [0.02] 0.59 [0.06] TRI. 0.08 [0.01] 0.04 [<0.005] 0.05 [0.01] 0.02 [0.01] 0.64 [0.01] 0.54 [0.02] 0.55 [0.02] 0.51 [0.02] AR(1) 0.14 [0.01] 0.06 [0.02] 0.05 [0.03] 0.06 [0.02] 0.62 [0.01] 0.50 [0.02] 0.50 [0.01] 0.51 [0.01]

relatively accurately.

To provide a quantitative analysis, we compare each learned prior to the corresponding ground truth RP using two common metrics in the SBI literature: the Wasserstein distance and a classifier two-sample test (C2ST) (Lueckmann et al., 2021). Lower values are preferred in both cases, and indicate closer match to the ground truth RP. As a baseline, we use the numerical method described in Berger et al. (2009), which generates pointwise evaluations of the unnormalised RP using (9). Full results are provided in Table 1. From this we see that, for the Exponential, Triangular, and AR(1) models, all of our proposed methods perform consistently well. Indeed, each outperforms the numerical method from Berger et al. (2009) according to both metrics. For the Gaussian Scale model, Info NCE marginally underperforms compared to this numerical method in terms of the Wasserstein distance to the true RP; this is likely due to the difficulties of capturing asymptotic behaviour with proper densities.

5.2. Complex Examples

5.2.1. SIMPLE LIKELIHOOD, COMPLEX POSTERIOR

We next consider the popular SBI benchmark task SLCPD (Lueckmann et al., 2021), based on the experiment first introduced by Papamakarios et al. (2019). This simulator has 5 parameters, θ = (θ1, . . . , θ5), taking values in Θ = [ 3, 3]5 and parameterising the following data-generating process: zi N(µ(θ), Σ(θ)), i = 1, . . . , k, where

µ(θ) = (θ1, θ2) , Σ(θ) = θ4 3 ρ θ2 3 θ2 4 ρ θ2 3 θ2 4 θ4 4

and ρ = tanh θ5. Distractor variables δi are also drawn from a mixture of Student t-distributions, with the final model output constructed as xi = (zi, δi). Further details are provided in Appendix C.1. The RP for θ is not analytically known. Further, this model has a higher dimensional parameter space than the experiments in Section 5.1, and is known to generate complex posteriors; this makes this experiment more challenging than those in Section 5.1.

While the RP for θ in the SLCP model is unknown, it should be expected that the RP for the transformed parameter ϑ = (µ(θ), Σ(θ)) should adhere to the shape of the RP for a multivariate Normal. In particular, we should expect the

marginal distributions for the µi(θ) to be uniform in the range [ 3, 3], and the marginal distribution for det Σ(θ), where det Σ(θ) is the determinant of the model s covariance matrix, to decay rapidly in det Σ(θ) (Yang & Berger, 1996).

Results In Figure 3, we plot the marginal distributions for the priors learned via the Info NCE lower bound in Section 4.2 and the GED objective in Section 4.1; we see that both methods have been able to recover this general shape, with Info NCE doing so more successfully than GED. This is in line with our expectation that GED will underperform in this case, given that SLCP-D is known to generate complex posteriors that can be difficult to approximate accurately.

To conclude this experiment, we consider how the quality of the estimated RPs might be assessed when the form of the RP is unknown, as will be the case in practical applications. Given that the purpose of the RP is to maximise the MI between θ and model output, a simple method for assessing the relative performance of a procedure for learning RPs is to perform a classifier two-sample test. In particular, we train a binary classifier to distinguish between samples pairs (x1:n, θ) drawn jointly from hπϕ and drawn from the product of the marginal distributions, πϕ and mπϕ. Such metrics for assessing performance are common in SBI (see, e.g., Lueckmann et al., 2021).

In Table 2, we report classification accuracies for this classification task under four different choices for priors: a Uniform distribution on Θ, and the approximate RPs estimated via Info NCE, SMILE, and GED. From this, we see that the RPs estimated through our methods produce a higher classification accuracy than the Uniform distribution a common default choice of prior with the variational lower bound (VLB) methods outperforming the GED method in this instance. In summary, this practical method for assessing the quality of the learned RP corroborates our expectations that a Uniform distribution is a poor approximation to the RP for this model, while GED generates a better approximation that is inferior to those generated by the VLB methods.

5.2.2. AN EPIDEMIC AGENT-BASED MODEL

As a final example of a complex simulation model, we consider a susceptible-infected-recovered (SIR) agent-based model of disease spread in a population of N interacting

Learning Likelihood-Free Reference Priors

Figure 3. Marginal distributions of the priors learned for SLCP with distractors (SLCP-D) simulator.

Table 2. C2ST metrics (mean [standard deviation] from 5 repeats) for different priors in the SLCP-D experiment (Section 5.2.1). Higher values are better; bold denotes best performance.

UNIFORM INFONCE SMILE GED

0.55 [0.01] 0.99 [<0.005] 0.98 [<0.005] 0.69 [0.10]

individuals (the agents ). Agents in the model interact via a network, through which disease transmission occurs. Specifically, at time t = 1, . . . , τ, the state zi,t of agent i is

zi,t = I[zi,t 1 = 0] I[ui,t (1 (1 β)ηi,t)]

+ I[zi,t 1 = 1] (I[ui,t γ] + 1) + 2 I[zi,t 1 = 2], (21)

where the states zi,t = 0, 1, 2 represent, respectively, that agent i is susceptible, infected, or recovered at time t. In the above, ui,t U(0, 1), ηi,t counts the number of agent i s neighbours that are infected at time t, and β, γ (0, 1) are parameters that determine the rate of infection and recovery. The initial states zi,0 Bernoulli(i0), where i0 (0, 1) determines the proportion of individuals who are infected at time 0. Further details on the model and its implementation are given in Appendix C.2. Finally, we extract the total number of infected agents at each step, such that

i=1 I[zi,t = 1].

Taking i0 = 0.1, τ = 50, and N = 200, we learn an n-reference prior for the parameter θ = (β, γ) (0, 1)2, where in this case n is the number of iid length-τ trajectories generated at each θ.

Results To illustrate the impact of RPs in simulation models, we plot in Figure 4 trajectories generated from the marginal likelihood functions corresponding to each of the learned priors for the SIR model. From each learned prior, we sample 10 values for θ, and for each of these we generate 10 trajectories. Trajectories simulated from the same parameter value share the same colour. Visually, it is apparent that the priors that have been trained to maximise MI (via Info NCE and GED) produce a much more diverse set of

outcomes while retaining low conditional entropy of individual parameter outcomes; in contrast, while the conditional entropies are still relatively low for the Uniform prior, the diversity in the sampled trajectories from across all parameter values is visibly lower. This can be especially useful for model designers, as with very few runs RPs produce specific (and diverse) outcomes.

Finally, we demonstrate our claim in Section 4.3 that SBI can be performed at no extra cost using the posterior and critic networks trained during GED and VLB-based methods. In particular, we compare inferences obtained by sampling directly from the posterior network trained in GED; performing MCMC using the critic network produced from the Info NCE objective; sampling directly from a neural posterior estimator (NPE) trained as in Greenberg et al. (2019); performing MCMC using a neural density ratio estimator (NRE) trained as in Miller et al. (2022). For NPE (resp. NRE) we use the RP learned via GED (resp. Info NCE) as the prior density, and use 104 simulations. In Figure 5, we show that samples from the posterior predictive distributions for each method are comparable, demonstrating that SBI can be performed accurately at no additional training cost once RPs are learned via the GED and VLB methods we consider.

6. Discussion

Main Results In this paper, we study and test multiple approaches to learning RPs for arbitrary simulation models. Through experiments on tractable and intractable examples, we have shown that these methods can to some extent construct good RPs and attain properties of interest: they have been able to recover the overall shape of known RPs, in many cases being indistinguishable from the ground truth according to standard two-sample tests, and produce behaviour that can be seen visibly as achieving high MI between model parameters and model output in comparison to common default priors used in Bayesian inference (e.g., the Uniform distribution). Using these methods for constructing approximate RPs may therefore be useful in practice for complementing subjective, modeller-specified priors during

Learning Likelihood-Free Reference Priors

(a) Info NCE SIR results (b) GED SIR results (c) Uniform SIR results

Figure 4. Prior predictive simulations for the SIR model, as induced by (a) the Info NCE prior, (b) the GED prior, and (c) a Uniform prior.

Figure 5. Posterior predictive samples for the SIR model conditioned on synthetic data sampled from learned RPs. The top rows displays samples generated using the density ratio/posterior estimator trained whilst learning each RP, whilst the bottom row shows samples generated using density ratio/posterior estimators seperately trained via NRE/NPE.

real-world Bayesian analyses involving simulation models: they can allow modellers to assess the degree to which their inferences differ from the case where minimal information as measured by the MI between θ and data is built in to their subjective priors. This is, we believe, the main practical benefit of our work.

Method choice and the complexity of learning reference priors From a technical perspective, learning (good) RPs proved a relatively difficult task even with the chosen generative models and (deep learning) heuristics employed: RPs are not necessarily proper, which means that estimating them via generative models introduces inevitable biases and approximation errors. Further, we have on occasion seen that multiple runs of the same method can produce approximated RPs that differ substantially. This may reflect the fact that multiple (local) optima may exist; as discussed in Section 5.2.1, metrics such as classifier two-sample tests may be useful for identifying this if it occurs.

With respect to choosing between methods, a number of factors are to be considered. VLB methods seemed faster and more stable, but do not produce precise estimates of MI. Further, VLB methods involve only one density estimation task that of estimating the prior density while GED involves two learning both the prior and posterior density and for this reason VLB can be less computationally expensive and more accurate when in settings with complex posteriors. This is reflected to some degree in our experiments with the SLCP model in Section 5.2.1; however, the fact that GED has nonetheless produced a good approximate RP even in this case demonstrates that this may not be as limiting a factor. Additionally, we have seen that while GED can produce

reasonable estimates, training can also be unstable, exhibiting sensitivity to learning rates. However, GED methods can be applied to non-differentiable simulators, while currently the VLB methods we consider are only applicable to simulators for which derivatives of the model output with respect to the input parameters can be constructed through, e.g., reparameterisation tricks (see, e.g., Jang et al., 2016). Further, through GED, posterior samples can be generated rapidly via a forward pass through the posterior network, while for VLB posterior samples must be generated via MCMC. Thus, since MCMC can be more time consuming, GED methods may be preferred when fast posterior sampling is also required. In general, however, we recommend applying both methods where this can be done, since it may in general be difficult to predict beforehand which approach will perform best for any given problem.

Limitations Our work naturally suffers limitations. In this paper, we focus our attention to scenarios in which all components of the parameter vector θ are of interest to the modeller, with no nuisance parameters present. Nuisance parameters introduce additional complications, and constructing RPs in their presence requires a more nuanced treatment (Bernardo, 1979; Berger & Bernardo, 1992). Finally, RPs are only one possible approach to conducting objective Bayesian inference, and other considerations can lead to alternative objective priors in Bayesian analysis (see, e.g., Consonni et al., 2018). The present paper is a first step towards enabling objective Bayesian inference for implicit simulation models; developing likelihood-free methods for estimating other classes of objective priors would also constitute interesting future work.

Learning Likelihood-Free Reference Priors

Acknowledgements

NB, DJO, JD, AC, and MW acknowledge funding from a UKRI AI World Leading Researcher Fellowship awarded to Wooldridge (grant EP/W002949/1). MW and AC also acknowledge funding from Trustworthy AI - Integrating Learning, Optimisation and Reasoning (TAILOR), a project funded by the European Union Horizon2020 research and innovation program under Grant Agreement 952215.

Impact Statement

This paper presents work whose goal is to advance the field of machine learning and simulation modelling. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14 20, 1972.

Barber, D. and Agakov, F. The IM algorithm: a variational approach to Information Maximization. In Proceedings of the 17th International Conference on Neural Information Processing Systems, pp. 201 208, 2003.

Beaumont, M. A. Approximate Bayesian computation. Annual review of statistics and its application, 6:379 403, 2019.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, pp. 531 540. PMLR, 2018.

Bengio, Y., L eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

Berger, J. O. and Bernardo, J. M. On the development of the reference prior method. Bayesian statistics, 4(4):35 60, 1992.

Berger, J. O. and Yang, R.-Y. Noninformative priors and Bayesian testing for the AR (1) model. Econometric Theory, 10(3-4):461 482, 1994.

Berger, J. O., Bernardo, J. M., and Mendoza, M. On priors that maximize expected information. 1988.

Berger, J. O., Bernardo, J. M., and Sun, D. The formal definition of reference priors. The Annals of Statistics, 37 (2):905 938, 2009.

Bernardo, J. M. Reference posterior distributions for Bayesian inference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):113 128, 1979.

Bernardo, J. M. Non-informative priors do not exist. Journal of Statistical Planning and Inference, 65(1):159 189, 1997.

Blahut, R. Computation of channel capacity and ratedistortion functions. IEEE transactions on Information Theory, 18(4):460 473, 1972.

Blondel, M. and Roulet, V. The Elements of Differentiable Programming. ar Xiv preprint ar Xiv:2403.14606, 2024.

Bonachela, J. A., Hinrichsen, H., and Munoz, M. A. Entropy estimates of small data sets. Journal of Physics A: Mathematical and Theoretical, 41(20):202001, 2008.

Borkar, V. S. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291 294, 1997.

Clarke, B. S. and Barron, A. R. Jeffreys prior is asymptotically least favorable under entropy risk. Journal of Statistical planning and Inference, 41(1):37 60, 1994.

Consonni, G., Fouskakis, D., Liseo, B., and Ntzoufras, I. Prior Distributions for Objective Bayesian Analysis. Bayesian Analysis, 13(2):627 679, 2018.

Diggle, P. J. and Gratton, R. J. Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society: Series B (Methodological), 46 (2):193 212, 1984.

Donsker, M. D. and Varadhan, S. R. S. Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1 47, 1975.

Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. Advances in neural information processing systems, 32, 2019.

Dyer, J., Cannon, P., Farmer, J. D., and Schmon, S. M. Black-box Bayesian inference for agent-based models. Journal of Economic Dynamics and Control, 161:104827, 2024a.

Dyer, J., Cannon, P., and Schmon, S. M. Approximate Bayesian Computation with Path Signatures. In Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, volume 244 of Proceedings of Machine Learning Research, pp. 1207 1231. PMLR, 15 19 Jul 2024b.

Learning Likelihood-Free Reference Priors

Fearnhead, P. and Prangle, D. Constructing summary statistics for approximate Bayesian computation: semiautomatic approximate Bayesian computation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(3):419 474, 2012.

Gao, Y., Ramesh, R., and Chaudhari, P. Deep reference priors: What is the best way to pretrain a model? In International Conference on Machine Learning, pp. 7036 7051. PMLR, 2022.

Greenberg, D. S., Nonnenmacher, M., and Macke, J. H. Automatic posterior transformation for likelihood-free inference. 36th International Conference on Machine Learning, pp. 4288 4304, 2019.

Grimmett, G. and Stirzaker, D. Probability and random processes. Oxford university press, 2001.

Hermans, J., Begy, V., and Louppe, G. Likelihood-free MCMC with amortized approximate ratio estimators. In International Conference on Machine Learning, pp. 4239 4248. PMLR, 2020.

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016.

Jarne Ornia, D., Dyer, J., Bishop, N. G., Calinescu, A., and Wooldridge, M. J. Model exploration through marginal likelihood entropy maximisation. In Neur IPS 2024 Workshop on Data-driven and Differentiable Simulations, Surrogates, and Solvers, 2024.

Kerr, C. C., Stuart, R. M., Mistry, D., Abeysuriya, R. G., Rosenfeld, K., Hart, G. R., N u nez, R. C., Cohen, J. A., Selvaraj, P., Hagedorn, B., et al. Covasim: an agent-based model of COVID-19 dynamics and interventions. PLOS Computational Biology, 17(7):e1009149, 2021.

Kingma, D. P. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kozachenko, L. F. and Leonenko, N. N. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9 16, 1987.

Kraskov, A., St ogbauer, H., and Grassberger, P. Estimating mutual information. Physical Review E Statistical, Nonlinear, and Soft Matter Physics, 69(6):066138, 2004.

Lafferty, J. and Wasserman, L. Iterative Markov chain Monte Carlo computation of reference priors and minimax risk. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 293 300, 2001.

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3744 3753. PMLR, 2019.

Letizia, N. A., Novello, N., and Tonello, A. M. Variational f-divergence and derangements for discriminative mutual information estimation. ar Xiv preprint ar Xiv:2305.20025, 2023.

Lueckmann, J.-M., Boelts, J., Greenberg, D., Goncalves, P., and Macke, J. Benchmarking simulation-based inference. In International conference on artificial intelligence and statistics, pp. 343 351. PMLR, 2021.

Mc Allester, D. and Stratos, K. Formal limitations on the measurement of mutual information. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 875 884. PMLR, 2020.

Miller, B. K., Weniger, C., and Forr e, P. Contrastive neural ratio estimation. Advances in Neural Information Processing Systems, 35:3262 3278, 2022.

Nalisnick, E. and Smyth, P. Learning approximately objective priors. ar Xiv preprint ar Xiv:1704.01168, 2017.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Papamakarios, G. and Murray, I. Fast ε-free inference of simulation models with Bayesian conditional density estimation. Advances in neural information processing systems, 29, 2016.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017.

Papamakarios, G., Sterratt, D., and Murray, I. Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows. In Proceedings of the Twenty Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 837 848. PMLR, 2019.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K opf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Py Torch: An Imperative Style, High-Performance Deep Learning Library, 2019.

Learning Likelihood-Free Reference Priors

Pichler, G., Colombo, P. J. A., Boudiaf, M., Koliander, G., and Piantanida, P. A differential entropy estimator for training neural networks. In International Conference on Machine Learning, pp. 17691 17715. PMLR, 2022.

Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5171 5180. PMLR, 2019.

Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. Population growth of human y chromosomes: a study of y chromosome microsatellites. Molecular biology and evolution, 16(12):1791 1798, 1999.

Quera-Bofarull, A., Dyer, J., Calinescu, A., Farmer, J. D., and Wooldridge, M. Black BIRDS: Black-Box Inference fo R Differentiable Simulators. Journal of Open Source Software, 8(89), 2023.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015.

Song, J. and Ermon, S. Understanding the limitations of variational mutual information estimators. In International Conference on Learning Representations, 2020.

Thomas, O., Dutta, R., Corander, J., Kaski, S., and Gutmann, M. U. Likelihood-free inference by ratio estimation. ar Xiv preprint ar Xiv:1611.10242, 2016.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Watts, D. J. and Strogatz, S. H. Collective dynamics of small-world networks. nature, 393(6684):440 442, 1998.

Wiese, S., Kaszowska-Mojsa, J., Dyer, J., Moran, J., Pangallo, M., Lafond, F., Muellbauer, J., Calinescu, A., and Farmer, J. D. Forecasting macroeconomic dynamics using a calibrated data-driven agent-based model. ar Xiv preprint ar Xiv:2409.18760, 2024.

Wood, S. N. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102 1104, 2010.

Yang, R. and Berger, J. O. A catalog of noninformative priors, volume 1018. Institute of Statistics and Decision Sciences, Duke University Durham, NC, USA, 1996.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, volume 30, 2017.

Zhang, Z. Discrete noninformative priors. Yale University, 1994.

Learning Likelihood-Free Reference Priors

A. Further Background

A.1. Proper and Improper Priors

An improper prior π on space Θ is one for which the quantity Z

Θ π(θ)dθ (22)

diverges (see, e.g., Berger et al., 2009). In contrast, a prior π is said to be proper if (22) is finite, such that π corresponds to (or can be appropriately normalised to correspond to) a probability measure. While not corresponding to probability distributions in their own right, improper priors are sometimes still used in Bayesian analysis if the posterior distribution they induce via Bayes theorem corresponds to a probability distribution; that is, if the posteriors they generate are proper.

A.2. Relationships between Mutual Information and Entropies

We provide some basic relationships between MI and entropies for the unfamiliar reader. With MI under prior π defined as

Iπ(x1:n, θ) := E θ π x1:n pθ

log πx1:n(θ)

we can separate out the logarithms to obtain

Iπ(x1:n, θ) = E θ π x1:n pθ [log πx1:n(θ) log π(θ)] (24)

= E θ π x1:n pθ [log πx1:n(θ)] E θ π x1:n pθ [log π(θ)] (linearity of expectation) (25)

= Ex1:n mπ θ πx1:n [ log πx1:n(θ)] Eθ π [ log π(θ)] (26)

= Ex1:n mπ Hπx1:n[θ]

+ Hπ[θ], (27)

revealing the relationship stated in (11). Finally, the relationship observed in (12) is obtained by recognising that

Ex1:n mπ Hπx1:n[θ] = E θ π x1:n pθ [log πx1:n(θ)] (28)

= E θ π x1:n pθ

log hπ(x1:n, θ)

(Bayes theorem) (29)

= E θ π x1:n pθ [log hπ(x1:n, θ)] E θ π x1:n pθ [log mπ(x1:n)] (30)

= E(x1:n,θ) hπ [ log hπ(x1:n, θ)] + Ex1:n mπ [ log mπ(x1:n)] (31)

= Hhπ [x1:n, θ] + Hmπ[x1:n]. (32)

B. Experimental Details

In this section, we provide additional details regarding each of our experiments including the neural architectures used to implement variational families and mutual information estimators. We also provide a more in depth overview of our training procedures.

B.1. Variational Prior Families

To describe the variational family Π in each of our experiments we use normalizing flows. Each flow is composed of a sequence of blocks. Each block is comprised of a masked affine autoregressive layer (Papamakarios et al., 2017) and an LU permute block (Durkan et al., 2019). If required, we use a sigmoid flow layer at the end of the flow to ensure that the flow s output are bounded to lie within the range support by the simulator of study. Details about the sigmoid flow layer are presented in Section B.2.

B.2. Sigmoid Flow Layer

The variational family Π must be constrained to produce parameters within the range supported by the simulator. As previously mentioned, this is achieved via a sigmoid flow layer. Let mi and Mi respectively define the minimum and

Learning Likelihood-Free Reference Priors

Algorithm 1 Flow Pretraining Procedure pretrain

Hyperparameters: Learning rate β, Epochs R, Pretrain epochs Rpre Input: Variational prior πϕ t = 1 for t Rpre do

Sample {θ(b)}B b=1 πB ϕ Compute loss Lϕ = 1

B PB b=1 log πϕ(θ(b)). Update prior parameters ϕ ADAMβ(ϕ, µ, ϕ ˆLϕ) end for

maximum permissible values of the parameter θi. Given an output z Rd from the previous flow layer, the sigmoid flow layer performs the following forward transformation:

ˆθi = mi + (Mi mi)σ(zi), i = 1, 2, . . . , d, (33)

where σ(zi) is the logistic sigmoid function. Note that θi is now guaranteed to lie in the range [mi, Mi]. The corresponding inverse (or backwards) transformation is given by:

zi = log (ˆθi mi) log (Mi ˆθi). (34)

Meanwhile, the log-determinants of the forward and backwards transformations can be readily computed:

log(Mi mi) + log σ(zi) + log (1 σ(zi)) ,

log(Mi mi) + log σ(zi) + log (1 σ(zi)) .

B.3. Notation for Network Architectures

Throughout the remainder of the Appendix, we will use the following notation to describe network architectures. We use MLP( ) to refer to fully connected multilayer perceptrons (MLPs). For example, MLP(10, 16, 32, 64, 1) refers to a three layer MLP with input dimension 10, hidden layer dimensions 16, 32, and 64, and final output dimension 1. Likewise, we use LSTMh( ) to refer to a stacked Long Short-Term Memory units (LSTM) with hidden state dimension h. For example, LSTM8(4, 2) denotes a stacked LSTM unit with input dimension 4 that is comprised of two LSTMs each with hidden dimension 8. Similarly, we use FLOW( ) to refer to a normalizing flow. For example, FLOW(4, 10, 16) refers to a normalizing flow with input dimension 4 composed of 10 blocks (as described in Section B.1) each with 16 hidden neurons. Enclosed brackets are used to denote concatenated network architectures For example (LSTM8(4, 1), MLP(8, 16, 1)) denotes a network which passes a time-series of 4-dimensional inputs through an LSTM, before passing the 8-dimensional final hidden state of the LSTM to an MLP, which produces a final 1-dimensional output.

Similarly, we use SET( , ) to describe the set encoder architecture proposed by (Zaheer et al., 2017). More specifically, SET(ρ, ϕ) first applies the network ϕ to each individual element of a set {xi}n i=1 before passing their sum to the network ρ, which produces the final output. As shown by Zaheer et al. (2017), such encoders are capable of representing any permutation-invariant function for sets of fixed size. As a result, set-encoders are well-suited to encoding the independently and identically distributed outputs {x(b) 1:n} generated by a simulator under parameter θ(b). We also use TRANSFORMER to refer to the transformer-based set encoder architecture proposed by (Lee et al., 2019). The transformer used is the same across all experiments and consists of two ISAB blocks and one PMA block, each with four attention heads and four inducing points.

B.4. Pretraining Procedure for Variational Lower Bound Methods

In all of our training procedures, we first pretrain the variational prior to maximise entropy over the parameter space. Experimentally, we found that this helped prevent the reference prior from getting stuck in local optima, and reduced the

Learning Likelihood-Free Reference Priors

Algorithm 2 Training Procedure with Variational Lower Bounds

Hyperparameters: Batch size B, interval l, Learning rates α, β, epochs R, pretrain epochs Rpre Inputs: Variational prior πϕ, Critic Tµ,φ, Simulator pθ ϕ pretrain(ϕ, β, Rpre) t = 1 for t R do

Sample {θ(b)}B b=1 πB ϕ Simulate x(b) 1:n pn θ(b) for b = 1, . . . , B.

Dϕ {(θ(b), x(b) 1:n)}B b=1 Compute ˆIφ,µ(Dϕ) via SMILE or Info NCE Update critic parameters µ, φ ADAMα((φ, µ), φ,µ ˆIφ,µ(Dϕ)) if t (mod l) = 0 then

Update prior parameters ϕ ADAMβ(ϕ, ϕ ˆIϕ(Dϕ)) end if t t + 1 end for

total number of epochs required for training to converge. As we use normalizing flows to implement the variational family Π in each of our experiments, the density of the variational prior can be easily evaluated. We exploit this fact to compute a plug-in estimate of the entropy and pretrain the prior using stochastic gradient descent. This pretraining loop is described in Algorithm 1.

B.5. Training Procedure for Variational Lower Bound Methods

Here, we outline the training procedure for learning a reference prior using a variational lower bound, such as SMILE or Info NCE. The general procedure is outlined in Algorithm 2. On each epoch a batch of B samples {θ(b)}B b=1 πϕ are sampled from the variational prior. Each θ(b) is run on the simulator n times to generate input-output pairs of the form Dϕ = {(θ(b), x(b) 1:n)}B b=1.

These pairs are used to estimate a variational lower bound on the mutual information ˆIφ,µ(Dϕ), which relies on a critic form Tφ = fφ(sφ( ), ) to score input-output pairs. We refer to as sφ as the encoder as it is responsible for encoding outputs into lower-dimension representations, whilst we refer to fφ as the score network as it is responsible producing a scalar value scoring a given parameter and its corresponding encoded output. In other words, for any input-output pair (θ, x1:n) the score is computed as Tµ,φ(θ, x1:n) = fµ(sφ(x1:n), θ).

As mentioned in the main body, we use two variational lower bounds in our experiments; Info NCE and SMILE. The Info NCE objective is given by

ˆIφ,µ(Dϕ) = 1

b=1 log Tµ,φ(x(b) 1:n, θ(b))

a =b Tµ,φ(x(b) 1:n, θ(a)) , (36)

while the SMILE objective is given by

ˆIφ,µ(Dϕ) = 1

b=1 Tµ,φ(x(b) 1:n, θ(b)) log 1 B(B 1)

a =b clip(e Tµ,φ(x(b) 1:n,θ(b)), e τ, eτ), (37)

where clip denotes the clipping function clip(a, b, c) = min(max(a, b), c). Given ˆIφ,µ(Dϕ), we then update ϕ, φ and µ using a standard gradient ascent rule such as Adam (Kingma, 2014) using ϕ,φ,µ ˆIφ,µ(Dϕ). Note that the parameters ϕ of the variational reference prior are only updated every l epochs. This provides the critic Tµ,φ time to adjust so that the variational reference prior has a good lower bound to train against. Additionally, for some complex simulators, we normalize the gradient to unit length before performing a gradient update.

The neural architectures used for each experiment are display in Table 3, whilst the training hyperparameters used are described in Table 4. The variational prior and the critic are updated with learning rates α and β respectively.

Learning Likelihood-Free Reference Priors

Table 3. Neural Architectures used in each SMILE and Info NCE experiment.

SIMULATOR VARIATIONAL FAMILY ENCODER SCORE NETWORK

GAUSSIAN SCALE FLOW(1, 4, 16) TRANSFORMER MLP(9, 64, 64, 64, 1) EXPONENTIAL FLOW(1, 4, 8) TRANSFORMER MLP(2, 32, 1) TRIANGULAR FLOW(1, 5, 16) SET(MLP(1, 16, 16, 1), MLP(1, 8, 8, 1)) MLP(2, 16, 16, 1) AR(1) FLOW(1, 6, 16) (LSTM16(1, 2), MLP(16, 1)) MLP(2, 8, 8, 1) SLCP FLOW(5, 6, 8) SET(MLP(100, 64, 32), MLP(32, 64, 32)) MLP(37, 64, 1) G-AND-K FLOW(4, 8, 16) MLP(k + 1, 128, 1) SIR FLOW(2, 8, 16) SET((LSTM8(3, 2), MLP(8, 8, 16, 8)), MLP(8, 8, 8, 8)) MLP(10, 16, 8, 1)

Table 4. Hyperparameter settings for Info NCE and SMILE experiments.

SIMULATOR RPRE EPOCHS GRADIENT NORM l B n τ α β

GAUSSIAN SCALE 200 3000 YES 1 256 50 10 5 10 3 5 10 4

EXPONENTIAL 200 2000 YES 1 256 50 5 10 3 5 10 4

TRIANGULAR 200 2000 NO 1 256 50 10 5 10 3 5 10 4

AR(1) 200 2500 YES 3 256 1 5 5 10 3 5 10 3

SLCP 1000 1000 YES 1 256 100 100 5 10 3 5 10 4

G-AND-K 1000 1000 YES 5 256 100 5 1 10 3 1 10 3

SIR 250 2000 NO 10 64 1 5 5 10 3 5 10 4

The following sections include details on our implementation of the GED method described in Section 4.1.

Architectures The GED method requires learning an estimator ˆπψ x1:n for the conditional distribution p(θ | x1:n). For consistency, we use the same general architecture5 for this estimator as for the proposal π, but we use a conditional normal distribution as the base distribution, and use an encoder sφ : Xn W that maps the sampled data x1:n to some latent variable w. As a result, the base distribution q0 for the normalizing flow ˆπψ x1:n is a conditional Gaussian denoted as q0(z | sφ(x1:n)). The way to apply this conditional is by having sφ(x1:n) R2 d where d is the number of parameters in the model. Then, q0(z | sφ(x1:n)) N(sφ(x1:n)1:d | sφ(x1:n)d:2d).

In other words, the encoder is trained to output the mean and the variance of the base flow distribution.

B.6.1. REFERENCE PRIOR TRAINING

The main training loop follows Algorithm 4. We set α β to induce a two time-scale learning dynamic. Both update steps are computed through an Adam optimizer (Kingma, 2014). As in section B.5, each parameter is simulated n times to ensure that the asymptotic posterior can be well approximated. The architectures and hyperparameters used are presented in Tables 5 and 6. The same architectures were used for both the prior and the conditional density estimator per experiment.

B.6.2. REFERENCE PRIOR TRAINING WITHOUT SIMULATOR GRADIENTS

For each experiment discussed in the main body, GED was used in conjunction with pathwise simulator gradients in order to provide a more like-for-like comparison with VLB methods. Next, we show how GED can be applied to arbitrary simulators by providing an an estimator for network gradients that does not rely on differentiability of the simulator.

To begin, we restate the objective function for GED:

Iψ,ϕ(Dϕ) = Eθ πϕ[ log πϕ(θ)] Eθ πϕEx pθ [ log πψ(θ | x)] . (38)

The derivative of the first term with respect to ϕ is easy to compute. Since the variational prior family has reparameterisable

5While we assume that the prior and posterior estimators belong to the same variational family in our experiments, in the interest of simplicity, this is not an essential aspect of this method, and in general they may belong to different variational families.

Learning Likelihood-Free Reference Priors

Algorithm 3 Flow Pretraining Procedure pretrain-conditional

Hyperparameters: Learning rate α, Pretrain epochs Rpre, Batch size B Input: Variational prior πϕ, Density Estimator ˆπψ

t = 1 for t Rpre do

Sample {θ(b), x(b) 1:n}B b=1 hπϕ Compute loss Lψ = 1

B PB b=1 log ˆπψ(θ(b) | sφ(x(b) 1:n)). Update estimator ψ ADAMβ(ψ, ϕ ˆLψ) end for

Algorithm 4 Training for GED

Hyperparameters: Batch size B, Interval l, Learning rates α, β, epochs R, pretrain epochs Rpre Input: Variational prior πϕ, Simulator pθ, encoder sφ, density estimator ˆπψ x1:n ϕ pretrain(ϕ, β, Rpre) ψ pretrain-conditional(ϕ, α, Rpre) t = 1 for t R do

Sample θ(b) π, repeat each sample k times (b = 1, . . . , B k). Simulate x(b) 1:n, θ(b) hπϕ, b = 1, . . . , B k. for 1 i l do

Update ψ ADAMα

ψ, log ˆπψ x(b) 1:n(θ(b))}

end for Update ϕ ADAMβ ϕ, ˆHπ[θ] ˆHψ πx1:n[θ]

t t + 1 end for

sampling paths, we have θ = gϕ(z) where gϕ is a differentiable function (such as flow) and z t is a random sample from a base distribution t independent of ϕ. Using the Law of the Unconscious Statistician (Grimmett & Stirzaker, 2001), we obtain

ϕEθ πϕ[ log πϕ(θ)] = ϕEz t[ log πϕ(gϕ(z))] (39)

= Ez t[ ϕ log πϕ(gϕ(z))] (40)

n=1 ϕ log πϕ(gϕ(z(n))). (41)

Meanwhile, the derivative of the second term with respect to ϕ can be estimated as follows:

ϕEθ πϕEx pθ [ log πψ(θ | x)] = Eθ πϕ [ ϕ log πϕ(θ) Ex pθ [ log πψ(θ | x)]] (42)

= Eθ πϕEx pθ [ ϕ log πϕ(θ) log πψ(θ | x)] (43)

n=1 ϕ log πϕ(θ(n)) log πψ(θ(n) | x(n)) (44)

where x(n), θ(n) are drawn jointly by first sampling θ(n) from πϕ, and then forward-simulating. In practice, Equation (44) can be evaluated (Paszke et al., 2019) by applying a .stop_gradient() method (e.g., .detach() in Py Torch ) to θ(n) after it is sampled from the prior. Summarising, we may compute the full gradient in Py Torch as follows:

h log πϕ(θ(n)) + log πϕ(θ(n).detach()) log πψ(θ(n).detach() | x(n).detach()) i . (45)

Figure 6 shows the results of applying the above described gradient-free GED method to three of the models studied. As it can be seen quality of the learned priors differs slightly when compared to the differentiable models, but is still representative

Learning Likelihood-Free Reference Priors

Table 5. Neural Architectures Used in Each GED experiment.

SIMULATOR VARIATIONAL FAMILY ENCODER

GAUSSIAN SCALE FLOW(1, 8, 16) SET(ID, MLP(64, 128, 2)) EXPONENTIAL FLOW(1, 4, 16) SET(ID, MLP(64, 128, 2)) TRIANGULAR FLOW(1, 8, 16) SET(ID, MLP(64, 128, 2)) AR(1) FLOW(1, 8, 16) SET((LSTM32(1, 2), MLP(32, 64)), MLP(64, 128, 2)) SLCP FLOW(1, 8, 16) SET(MLP(100, 64, 64), MLP(64, 128, 2)) G-AND-K FLOW(4, 8, 16) MLP(8, 16, 8) SIR FLOW(3, 8, 16) SET(LSTM8(3, 2), MLP(8, 16, 6))

Table 6. Hyperparameter settings for GED.

SIMULATOR RPRE EPOCHS GRADIENT NORM l B n α β

GAUSSIAN SCALE 100 1000 YES 2 256 1 5 10 3 1 10 4

EXPONENTIAL 100 1000 YES 2 128 8 5 10 3 1 10 4

TRIANGULAR 100 1000 YES 2 256 8 5 10 3 1 10 4

AR(1) 100 2000 YES 2 128 1 5 10 3 1 10 4

SLCP 1000 3000 YES 4 256 100 5 10 3 1 10 4

G-AND-K 100 2000 YES 4 256 128 5 10 3 1 10 4

SIR 100 2000 YES 4 64 1 5 10 3 5 10 4

of the true underlying reference prior. As a qualitative note, training reference priors without model gradients seem to yield higher variance learning processes, which (in this case) was compensated by slightly increasing the batch sizes and learning rates.

B.7. Entropy Estimation for Models with Tractable Likelihood functions

For simulators with tractable likelihood functions, we are able to obtain Monte Carlo estimates for the MI between x1:n generated from the model at θ and θ π using the decomposition

I(x1:n, θ) = Hmπ[x1:n] Eθ πHpθ[x1:n]. (46)

Using the the Law of the Unconscious Statistician (Grimmett & Stirzaker, 2001) and the model s likelihood function, the first term in Equation (46) may be estimated as follows:

Hmπ[x1:n] = Ex1:n mπ [ log Eθ π [pθ (x1:n)]] (47)

r=1 pθ(r)(x(b) 1:n)

, θ(r) π, x(b) 1:n mπ (48)

This provides a (biased) estimator for Hmπ[x1:n]. The second term can be estimated this straightforwardly as follows:

Eθ πHpθ[x1:n] = Eθ πEx1:n pθ [ log pθ(x1:n)] (49)

b=1 log pθ(b)(x(b) 1:n), x(b) 1:n, θ(b) hπ. (50)

C. Further Details on Complex Experiments

C.1. Simple likelihood, complex posterior with distractors

Our implementation of SLCP with distractors follows that detailed in Appendix T.4 of (Lueckmann et al., 2021). In particular, distractors are sampled in from a mixture of t-distributions

(δi)100 i=9 1

j=1 t2(µj, Σj), (51)

Learning Likelihood-Free Reference Priors

(a) Gaussian Scale Model (b) Exponential Model (c) Triangular Model

Figure 6. Reference prior learned for each model, with and without gradients, for the GED method.

where µi N(0, 152I), Σi j,k N(0, 9) if j > k, Σj,j = 3ea, where a N(0, 1), and Σi j,k = 0 otherwise.

C.2. The SIR agent-based model

We use an SIR simulator similar in nature to the model supplied as part of the Black BIRDS software package (Quera Bofarull et al., 2023). In order to make the model differentiable, we make use of surrogate Gumbel-Softmax gradients (Jang et al., 2016) to backpropagate through Bernoulli random variables used to determine the discrete transitions between agent states. Each simulation involves simulating 200 agents for T = 50 time steps on a Watts-Strogatz random network (Watts & Strogatz, 1998), initialised with 10 edges per node and a rewiring probability of 0.1.

C.3. The g-and-k model

The g-and-k model appears frequently as a benchmark case study for SBI methods (see, e.g., Fearnhead & Prangle, 2012). Inference is challenging for this model due to its ability to produce a wide range of data distributions from relatively few parameters. The generative process is as follows: for zt N(0, 1) and parameters θ = (a, b, g, k) Θ = [0, 5]4, output xt R is generated as

xt = a + b 1 + c1 exp( gzt)

1 + exp( gzt)

1 + z2 t k zt., (52)

where it is customary to take c = 0.8 as fixed. In this case, the true (n-)reference prior isn t available. However, we can expect two features to be present in the g-and-k reference prior based on the roles the parameters play in the data-generating process, (52): since a plays the role of a location parameter, we can expect the reference prior to be flat in a (Yang & Berger, 1996); in contrast, since b is a scale parameter determining the magnitude of the contribution from the second term in (52), we might expect the reference prior to decay as approximately 1/b (Yang & Berger, 1996).

The RPs obtained by GED, SMILE and Info NCE for the same random seed are plotted in Figure 7. We emphasise that each method failed to generate consistent prior estimates across random seeds, suggesting that the asymptotic RP associated with the g-and-k model is particularly difficult to estimate. We conjecture that this is due to the heavy tails of the g-and-k distribution. The development of RP estimators that perform consistently on the g-and-k model forms an interesting open challenge.

Learning Likelihood-Free Reference Priors

Figure 7. Learned RPs for the g-and-k model.