# hierarchical_neural_simulationbased_inference_over_event_ensembles__bd8cfd0d.pdf Published in Transactions on Machine Learning Research (02/2024) Hierarchical Neural Simulation-Based Inference Over Event Ensembles Lukas Heinrich l.heinrich@tum.de Technical University of Munich Siddharth Mishra-Sharma smsharma@mit.edu MIT, Harvard University, IAIFI Chris Pollard christopher.pollard@warwick.ac.uk University of Warwick Philipp Windischhofer windischhofer@uchicago.edu University of Chicago Reviewed on Open Review: https: // openreview. net/ forum? id= Jy2Igzjo FH When analyzing real-world data it is common to work with event ensembles, which comprise sets of observations that collectively constrain the parameters of an underlying model of interest. Such models often have a hierarchical structure, where local parameters impact individual events and global parameters influence the entire dataset. We introduce practical approaches for frequentist and Bayesian dataset-wide probabilistic inference in cases where the likelihood is intractable, but simulations can be realized via a hierarchical forward model. We construct neural estimators for the likelihood(-ratio) or posterior and show that explicitly accounting for the model s hierarchical structure can lead to significantly tighter parameter constraints. We ground our discussion using case studies from the physical sciences, focusing on examples from particle physics and cosmology. 1 Introduction Datasets composed of multiple samples are ubiquitous in scientific and more broadly real-world data analysis tasks. These datasets are typically governed by underlying models that exhibit a rich hierarchical structure, with local parameters shaping individual events while global parameters exert influence across the entire dataset. This layered structure, if appropriately utilized, can greatly augment the efficiency and effectiveness of the inference process. The complexity of scientific models and high-dimensionality of datasets has led to a recent surge in interest in implicitly-specified models, where the likelihood function is intractable but simulations can be realized via mechanistic forward modeling. The paradigm of simulation-based inference (SBI), augmented using tools from machine learning and differentiable optimization more broadly, has emerged as a powerful approach for performing inference in such scenarios (Cranmer et al., 2020). However, the majority of existing methods for simulation-based inference are designed for learning from individual data points and do not fully capitalize on the hierarchical structure of the data-generating process. To address these gaps, we propose a set of novel approaches that augment existing simulation-based inference techniques, in both frequentist and Bayesian paradigms, with the goal of exploiting the hierarchical structure of the governing models. We contextualize our discussion using case studies from the physical sciences, with a particular focus on particle physics (particle collider data) and astrophysics (strong gravitational lensing images). Published in Transactions on Machine Learning Research (02/2024) In the present work we make several methodological contributions that are necessary for optimal deployment of simulation-based inference methods on large-scale scientific datasets. Our primary contributions are the following: We substantiate theoretically as well as empirically the fact that hierarchy-aware inference in many implicit models with a hierarchical structure requires a dataset-wide approach, contrasted with the more common paradigm of combining implicit likelihood or posterior estimators associated with individual observations. By hierarchy-aware inference , we refer to building a still approximate posterior or likelihood-ratio estimator that nevertheless takes into account the full (in our case) hierarchical structure of the assumed forward model. We systematically derive conditions under which such a hierarchical approach is necessary for correct inference, depending on how the model parameters are partitioned into local and global parameters of interest and nuisance parameters; We introduce frequentist as well as Bayesian methods for dataset-wide learning that can be used to perform simulation-based inference over event ensembles and can deal with datasets of varying cardinality. We connect our insights to several common use cases in the physical sciences and show how popular simulation-based inference paradigms can be adapted for optimal dataset-wide learning. In particular, we introduce the first approach for end-to-end frequentist simulation-based inference targeting hierarchical set-valued data; We show that our machine learning-based inference methods are generically substantially faster than traditional approaches, such as Markov Chain Monte Carlo (MCMC) methods, even when the likelihood is tractable, while giving consistent results. They also allow performing inference in streaming mode where, e.g., posterior estimates are efficiently updated in real-time as new observations are made without having to perform a re-analysis of the entire updated dataset. The paper is organized as follows. In Sec. 2 we provide a background discussion of hierarchical frequentist as well as Bayesian inference, and describe the class of forward models we target. We discuss related work in context of this background Sec. 3. In Sec. 4 we describe the methods and architectures used in this paper. Section 5 presents several case studies including a simple toy example, a particle physics example, and an example from astrophysics, in increasing order of complexity. We conclude in Sec. 6. 2 Theory and background We use the following notation throughout: xi refers to an observation associated with an individual event, θ are the global (dataset-wide) parameters, and zi are the local (per-event) parameters. Ensembles of either observations or variables are denoted with curly braces, {x} {x}N i=1. We will also differentiate nuisance parameters, which are parameters associated with the data-generating process that we are generally not interested in inferring; these will be treated through either profiling (defined later) or marginalization. Nuisance parameters will be denoted by ν, or θν and zν when we wish to distinguish between global and local nuisance parameters. 2.1 Hierarchical models The joint probability distribution of a set of events with cardinality N, with global parameters of interest θ, global nuisance parameters θν as well as local parameters zi can be written as: p({x} | {z}, θ, θν) = i=1 p (xi | zi, θ, θν) , (1) In many scientific applications, the number of events in a dataset may not be known a priori. This could occur if events are observed sequentially (such as gravitational wave events observed over time), if we aim to apply the same model to datasets with different event counts (e.g., astronomical observations of sky patches containing different numbers of stars/galaxies), or if the rate of events is itself a model parameter that may Published in Transactions on Machine Learning Research (02/2024) be interesting to measure (e.g., an interaction cross section at a particle collider experiment). In these cases, we can generalize the joint probability to an extended model: p({x} | {z}, θ, θν) = N=0 p(N | θ) i=1 p (xi | zi, θ, θν) . (2) Here, p(N | θ) represents the probability of observing N events given the global parameters θ. Often, p(N | θ) is a Poisson distribution, and the full model is known as a marked Poisson process. Our proposed techniques apply to both situations, and we present example applications described by equation 1 as well as equation 2 below. 2.2 Bayesian inference In the Bayesian paradigm, we introduce a prior p({z}, θ, θν) = p(θ, θν) Q i p(zi | θ, θν) and wish to target the posterior distribution over the global as well as local parameters of interest (and marginalizations thereof) given a dataset. In special cases, it is possible to combine event-level inferences to construct the dataset-wide inference. For example, in the non-extended case one can construct the dataset-wide full posterior from the event-level posteriors p(θ, θν, zi | xi), p(θ, θν, {z} | {x}) = 1 p(θ, θν)N 1 i=1 p(θ, θν, zi | xi), (3) where the prior factor in the denominator ensures that the prior density is not counted multiple times. Similarly, if one has access to the per-event posteriors marginalized over local parameters zi, and the target quantity is the dataset-wide posterior marginalized over local nuisance parameters, a combination is possible due to the independence of the events and the assumption that the various priors also factorize: p(θ | {x}) = Z d{zν} p(θ, {zν} | {x}) = Z d{zν} p({x} | θ, {zν}) p(θ, {zν}) Z dzν,i p(xi | θ, zν,i) p(xi) p(zν,i | θ) p(θ) = 1 p(θ)N 1 i=1 p(θ | xi) However in the general case such combinations do not hold; e.g., it is not possible to combine marginal perevent posteriors over global nuisance parameters p(θ | xi) = R dθν p(θ, θν | xi) to construct the dataset-wide posterior p(θ | {x}), marginalized over global and local nuisance parameters. Therefore, a general solution for marginalized dataset-wide posteriors necessitates a dataset-wide approach to inference. This can be done either by inferring global parameters at a dataset-wide level (in particular if the dimensionality of the resulting posterior is sufficiently wieldy), or by performing dataset-level marginalization. 2.3 Frequentist inference In the frequentist setting, inference on parameters is typically split into procedures for point estimation and interval estimation. A popular choice for a point estimator is the maximum-likelihood estimator (MLE), due to its favorable asymptotic property of being the minimum-variance unbiased estimator. Interval estimation and hypothesis testing are based on a subjective choice of a test statistic tθ,θν(x) and a desired confidence level: ˆθ, ˆθν = argminθ,θν [ log L{x}(θ, θν)]; Iα θ,θν = {θ | tθ,θν({x}) < tα θ,θν} (5) where L{x}(θ, θν) is a function proportional to the likelihood p({x} | θ, θν) and tα θ,θν is the value of the quantile function at 1 α cumulative probability and serves as the threshold value for the interval. Here, α indicates the size of a hypothesis test, with a typical value being α = 0.05. As in the Bayesian case, one is often interested in an inference that only considers the parameters of interest θ and removes any dependence on nuisance parameters θν. The frequentist analogue to (partial) marginalization Published in Transactions on Machine Learning Research (02/2024) is (partial) optimization (also referred to as profiling ). For interval estimation and hypothesis testing the test statistic is thus modified to read tθ({x}) = logp({x} | θ, ˆˆθν) p({x} | ˆθ, ˆθν) . (6) where ˆˆθν = ˆˆθν({x}, θ) is the optimal value for θν when θ is fixed to a particular value. Similarly to the Bayesian case, it is in general not possible to reconstruct the data-set wide nuisance-free inferences from per-event nuisance free inferences: In general the dataset-wide profile likelihood ratios cannot be constructed from per-event profile likelihood ratios. 2.4 Simulation-based inference When the likelihood is not tractable but simulations can be realized via forward modeling, neural simulationbased inference methods can be leveraged to perform inference. In cases where event-level inference quantities can be combined to yield a correct dataset-wide quantity, neural posterior or likelihood-ratio estimators trained on individual events can be combined. For example, marginal posteriors p(θ, θν | xi) can be trained and combined using equation 4. Alternatively, a per-event likelihood ratio estimator parameterized by all parameters, including nuisance parameters, can be learned (Cranmer et al., 2015; Nachman & Thaler, 2021) to train a neural estimate for the per-event likelihood ratio sϕ(xi, θ, θν) = p(xi|θ,θν) p(xi|θ0,θν,0), These can be used to build a dataset-wide likelihood i p(xi|θ,θν) p(xi|θ0,θν,0) that can be an input for a dataset-wide statistical analysis to yield posteriors or likelihood ratios which can then be marginalized or profiled over to yield dataset-level quantities. While this simplifies the training procedure as the networks are only performing inference for single events, a disadvantage of this approach is that the downstream statistical treatment can be very costly if the number of nuisance parameters θν is large, so that the amortization gains from the neural estimates are not fully exploited. Given these nuances, methods for set-wide simulation-based inference necessitate architectures which can efficiently model joint posteriors or likelihood ratio over global and local parameters of interest, for sets of varying cardinality, without requiring explicit parameterization over a potentially large number of nuisance parameters (which would otherwise be necessary for hierarchy-aware inference with event-level estimators). In the following, we describe two such architectures for amortized dataset-wide inferences. 3 Related work The methods presented in this paper build upon previous studies across several disciplines. A likelihood-free algorithm for obtaining frequentist profile likelihood ratios corresponding to individual observations was previously presented in Heinrich (2022). Also in the context of frequentist inference, Nachman & Thaler (2021) explored dataset-wide inference in the context of particle collider studies, without distinguishing nuisance parameters and parameters of interest. Hierarchical inference for implicit models has been studied using Gaussian posterior density estimators in Wagner-Carena et al. (2021; 2022), where again individual (per-event) posteriors are learned and combined to obtain estimates for the global parameters. Agrawal & Domke (2021) considered amortized variational inference for a simple class of hierarchical models. While similar in spirit, we consider hierarchical inference in the language of modern scientific simulation-based inference methods, and discuss subtleties associated with treating classes of variables in different hierarchies (local and global) as nuisance parameters. Hierarchical Neural Posterior Estimation (Rodrigues et al., 2021) is a recently-introduced simulation-based inference method closely related to the deep set-based Bayesian inference model used here. We complement this approach by introducing frequentist analogues for dataset-wide learning; by including amortized estimators that can then target datasets of varying sizes; and by comparing the proposed approach to more traditional methods in terms of inference computational time. Finally, Geffner et al. (2022) introduced a method for efficient set-wide learning by combining score estimators targeting a sub-set of events. Our method is Published in Transactions on Machine Learning Research (02/2024) complementary, focusing on the unexplored setting of hierarchical models and bridging applications to domain sciences. Our goal is to target the graphical models described by equation 1 and equation 2 in a simulation-based inference setting. We wish to construct amortized estimators for the likelihood ratio or posterior corresponding to these models such that they (1) are sensitive to the full hierarchical structure of the forward model i.e., take a dataset-wide view, treating local and global parameters separately, (2) distinguish between nuisance parameters and parameters of interest at various hierarchy levels, and (3) can process datasets of varying cardinality. While previous related works, discussed in Sec. 3, have considered subsets of these requirements, our goal is to target all three with a particular eye towards scientific applications where event ensembles are common. We consider the following architectures, which satisfy these desiderata. Deep set-based model Here we consider a simple deep sets-inspired (Zaheer et al., 2017) architecture, where per-event encoded features sglob ϕ (xi) are aggregated via a permutation-invariant pooling and passed through a decoder network gψ that outputs the parameters of a variational posterior distribution for the global parameters qglob φ (θ | {x}) (Papamakarios & Murray, 2016). At the same time, local per-event embeddings sloc ϕ (xi) are used to condition a density estimator for the local parameters, qloc φ (z | {x}, θ), including a conditional dependence on the global parameter θ. In practice, the local and global embeddings are obtained by chunking a feature vector obtained with sϕ. For the variational ansatz we consider either a multivariate normal distribution or a normalizing flow (Rezende & Mohamed, 2015), with the decoder outputting either the mean-covariance of the normal distribution or the conditioning context for a normalizing flow. The (negative) sum of the local and global posterior log-densities is used as the minimization objective, L = log qglob φ i=1 sglob ϕ (xi) i=1 log qloc φ zi | sloc ϕ (xi), θ . (7) We derive and further motivate this loss function in App. B. For models of the form of equation 2, the cardinality of the input set is drawn as N p(N | θ) in the training. The density estimator parameters {φ, φ } as well as the embedding network and decoder parameters ϕ and ψ are optimized simultaneously. We note that this is similar to the set-up used in Rodrigues et al. (2021). Fig. 1 shows a schematic illustration of this deep set-based architecture. Transformer-based model As an alternative to deep set-based aggregation with varying cardinality at training time, we consider a decoder-only transformer (Vaswani et al., 2017; Phuong & Hutter, 2022) which takes feature embeddings from individual events and produces cumulative posterior embeddings using blocks of self-attention and dense layers, applying a causal mask to ensure that output embeddings are only sensitive to preceding events in the sequence. The embeddings are then used to predict the posterior mean and covariance of the variational Gaussian ansatz. A potential advantage of the transformer approach is that, assuming a uniform distribution on the cardinality N, we do not need to vary the cardinality of the input set during training. Since the loss consists of a sum of log-likelihoods corresponding to cumulative posteriors, one for each cardinality N up to some specified maximum value, all cardinalities are considered together; L = P N log qglob φ (θ | {x}N i=1) + P i log qloc φ (zi | xi, θ), leaving the data-point encodings implicit. Networks that used LSTM cells to iteratively update a posterior embedding based on incoming data embeddings were also considered, but their convergence properties were noticeably poorer than those of the deep setand transformer-inspired architectures described above. 5 Experiments We describe several case studies of dataset-wide learning, ranging from illustrative toy experiments to more prototypical examples representing problems in particle physics and astrophysics. We refer to App. A for Published in Transactions on Machine Learning Research (02/2024) Per-event embedding network V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj xi V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij 2m2V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj 2m2SS2 mχ χχ yχS χχ yij q S qiqj 2m2SS2 mχ χχ yχS χχ yij q S qiqj 2m2P P 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj 2m2V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj 2m2SS2 mχ χχ yχS χχ yij q S qiqj 2m2SS2 mχ χχ yχS χχ yij q S qiqj 2m2P P 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj 2m2V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj 2m2SS2 mχ χχ yχS χχ yij q S qiqj 2m2SS2 mχ χχ yχS χχ yij q S qiqj 2m2P P 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj q'0(zi | xi, ) V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj zi V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj zi Event-level (local) posteriors Per-event (local) parameters Global parameters Observations Dataset-level (global) posterior Likelihood-ratio test statistic Frequentist setting Set-level network V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij V VµV µ mχ χχ gχVµ χγµχ gij q Vµ qiγµqj SS2 mχ χχ yχS χχ yij SS2 mχ χχ yχS χχ yij PP 2 mχ χχ iyχP χγ5χ iyij q P qiγ5qj sloc Figure 1: Schematic illustration of the deep set-based architecture used in this work. The red lines/arrows show the path used only in the frequentist setting for training a global test-statistic estimator while profiling over global nuisance parameters. additional details on the experiments, including details on training. All experiments are implemented using Py Torch (Paszke et al., 2019). 5.1 Simple multi-variate normal likelihood The forward model To verify the ability of our method to recover the true posterior distribution for sets of varying cardinality, we consider a simple multivariate normal likelihood with known covariance matrix Σ; p({x} | θ, Σ) = QN i=0 No(xi | θ, Σ). The model parameter is the mean vector θ, and µ0, Σ0 are the hyperparameters of the prior multivariate normal distribution; p(θ) = No(µ0, Σ0). In this case, the posterior distribution p(θ | {x}) is also a multivariate normal the prior and posterior are conjugate with mean µpost and covariance Σpost of an updated posterior given by µpost = Σ 1 0 + NΣ 1 1 Σ 1 0 µ0 + NΣ 1x and Σpost = Σ 1 0 + NΣ 1 1 where x is the sample mean of xi and N is the total cardinality of the dataset. We choose Σ = diag(2, 4, 6), with each sample consisting of 5 draws from the multi-variate normal distribution; each individual data point then consists of 15 features. Inference We train the deep set and transformer architectures described in Sec. 4 on 50,000 samples drawn from this likelihood with prior p(µ) = No(0, 3) and a maximum sequence length of Nmax = 200. The cardinality of the training set is randomly varying as N Unif(1, Nmax). Further details on the training procedure are provided in App. A. Results The model is then tested on 500 new sequences, and the distribution (median and middle-68% containment) over inferred standard deviation σ for each of the 3 parameters is shown in Fig. 2 (left: deep set, middle: transformer) compared to the true expected scaling of the parameters (dashed lines). We show σ1, σ2, and σ3, which are the diagonal entries of Σpost, the covariance of the posterior on the target mean parameters θ. The deep set model typically gives more faithful posterior widths compared to the transformer, potentially due to the relatively simple nature of the problem combined with the more data-hungry nature of the transformer architecture. Given this, we restrict our experiments in subsequent sections to use the deep set-based architecture, noting that the transformer approach may nevertheless yield advantages depending on the specific structure of the forward model. The right plot shows the evolution of a posterior for a specific sequence using the deep set model, illustrating convergence of the posterior mass around the true point as more data points are included. We note that although this is a simple, analytically tractable example, the final posterior is obtained by analyzing a dataset of dimensionality 3000 far from a trivial task. Published in Transactions on Machine Learning Research (02/2024) 100 101 102 Sequence length Max. training cardinality Predicted posterior width; Deep Set 3 Ground truth 100 101 102 Sequence length Max. training cardinality Predicted posterior width; Transformer 3 Ground truth 0.5 1.0 1.5 2.0 2.5 0 Predicted posterior evolution; Deep Set Ground truth Figure 2: The median and middle-68% containment of the inferred posterior width over 500 test samples as a function of the size of the dataset. Results for the deep sets (left) and transformer (middle) architectures are shown. The expected scaling (dashed lines) is observed in all cases, but the deep set architecture shows a narrower spread in outcomes. The right plot shows the evolution of a posterior for a specific sequence using the deep set model, illustrating convergence of the posterior mass around the true point. 5.2 Mixture models in particle physics: frequentist treatment We demonstrate the dataset level profiling by adapting the approach described in Heinrich (2022) and compare the trained test statistic to the true profile likelihood ratio (not to be confused by the likelihood-to-evidence ratio (Miller et al., 2022)). The forward model In this example we consider a situation that is common in both particle and astrophysics. Here, the observed data can often originates from multiple disparate physics processes of varying intensity (for example a signal process and a background process), such that the overall model is a mixture of the per-process models: p({x} | θ, θν) = Pois(N | λ(θ, θν)) i=1 (csps(xi | θ, θν) + (1 cs)pb(xi | θ, θν)) ; with, (8) with cs = cs(θ, θν) being the relative signal strength and ps and pb the per-event densities for signal and background, respectively. Fast inference for such models is especially important in settings that perform real-time monitoring of the quality of the data stream delivered by the experiment. The observed perevent densities are typically not tractable, but high-fidelity simulators exist. In this work we focus on Gaussian mixture components, since the empirical distributions typically considered in the domain use-cases feature multi-modal densities (e.g., in searches for bump features). To demonstrate the simulation-based frequentist inference, we consider the special case where ps and pb are Normal distributions No(µ, σ) with fixed hyperparameters µs/b, σs/b, but the component coefficients are a function of both the parameter of interest (the signal strength ) θ as well as the nuisance parameter (i.e. the background strength ): cs = θ/(θ + θν). The overall expected rate is then λ(θ, θν) = θ + θν. Inference The model is again based on the deep set-inspired architecture described in Sec. 4, with a small variation after aggregating the per-event embeddings, these are concatenated with the parameters of interest θ in order to obtain a parameterized neural network classifier one-to-one with the desired test statistic (Cranmer et al., 2015), sϕ({x}, θ) tθ({x}). Following the procedure in Heinrich (2022) we aim to train a test statistic with best average power across alternatives irrespective of the nuisance parameters and show that the result is a good approximation of the profile likelihood ratio, i.e., the dataset-wide test statistic typically used for interval estimation and hypothesis testing in frequentist inference: sϕ({x}, θ) tθ({x}) = 2 log p({x} | θ, ˆˆθν) p({x} | ˆθ, ˆθν) . (9) Details on the training procedure as well as hyperparameter choices are provided in App. A. Published in Transactions on Machine Learning Research (02/2024) 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 0.5 2.5 Figure 3: (Left) True profile likelihood ratio for the mixture model scenario in Sec. 5.2 (bottom) compared to the neural test statistic (top). (Center) Neural test statistic and profile likelihood ratio scatter plot, showing bijective relationship between the two. (Right) Maximum-likelihood estimate obtained from the neural test statistic ˆθML, compared with the true value ˆθp LR. Results As the ground truth is available in this model, we can compare the performance of the learned test statistic directly. In Fig. 3 (left) we show the ground truth profile likelihood ratio tp LR as a function of the parameter of interest θ for 1000 datasets {x} p({x} | θ, θν) with average cardinality of 110 events. The profile likelihoods have the typical parabolic structure with its minimum at the maximum-likelihood estimate ˆθp LR = argminθ tp LR {x} (θ). The neural test statistic t ML exhibits similar features, however as it is an output of a classification task is constrained to the interval [0,1]. The relationship between neural and profile likelihood ratio test statistic is shown in Fig. 3 (center), where a clear bijective mapping is recognizable. The network therefore learns an equivalent test statistic in terms of inferential power. Similar to the ground truth case, the best fit parameter can be found by minimizing the test statistic ˆθML = argminθ t ML {x}(θ) to produce a fully likelihood-free point estimate. In Fig. 3 (right) the two point estimates can be seen to correspond well to each other. One of the major advantages of the neural test statistic is its inference time: While explicit profiling based on either a known or learned test statistic Lx(θ, θν) requires multiple optimizations for each data instance, the neural network statistic allows for a fast vectorized computation in a single forward pass. We observe the neural network based inference for 1000 datasets to be over two orders of magnitude faster than the standard approach, i.e. explicit construction of the profile likelihood ratio computation of ˆθ, ˆθν, ˆˆθν via optimization. 5.3 Mixture models in particle physics: Bayesian inference for a narrow resonance The forward model We also demonstrate a Bayesian dataset-wide inference for the mixture model case. Here the parameters of the model affect the mean of the signal component and the mixture coefficient, whereas the signal width and all background parameters are fixed hyperparameters. Using the notation in equation 8, we have ps(xi | θν) = No(xi | θν, σs); pb(xi) = No(xi | µb, σb); cs(θ) = θ, (10) i.e. the parameter of interest θ is the relative signal strength and the location of the signal in the observed feature space is a nuisance parameter. In typical applications in particle physics, the signal is often very narrow, whereas the background distribution is more diffuse. All parameter choices and the priors p(θ) and p(θν) are specified in App. A. Instead of parameterizing the overall Poisson rate λ, here we used a fixed batch of events of size N0, corresponding to the situation one might encounter in a real-time inference task. In summary, the dataset-wide joint probability is: p({x} | θ, θν)p(θ, θν) = p(θ)p(θν) i=1 [cs(θ) ps(xi | θν) + (1 cs(θ)) pb(xi)] . (11) Published in Transactions on Machine Learning Research (02/2024) , true = 2.0 , true = 0.0 true = 0.2 Background Signal + background 1 0 1 2 3 True signal position , true Posterior mean Truth Deep set MCMC on x ( 1 ) MCMC on snom MCMC on smarg 1 0 1 2 3 True signal position , true Posterior standard deviation Deep set MCMC on x MCMC on snom MCMC on smarg Figure 4: Signaland background densities ps and pb in a simplified particle physics analysis for two different true values of the nuisance parameter θν (left), additionally showing sample datasets {x} of size N0 = 100. The median mean µθ (center) and median standard deviation σθ (right) of the posterior p(θ | {x}), obtained with different inference methods for different true nuisance parameter values θν,true. The median is computed on ensembles of 400 datasets for each parameter point. Inference We wish to extract the posterior p(θ | {x}) characterizing the prevalence of the signal process given a set of N0 events, marginalizing over θν. The mixture fraction θ is a fundamental property of nature; tracing its evolution over time is thus an excellent indicator of the health of the experimental apparatus. We use a deep set network to perform fast, amortized inference in this setting, trained as described in Sec. 4 directly on the set of observations {x}. Using the likelihood of equation 11, the posterior is also accessible through MCMC inference (as described in App. A), which can serve as a point of comparison. In many cases, the individual events x are too high-dimensional for direct inference to be possible. A fundamental problem in the analysis of collider data thus consists in the construction of a low-dimensional summary statistic x s(x), designed to retain the relevant information contained in the data, subsequently used in the inference step. It is common practice in contemporary work to use the signal-to-background density ratio snom(x; θν,nom) = ps(x | θν,nom)/pb(x) evaluated at a particular nominal nuisance parameter value θν,nom. This observable is known to be a sufficient statistic for θ under the hypothesis θν = θν,nom (Neyman & ES, 1933). Another popular choice is to neglect the hierarchical structure of the data-generating process; in the Bayesian paradigm this involves marginalizing over the nuisance parameters at a per-event level and using smarg(x) = R dθν ps(x | θν) p(θν)/pb(x) as summary statistic. Both quantities are known to not be information-preserving in general. Our example admits analytic likelihoods also for snom and smarg, allowing their performance to be evaluated through the MCMC procedure described above. Results The center and right panes in Fig. 4 compare the neural posterior estimate (black lines) to MCMC posterior estimates based on x (blue markers). The width of the neural posterior closely matches the MCMC results, demonstrating the ability of the deep set to extract the entire information about the parameter θ contained in the data. Also shown are MCMC posterior estimates for snom (red markers), and smarg (green markers). As expected, these generate wider posteriors1 and, hence, weaker parameter constraints, illustrating the importance of a proper treatment of global nuisance parameters. Using the fully parameterized density ratio ps(x | θν)/pb(x) for hierarchy-aware inference is, as mentioned in Sec. 2.3, complicated by the fact that often many hundred nuisance parameters are present in contemporary particle physics problems (The ATLAS Collaboration, 2022; The CMS Collaboration, 2022). However, nuisance parameters inform the the deep set through their presence in the training dataset and do not require any explicit parameterization of the network, suggesting a much better scaling behavior. While we do not explore this direction further here, we note that the event-embedding sglob ϕ (x) learned by the deep set also plays the role of an information-preserving summary statistic that can serve as input to traditional (non-amortized) inference pipelines. 1In our simple example, smarg(x) is related to snom(x; θ ν,nom) for θ ν,nom 0.3, leading to a sharp improvement in sensitivity for this model point. This is an artifact that is not expected to occur in more complicated environments. Published in Transactions on Machine Learning Research (02/2024) 5.4 Astrophysics example: Strong gravitational lensing We test our procedure for hierarchical simulation-based inference on a more complicated problem from the domain of astrophysics, where the likelihood associated with the forward model is intractable the characterization of the distribution of dark matter clumps (subhalos) in galaxies using images of galaxies gravitationally lensed by the clumps as well as a larger smooth mass distribution. The model contains both per-event (i.e., local) as well as dataset-wide (i.e., global) parameters. Estimating global lensing parameters with more sophisticated forward models was studied in, e.g., Anau Montel et al. (2023); Montel & Weniger (2022); Coogan et al. (2022); Wagner-Carena et al. (2022); Brehmer et al. (2019); we note that our aim here is instead to showcase the ability to do hierarchical simulation-based inference over local and global parameters with an amortized set-wide estimator. Different latent realizations zsub p(zsub glob, loc) Vary glob and loc Figure 5: Illustrative samples from the lensing model. The rows show two different choices of local (per-event) parameters, while the different columns show variations on the global (set-wide) parameters for the fixed choice of local parameters. Sample-to-sample variation induced by the global parameters, which control the abundance of a subhalo population in the lens, can be seen. The scatter points shows the location of individual subhalos in each image. The forward model The substructure mass function, which describes the mass distribution of clumps, is parameterized as a power law dn dmsub = α m β sub where the normalization α parameterizes the overall abundance of subhalos, and β is the spectral index. θglob = {α, β} are the global parameters in the hierarchical model. The number of subhalos follows a Poisson rate, Nsub Pois(µsub), where we have the expected number of subhalos µsub = R dmsub dn dmsub . In addition we have per-image (local) parameters θloc = {θE, x, y, q}, where θE is the Einstein radius characterizing the overall scale of the lensing ring, { x, y} is the offset vector between the centers of the background source and the lens galaxy, and q [0, 1] is the axis ratio with 1 corresponding to a spherical lens. The subhalo radial positions are drawn from a uniform distribution close to the Einstein radius, {rsub}Nsub i=1 Unif(θE 0.2 , θE + 0.2 ) and the azimuthal angle is drawn uniformly from the interval [0, 2π]. Given a realization of the masses and positions of the clumps {xsub, ysub, msub}Nsub i=1 (which otherwise act as latent variables, collectively denoted {zsub}), the expected image µimage can be obtained using a gravitational lensing simulator (Brehmer et al., 2019). The actual image is then a Poisson realization of this expectation, x Pois(µimage). We choose our simulated lensed images to be 64 64 pixels in size; see App. A for more details on the lensing forward model. The per-image likelihood is intractable because of the high dimensionality of the latent space. Z d Nsub{zsub} Pois(x | µimage) p(µimage | {zsub}Nsub, θloc) Pois(Nsub | µsub). (12) Published in Transactions on Machine Learning Research (02/2024) Some illustrative samples from the lensing forward model are shown in Fig. 5, showing variation with local (per-event) and global (set-wide) parameters across rows, and different realizations of the subhalo latent parameters (mass and location) along columns. The scatter points show the location of individual sub-clumps in each image. Inference We would like to infer the posterior over the local as well as global parameters p(θ, z | {x}) given an ensemble of images. As discussed in Sec. 2, building per-image posterior estimators ignores the joint influence of the global parameters over the entire dataset, potentially reducing the identifiability of the local parameters. If a combined posterior for the global parameters is desired, combining per-image estimators into a global posterior can also be error-prone as per-observation errors accumulate, especially if flexible density estimators (e.g., normalizing flows) are used, as we do here. We train a hierarchical neural posterior estimator with datasets containing up to 25 lensed images, using a Res Net-18 (He et al., 2016) CNN as the per-event feature extractor. App. A describes further details related to model training. Results Figure 6 shows exemplary posterior inference results on one of the local parameters, the Einstein radius θE (left) and one of the global parameters, the amplitude of the clump mass function α (right), on test datasets. The red lines show posteriors using the hierarchical estimator, while the blue dotted lines show results of using a per-image estimator. We can see that the posterior mass concentrates around the true value (vertical dashed) as more images are analyzed. On the other hand, the local parameter values are very similar to those obtained without simultaneously constraining dataset-wide global parameters. This is because, in this case, sub-clumps have a minimal (sub-percent level) effect on the lensing image, and the degeneracy between the local and global parameters is weak. Local parameter inference Global parameter inference 1.50 1.75 2.00 2.25 2.50 0 Hierarchical network Local-only network Truth 1.6 1.8 2.0 2.2 0 1.6 1.8 2.0 2.2 2.4 E (local param.) 1.5 2.0 2.5 E (local param.) 0 10 20 0.0 0 10 20 0.0 Number of images 0 10 20 (global param.) 0 10 20 (global param.) Number of images Figure 6: Example results of inference on local (left) as well as global (right) gravitational lensing parameters using the hierarchical simulation-based inference model, with the posterior on global parameter shown for various cardinalities as up to 25 images are analyzed. Concentration of the posterior mass around the true global parameter value (vertical dashed lines) is seen as more lensed images are analyzed, and the posterior distributions on the local parameters are seen to be qualitatively consistent with the true parameter values used for simulation. Statistical coverage We perform a test of posterior statistical coverage (described in detail in Hermans et al. (2021)) for the gravitational lensing example, serving as a validation check of the learned posterior estimator across cardinalities. Using the same notation as in the main text, let θ be the global parameter of interest and {x} be a dataset. Denoting our learned posterior density estimator as ˆp(θ|{x}). For a confidence level 1 α, the expected coverage probability is E(θ,{x}) p(θ,{x}) 1Θ θ Θˆp(θ|{x})(1 α) , (13) Published in Transactions on Machine Learning Research (02/2024) where Θˆp(θ|{x})(1 α) gives the 1 α highest posterior density interval (HDPI) of the estimator ˆp(θ|{x}) and 1Θ() is an indicator function mapping samples that fall within the HDPI to one. Given N sampled datasets from the joint distribution (θ , {x}) p(θ, x), the empirical expected coverage for the posterior estimator ˆp(θ|{x}) is i=1 1Θ θ Θˆp(θ|{x})(1 α) . (14) The nominal expected coverage is the expected coverage when ˆp(θ|{x}) = p(θ|{x}) and is equal to the confidence level 1 α. In Fig. 7 we show the results of a test of statistical coverage for the gravitational lensing example, showing empirical against nominal coverage probability for the same posterior estimator evaluated over different cardinalities (corresponding to the different lines shown), using 100 test samples. If an estimator produces perfectly calibrated posteriors, its empirical expected coverage probability and nominal expected coverage probability match (dashed diagonal line). The amortized posteriors obtained are shown to be well-calibrated across different cardinalities. 0.0 0.2 0.4 0.6 0.8 1.0 Nominal coverage Empirical coverage Conservative Overconfident Coverage for substructure fraction Images in set Figure 7: Test of statistical coverage for the gravitational lensing example, showing empirical against nominal coverage probability. If an estimator produces perfectly calibrated posteriors, its empirical expected coverage probability and nominal expected coverage probability match (dashed diagonal line). The amortized posteriors obtained are shown to be well-calibrated across different cardinalities (corresponding to the different lines). 6 Conclusions We studied the problem of simulation-based inference over event ensembles in the presence of hierarchical structures in the underlying forward model. In this setting, common in the physical sciences, hierarchy-aware parameter inference requires a dataset-wide approach. For this purpose, we introduced neural estimators for likelihood-ratios and posteriors, illustrated their use for Bayesian as well as frequentist analysis in situations typical of astroand particle physics, and substantiated correct inference performance in these scenarios. Our methods do not require explicit parameterization in terms of marginalized or profiled nuisance parameters, making them complementary to existing techniques for dataset-wide inference. Once trained, they provide fully amortized inference over datasets of varying cardinality and offer significant speedups over traditional frequentist and Bayesian methods, making them suitable also for high-throughput applications. Limitations Our methods apply to situations where the cardinality N of the observed dataset follows an arbitrary, but fixed, distribution p(N | θ) (cf. equation 2). It is currently an open question to what extent estimators trained on p1(N | θ) achieve statistically sound performance when applied to datasets Published in Transactions on Machine Learning Research (02/2024) corresponding to a different distribution p2(N | θ), encoding, for example, the accumulation of additional data. We leave this extension for future investigation, but note that insights from recent work Berman et al. (2022); Chen et al. (2019) describing Bayesian updating as a dynamical system may offer a promising way to address this important problem. Acknowledgments We made extensive use of the Einops (Rogozhnikov, 2022), Jax (Bradbury et al., 2018), Jupyter (Kluyver et al., 2016), Matplotlib (Hunter, 2007), nflows (Durkan et al., 2020), Numpy (Harris et al., 2020), Py MC5 (Salvatier et al., 2016), Py Torch (Paszke et al., 2019), Py Torch Lightning (Falcon et al., 2020), and Scipy (Virtanen et al., 2020) packages. LH is supported by the Excellence Cluster ORIGINS, which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy EXC-2094-390783311. SM is supported by the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of High Energy Physics of U.S. Department of Energy under grant Contract Number DE-SC0012567. CP acknowledges support through STFC consolidated grant ST/W000571/1. PW is supported by the Grainger Fellowship at the University of Chicago. This work was initiated at the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. We thank the Munich Institute for Astro-, Particle and Bio Physics (MIAPb P) for their hospitality during the completion of this project. Abhinav Agrawal and Justin Domke. Amortized variational inference for simple hierarchical models. Advances in Neural Information Processing Systems, 34:21388 21399, 2021. Noemi Anau Montel, Adam Coogan, Camila Correa, Konstantin Karchev, and Christoph Weniger. Estimating the warm dark matter mass from strong lensing images with truncated marginal neural ratio estimation. Monthly Notices of the Royal Astronomical Society, 518(2):2746 2760, 2023. David S Berman, Jonathan J Heckman, and Marc Klinger. On the dynamics of inference and learning. ar Xiv preprint ar Xiv:2204.12939, 2022. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Johann Brehmer, Siddharth Mishra-Sharma, Joeri Hermans, Gilles Louppe, and Kyle Cranmer. Mining for dark matter substructure: Inferring subhalo population properties from strong lenses with machine learning. The Astrophysical Journal, 886(1):49, 2019. Xinshi Chen, Hanjun Dai, and Le Song. Particle flow Bayes rule. In International Conference on Machine Learning, pp. 1022 1031. PMLR, 2019. Adam Coogan, Noemi Anau Montel, Konstantin Karchev, Meiert W Grootes, Francesco Nattino, and Christoph Weniger. One never walks alone: the effect of the perturber population on subhalo measurements in strong gravitational lenses. ar Xiv preprint ar Xiv:2209.09918, 2022. Kyle Cranmer, Juan Pavez, and Gilles Louppe. Approximating likelihood ratios with calibrated discriminative classifiers. ar Xiv preprint ar Xiv:1506.02169, 2015. Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055 30062, 2020. Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. nflows: normalizing flows in Py Torch, November 2020. URL https://doi.org/10.5281/zenodo.4296287. Published in Transactions on Machine Learning Research (02/2024) William Falcon et al. Pytorchlightning/pytorch-lightning: 0.7.6 release, May 2020. URL https://doi.org/ 10.5281/zenodo.3828935. Tomas Geffner, George Papamakarios, and Andriy Mnih. Score modeling for simulation-based inference. ar Xiv preprint ar Xiv:2209.14249, 2022. Charles R. Harris et al. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Lukas Heinrich. Learning optimal test statistics in the presence of nuisance parameters. ar Xiv preprint ar Xiv:2203.13079, 2022. Joeri Hermans, Arnaud Delaunoy, François Rozet, Antoine Wehenkel, Volodimir Begy, and Gilles Louppe. A trust crisis in simulation-based inference? your posterior approximations can be unfaithful. ar Xiv preprint ar Xiv:2110.06581, 2021. J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):90 95, 2007. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Thomas Kluyver et al. Jupyter notebooks - a publishing format for reproducible computational workflows. In ELPUB, 2016. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. Benjamin K Miller, Christoph Weniger, and Patrick Forré. Contrastive neural ratio estimation. Advances in Neural Information Processing Systems, 35:3262 3278, 2022. Noemi Anau Montel and Christoph Weniger. Detection is truncation: studying source populations with truncated marginal neural ratio estimation. ar Xiv preprint ar Xiv:2211.04291, 2022. Benjamin Nachman and Jesse Thaler. Learning from many collider events at once. Physical Review D, 103 (11):116013, 2021. J Neyman and Pearson ES. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character231:289 337, 1933. George Papamakarios and Iain Murray. Fast ε-free inference of simulation models with bayesian conditional density estimation. Advances in neural information processing systems, 29, 2016. George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017. Adam Paszke et al. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. URL http://papers.neurips. cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Mary Phuong and Marcus Hutter. Formal algorithms for transformers. ar Xiv preprint ar Xiv:2207.09238, 2022. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015. Published in Transactions on Machine Learning Research (02/2024) Pedro Rodrigues, Thomas Moreau, Gilles Louppe, and Alexandre Gramfort. HNPE: Leveraging global parameters for neural posterior estimation. Advances in Neural Information Processing Systems, 34: 13432 13443, 2021. Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oap KSVM2bcj. J Salvatier, TV Wiecki, and Fonnesbeck C. Probabilistic programming in Python using Py MC3. Peer J Computer Science 2:e55, 2016. The ATLAS Collaboration. A detailed map of Higgs boson interactions by the ATLAS experiment ten years after the discovery. Nature 607, 52 59, 2022. The CMS Collaboration. A portrait of the Higgs boson by the CMS experiment ten years after the discovery. Nature 607, 60 68, 2022. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Pauli Virtanen et al. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2. Sebastian Wagner-Carena, Ji Won Park, Simon Birrer, Philip J Marshall, Aaron Roodman, Risa H Wechsler, LSST Dark Energy Science Collaboration, et al. Hierarchical inference with bayesian neural networks: An application to strong gravitational lensing. The Astrophysical Journal, 909(2):187, 2021. Sebastian Wagner-Carena, Jelle Aalbers, Simon Birrer, Ethan O Nadler, Elise Darragh-Ford, Philip J Marshall, and Risa H Wechsler. From images to dark matter: End-to-end inference of substructure from hundreds of strong gravitational lenses. ar Xiv preprint ar Xiv:2203.00690, 2022. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017. A Additional details on experiments A.1 Details of Simple multi-variate normal likelihood experiments The deep set as well as transformer multivariate normal posterior estimators are trained on 50,000 sequences {x} with a batch size of 128, withholding 10% of the samples for validation. Sequences consist of up to 200 elements, with each element having 15 features (3 multivariate normal dimensions with 5 draws per dimension). The means across the 3 dimensions of the multivariate normal are the parameters of interest θ, and we target a posterior distribution over these. A 3-layer MLP with RELU activations and hidden size 512 is used as the per-event encoder sglob ϕ , which generates 256-dimensional per-event embeddings which are either aggregated (in the deep set model) or fed through transformer layers. An analogous decoder or prediction network gψ then outputs the means and log-standard deviations, which are used along with the ground truth values to train the Gaussian density estimator qφ(θ | {x}). The estimators are trained using the Adam W (Loshchilov & Hutter, 2019; Kingma & Ba, 2014) optimizer with initial learning rate 3 10 4 and cosine annealing over 100 epochs. The checkpoint corresponding to the lowest validation loss is used. Evaluation is performed on 500 new test samples, evaluating the posterior estimator for sequences with cardinality varied from 1 to 200 for each test sequence. Published in Transactions on Machine Learning Research (02/2024) A.2 Details of Mixture models in particle physics: frequentist treatment experiments Hyperparameters The hyperparameters are chosen such that at the nominal parameter values θ = θν = 1 the total number of expected background events is nb = 100 events, and the total number of expected signal events is ns = 10. The hyperparameters of the Gaussian mixture components are µs = 7, σs = 2 and µb = 0, σb = 3. Minibatch sampling The training algorithm for learning the test statistic for frequentist use requires choosing a weighting function in the (θ, θν)-space. For each minibatch the (y = 0)-labeled instances are sampled from the simulator {x} p({x}|θ, θν) at (θ = θ0, θν = θν,0). The parameter of interest is chosen randomly from a uniform distribution, θ Unif(3, 7), and a random nuisance parameter value is sampled from θν Unif(0.5, 2.0). The (y = 1)-labeled instances are sampled from (θ = θ0 , θν = θν, ), where Unif(0.5, 2.0) and θν, Unif(0.5, 2.0). Overall, each minibatch has N = 100 positive instances and N = 100 negative instances and the former are split equally between the two θ = θ0 cases. Training For each minibatch sampled for θ = θ0, the instances {x}i are evaluated with the deep set conditioning parameter set to θ = θ0 and the loss is evaluated as a standard binary cross-entropy loss averaged over the minibatch. The network parameters are optimized with the Adam optimizer with a learning rate of λelem = 10 4 for the per-element network parameters and λset wide = 10 3 for the set-wide network parameters. Training proceeds for 30000 steps. A.3 Details of Mixture models in particle physics: Bayesian inference for a narrow resonance experiments Choice of distributions and priors The model hyperparameters are chosen to be σs = 0.1 for the signal and µb = 0, σb = 1 for the background. The mixture coefficient θ is subject to a uniform prior, θ Unif(0, 1), and the nuisance parameter θν has a Gaussian prior p(θν) = No(θν|µ = 1, σ = 2). Training of the deep set The networks implementing the per-event encoder sglob ϕ and decoder qglob φ are comprised of two dense layers with 128 hidden units and GELU activation functions; the event encoding consists of 64 features. 500k example datasets are generated from the prior distributions over θν and θ; these datasets are used to train for 300 epochs with the Adam W optimizer, a batch-size of 256 and an initial learning rate of 10 3. Cosine-annealing is used over 300 epochs. The evaluation is performed on independent datasets not used during training. Markov Chain Monte Carlo We use the implementation of the NUTS algorithm available in Py MC5 Salvatier et al. (2016). For each inference run, two Markov Chains of 50k samples each are sampled. The first 12k samples in each chain are discarded and not used for the estimation of the posterior. A.4 Details of Astrophysics example: Strong gravitational lensing experiments We generate 5000 ensembles (drawn from the same global parameters θglob = {α, β}), each containing 25 lensed images with varying local parameters θloc = {θE, x, y, q}. We use a Res Net-18 He et al. (2016) convolutional neural network as the per-event encoder sglob ϕ to extract a 128-dim embedding vector from each image; half of the features are concatenated with the true global parameters and used to condition masked autoregressive normalizing flows Papamakarios et al. (2017); Rezende & Mohamed (2015) qloc φ characterizing the local parameters for each image. The other half of the features are averaged, concatenated with the (randomly-varying at training-time) cardinality of the image dataset, and passed through a 3-layer MLP decoder, subsequently conditioning a normalizing flow qglob φ modeling the global parameters of interest. The experiment is trained with a batch size of 16 (each batch containing 16 25 = 400 images), using the Adam W optimizer with cosine-annealed learning rate starting at 3 10 4 for up to 100 epochs with early stopping, using the model with the best validation loss for evaluation. The sum of globaland local-normalizing flow log-likelihoods is used as the optimization objective. Published in Transactions on Machine Learning Research (02/2024) B Motivation for loss function for joint posterior inference We motivate the loss function used in equation 7, which factorizes the joint posterior over the local and global parameters zi and θ into a global posterior given the set {x} and a product of posteriors over local parameters additionally conditioned on the global parameters: L = log qglob φ i=1 sglob ϕ (xi) i=1 log qloc φ zi | sloc ϕ (xi), θ . (15) Given a hierarchical model with global parameters θ and local parameters zi, and associated observed data {x}, we aim to demonstrate the factorization p(θ, {zi}|{x}) p(θ|{x}) Y i p(zi|θ, xi). (16) We make two key assumptions, given the nature of common hierarchical models. (a) Given the observations {x}, the global parameters θ are conditionally independent of the local parameters {zi}, which gives us p(θ, {zi}|{x}) = p(θ|{x}) p({zi}|θ, {x}), (17) and (b) Each local parameter zi is conditionally independent of any other zj given θ and its respective observation xi, which gives p({zi}|θ, {x}) = Y i p(zi|θ, xi). (18) Substituting the result from (b) into the expression from (a), we obtain our targeted factorized form: p(θ, {zi}|{x}) = p(θ|{x}) Y i p(zi|θ, xi) (19) justifying the loss function used in the hierarchical model.