# bayesian_data_selection__5fca2f95.pdf

Journal of Machine Learning Research 24 (2023) 1-72 Submitted 9/21; Revised 9/22; Published 1/23

Bayesian Data Selection

Eli N. Weinstein ew2760@columbia.edu Data Science Institute Columbia University New York, NY 10027, USA

Jeﬀrey W. Miller jwmiller@hsph.harvard.edu Department of Biostatistics Harvard T.H. Chan School of Public Health Boston, MA 02115, USA

Editor: Mingyuan Zhou

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the data selection problem: ﬁnding a lower-dimensional statistic such as a subset of variables that is well ﬁt by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining background components of the data, and perform standard Bayesian model selection for the choice of statistic. However, ﬁtting a nonparametric model to high-dimensional data tends to be highly ineﬃcient, statistically and computationally. We propose a novel score for performing data selection, the Stein volume criterion (SVC) , that does not require ﬁtting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation. Keywords: Bayesian nonparametrics, Bayesian theory, consistency, misspeciﬁcation, Stein discrepancy

. Work conducted while at Harvard University.

2023 Eli N. Weinstein and Jeﬀrey W. Miller.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1067.html.

Weinstein and Miller

1. Introduction

Scientists often seek to understand complex phenomena by developing working models for various special cases and subsets. Thus, when faced with a large complex data set, a natural question to ask is where and when a given working model applies. We formalize this question statistically by saying that given a high-dimensional data set, we want to identify a lowerdimensional statistic such as a subset of variables that follows a parametric model of interest (the working model). We refer to this problem as data selection , in counterpoint to model selection, since it requires selecting the aspect of the data to which a given model applies. For example, early studies of single-cell RNA expression showed that the expression of individual genes was often bistable, which suggests that the system of cellular gene expression might be described with the theory of interacting bistable systems, or spin glasses, with each gene a separate spin and each cell a separate observation. While it seems implausible that such a model would hold in full generality, it is quite possible that there are subsets of genes for which the spin glass model is a reasonable approximation to reality. Finding such subsets of genes is a data selection problem. In general, a good data selection method would enable one to (a) discover interesting phenomena in complex data sets, (b) identify precisely where naive application of the working model to the full data set goes wrong, and (c) evaluate the robustness of inferences made with the working model. Perhaps the most natural Bayesian approach to data selection is to employ a semiparametric joint model, using the parametric model of interest for the low-dimensional statistic (the foreground ) and using a ﬂexible nonparametric model to explain all other aspects of the data (the background ). Then, to infer where the foreground model applies, one would perform standard Bayesian model selection across diﬀerent choices of the foreground statistic. However, this is computationally challenging due to the need to integrate over the nonparametric model for each choice of foreground statistic, making this approach quite diﬃcult in practice. A natural frequentist approach to data selection would be to perform a goodness-of-ﬁt test for each choice of foreground statistic. However, this still requires specifying an alternative hypothesis, even if the alternative is nonparametric, and ensuring comparability between alternatives used for diﬀerent choices of foreground statistics is nontrivial. Moreover, developing goodness-of-ﬁt tests for composite hypotheses or hierarchical models is often diﬃcult in practice. In this article, we propose a new score for both data selection and model selection that is similar to the marginal likelihood of a semi-parametric model but does not require one to specify a background model, let alone integrate over it. The basic idea is to employ a generalized marginal likelihood where we replace the foreground model likelihood by an exponentiated divergence with nice properties, and replace the background model s marginal likelihood with a simple volume correction factor. For the choice of divergence, we use a kernelized Stein discrepancy (KSD) since it enables us to provide statistical guarantees and is easy to estimate compared to other divergences for instance, the Kullback Leibler divergence involves a problematic entropy term that cannot simply be dropped. The background model volume correction arises roughly as follows: if the background model is well-speciﬁed, then asymptotically, its divergence from the empirical distribution converges to zero and all that remains of the background model s contribution is the volume of its eﬀective parameter

Bayesian Data Selection

space. Consequently, it is not necessary to specify the background model, only its eﬀective dimension. To facilitate computation further, we develop a Laplace approximation for the foreground model s contribution to our proposed score.

This article makes a number of novel contributions. We introduce the data selection problem in broad generality, and provide a thorough asymptotic analysis. We propose a novel model/data selection score, which we refer to as the Stein volume criterion, that takes the form of a generalized marginal likelihood using a KSD. We provide new theoretical results for this generalized marginal likelihood and its associated posterior, complementing and building upon recent work on the frequentist properties of minimum KSD estimators (Barp et al., 2019). Finally, we provide ﬁrst-of-a-kind empirical data selection analyses with two models that are frequently used in single-cell RNA sequencing analysis.

The article is organized as follows. In Section 2, we introduce the data selection problem and our proposed method. In Section 3 we study the asymptotic properties of Bayesian data selection methods and compare to model selection. Section 4 provides a review of related work and Section 5 illustrates the method on a toy example. In Section 6, we prove (a) consistency results for both data selection and model selection, (b) a Laplace approximation for the proposed score, and (c) a Bernstein von Mises theorem for the corresponding generalized posterior. In Section 7, we apply our method to probabilistic principal components analysis (p PCA), assess its performance in simulations, and demonstrate it on single-cell RNA sequencing (sc RNAseq) data. In Section 8, we apply our method to a spin glass model of gene expression, also demonstrated on an sc RNAseq data set. Section 9 concludes with a brief discussion.

Suppose the data X(1), . . . , X(N) X are independent and identically distributed (i.i.d.), where X Rd. Suppose the true data-generating distribution P0 has density p0(x) with respect to Lebesgue measure, and let {q(x|θ) : θ Θ} be a parametric model of interest, where Θ Rm. We are interested in evaluating this model when applied to a projection of the data onto a subspace, XF X (the foreground space). Speciﬁcally, let XF := V X be a linear projection of a datapoint X X onto XF, where V is a matrix with orthonormal columns which deﬁnes the foreground space. Let q(x F|θ) denote the distribution of XF when X q(x|θ), and likewise, let p0(x F) be the distribution of XF when X p0(x). Even when the complete model q(x|θ) is misspeciﬁed with respect to p0(x), it may be that q(x F|θ) is well-speciﬁed with respect to p0(x F); see Figure 1 for a toy example. In such cases, the parametric model is only partially misspeciﬁed speciﬁcally, it is misspeciﬁed on the background space XB, deﬁned as the orthogonal complement of XF (that is, the set of all vectors that are orthogonal to every vector in XF).

Our goal is to ﬁnd subspaces XF of the data space X for which the model q(x F|θ) is correctly speciﬁed. We are not seeking a subset of datapoints, but rather a projection of all the datapoints.

A natural Bayesian solution would be to replace the background component of the assumed model, q(x B|x F, θ), with a more ﬂexible component q(x B|x F, φB) that is guaranteed to be well-speciﬁed with respect to p0(x B|x F), such as a nonparametric model. The resulting

Weinstein and Miller

3 2 1 0 1 2 3 X1

(a) An example for which a bivariate normal model is partially misspeciﬁed. Basis vectors for XF (foreground) and XB (background) are blue and red, respectively.

3 2 1 0 1 2 3 X

(b) A univariate normal model is wellspeciﬁed for the data projection onto XF.

3 2 1 0 1 2 3 X

(c) A univariate normal model is misspeciﬁed for the data projection onto XB.

Figure 1: A simple example illustrating the data selection problem.

joint model, which we refer to as the augmented model , is then

θ π(θ), X(i) F | θ q(x F | θ),

φB πB(φB), X(i) B | X(i) F , φB q(x B | X(i) F , φB) (1)

independently for i {1, . . . , N}. In other words, the pairs (X(1) F , X(1) B ), . . . , (X(N) F , X(N) B ) are i.i.d. given θ and φB, with the foreground projections X(i) F drawn from the parametric model of interest, and the background projections X(i) B drawn from the ﬂexible background model. The standard Bayesian approach to infer XF would be to put a prior on the choice of foreground space XF, and compute the posterior over the choice of XF. Computing this posterior boils down to computing the Bayes factor q(X(1:N)|F)/ q(X(1:N)|F ) for any given pair of foregrounds F and F , where q(X(1:N)|F) denotes the marginal likelihood of F under the augmented model, that is, q(X(1:N)|F) = R R q(X(1:N) F |θ) q(X(1:N) B |X(1:N) F , φB)π(θ)πB(φB)dθdφB. However, in general, it is diﬃcult to ﬁnd a background model that (a) is guaranteed to be well-speciﬁed with respect to p0(x B|x F) and (b) can be integrated over in a computationally tractable way to obtain the posterior on the choice of F. Our proposed method, which we introduce next, sidesteps these diﬃculties while still exhibiting similar guarantees.

2.1 Proposed score for data selection and model selection

In this section, we propose a model/data selection score that is simpler to compute than the marginal likelihood of the augmented model and has similar theoretical guarantees. This score takes the form of a generalized marginal likelihood with a normalized kernelized Stein discrepancy (nksd) estimate taking the place of the log likelihood. Speciﬁcally, our

Bayesian Data Selection

proposed model/data selection score, termed the Stein volume criterion (SVC), is

m B/2 Z exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ)dθ (2)

where the temperature T > 0 is a hyperparameter and m B is the eﬀective dimension of the background model parameter space. [ nksd( ) is an empirical estimate of the nksd (Equations 4 and 5), and measures the mismatch between the data and the model over the foreground subspace. There are three key properties of [ nksd that distinguish it from other estimators of other divergences. First, it estimates the divergence directly, not just up to a data-dependent constant; this is essential for data selection consistency (Section 3.1). For instance, putting the log likelihood in place of N

T [ nksd in Equation 2 fails to provide data selection consistency since it implicitly involves comparing the foreground entropy under P0. Second, [ nksd converges at a O(1/N) convergence rate when the model is correct; this is essential for nested data selection consistency (Section 3.2). In contrast, even if the foreground entropy under P0 is known exactly, using a Monte Carlo estimate of the Kullback Leibler divergence in place of N

T [ nksd fails since the convergence rate is only O(1/

N). Third, the NKSD exhibits subsystem independence (Section 6.1), which ensures that the SVC is comparable between foreground spaces of diﬀerent dimension. We are unaware of any other divergence estimator with all three of these key properties. The integral in Equation 2 can be approximated using techniques discussed in Section 2.3. The hyperparameter T can be calibrated by comparing the coverage of the standard Bayesian posterior to the coverage of the nksd generalized posterior (Section A.1). The (2π/N)m B/2 factor penalizes higher-complexity background models. In general, we allow m B to grow with N, particularly when the background model is nonparametric. Crucially, the likelihood of the background model does not appear in our proposed score, sidestepping the need to ﬁt or even specify the background model indeed, the only place that the background model enters into the SVC is through m B. Thus, rather than specify a background model and then derive m B, one can simply specify an appropriate value of m B. Reasonable choices of m B can be derived by considering the asymptotic behavior of a Pitman-Yor process mixture model, a common nonparametric model that is a natural choice for a background model. A Pitman-Yor process mixture model with discount parameter α (0, 1), concentration parameter ν > α, and D-dimensional component parameters will asymptotically have expected eﬀective dimension

m B D Γ(ν + 1)

αΓ(ν + α)Nα (3)

under the prior, where a N b N means that a N/b N 1 as N and Γ( ) is the gamma function (Pitman, 2002, 3.3). As a default, we recommend setting m B = c B r B

N, where r B is the dimension of XB and c B is a constant chosen to match Equation 3 with α = 1/2. The

N scaling is particularly nice in terms of asymptotic guarantees; see Section 3.2. The SVC uses a novel, normalized version of the ksd between densities p(x) and q(x):

nksd(p(x) q(x)) := EX,Y p (sq(X) sp(X)) (sq(Y ) sp(Y ))k(X, Y )

EX,Y p[k(X, Y )] (4)

Weinstein and Miller

where k(x, y) R is an integrally strictly positive deﬁnite kernel, sq(x) := x log q(x), and sp(x) := x log p(x); see Section 6.1 for details. The numerator corresponds to the standard ksd (Liu et al., 2016). The denominator, which is strictly positive and independent of q(x), is a normalization factor that we have introduced to make the divergence comparable across spaces of diﬀerent dimension. See Section A.2 for kernel recommendations. Extending the technique of Liu et al. (2016), we propose to estimate the normalized KSD using U-statistics:

[ nksd(p(x) q(x)) =

P i =j u(X(i), X(j)) P i =j k(X(i), X(j)) (5)

where X(i) p(x) i.i.d., the sums are over all i, j {1, . . . , N} such that i = j, and

u(x, y) := sq(x) sq(y)k(x, y) + sq(x) yk(x, y) + sq(y) xk(x, y) + trace( x y k(x, y)).

Importantly, Equation 5 does not require knowledge of sp(x), which is unknown in practice.

2.2 Comparison with the standard marginal likelihood

It is instructive to compare our proposed model/data selection score, the Stein volume criterion, to the standard marginal likelihood q(X(1:N)|F). In particular, we show that the SVC approximates a generalized version of the marginal likelihood. To see this, ﬁrst deﬁne H := R p0(x) log p0(x)dx, the entropy of the complete data distribution, and note that if H were somehow known, then the Kullback-Leibler (kl) divergence between the augmented model and the data distribution could be approximated as

c kl(p0(x) q(x F|θ) q(x B|x F, φB)) := 1

i=1 log q(X(i) F |θ) q(X(i) B |X(i) F , φB) H.

Since multiplying the marginal likelihoods by a ﬁxed constant does not aﬀect the Bayes factors, the following expression could be used instead of the marginal likelihood q(X(1:N)|F) to decide among foreground subspaces:

q(X(1:N)|F)

exp( NH) = Z Z exp N c kl(p0(x) q(x F|θ) q(x B|x F, φB)) π(θ)πB(φB)dθdφB. (6)

Now, consider a generalized marginal likelihood where the nksd replaces the kl:

K := Z Z exp N 1

T [ nksd p0(x) q(x F|θ) q(x B|x F, φB) π(θ)πB(φB)dθdφB. (7)

We refer to K as the nksd marginal likelihood of the augmented model. Intuitively, we expect it to behave similarly to the standard marginal likelihood, except that it quantiﬁes the divergence between the model and data distributions using the nksd instead of the kl. However, a key advantage of the nksd marginal likelihood is that it admits a simple approximation via the SVC when the background model is well-speciﬁed, unlike the standard marginal likelihood. For instance, if the foreground and background are independent, that

Bayesian Data Selection

is, p0(x) = p0(x F)p0(x B) and q(x B|x F, φB) = q(x B|φB), then the theory in Section 6 can be extended to the full augmented model to show that

log K log K

P0 N 1, (8)

where K is the SVC (Equation 2). Thus, the SVC approximates the nksd marginal likelihood of the augmented model, suggesting that the SVC may be a convenient alternative to the standard marginal likelihood. Formally, Section 3 shows that the SVC exhibits consistency properties similar to the standard marginal likelihood, even when p0(x) = p0(x F)p0(x B).

2.3 Computation

Next, we discuss methods for computing the SVC including exact solutions, Laplace/BIC approximation, variational approximation, and comparing many possible choices of F. An attractive feature of the SVC is that, unlike the fully Bayesian augmented model, the computation time required does not grow with the background dimension m B.

2.3.1 Exact solution for exponential families

When the foreground model is an exponential family, the SVC can be computed analytically. Speciﬁcally, in Section A.3, we show if q(x F|θ) = λ(x F) exp(θ t(x F) κ(θ)), then

[ nksd(p0(x F) q(x F|θ)) = θ A θ + B θ + C (9)

where A, B, and C depend on the data X(1:N) but not on θ. Therefore, we can compute the SVC in closed form by choosing a multivariate Gaussian for the prior π(θ) in Equation 2; see Section A.3.

2.3.2 Laplace and BIC approximations

The Laplace approximation is a widely-used technique for computing marginal likelihoods. In Theorem 9, we establish regularity conditions under which a Laplace approximation to the SVC is justiﬁed by being asymptotically correct. The resulting approximation is

T [ nksd(p0(x F) q(x F|θN)) π(θN)

T 2 θ [ nksd(p0(x F) q(x F|θN))|1/2

(m F+m B)/2 (10)

where θN := argminθ [ nksd(p0(x F) q(x F|θ)) is the point at which the estimated nksd is minimized, the minimum Stein discrepancy estimator as deﬁned by Barp et al. (2019). Here, θN is simply used to help compute the approximation and does not depend on π(θ), which can be any prior that is continuous and positive at the limiting value of θN. We can also make a rougher approximation, analogous to the Bayesian information criterion (BIC), which does not require one to compute second derivatives of [ nksd:

T [ nksd(p0(x F) q(x F|θN)) 2π

(m F+m B)/2 . (11)

Weinstein and Miller

This approximation is easy to compute, given a minimum Stein discrepancy estimator θN. Like the SVC, it satisﬁes all of our consistency desiderata (Section B). However, we expect it to perform worse than the SVC when there is not yet enough data for the nksd posterior to be highly concentrated, that is, when a range of θ values can plausibly explain the data.

2.3.3 Comparing many foregrounds using approximate optima

Often, we would like to evaluate many possible subspaces XF when performing data selection. Even when using the Laplace or BIC approximation to the SVC, this can get computationally prohibitive since we need to re-optimize to ﬁnd θN for every F under consideration. Here, we propose a way to reduce this cost by making a fast linear approximation. Deﬁne ℓj(θ) := [ nksd(p0(x Fj) q(x Fj|θ)) for j {1, 2}. For w [0, 1], we can linearly interpolate θN(w) := argmin θ ℓ1(θ) + w(ℓ2(θ) ℓ1(θ)). (12)

Now, θN(0) and θN(1) are the minimum Stein discrepancy estimators for F1 and F2, respectively. Given θN(0), we can approximate θN(1) by applying the implicit function theorem and a ﬁrst-order Taylor expansion (Section A.4):

θN(1) θN(0) 2 θℓ1(θN(0)) 1 θℓ2(θN(0)). (13)

Note that the derivatives of ℓj are often easy to compute with automatic diﬀerentiation (Baydin et al., 2018). Note also that when we are comparing one foreground subspace, such as XF1 = X, to many other foreground subspaces XF2, the inverse Hessian 2 θℓ1(θN(0)) 1

only needs to be computed once. Thus, Equation 13 provides a fast approximate method for computing Laplace or BIC approximations to the SVC for a large number of candidate foregrounds F. We apply this technique in Section 7, where we ﬁnd that it performs well in simulation studies and in practice.

2.3.4 Variational approximation

Variational inference is a method for approximating both the posterior distribution and the marginal likelihood of a probabilistic model. Since the SVC takes the form of a generalized marginal likelihood, we can derive a variational approximation to the SVC. Let rζ(θ) be an approximating distribution parameterized by ζ. By Jensen s inequality, we have

log Z exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ)dθ

= log Z exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ) rζ(θ) rζ(θ)dθ

T [ nksd(p0(x F) q(x F|θ)) π(θ) rζ(θ)

T Erζ [ nksd(p0(x F) q(x F|θ)) + Erζ[log π(θ)] Erζ[log rζ(θ)].

Maximizing this lower bound with respect to the variational parameters ζ, and adding the background correction (m B/2) log(2π/N), provides an approximation to the log SVC. Note

Bayesian Data Selection

that this variational approximation falls within the framework of generalized variational inference proposed by Knoblauch et al. (2022). This variational approximation to the SVC is particularly useful when we are aiming to ﬁnd the best subspace XF among a very large number of candidates, since we can jointly optimize the variational parameters ζ and the choice of foreground subspace XF. Here, we do not necessarily need to evaluate the SVC for all foreground subspaces XF under consideration, and can instead rely on optimization methods to search for the best XF from among a large set of possibilities (see Section 8 for an example). Practically, we recommend using the local linear approximation in Section 2.3.3 when the goal is to compare SVC values among many not-too-diﬀerent foreground subspaces XF, and using the variational approximation when the goal is to ﬁnd one best XF from among a large and diverse set.

3. Data selection and model selection consistency

This section presents our consistency results when comparing two diﬀerent foreground subspaces (data selection) or two diﬀerent foreground models (model selection). The theory supporting these results is in Sections 6 and B. We consider four distinct properties that a procedure would ideally exhibit: data selection consistency, nested data selection consistency, model selection consistency, and nested model selection consistency; see Section 6.4 for precise deﬁnitions. We consider six possible model/data selection scores, and we establish which scores satisfy which properties; see Table 1. The SVC and the full marginal likelihood are the only two of the six scores that satisfy all four consistency properties. The intuition behind Bayesian model selection is often explained in terms of Occam s razor: a theory should be as simple as possible but no simpler. Data selection and nested data selection encapsulate a complementary intuition: a theory should explain as much of the data as possible but no more. In other words, when choosing between foreground spaces, a consistent data selection score will asymptotically prefer the highest-dimensional space on which the model is correctly speciﬁed. As in standard model selection, a practical concern in data selection is robustness. For instance, if the foreground model is even slightly misspeciﬁed on XF2, then the empty foreground XF1 = will be asymptotically preferred over XF2. Since the SVC takes the form of a generalized marginal likelihood, techniques for improving robustness with the standard marginal likelihood such as coarsened posteriors, power posteriors, and Bayes Bag could potentially be extended to address this issue (Miller and Dunson, 2019; Huggins and Miller, 2021). We leave exploration of such approaches to future work.

3.1 Data selection consistency

First, consider comparisons between diﬀerent choices of foreground, F1 and F2. When the model is correctly speciﬁed over F1 but not F2, we refer to asymptotic concentration on F1 as data selection consistency (and vice versa if F2 is correct but not F1). For the standard marginal likelihood of the augmented model, we have (see Section B.2)

1 N log q(X(1:N)|F1)

q(X(1:N)|F2)

P0 N kl(p0(x F2) q(x F2|θkl 2, )) kl(p0(x F1) q(x F1|θkl 1, )) (15)

Weinstein and Miller

Consistency property Score d.s. nested d.s. m.s. nested m.s. q(X(1:N)|F) full marginal likelihood K(a) foreground marg lik, background volume K(b) foreground marg NKSD K(c) foreground marg KL, background volume K(d) foreground NKSD, background volume K foreground marg NKSD, background volume

Table 1: Consistency properties satisﬁed by various model/data selection scores. Only the Stein volume criterion K and the full marginal likelihood q(X(1:N)|F) satisfy all four desiderata. (d.s. = data selection, m.s. = model selection, marg = marginal, lik = likelihood.)

where θkl j, := argmin kl(p0(x Fj) q(x Fj|θ)) for j {1, 2}, that is, θkl j, is the parameter value that minimizes the kl divergence between the projected data distribution p0(x Fj) and the projected model q(x Fj|θ). Thus, q(X(1:N)|Fj) asymptotically concentrates on the Fj on which the projected model can most closely match the data distribution in terms of kl. In Theorem 17, we show that under mild regularity conditions, the Stein volume criterion behaves precisely the same way but with the nksd in place of the kl:

P0 N 1 T nksd(p0(x F2) q(x F2|θnksd 2, )) 1

T nksd(p0(x F1) q(x F1|θnksd 1, )) (16)

where θnksd j, := argmin nksd(p0(x Fj) q(x Fj|θ)) for j {1, 2}. Therefore, q(X(1:N)|F) and K both yield data selection consistency. It is important here that the SVC uses a true divergence, rather than a divergence up to a data-dependent constant. If we instead used

m B/2 q(X(1:N) F ), (17)

which employs the foreground marginal likelihood q(X(1:N) F ) = R q(X(1:N) F |θ)π(θ)dθ and a background volume correction, we would get qualitatively diﬀerent behavior (Section B.2):

1 N log K(a) 1 K(a) 2

P0 N kl(p0(x F2) q(x F2|θkl 2, )) kl(p0(x F1) q(x F1|θkl 1, )) + HF2 HF1 (18)

where HFj := R p0(x Fj) log p0(x Fj)dx Fj is the entropy of p0(x Fj) for j {1, 2}. In short, the naive score K(a) is a bad choice: it decides between data subspaces based not just on how well the parametric foreground model performs, but also on the entropy of the data distribution in each space. As a result, K(a) does not exhibit data selection consistency.

3.2 Nested data selection consistency

When XF2 XF1, we refer to the problem of deciding between subspaces F1 and F2 as nested data selection, in counterpoint to nested model selection, where one model is a

Bayesian Data Selection

subset of another (Vuong, 1989). If the model q(x|θ) is well-speciﬁed over XF1, then it is guaranteed to be well-speciﬁed over any lower-dimensional sub-subspace XF2 XF1; in this case, we refer to asymptotic concentration on F1 as nested data selection consistency . In this situation, kl(p0(x Fj) q(x Fj|θkl j, )) and nksd(p0(x Fj), q(x Fj|θnksd j, )) are both zero for j {1, 2}, making it necessary to look at higher-order terms in Equations 15 and 16. In Section B.3, we show that if XF2 XF1, q(x|θ) is well-speciﬁed over XF1, the background models are well-speciﬁed, and their dimensions m B1 and m B2 are constant with respect to N, then 1 log N log q(X(1:N)|F1)

q(X(1:N)|F2)

P0 N 1 2(m F2 + m B2 m F1 m B1) (19)

where m Fj is the eﬀective dimension of the parameter space of q(x Fj|θ). In Theorem 17, we show that under mild regularity conditions, the SVC behaves the same way:

1 log N log K1

P0 N 1 2(m F2 + m B2 m F1 m B1). (20)

Thus, so long as m F2 +m B2 > m F1 +m B1 whenever XF2 XF1, the marginal likelihood and the SVC asymptotically concentrate on the larger foreground F1; hence, they both exhibit nested data selection consistency. This is a natural assumption since the background model is generally more ﬂexible on a per dimension basis than the foreground model. The volume correction (2π/N)m B/2 in the deﬁnition of the SVC is important for nested data selection consistency (Equation 20). An alternative score without that correction,

K(b) := Z exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ)dθ, (21)

exhibits data selection consistency (Equation 16 holds for K(b)), but not nested data selection consistency; see Sections B.2 and B.3. More subtly, the asymptotics of the SVC in the case of nested data selection also depend on the variance of U-statistics. To illustrate, consider a score that is similar to the SVC but uses c kl instead of [ nksd:

m B/2 Z exp N c kl(p0(x F) q(x F|θ)) π(θ)dθ (22)

where c kl(p0(x F) q(x F|θ)) := 1

N PN i=1 log q(X(i) F |θ) HF and HF is required to be known. The score K(c) exhibits data selection consistency, but not nested data selection consistency. The reason is that the error in estimating the kl is of order 1/

N by the central limit theorem, and this source of error dominates the log N term contributed by the volume correction; see Section B.3. Meanwhile, the error in estimating the nksd is of order 1/N when the model is well-speciﬁed, due to the rapid convergence rate of the U-statistic estimator. Thus, in the SVC, this source of error is dominated by the volume correction; see Theorem 12. The nested data selection results we have described so far assume m B does not depend on N, or at least m B2 m B1 does not depend on N (Theorem 17). However, in Section 2.1, we suggest setting m B = c B r B

N where c B is a constant and r B is the dimension of XB. With this choice, the asymptotics of the SVC for nested data selection become (Theorem 17)

N log N log K1

P0 N 1 2c B (r B2 r B1). (23)

Weinstein and Miller

Since r B1 < r B2 when XF2 XF1, the SVC concentrates on the larger foreground F1, yielding nested data selection consistency. Going beyond the well-speciﬁed case, Theorem 17 shows that Equation 23 holds when nksd(p0(x F1) q(x F1 | θnksd 1, )) = nksd(p0(x F2) q(x F2 | θnksd 2, )) = 0, that is, when the models are misspeciﬁed by the same amount as measured by the nksd. Equation 23 holds regardless of whether m F1 is equal to m F2.

3.3 Model selection and nested model selection consistency

Consider comparing diﬀerent foreground models q1(x F|θ1) and q2(x F|θ2) over the same subspace XF, while using the same background model. We say that a score exhibits model selection consistency if it concentrates on the correct model, when one of the models is correctly speciﬁed and the other is not. When the two models are nested and both are correct, a score exhibits nested model selection consistency if it concentrates on the simpler model. Like the standard marginal likelihood, the SVC exhibits both types of model selection consistency. The standard marginal likelihood satisﬁes (Section B.4)

1 N log q1(X(1:N)|F)

q2(X(1:N)|F)

P0 N kl(p0(x F) q2(x F|θkl 2, )) kl(p0(x F) q1(x F|θkl 1, )) (24)

where θkl j, := argmin kl(p0(x F) qj(x F|θj)) for j {1, 2}. Analogously, by Theorem 17,

P0 N 1 T nksd(p0(x F) q2(x F|θnksd 2, )) 1

T nksd(p0(x F) q1(x F|θnksd 1, )) (25)

where θnksd j, := argmin nksd(p0(x F) qj(x F|θj)) for j {1, 2}. Thus, for both scores, concentration occurs on the model that comes closer to the data distribution in terms of the corresponding divergence (kl or nksd). For nested model selection, suppose both foreground models are well-speciﬁed and m B1 = m B2. Letting m F,j be the parameter dimension of qj(x F|θj), we have (Section B.5)

1 log N log q1(X(1:N)|F)

q2(X(1:N)|F)

P0 N 1 2(m F,2 m F,1). (26)

In Theorem 17, we show that the SVC behaves identically:

1 log N log K1

P0 N 1 2(m F,2 m F,1). (27)

Here, a key role is played by the volume of the foreground parameter space, which quantiﬁes the foreground model complexity. The SVC accounts for this by integrating over foreground parameter space. Meanwhile, a naive alternative that ignores the foreground volume,

m B/2 exp N

T min θ [ nksd(p0(x F) q(x F|θ)) , (28)

exhibits model selection consistency (Equation 25 holds for K(d)) but not nested model selection consistency (Section B.5). The Laplace and BIC approximations to the SVC (Equations 10 and 11) explicitly correct for the foreground parameter volume without integrating.

Bayesian Data Selection

4. Related work

Projection pursuit methods are closely related to data selection in that they attempt to identify interesting subspaces of the data. However, projection pursuit uses certain prespeciﬁed objective functions to optimize over projections, whereas our method allows one to specify a model of interest (Huber, 1985). Another related line of research is on Bayesian goodness-of-ﬁt (GOF) tests, which compute the posterior probability that the data comes from a given parametric model versus a ﬂexible alternative such as a nonparametric model. Our setup diﬀers in that it aims to compare among diﬀerent semiparametric models. Nonetheless, in an eﬀort to address the GOF problem, a number of authors have developed nonparametric models with tractable marginals (Verdinelli and Wasserman, 1998; Berger and Guglielmi, 2001), and using these models as the background component in an augmented model could in theory solve data selection problems. In practice, however, such models can only be applied to one-dimensional or few-dimensional data spaces. In Section 7, we show that naively extending the method of Berger and Guglielmi (2001) to the multi-dimensional setting has fundamental limitations. There is a sizeable frequentist literature on GOF testing using discrepancies (Gretton et al., 2012; Barron, 1989; Gy orﬁand Van Der Meulen, 1991). Our proposed method builds directly on the KSD-based GOF test proposed by Liu et al. (2016) and Chwialkowski et al. (2016). However, using these methods to draw comparisons between diﬀerent foreground subspaces is non-trivial, since the set of alternative models considered by the GOF test, though nonparametric, will be diﬀerent over data spaces with diﬀerent dimensionality. Moreover, the Bayesian aspect of the SVC makes it more straightforward to integrate prior information and employ hierarchical models. In composite likelihood methods, instead of the standard likelihood, one uses the product of the conditional likelihoods of selected statistics (Lindsay, 1988; Varin et al., 2011). Composite likelihoods have seen widespread use, often for robustness or computational purposes. However, in composite likelihood methods, the choice of statistics is ﬁxed before performing inference. In contrast, in data selection the choice of statistics is a central quantity to be inferred. Relatedly, our work connects with the literature on robust Bayesian methods. Doksum and Lo (1990) propose conditioning on the value of an insuﬃcient statistic, rather than the complete data set, when performing inference; also see Lewis et al. (2021). However, making an appropriate choice of statistic requires one to know which aspects of the model are correct; in contrast, our procedure infers the choice of statistic. The nksd posterior also falls within the general class of Gibbs posteriors, which have been studied in the context of robustness, randomized estimators, and generalized belief updating (Zhang, 2006a,b; Jiang and Tanner, 2008; Bissiri et al., 2016; Jewson et al., 2018; Miller and Dunson, 2019). Our theoretical results also contribute to the emerging literature on Stein discrepancies (Anastasiou et al., 2021). Barp et al. (2019) recently proposed minimum kernelized Stein discrepancy estimators and established their consistency and asymptotic normality. In Section 6, we establish a Bayesian counterpart to these results, showing that the nksd posterior is asymptotically normal (in the sense of Bernstein von Mises) and admits a Laplace approximation. To prove this result, we rely on the recent work of Miller (2021) on the asymptotics of generalized posteriors. Since Barp et al. (2019) show that the kernelized

Weinstein and Miller

Stein discrepancy is related to the Hyv arinen divergence in that both are Stein discrepancies, our work bears an interesting relationship to that of Shao et al. (2018), who use a Bayesian version of the Hyv arinen divergence to perform model selection with improper priors. They derive a consistency result analogous to Equation 16, however, their model selection score takes the form of a prequential score, not a Gibbs marginal likelihood as in the SVC, and cannot be used for data selection. In independent recent work, Matsubara et al. (2022) propose a Gibbs posterior based on the KSD and derive a Bernstein-von Mises theorem similar to Theorem 9 using the results of Miller (2021). Their method is not motivated by the Bayesian data selection problem but rather by (1) inference for energy-based models with intractable normalizing constants and (2) robustness to ϵ-contamination. Their Bernstein-von Mises theorem diﬀers from ours in that it applies to a V-statistic estimator of the KSD rather than a U-statistic estimator of the NKSD. Our linear approximation to the minimum Stein discrepancy estimator (Section 2.3.3) is inspired by previous work on empirical inﬂuence functions and the Swiss Army inﬁnitesimal jackknife (Giordano et al., 2019; Koh and Liang, 2017). These previous methods similarly compute the linear response of an extremum estimator with respect to perturbations of the data set, but focus on the eﬀects of dropping datapoints rather than data subspaces.

5. Toy example

The purpose of this toy example is to illustrate the behavior of the Stein volume criterion, and compare it to some of the defective alternatives listed in Table 1, in a simple setting where all computations can be done analytically (Section A.3). In all of the following experiments, we simulated data from a bivariate normal distribution: X(1), . . . , X(N) i.i.d. N((0, 0) , Σ0). To set up the Stein volume criterion, we set T = 5 and we choose a radial basis function kernel, k(x, y) = exp( 1

2 x y 2 2), which factors across dimensions. We considered both data set size-independent values of m B (in particular, m B = 5 r B) and data set sizedependent values of m B (in particular, Equation 3 with α = 0.5, ν = 1, and D = 0.2, where fractional values of D correspond to shared parameters across components in the Pitman-Yor mixture model), obtaining very similar results in each case (shown in Figures 2 and 10, respectively). These choices of m B ensure that, except for at very small N, the background model has more parameters per data dimension than each of the foreground models considered below, which have just one. In particular, m B > 1 r B for all N (in the size-independent case) and for N 5 (in the size-dependent case).

5.1 Data selection consistency

First, we set Σ0 to be a diagonal matrix with entries (1, 1/2), that is, Σ0 = diag(1, 1/2), and for x R2, we consider the model

q(x|θ) = N(x | θ, I)

π(θ) = N(θ | (0, 0) , 10I) (29)

where I denotes the identity matrix. This parametric model is misspeciﬁed, owing to the incorrect choice of covariance matrix. We consider two choices of foreground subspace: the

Bayesian Data Selection

ﬁrst dimension (deﬁned by the projection matrix VF1 = (1, 0) ) or the second dimension (projection matrix VF2 = (0, 1) ). The model is only well-speciﬁed for F1 (not F2), so a successful data selection procedure would asymptotically select F1.

In Figure 2a, we see that the SVC correctly concentrates on F1 as the number of datapoints N increases, with the log SVC ratio growing linearly in N, as predicted by Equation 16. Meanwhile, the naive alternative score K(a) (Equation 17) fails since it depends on the foreground entropies, while K(b) (Equation 21) succeeds since the volume correction is negligible in this case; see Section 3.1 and Table 1.

5.2 Nested data selection consistency

Next, we examine the nested data selection case. We use the same model (Equation 29), but we set Σ0 = I so that the model is well-speciﬁed even without being projected. We compare the complete data space (XF1 = X, projection matrix VF1 = I) to the ﬁrst dimension alone (projection matrix VF1 = (1, 0) ). Nested data selection consistency demands that the higher-dimensional data space XF1 be preferred asymptotically, since the model is wellspeciﬁed for both XF1 and XF2. Figure 2b shows that this is indeed the case for the Stein volume criterion, with the log SVC ratio growing at a log N rate when m B is independent of N, as predicted by Equation 20. When m B depends on N via the Pitman-Yor expression, the log SVC ratio grows at a Nα log N rate (Figure 10b). Meanwhile, Figure 2b shows that K(a) and K(b) both fail to exhibit nested data selection consistency, in accordance with our theory (Section 3.2 and Table 1).

5.3 Model selection consistency (nested and non-nested)

Finally, we examine model selection and nested model selection consistency. We again set Σ0 = I. We ﬁrst compare the (well-speciﬁed) model q(x|θ) = N(x | θ, I) to the (misspeciﬁed) model q(x|θ) = N(x | θ, 2I), using the prior π(θ) = N(θ | (0, 0) , 10I) for both models. As shown in Figure 2c, the SVC correctly concentrates on the ﬁrst model, with the log SVC ratio growing linearly in N, as predicted by Equation 25. The same asymptotic behavior is exhibited by K(a), which is equivalent to the standard Bayesian marginal likelihood in this setting (Section 3.3). Finally, to check nested model selection consistency, we compare two well-speciﬁed nested models: q(x) = N(x | (0, 0) , I) and q(x|θ) = N(x | θ, I). Figure 2d shows that the SVC correctly selects the simpler model (that is, the model with smaller parameter dimension) and the log SVC ratio grows as log N (Equation 27). This, too, matches the behavior of the standard Bayesian marginal likelihood, seen in the plot of K(a).

In this section we describe our formal theoretical results. We start by studying the NKSD and then the NKSD posterior, before ﬁnally establishing data and model selection consistency for the SVC.

Weinstein and Miller

20 40 60 80 number of samples

log 1 log 2

Data Selection

20 40 60 80 number of samples

log 1 log 2

Nested Data Selection

25 50 75 100 125 150 175 200 number of samples

log 1 log 2

Model Selection

20 40 60 80 number of samples

log 1 log 2

Nested Model Selection

Figure 2: Behavior of the Stein volume criterion K, the foreground marginal likelihood with a background volume correction K(a), and the foreground marginal nksd K(b) on toy examples. Here, we set m B = 5 r B. The plots show the results for 5 randomly generated data sets (thin lines) and the average over 100 random data sets (bold lines).

6.1 Properties of the NKSD

Suppose X(1), . . . , X(N) are i.i.d. samples from a probability measure P on X Rd having density p(x) with respect to the Lebesgue measure. Let L1(P) denote the set of measurable functions f such that R f(x) p(x)dx < where is the Euclidean norm. We impose the following regularity conditions to use the nksd to compare P with another probability measure Q having density q(x) with respect to the Lebesgue measure; these are similar to conditions used for the standard ksd in previous work (Liu et al., 2016; Barp et al., 2019).

Bayesian Data Selection

Condition 1 (Restrictions on p and q) Assume sp(x) := x log p(x) and sq(x) := x log q(x) exist and are continuous for all x X, and assume X is connected and open. Further, assume sp, sq L1(P).

We refer to sp as the Stein score function of p. Note that existence of sp(x) implies p(x) > 0. Now, consider a kernel k : X X R. The kernel k is said to be integrally strictly positive deﬁnite if for any g : X R such that 0 < R

X |g(x)|dx < , we have R

X g(x)k(x, y)g(y)dxdy > 0. The kernel k is said to belong to the Stein class of P if R

X x(k(x, y)p(x))dx = 0 for all y X.

Condition 2 (Restrictions on k) Assume the kernel k is symmetric, bounded, integrally strictly positive deﬁnite, and belongs to the Stein class of P.

The following result shows that the nksd can be written in a way that does not involve sp; this is particularly useful for estimating the nksd when P is unknown.

Proposition 3 If Conditions 1 and 2 hold, then the nksd is ﬁnite and

nksd(p(x) q(x)) := EX,Y p[u(X, Y )]

EX,Y p[k(X, Y )] (30)

u(x, y) = sq(x) sq(y)k(x, y) + sq(x) yk(x, y) + sq(y) xk(x, y) + trace( x y k(x, y)). (31)

The proof is in Section C.1. Next, we show the nksd satisﬁes the properties of a divergence.

Proposition 4 If Conditions 1 and 2 hold, then

nksd(p(x) q(x)) 0, (32)

with equality if and only if p(x) = q(x) almost everywhere.

The proof is in Section C.1. Unlike the standard ksd, but like the kl divergence, the nksd exhibits subsystem independence (Caticha, 2004, 2011; Rezende, 2018): if two distributions P and Q have the same independence structure, then the total nksd separates into a sum of individual nksd terms. This is formalized in Proposition 6.

Condition 5 (Shared independence structure) Let x = (x 1 , x 2 ) be a decomposition of a vector x Rd into two subvectors, x1 and x2. Assume p(x) and q(x) factor as p(x) = p(x1)p(x2) and q(x) = q(x1)q(x2), and that the kernel k factors as k(x, y) = k1(x1, y1)k2(x2, y2) where k1 and k2 both satisfy Condition 2.

Proposition 6 (Subsystem independence) If Conditions 1, 2, and 5 hold, then

nksd(p(x) q(x)) = nksd(p(x1) q(x1)) + nksd(p(x2) q(x2)) (33)

where the ﬁrst term on the right-hand side uses kernel k1 and the second term uses k2.

See Section C.1 for the proof. Subsystem independence is powerful since it separates the problem of evaluating the foreground model from that of evaluating the background model. A modiﬁed version applies to the estimator [ nksd(p q) (Equation 5); see Proposition 20.

Weinstein and Miller

6.2 Bernstein von Mises theorem for the NKSD posterior

In this section, we establish asymptotic properties of the SVC and, more broadly, of its corresponding generalized posterior, which we refer to as the nksd posterior, deﬁned as

πN(θ) exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ). (34)

In particular, in Theorem 9, we show that the nksd posterior concentrates and is asymptotically normal, and we establish that the Laplace approximation to the SVC (Equation 10) is asymptotically correct. These results form a Bayesian counterpart to those of Barp et al. (2019), who establish the consistency and asymptotic normality of minimum ksd estimators. Thus, in both the frequentist and Bayesian contexts, we can replace the average log likelihood with the negative ksd and obtain similar key properties. Our results in this section do not depend on whether or not we are working with a foreground subspace, so we suppress the x F notation. Let Θ Rm, and let {Qθ : θ Θ} be a family of probability measures on X Rd

having densities qθ(x) with respect to Lebesgue measure. For notational convenience, we sometimes write q(x|θ) instead of qθ(x). Suppose the data X(1), . . . , X(N) are i.i.d. samples from some probability measure P0 on X having density p0(x) with respect to Lebesgue measure. To ensure the nksd satisﬁes the properties of a divergence for all qθ, and that convergence of [ nksd is uniform on compact subsets of Θ (Proposition 21), we require the following.

Condition 7 Assume Conditions 1 and 2 hold for p0, k, and qθ for all θ Θ. Further, assume that the kernel k has continuous and bounded partial derivatives up to and including second order, and k(x, y) > 0 for all x, y X.

Now we can set up the generalized posterior. First deﬁne

f N(θ) := 1

T [ nksd(p0(x) q(x|θ)) = 1

P i =j uθ(X(i), X(j)) P i =j k(X(i), X(j)) , (35)

where uθ(x, y) is the u(x, y) function from Equation 5 with qθ in place of q. For the case of N = 1, we deﬁne f1(θ) = 0 by convention. Note that Nf N(θ) plays the role of the log likelihood. Also deﬁne

T nksd(p0(x) q(x|θ)), (36)

Θ exp( Nf N(θ))π(θ)dθ,

z N exp( Nf N(θ))π(θ),

where π(θ) is a prior density on Θ. Note that πN(θ)dθ is the NKSD posterior and z N is the corresponding generalized marginal likelihood employed in the SVC. Denote the gradient and Hessian of f by f (θ) = θf(θ) and f (θ) = 2 θf(θ), respectively. To ensure that the nksd posterior is well deﬁned and has an isolated maximum, we assume the following condition.

Bayesian Data Selection

Condition 8 Suppose Θ Rm is a convex set and (a) Θ is compact or (b) Θ is open and f N is convex on Θ with probability 1 for all N. Assume z N < a.s. for all N. Assume f has a unique minimizer θ Θ, f (θ ) is invertible, π is continuous at θ , and π(θ ) > 0.

By Proposition 4, f has a unique minimizer whenever {Qθ : θ Θ} is well-speciﬁed and identiﬁable, that is, when Qθ = P0 for some θ and θ 7 Qθ is injective. In Theorem 9 below, we establish the following results: (1) the minimum [ nksd converges to the minimum nksd; (2) πN concentrates around the minimizer of the nksd; (3) the Laplace approximation to z N is asymptotically correct; and (4) πN is asymptotically normal in the sense of Bernstein von Mises. The primary regularity conditions we need for this theorem are restraints on the derivatives of sqθ with respect to θ (Condition 10). Our proof of Theorem 9 relies on the theory of generalized posteriors developed by Miller (2021). We use for the Euclidean Frobenius norms: for vectors A RD, A = (P i A2 i )1/2; for matrices A RD D, A = (P i,j A2 i,j)1/2; for tensors A RD D D, A = (P i,j,k A2 i,j,k)1/2; and so on.

Theorem 9 If Conditions 7, 8, and 10 hold, then there is a sequence θN θ a.s. such that:

1. f N(θN) f(θ ), f N(θN) = 0 for all N suﬃciently large, and f N(θN) f (θ ) a.s.,

2. letting Bϵ(θ ) := {θ Rm : θ θ < ϵ}, we have Z

Bϵ(θ ) πN(θ)dθ a.s. N 1 for all ϵ > 0, (37)

z N exp( Nf N(θN))π(θ )

| det f (θ )|1/2

almost surely, where a N b N means that a N/b N 1 as N , and

4. letting h N denote the density of

N(θ θN) when θ is sampled from πN, we have that h N converges to N(0, f (θ ) 1) in total variation, that is, Z

h N( θ) N( θ | 0, f (θ ) 1) d θ a.s. N 0. (39)

The proof is in Section C.2. We write 2 θsqθ to denote the tensor in Rd m m in which entry (i, j, k) is 2sqθ(x)i/ θj θk. Likewise, 3 θsqθ denotes the tensor in Rd m m m in which entry (i, j, k, ℓ) is 3sqθ(x)i/ θj θk θℓ. We write N to denote the set of natural numbers.

Condition 10 (Stein score regularity) Assume sqθ(x) has continuous third-order partial derivatives with respect to the entries of θ on Θ. Suppose that for any compact, convex subset C Θ, there exist continuous functions g0,C, g1,C L1(P0) such that for all θ C, x X,

sqθ(x) g0,C(x),

θsqθ(x) g1,C(x). (40)

Weinstein and Miller

Further, assume there is an open, convex, bounded set E Θ such that θ E, E Θ, and the sets n 1

i=1 2 θsqθ(X(i)) : N N, θ E o , (41)

i=1 3 θsqθ(X(i)) : N N, θ E o (42)

are bounded with probability 1.

Next, Theorem 11 shows that in the special case where qθ(x) is an exponential family, many of the conditions of Theorem 9 are automatically satisﬁed.

Theorem 11 Suppose {Qθ : θ Θ} is an exponential family with densities of the form qθ(x) = λ(x) exp(θ t(x) κ(θ)) for x X Rd. Assume Θ = {θ Rm : |κ(θ)| < }, and assume Θ is convex, open, and nonempty. Assume log λ(x) and t(x) are continuously diﬀerentiable on X, x log λ(x) and xt(x) are in L1(P0), and the rows of the Jacobian matrix xt(x) Rm d are linearly independent with positive probability under P0. Suppose Condition 7 holds, f has a unique minimizer θ Θ, the prior π is continuous at θ , and π(θ ) > 0. Then the assumptions of Theorem 9 are satisﬁed for all N suﬃciently large.

The proof is in Section C.2.

6.3 Asymptotics of the Stein volume criterion

The Laplace approximation to the SVC uses the estimate [ nksd and its minimizer θN, rather than the true nksd and its minimizer θ . To establish the consistency properties of the SVC, we need to understand the relationship between the two. To do so, we adapt a standard approach to performing such an analysis of the marginal likelihood, for instance, as in Theorem 1 of Dawid (2011).

Theorem 12 Assume the conditions of Theorem 9 hold, and assume sqθ and θ θ=θ sqθ are in L2(P0). Then as N ,

f N(θN) f N(θ ) = OP0(N 1). (43)

Further, if nksd(p0(x) q(x|θ )) > 0 then

f N(θ ) f(θ ) = OP0(N 1/2), (44)

whereas if nksd(p0(x) q(x|θ )) = 0 then

f N(θ ) f(θ ) = OP0(N 1). (45)

The proof is in Section C.3. Remarkably, Equation 45 shows that f N(θ ) converges to f(θ ) more rapidly when the model is well-speciﬁed, speciﬁcally, at a 1/N rate instead of 1/

N. This is unusual and is crucial for our results in Section 6.4. The standard log likelihood does not exhibit this rapid convergence; see Section B.1. This property of the nksd derives from similar properties exhibited by the standard ksd (Liu et al., 2016, Theorem 4.1). Combined with Theorem 9 (part 3), Theorem 12 implies that when the model is misspeciﬁed, the leading order term of log z N is Nf(θ ), whereas when the model is well-speciﬁed, the leading order term is 1

Bayesian Data Selection

6.4 Data and model selection consistency of the SVC

In this section, we establish the asymptotic consistency of the Stein volume criterion (SVC) when used for data selection, nested data selection, model selection, and nested model selection; see Theorem 17. This provides rigorous justiﬁcation for the claims in Section 3. These results are all in the context of pairwise comparisons between two models or two model projections, M1 and M2. Before proving the results, we formally deﬁne the consistency properties discussed in Section 3. Each property is deﬁned in terms of a pairwise score ρ(M1, M2), such as ρ(M1, M2) = log(K1/K2). For simplicity, we assume ρ(M1, M2) = ρ(M2, M1); this is satisﬁed for all of the cases we consider. Let dim( ) denote the dimension of a real space.

Deﬁnition 13 (Data selection consistency) For j {1, 2}, consider foreground model projections Mj := {q(x Fj|θ) : θ Θ}. We say that ρ satisﬁes data selection consistency if ρ(M1, M2) as N when M1 is well-speciﬁed with respect to p0(x F1) and M2 is misspeciﬁed with respect to p0(x F2).

Deﬁnition 14 (Nested data selection consistency) For j {1, 2}, consider foreground model projections Mj := {q(x Fj|θ) : θ Θ}. We say that ρ satisﬁes nested data selection consistency if ρ(M1, M2) as N when M1 is well-speciﬁed with respect to p0(x F1), XF2 XF1, and dim(XF2) < dim(XF1).

Deﬁnition 15 (Model selection consistency) For j {1, 2}, consider foreground models Mj := {qj(x F|θj) : θj Θj}. We say that ρ satisﬁes model selection consistency if ρ(M1, M2) as N when M1 is well-speciﬁed with respect to p0(x F) and M2 is misspeciﬁed.

Deﬁnition 16 (Nested model selection consistency) For j {1, 2}, consider foreground models Mj := {qj(x F|θj) : θj Θj}. We say that ρ satisﬁes nested model selection consistency if ρ(M1, M2) as N when M1 is well-speciﬁed with respect to p0(x F), M1 M2, and dim(Θ1) < dim(Θ2).

In each case, ρ may diverge almost surely ( strong consistency ) or in probability ( weak consistency ). Note that in Deﬁnitions 13 14, the diﬀerence between M1 and M2 is the choice of foreground data space F, whereas in Deﬁnitions 15 16, M1 and M2 are over the same foreground space but employ diﬀerent model spaces. In Theorem 17, we show that the SVC has the asymptotic properties outlined in Section 3. In combination with the subsystem independence properties of the NKSD (Propositions 6 and 20), Theorem 17 also leads to the conclusion that the SVC approximates the NKSD marginal likelihood of the augmented model (Equation 8). Our proof is similar in spirit to previous results for model selection with the standard marginal likelihood, notably those of Hong and Preston (2005) and Huggins and Miller (2021), but relies on the special properties of the nksd marginal likelihood in Theorem 12.

Theorem 17 For j {1, 2}, assume the conditions of Theorem 12 hold for model Mj deﬁned on XFj, with density qj(x Fj|θj) for θj Θj Rm Fj,j. Let Kj,N be the Stein volume criterion for Mj, with background model penalty m Bj = m Bj(N), and let θj, := argminθj nksd(p0(x Fj) qj(x Fj|θj)). Then:

Weinstein and Miller

1. If m Bj = o(N/ log N) for j {1, 2}, then

1 N log K1,N

P0 N 1 T nksd(p0(x F2) q2(x F2|θ2, )) 1

T nksd(p0(x F1) q1(x F1|θ1, )).

2. If nksd(p0(x F1) q1(x F1|θ1, )) = nksd(p0(x F2) q2(x F2|θ2, )) = 0 and m B2 m B1 does not depend on N, then

1 log N log K1,N

P0 N 1 2(m F2,2 + m B2 m F1,1 m B1).

3. If nksd(p0(x F1) q1(x F1|θ1, )) = nksd(p0(x F2) q2(x F2|θ2, )), m B1 = c B1

N, and m B2 = c B2

N, where c B1 and c B2 are positive and constant in N, then

N log N log K1,N

P0 N 1 2(c B2 c B1).

The proof is in Section C.4. In particular, assuming the conditions of Theorem 12, we obtain the following consistency results in terms of convergence in probability. Let Dj := nksd(p0(x Fj) qj(x Fj|θj, )) for j {1, 2}.

If m Bj = o(N/ log N) then the SVC exhibits data selection consistency and model selection consistency. This holds by Theorem 17 (part 1) since D2 > D1 = 0.

If m B1 = m B2 then the SVC exhibits nested model selection consistency. This holds by Theorem 17 (part 2) since D1 = D2 = 0, m B2 m B1 = 0, and m F2,2 > m F1,1.

Consider a nested data selection problem with XF2 XF1. If (A) m B2 m B1 does not depend on N and m F2,2+m B2 > m F1,1+m B1 or (B) m Bj = c Bj

N and c B2 > c B1 > 0, then the SVC exhibits nested data selection consistency. Cases A and B hold by Theorem 17 (parts 2 and 3, respectively) since D1 = D2 = 0.

7. Application: probabilistic PCA

Probabilistic principal components analysis (p PCA) is a commonly used tool for modeling and visualization. The basic idea is to model the data as linear combinations of k latent factors plus Gaussian noise. The inferred weights on the factors are frequently used to provide low-dimensional summaries of the data, while the factors themselves describe major axes of variation in the data. In practice, p PCA is often applied in settings where it is likely to be misspeciﬁed for instance, the weights are often clearly non-Gaussian. In this section, we show how data selection can be used to uncover sources of misspeciﬁcation and to analyze how this misspeciﬁcation aﬀects downstream inferences. The generative model used in p PCA is

Z(i) N(0, Ik),

X(i)|Z(i) N(HZ(i), v Id), (46)

Bayesian Data Selection

independently for i = 1, . . . , N, where Ik is the k-dimensional identity matrix, Z(i) Rk is the weight vector for datapoint i, H Rd k is the unknown matrix of latent factors, and v > 0 is the variance of the noise. To form a Laplace approximation for the Stein volume criterion, we follow the approach developed by Minka (2001) for the standard marginal likelihood. Speciﬁcally, we parameterize H as

H = U(L v Ik)1/2 (47)

where U is a d k matrix with orthonormal columns (that is, it lies on the Stiefel manifold) and L is a k k diagonal matrix. We use the priors suggested by Minka (2001),

U Uniform(U),

Lii Inverse Gamma(α/2, α/2),

v Inverse Gamma (α/2 + 1)(d k) 1, (α/2)(d k) ,

where U is the set of d k matrices with orthonormal columns and Lii is the ith diagonal entry of L. We set α = 0.1 in the following experiments, and we use pymanopt (Townsend et al., 2016) to optimize U over the Stiefel manifold (Section D).

7.1 Simulations

In simulations, we evaluate the ability of the SVC to detect partial misspeciﬁcation. We set d = 6, draw the ﬁrst four dimensions from a p PCA model with k = 2 and

1 0 1 1 0 1 1 1

and generate dimensions 5 and 6 in such a way that p PCA is misspeciﬁed. We consider two misspeciﬁed scenarios: scenario A (Figure 3a) is that

W (i) Bernoulli(0.5),

X(i) 5:6 | W (i) N (0, ΣW (i)) , (50)

where ΣW (i) = (0.05)W (i)I2. Scenario B (Figure 3d) is the same but with

1 ( 1)W (i)0.99 ( 1)W (i)0.99 1

Scenario B is more challenging because the marginals of the misspeciﬁed dimensions are still Gaussian, and thus, misspeciﬁcation only comes from the dependence between X5 and X6. As illustrated in Figures 3b and 3e, both kinds of misspeciﬁcation are very hard to see in the lower-dimensional latent representation of the data. Our method can be used to both (i) detect misspeciﬁed subsets of dimensions, and (ii) conversely, ﬁnd a maximal subset of dimensions for which the p PCA model provides a reasonable ﬁt to the data. We set T = 0.05 in the SVC, based on the calibration procedure

Weinstein and Miller

4 2 0 2 4 4

0 2 density

misspecified dimension 2

4 2 0 2 4 misspecified dimension 1

(a) Scenario A, misspeciﬁed dimensions.

4 3 2 1 0 1 2 3 4 latent z1

(b) Scenario A, p PCA latent space.

200 400 600 800 1000 number of samples

balanced accuracy

SVC Polya Tree

(c) Scenario A, accuracy in detecting misspeciﬁed dimensions.

4 2 0 2 4 4

misspecified dimension 2

4 2 0 2 4 misspecified dimension 1

(d) Scenario B, misspeciﬁed dimensions.

4 3 2 1 0 1 2 3 4 latent z1

(e) Scenario B, p PCA latent space.

500 1000 1500 2000 number of samples

balanced accuracy

SVC Polya Tree

(f) Scenario B, accuracy in detecting misspeciﬁed dimensions.

500 1000 1500 2000 number of samples

mean runtime (seconds)

SVC Polya Tree

(g) Mean runtime over 5 repeats.

Figure 3: Data selection in the probabilistic PCA model.

Bayesian Data Selection

in Section A.1 (Section D.3). We use the Pitman-Yor mixture model expression for the background model dimension (Equation 3), with α = 0.5, ν = 1, and D = 0.2. This value of D ensures that the number of background model parameters per data dimension is greater than the number of foreground model parameters per data dimension except for at very small N, since there are two foreground parameters for each additional data dimension in the p PCA model, and m B > 2 r B for N 20. We performed leave-one-out data selection, comparing the foreground space XF0 = X to foreground spaces XFj for j {1, . . . , d}, which exclude the jth dimension of the data. We computed the log SVC ratio log(Kj/K0) = log Kj log K0 using the BIC approximation to the SVC (Section 2.3.2) and the approximate optima technique (Section 2.3.3). We quantify the performance of the method in detecting misspeciﬁed dimensions in terms of the balanced accuracy, deﬁned as (TN/N + TP/P)/2 where TN is the number of true negatives (dimension by dimension), N is the number of negatives, TP is the number of true positives, and P is the number of positives. Experiments were repeated independently ﬁve times. Figures 3c and 3f show that as the sample size increases, the SVC correctly infers that dimensions 1 through 4 should be included and dimensions 5 and 6 should be excluded.

7.2 Comparison with a nonparametric background model

To benchmark our method, we compare with an alternative approach that uses an explicit augmented model. The P olya tree is a nonparametric model with a closed-form marginal likelihood that is tractable for one-dimensional data (Lavine, 1992). We deﬁne a ﬂexible background model by sampling each dimension j of the background space independently as

Xj Polya Tree(F, F, η), (52)

with the P olya tree constructed as by Berger and Guglielmi (2001) (Section D.4). We set F = N(0, 10), F = N(0, 10), and η = 1000 so that the model is weighted only very weakly towards the base distribution.

We performed data selection using the marginal likelihood of the P olya tree augmented model, computing the marginal of the p PCA foreground model using the approximation of Minka (2001). The accuracy results for data selection are in Figures 3c and 3f. On scenario A (Equation 50), the P olya tree augmented model requires signiﬁcantly more data to detect which dimensions are misspeciﬁed. On scenario B (Equation 51) the P olya tree augmented model fails entirely, preferring the full data space XF0 = X which includes all dimensions (Figure 3f). The reason is that the background model is misspeciﬁed due to the assumption of independent dimensions, and thus, the asymptotic data selection results (Equations 15 and 19) do not hold. This could be resolved by using a richer background model that allows for dependence between dimensions, however, computing the marginal likelihood under such a model would be computationally challenging. Even with the independence assumption, the P olya tree approach is already substantially slower than the SVC (Figure 3g).

Weinstein and Miller

7.3 Application to p PCA for single-cell RNA sequencing

Single-cell RNA sequencing (sc RNAseq) has emerged as a powerful technology for highthroughput characterization of individual cells. It provides a snapshot of the transcriptional state of each cell by measuring the number of RNA transcripts from each gene. PCA is widely used to study sc RNAseq data sets, both as a method for visualizing diﬀerent cell types in the data set and as a pre-processing technique, where the latent embedding is used for downstream tasks like clustering and lineage reconstruction (Qiu et al., 2017; van Dijk et al., 2018). We applied data selection to answer two practical questions in the application of probabilistic PCA to sc RNAseq data: (1) Where is the p PCA model misspeciﬁed? (2) How does partial misspeciﬁcation of the p PCA model aﬀect downstream inferences?

7.3.1 Model criticism

Our ﬁrst goal was to verify that the SVC provides reasonable inferences of partial model misspeciﬁcation in practice. We examined two diﬀerent sc RNAseq data sets, focusing for illustration on a data set from human peripheral blood mononuclear cells taken from a healthy donor, and pre-processed the data following standard procedures in the ﬁeld (Section D.5). We subsampled each data set to 200 genes (selected randomly from among the 2000 most highly expressed) and 2000 cells (selected randomly) for computational tractability, then mean-subtracted and standardized the variance of each gene, again following standard practice in the ﬁeld. The number of latent components k was set to 3, based on the procedure of Minka (2000). We performed leave-one-out data selection, comparing the foreground space XF0 := X to foreground spaces XFj that exclude the jth gene. We computed the log SVC ratio log Kj log K0 using the BIC approximation to the SVC (Section 2.3.2) and the approximate optima technique (Section 2.3.3). We used the same setting of T and of m B as was used in simulation, resulting in a background model complexity of m B = 20 r B for data sets of this size. Based on the SVC criterion, 162 out of 200 genes should be excluded from the foreground p PCA model, suggesting widespread partial misspeciﬁcation. Figure 4 compares the histogram of individual genes to their estimated density under the p PCA model inferred for XF0 = X. Those genes most favored to be excluded (namely, UBE2V2 and IRF8) show extreme violations of normality, in stark contrast to those genes most favored to be included (MT-CO1 and RPL6). Next, we compared the results of our data selection approach to a more conventional strategy for model criticism. Criticism of partially misspeciﬁed models can be challenging in practice because misspeciﬁcation of the model over some dimensions of the data can lead to substantial model-data mismatch in dimensions for which the model is indeed well-speciﬁed (Jacob et al., 2017). The standard approach to model criticism ﬁrst ﬁt a model, then identify aspects of the data that the model poorly explains can therefore be misleading if our aim is to determine how the model might be improved (e.g., in the context of Box s loop , Blei, 2014). In particular, standard approaches such as posterior predictive checks will be expected to overstate problems with components of the model that are well-speciﬁed and understate problems with components of the model that are misspeciﬁed. Bayesian data selection circumvents this issue by evaluating augmented models, which replace potentially misspeciﬁed components of the model by well-speciﬁed compo-

Bayesian Data Selection

4 2 0 2 4 6 x

empirical density

MT-CO1, rank 1

4 2 0 2 4 x

empirical density

RPL6, rank 2

0 1 2 3 4 5 x

empirical density

UBE2V2, rank 199

0 2 4 6 8 10 x

empirical density

IRF8, rank 200

Figure 4: (a,b) Histograms of gene expression (after pre-processing), i.e., X(1) j , . . . , X(N) j , for genes j selected to be included in the foreground space based on the log SVC ratio log Kj log K0. The estimated density under the p PCA model is shown in blue. (c,d) Histograms of example genes selected to be excluded. Higher ranks (in each title) correspond to larger log SVC ratios.

nents. To illustrate the diﬀerence between these approaches in practice, we compared the SVC to a closely analogous measurement of error for the full foreground model (inferred from XF0 = X),

log Ej log E0 := N

T [ nksd(p0(x Fj) q(x Fj|θ0,N)) + N

T [ nksd(p0(x) q(x|θ0,N)) (53)

where θ0,N := argmin [ nksd(p0(x) q(x|θ)) is the minimum nksd estimator for the foreground model when including all dimensions. This model criticism score evaluates the amount of model-data mismatch contributed by the subspace XBj when modeling all data dimensions with the foreground model. For comparison, the BIC approximation to the log

Weinstein and Miller

Figure 5: Scatterplot comparison and projected marginals of the leave-one-out log SVC ratio, log Kj log K0 (with m Bj = m F0 m Fj), and the conventional full model criticism score, log Ej log E0, for each gene.

SVC ratio is

log Kj log K0 N

T [ nksd(p0(x Fj) q(x Fj|θj,N) + N

T [ nksd(p0(x) q(x|θ0,N))

+ m Bj + m Fj m F0

where θj,N := argmin [ nksd(p0(x Fj) q(x Fj|θ)) is the minimum nksd estimator for the projected foreground model applied to the restricted data set, which we approximate as θ0,N plus the implicit function correction derived in Section 2.3.3. Figure 5 illustrates the diﬀerences between the conventional criticism approach (log Ej log E0) and the log SVC ratio on an sc RNAseq data set. To enable direct comparison of the two methods, we focus on the lower order terms of Equation 54, that is, we set m Bj = m F0 m Fj. We see that the amount of error contributed by XBj, as judged by the SVC, is often substantially higher than the amount indicated by the conventional criticism approach, implying that the conventional criticism approach understates the problems caused by individual genes and, conversely, overstates the problems with the rest of the model. Using the SVC instead of a standard criticism approach can also help clarify trends in where the proposed model fails. A prominent concern in sc RNAseq data analysis is the common occurrence of cells that show exactly zero expression of a certain gene (Pierson and Yau, 2015; Hicks et al., 2018). We found a Spearman correlation of ρ = 0.89 between the conventional criticism log Ej log E0 for a gene j and the fraction of cells with zero expression of that gene j, suggesting that this is an important source of model-data mismatch in this sc RNAseq data set, but not necessarily the only source (Figure 6a). However,

Bayesian Data Selection

0.0 0.2 0.4 0.6 0.8 fraction of cells with zero expression

log j log 0

0.0 0.2 0.4 0.6 0.8 fraction of cells with zero expression

log j log 0

0.0 0.2 0.4 0.6 0.8 fraction of cells with zero expression

log j log 0

0.0 0.2 0.4 0.6 0.8 fraction of cells with zero expression

log j log 0

Figure 6: (a) Comparison of the conventional criticism score, for each gene j, and the fraction of cells that show zero expression of that gene j in the raw data. Spearman ρ = 0.89, p < 0.01. (b) Same as (a) but with the log SVC ratio. Spearman ρ = 0.98, p < 0.01. In orange are genes that would be included when using a background model with c B = 20 and in blue are genes that would be excluded. (c) Same as (a) for a data set taken from a MALT lymphoma (Section D.5). Spearman ρ = 0.81, p < 0.01. (d) Same as (b) for the MALT lymphoma data set. Spearman ρ = 0.99, p < 0.01.

the log SVC ratio yields a Spearman correlation of ρ = 0.98, suggesting instead that the amount of model-data mismatch can be entirely explained by the fraction of cells with zero expression (Figure 6b). These observations are repeatable across diﬀerent sc RNAseq data sets (Figure 6c, 6d).

7.3.2 Evaluating robustness

Data selection can also be used to evaluate the robustness of the foreground model to partial model misspeciﬁcation. This is particularly relevant for p PCA on sc RNAseq data, since the inferred latent embeddings of each cell are often used for downstream tasks such

Weinstein and Miller

as clustering, lineage reconstruction, and so on. Misspeciﬁcation may produce spurious conclusions, or alternatively, misspeciﬁcation may be due to structure in the data that is scientiﬁcally interesting. To understand how partial misspeciﬁcation of the p PCA model aﬀects the latent representation of cells (and thus, downstream inferences), we performed data selection with a sequence of background model complexities c B, where m B = c B r B (Figure 7a). We inferred the p PCA parameters based only on genes that the SVC selects to include in the foreground subspace. Figures 7e-7b visualize how the latent representation changes as c B grows and fewer genes are selected. We can observe the representation morphing into a standard normal distribution, as we would expect in the case where the p PCA model is well-speciﬁed. However, the relative spatial organization of cells in the latent space remains fairly stable, suggesting that this aspect of the latent embedding is robust to partial misspeciﬁcation. We can conclude that, at least in this example, misspeciﬁcation strongly contributes to the non-Gaussian shape of the latent representation of the data set, but not to the distinction between subpopulations.

8. Application: Glass model of gene regulation

A central goal in the study of gene expression is to discover how individual genes regulate one another other s expression. Early studies of single cell gene expression noted the prevalence of genes that were bistable in their expression level (Shalek et al., 2013; Singer et al., 2014). This suggests a simple physical analogy: if individual gene expression is a two-state system, we might study gene regulation with the theory of interacting two-state systems, namely spin glasses. We can consider for instance a standard model of this type in which each cell i is described by a vector of spins zi = (zi1, . . . , zid) drawn from an Ising model, specifying whether each gene j {1, . . . , d} is on or oﬀ . In reality, gene expression lies on a continuum, so we use a continuous relaxation of the Ising model and parameterize each spin using a logistic function, setting zij1(xij, µ, τ) = 1/(1 + exp( τ(xij µ))) and zij2(xij, µ, τ) = 1 zij1(xij, µ, τ). Here, xij is the observed expression level of gene j in cell i, the unknown parameter µ controls the threshold for whether the expression of a gene is on (such that zij (1, 0) ) or oﬀ (such that zij (0, 1) ), and the unknown parameter τ > 0 controls the sharpness of the threshold. The complete model is then given by

X(i) p(xi|H, J, µ,τ) := 1 ZH,J,µ,τ exp X

j H j zij(xij, τ, µ) + X

j >j z ij(xij, τ, µ)Jjj zij (xij , τ, µ)

where ZH,J,µ,τ is the unknown normalizing constant of the model, and the vectors Hj R2

and matrices Jjj R2 2 are unknown parameters. This model is motivated by experimental observations and is closely related to RNAseq analysis methods that have been successfully applied in the past (Friedman et al., 2000; Friedman, 2004; Ding and Peng, 2005; Chen et al., 2015; Banerjee et al., 2008; Duvenaud et al., 2008; Liu et al., 2009; Huynh-Thu et al., 2010; Moignard et al., 2015; Matsumoto et al., 2017). However, from a biological perspective we can expect that serious problems may occur when applying the model naively to an sc RNAseq data set. Genes need not exhibit bistable expression: it is straightforward in theory to write down models of gene regulation that do not have just one

Bayesian Data Selection

0 25 50 75 100 125 150 log j log 0

number of genes

c = 10 c = 20 c = 40 c = 60

Figure 7: (a) Histogram of log SVC ratios log Kj log K0 for all 200 genes in the data set (with m Bj = m F0 m Fj). Dotted lines show the value of the volume correction term in the SVC for diﬀerent choices of background model complexity c B; for each choice, genes with log Kj log K0 values above the dotted line would be excluded from the foreground subspace based on the SVC. (b) Posterior mean of the ﬁrst two latent variables (z1 and z2), with the p PCA model applied to the genes selected with a background model complexity of c B = 10 (keeping 23 genes in the foreground). (c-e) Same as (b), but with c B = 20 (keeping 38 genes), c B = 40 (keeping 87 genes) and c B = 60 (keeping all 200 genes). In (a)-(d), the points are colored using the z1 value when c B = 60.

or two steady states gene expression may fall on a continuum, or oscillate, or have three stable states and many alternative patterns have been well-documented empirically (Alon, 2019). Interactions between genes may also be more complex than the model assumes, involving for instance three-way dependencies between genes. All of these biological concerns can potentially produce severe violations of the proposed two-state glass model s assumptions. Data selection provides a method for discovering where the proposed model applies.

Applying standard Bayesian inference to the glass model is intractable, since the normalizing constant is unknown (it is an energy-based model). However, the normalizing

Weinstein and Miller

constant does not aﬀect the SVC, so we can still perform data selection. We used the variational approximation to the SVC in Section 2.3.4. We placed a Gaussian prior on H and a Laplace prior on each entry of J to encourage sparsity in the pairwise gene interactions; we also used Gaussian priors for µ and τ after applying an appropriate transform to remove constraints (Section E.1). Following the logic of stochastic variational inference, we optimized the SVC variational approximation using minibatches of the data and a reparameterization gradient estimator (Hoﬀman et al., 2013; Kingma and Welling, 2014; Kucukelbir et al., 2017). We also simultaneously stochastically optimized the set of genes included in the foreground subspace, using Leave-One-Out REINFORCE (Kool et al., 2019; Dimitriev and Zhou, 2021) to estimate log-odds gradients. We implemented the model and inference strategy within the probabilistic programming language Pyro by deﬁning a new distribution with log probability given by the negative NKSD (Bingham et al., 2019). Pyro provides automated, GPU-accelerated stochastic variational inference, requiring less than an hour for inference on data sets with thousands of cells. See Section E.1 for more details on these inference procedures. We examined three sc RNAseq data sets, taken from (i) peripheral blood monocytes (PBMCs) from a healthy donor (2,428 cells), (ii) a MALT lymphoma (7,570 cells), and (iii) mouse neurons (10,658 cells) (Section E.2). We preprocessed the data following standard protocols and focused on 200 high expression, high variability genes in each data set, based on the metric of Gigante et al. (2020). We set T = 0.05 as in Section 7, and used the Pitman-Yor expression for m B (Equation 3) with α = 0.5, ν = 1 and D = 100. This value of D ensures that the number of background model parameters per data dimension is larger than the number of foreground model parameters per data dimension except for at very small N; in particular, there are 798 foreground model parameter dimensions associated with each data dimension (from the 199 interactions Jjj that each gene has with each other gene, plus the contribution of Hj), and m B > 798 r B for N 13. Our data selection procedure selects 65 genes (32.5%) in the PBMC data set, 0 genes in the neuron data set, and 187 genes (93.5%) in the MALT data set; note that for a lower value of m B, in particular using D = 10, no genes are selected in the MALT data set. These results suggest substantial partial misspeciﬁcation in the PBMC and neuron data sets, and more moderate partial misspeciﬁcation in the MALT data set. We investigated the biological information captured by the foreground model on the MALT data set. In particular, we looked at the approximate NKSD posterior for the selected 187 genes, and compared it to the approximate NKSD posterior for the model when applied to all 200 genes. (Note that, since the glass model lacks a tractable normalizing constant, we cannot compare standard Bayesian posteriors.) Figure 8 shows, for a subset of selected genes, the posterior mean of the interaction energy Ejj := Jjj 21 + Jjj 12 Jjj 22 Jjj 11, that is, the total diﬀerence in energy between two genes being in the same state versus in opposite states. We focused on strong interactions with | Ejj | > 1, corresponding to just 5% of all possible gene-gene interactions (Figure 12). One foreground gene with especially large loading onto the top principal component of the E matrix is CD37 (Figure 8). In B-cell lymphomas, of which MALT lymphoma is an example, CD37 loss is known to be associated with decreased patient survival (Xu Monette et al., 2016). Further, previous studies have observed that CD37 loss leads to high NF-κB pathway activation (Xu-Monette et al., 2016). Consistent with this observation,

Bayesian Data Selection

CD37 HSPA1A HLA-DQA1

LY9 HLA-DRA HLA-DPB1 HLA-DQB1

EZR CD69 FOSB DNAJB1 MTRNR2L12

GAPDH HLA-DPA1

CD55 NFKBIA

JUN RPLP0 PPP1R15A

KLF2 CREM ACTG1 AC058791.1

ID3 BCL2A1 GPR183

IER2 TNFAIP3

RPL4 RPL13A

Figure 8: Posterior mean interaction energies Ejj := Jjj 21 + Jjj 12 Jjj 22 Jjj 11 for a subset of the selected genes. For visualization purposes, weak interactions (| Ejj | 1) are set to zero, and genes with less than 10 total strong connections are not shown. Genes are sorted based on their (signed) projection onto the top principal component of the E matrix.

the estimated interaction energies in our model suggest that decreasing CD37 will lead to higher expression of REL, an NF-κB transcription factor ( ECD37,REL = 2.5), decreased expression of NKFBIA, an NF-κB inhibitor ( ECD37,NKFBIA = 3.6), and higher expression of BCL2A1, a downstream target of the NF-κB pathway ( ECD37,BCL2A1 = 2.1). Separately, a knockout study of Cd37 in B-cell lymphoma in mice does not show Ig M expression (de Winde et al., 2016), consistent with our model ( ECD37,IGHM = 8.2). The same study does show MHC-II expression, and our model predicts the same result, for

Weinstein and Miller

4 2 0 2 4 pre-selection Ejj

post-selection Ejj

Figure 9: Comparison of posterior mean interaction energies Ejj for a model applied to all 200 genes (pre-data selection) to those learned from a model applied to the selected foreground subspace (post-data selection). Each point corresponds to a pairwise interaction between two of the selected 187 genes.

HLA-DQ in particular ( ECD37,HLA-DQA1 = 5.0, ECD37,HLA-DQB1 = 3.7). These results suggest that the data selection procedure can successfully ﬁnd systems of interacting genes that can plausibly be modeled as a spin glass, and which, in this case, are relevant for cancer. To investigate whether data selection provided a beneﬁt in this analysis, we compare with the results obtained by applying the foreground model to the full data set of all 200 genes. All but one of the interactions listed above have | E| < 1 in the full foreground model, and three have opposite signs ( ECD37,NFKBIA = +0.7, ECD37,IGHM = +0.0, ECD37,HLA-DQB1 = 0.6); see Figure 13. Across all 187 selected genes, we ﬁnd only a moderate correlation between the interaction energies estimated when using the full foreground model compared with the data selection-based model (Spearman s rho = 0.30, p < 0.01; Figure 9). These results show that using data selection can lead to substantially diﬀerent, and arguably more biologically plausible, downstream conclusions as compared to naive application of the foreground model to the full data set. As a simple alternative, one might wonder whether genes that are poorly ﬁt by the model could be identiﬁed simply by looking their posterior uncertainty under the full foreground model. This simple approach does not work well, however, since it is possible for parameters to have low uncertainty even when the model poorly describes the data. Indeed, we found that examining uncertainty in the glass model does not lead to the same conclusions as performing data selection: the genes excluded by our data selection procedure are not the ones with the highest uncertainty in their interactions (as measured by the mean posterior standard deviation of Ejj under the NKSD posterior), though they do have above average

Bayesian Data Selection

uncertainty (Figure 14a). Instead, the genes excluded by our data selection procedure are the ones with the highest fraction of cells with zero expression, violating the assumptions of the foreground model (Figure 14b). These results show how data selection provides a sound, computationally tractable approach to criticizing and evaluating complex Bayesian models.

9. Discussion

Statistical modeling is often described as an iterative process, where we design models, infer hidden parameters, critique model performance, and then use what we have learned from the critique to design new models and repeat the process (Gelman et al., 2013). This process has been called Box s loop (Blei, 2014). From one perspective, data selection oﬀers a new criticism approach. It goes beyond posterior predictive checks and related methods by changing the model itself, replacing potentially misspeciﬁed components with a ﬂexible background model. This has important practical consequences: since misspeciﬁcation can distort estimates of model parameters in unpredictable ways, predictive checks are likely to indicate mismatch between the model and the data across the entire space X even when the proposed parametric model is only partially misspeciﬁed. Our method, by contrast, reveals precisely those subspaces of X where model-data mismatch occurs. From another perspective, data selection is outside the design-infer-critique loop. An underlying assumption of Box s loop is that scientists want to model the entire data set. As data sets get larger, and measurements get more extensive, this desire has led to more and more complex (and often diﬃcult to interpret) models. In experimental science, however, scientists have often followed the opposite trajectory: faced with a complicated natural phenomenon, they attempt to isolate a simpler example of the phenomenon for close study. Data selection oﬀers one approach to formalizing this intuitive idea in the context of statistical analysis: we can propose a simple parametric model and then isolate a piece of the whole data set a subspace XF to which this model applies. When working with large, complicated data sets, this provides a method of searching for simpler phenomena that are hypothesized to exist. There are several directions for future work and improvement upon our proposed data selection approach. First, we have focused in our applied examples on discovering subsets of data dimensions. However, our theoretical results show that one can perform data selection on linear subspaces in general; for instance, in the context of sc RNAseq, we might ﬁnd that a model can describe a certain set of linear gene expression programs. Even more generally, one might be interested in discovering nonlinear features of the data that the model can explain such as a set of nonlinear gene expression programs and this would require extending our approach, perhaps by (1) applying a nonlinear volume-preserving map to the data, and then (2) performing standard linear data selection. Second, we have focused on choosing one best XF from among a ﬁnite set of possibilities. A future direction is to provide rigorous asymptotic guarantees when there are inﬁnitely many possible choices of XF, such as the set of all linear subspaces of X. Another future direction is to provide uncertainty quantiﬁcation of XF, rather than just point estimation. Here, it is important to consider the uncertainty due to having ﬁnite data as well as non-identiﬁability, since there may exist multiple optimal values of XF; for instance, this

Weinstein and Miller

can occur if the model is well speciﬁed over marginals of the data but not over the joint distribution of the data. Third, in many applications, researchers will be interested in inferring the parameters θ of the foreground model when applied to the selected subspace XF. On ﬁnite data, it is conceivable that foreground subspaces XF that are more likely to be selected are also more likely to have certain values of θ, which could create a postdata selection bias in conclusions about θ, analogous to the bias that occurs in post-selection inference (Yekutieli, 2012). The data selection problem does not ﬁt neatly in the framework of post-selection inference, however, so further investigation will be required to understand if, when, and to what extent such bias occurs. Finally, in comparison to the augmented model marginal likelihood, the SVC makes diﬀerent judgments as to what types of model-data mismatch are important. The nksd and the kl divergence are quite diﬀerent and do not, in general, coincide or tightly bound one another, so a model-data mismatch that looks big to one divergence may not look big to the other, and vice versa (Matsubara et al., 2022). The preference of the nksd for certain types of errors is not essential to achieving consistent data selection and nested data selection, but is very relevant to the practical use and interpretation of the SVC. One could use another divergence instead of the nksd in the deﬁnition of the SVC, and this would typically be expected to yield consistent model selection and nested model selection (Appendix B.1 and Miller, 2021), however, consistent data selection and nested data selection are more challenging, and depend on a combination of special properties that our nksd estimator possesses (Section 3). Developing data selection approaches with diﬀerent modeldata mismatch preferences, therefore, remains an open challenge. In summary, Bayesian data selection is a rich area for future work.

Acknowledgments

The authors wish to thank Jonathan Huggins, Pierre Jacob, Andre Nguyen, Elizabeth Wood, and the anonymous reviewers for helpful discussions and suggestions. We would like to thank Debora S. Marks in particular for suggesting the use of a Potts model in RNAseq analysis. E.N.W. was supported by the Fannie and John Hertz Fellowship. J.W.M. is supported by the National Institutes of Health grant 5R01CA240299-02.

Bayesian Data Selection

Appendix A. Methods details

A.1 Calibrating T

The SVC contains a hyperparameter T > 0. To choose an appropriate value of T, we aim, roughly, to match the coverage of the generalized posterior

πsvc N (θ)dθ = 1

T [ nksd(p0(x)) q(x|θ)) π(θ)dθ

to the coverage of the standard Bayesian posterior

πkl N (θ)dθ = 1 q(X(1:N)) exp N X

i=1 log q(X(i)|θ) π(θ)dθ

when the model is well-speciﬁed. Let θ be the true parameter value, such that p0(x) = q(x|θ ) almost everywhere. Let Gkl(θ) := 2 θEX p0[ log q(X|θ)] and let θkl N := argmax PN i=1 log q(X(i)|θ) be the maximum likelihood estimator. Let hkl N be the density of

N(θ θkl N ) when θ πkl N . Under regularity conditions (Miller, 2021), according to the Bernstein von Mises theorem, hkl N converges to a normal distribution in total variation, Z

hkl N (x) N x | 0, Gkl(θ ) 1 dx a.s. N 0.

According to Theorem 9, the generalized posterior associated with the SVC has analogous behavior. Let Gsvc(θ) := 2 θ 1 T nksd(p0(x) q(x|θ)) and let θsvc N := argmin [ nksd(p0(x) q(x | θ)). Let hsvc N be the density of

N(θ θsvc N ) when θ πsvc N . Then by Theorem 9, hsvc N converges to a normal distribution in total variation, Z

hsvc N (x) N x | 0, Gsvc(θ ) 1 dx a.s. N 0.

For the uncertainty in each posterior to be roughly the same order of magnitude, we want

det Gkl(θ ) det Gsvc(θ ),

or equivalently,

det 2 θ θ=θ nksd(p0(x) q(x|θ))

det 2 θ θ=θ EX p0[ log q(X|θ)]

To choose a single T value, we simulate true parameters from the prior, generate data from each simulated true parameter, and take the median of the estimated T values. That is, we use the median ˆT across samples drawn as

X(i) iid q(x|θ )

| det 2 θ θ=θ [ nksd(p0(x) q(x|θ)) |

| det 2 θ θ=θ 1 N PN i=1 log q(X(i)|θ) |

In practice, we ﬁnd that the order of magnitude of ˆT is stable across samples θ from π(θ). See Section D.3 for an example.

Weinstein and Miller

A.2 Kernel recommendations

To obtain subsystem independence (Proposition 6), we suggest using a kernel that factors across subspaces, such that k(X, Y ) = k F(XF, YF)k B(XB, YB) where k F and k B are integrally strictly positive deﬁnite kernels. In the applications in Sections 7 and 8, we use the following kernel.

Deﬁnition 18 The factored inverse multiquadric (IMQ) kernel is deﬁned as

c2 + (xi yi)2 β/d

for x, y Rd, where β [ 1/2, 0) and c > 0.

Note that this kernel factors across any subset of dimensions, that is, if S {1, . . . , d} and Sc = {1, . . . , d} \ S, then we can write k(x, y) = k S(x S, y S)k Sc(x Sc, y Sc). Thus, if the foreground subspace XF is the span of a subset of the standard basis, such that x F = V x = x S for some S {1, . . . , d}, then it follows that k factors as k(x, y) = k F(x F, y F)k B(x B, y B). Along with this observation, the next result shows that the factored IMQ satisﬁes the conditions of Theorem 9 that pertain to k alone.

Proposition 19 The factored IMQ kernel is symmetric, positive, bounded, integrally strictly positive deﬁnite, and has continuous and bounded partial derivatives up to order 2.

Proof It is clear that k(x, y) = k(y, x) and k(x, y) > 0. Next, we show that k has continuous and bounded partial derivatives up to order 2. Note that we can write k(x, y) = Qd i=1 ψ(xi yi) where ψ(r) = (c2 + r2)β/d for r R. Diﬀerentiating, we have

d 2r c2 + r2 ψ(r)

d 2 c2 + r2 ψ(r).

Since r2 0 and β < 0, |ψ(r)| c2β/d for all r R. Further, it is straightforward to verify that |ψ (r)| and |ψ (r)| are bounded on R by using the fact that |r|/(c2 + r2) 1/(2c). By the chain rule, it follows that for all i, j, the functions k(x, y), | k/ xi|, and | 2k/ xi yj| are bounded. Thus, we conclude that k, k , and 2k are bounded. Finally, we show that k is integrally strictly positive deﬁnite. First, for any d, for x, y Rd, the function (x, y) 7 (c2 + x y 2 2)β/d is an integrally strictly positive deﬁnite kernel (see, for example, Section 3.1 of Sriperumbudur et al., 2010); we refer to this as the standard IMQ kernel. Since the factored IMQ is a product of one-dimensional standard IMQ kernels, it deﬁnes a kernel on Rd (Lemma 4.6 of Steinwart and Christmann, 2008) and is positive deﬁnite (Theorem 4.16 of Steinwart and Christmann, 2008). By Bochner s theorem (Theorem 3 of Sriperumbudur et al., 2010), a continuous positive deﬁnite kernel can be expressed in terms of the Fourier transform of a ﬁnite nonnegative Borel measure.

Bayesian Data Selection

In particular, applying Bochner s theorem to ψ(r), we have

k(x, y) = Ψ(x y) :=

i=1 ψ(xi yi) =

1(xi yi)ωi dΛ0(ωi)

1(x y) ω dΛ(ω)

by Fubini s theorem, where Λ0 is the ﬁnite nonnegative Borel measure on R associated with ψ(r) and Λ = Λ0 Λ0 is the resulting product measure on Rd. Applying Bochner s theorem in the other direction, we see that Ψ is a positive deﬁnite function. Moreover, since the standard IMQ kernel is characteristic (Theorem 7 of Sriperumbudur et al., 2010), it follows that the support of Λ0 is R (Theorem 9 of Sriperumbudur et al., 2010), and thus the support of Λ is Rd. This implies that the factored IMQ kernel k is characteristic (Theorem 9 of Sriperumbudur et al., 2010) and, since k is also translation invariant, k must be integrally strictly positive deﬁnite (Section 3.4 of Sriperumbudur et al., 2011).

Our choice of the factored IMQ kernel is motivated by the analysis of Gorham and Mackey (2017), which suggests that the standard IMQ is a good default choice for the kernelized Stein discrepancy, particularly when working with distributions that are (roughly speaking) very spread out. In particular, it is straightforward to show that the factored IMQ kernel, like the standard IMQ kernel, meets the conditions of Theorem 3.2 of Huggins and Mackey (2018). However, we do not pursue further the question of whether the nksd with the factored IMQ detects convergence and non-convergence since our statistical setting is diﬀerent from that of Gorham and Mackey (2017), and we are assuming the data consists of i.i.d. samples from some underlying distribution rather than correlated samples from an MCMC chain which may or may not converge.

A.3 Exact solution for exponential families

Here, we show that when q(x|θ) is an exponential family, the estimated nksd has the form

[ nksd(p0(x) q(x|θ)) = θ A θ + B θ + C (56)

where A, B, and C depend on the data but not on θ. Since qθ(x) = q(x|θ) = λ(x) exp(θ t(x) κ(θ)), we have sqθ(x) = x log λ(x) + ( xt(x)) θ where ( xt(x))ij = ti/ xj. Thus, we can write

uθ(x, y) (57)

: = sqθ(x) sqθ(y)k(x, y) + sqθ(x) yk(x, y) + sqθ(y) xk(x, y) + trace( x y k(x, y))

= θ [( xt(x))( yt(y)) k(x, y)]θ

+ [( x log λ(x)) ( yt(y)) k(x, y) + ( y log λ(y)) ( xt(x)) k(x, y)

+ ( xk(x, y)) ( yt(y)) + ( yk(x, y)) ( xt(x)) ]θ

+ [( x log λ(x)) ( y log λ(y))k(x, y) + ( y log λ(y)) ( xk(x, y))

+ ( x log λ(x)) ( yk(x, y)) + trace( x y k(x, y))]. (58)

Weinstein and Miller

Then the estimated nksd takes the form in Equation 56 if we choose

A := 1 P i =j k(X(i), X(j))

i =j xt(X(i)) xt(X(j)) k(X(i), X(j))

B := 1 P i =j k(X(i), X(j))

( x log λ(X(i))) xt(X(j)) k(X(i), X(j))

+ ( x log λ(X(j))) xt(X(i)) k(X(i), X(j))

+( xk(X(i), X(j))) xt(X(j)) + ( yk(X(i), X(j))) xt(X(i))

C := 1 P i =j k(X(i), X(j))

( x log λ(X(i))) ( x log λ(X(j)))k(X(i), X(j))

+ ( x log λ(X(j))) xk(X(i), X(j))

+( x log λ(X(i))) yk(X(i), X(j)) + trace( x y k(X(i), X(j))) .

If the prior on θ is N(µ, Σ0), then the SVC is

m B/2 (2π) m F/2(det Σ0) 1/2

T [θ A θ + B θ + C] exp 1

2(θ µ) Σ 1 0 (θ µ) dθ

m B/2 (2π) m F/2(det Σ0) 1/2

T A + Σ 1 0 θ + N

T B + µ Σ 1 0 θ N

2µ Σ 1 0 µ dθ

m B/2 (det Σ0) 1/2 det 2N

T A + Σ 1 0 1/2

T B + µ Σ 1 0 2N

T A + Σ 1 0 1 N

T B + µ Σ 1 0 N

2µ Σ 1 0 µ .

Meanwhile, if q(x|θ) = N(θ, Σ) where Σ is a ﬁxed covariance matrix, then we have x log λ(x) = Σ 1x and xt(x) = Σ 1.

A.4 Comparing many foregrounds using approximate optima

Here, we justify the technique described in Section 2.3.3. As in Section 2.3.3, deﬁne ℓj(θ) = [ nksd(p0(x Fj) q(x Fj|θ)) for j {1, 2}, and let θN(w) = argminθ L(w, θ) where

L(w, θ) := ℓ1(θ) + w(ℓ2(θ) ℓ1(θ))

for w [0, 1]. We assume that the conditions of Theorem 9 are met, over both XF1 and XF2. Since ( L/ θi)(w, θN(w)) = 0, we have

θi (w, θN(w)) = 2L w θi (w, θN(w)) + X

2L θi θj (w, θN(w))

Bayesian Data Selection

or equivalently, in matrix/vector notation,

0 = w( θL(w, θN(w))) = θ w L(w, θN) + 2 θL(w, θN) w(θN(w)).

Rearranging, we have

wθN(w) = 2 θL(w, θN) 1 θ w L(w, θN).

At w = 0 we ﬁnd, plugging back in the deﬁnition of L,

wθN(0) = 2 θℓ1(θN(0)) 1( θℓ2(θN(0)) θℓ1(θN(0)))

= 2 θℓ1(θN(0)) 1 θℓ2(θN(0)).

Applying a ﬁrst-order Taylor series expansion gives us θN(1) θN(0) + wθN(0), which yields Equation 13.

Appendix B. Asymptotics of the alternative selection criteria

Theorem 17 shows that the SVC exhibits all four types of consistency: data selection, nested data selection, model selection, and nested model selection. In this section, we establish the consistency properties of the alternative criteria considered in Section 3.

We ﬁrst review the asymptotics of the standard marginal likelihood, discussed in depth by Dawid (2011) and Hong and Preston (2005), for example. Deﬁne

fkl N (θ) := 1

i=1 log q(X(i)|θ), θkl N := argmin θ fkl N (θ),

fkl(θ) := EX p0[log q(X|θ)], θkl := argmin θ fkl(θ).

Let m be the dimension of the parameter space. Under suitable regularity conditions (Miller, 2021), the Laplace approximation to the marginal likelihood is

q(X(1:N)) = Z q(X(1:N)|θ)π(θ)dθ exp Nfkl N (θkl N ) π(θkl ) det 2 θ fkl(θkl ) 1/2

almost surely, where a N b N indicates that a N/b N 1 as N . We can rewrite this as

log q(X(1:N)) + N(fkl N (θkl N ) fkl N (θkl ))

+ N(fkl N (θkl ) fkl(θkl )) + Nfkl(θkl )

2 log N log

π(θkl )(2π)m/2 det 2 θfkl(θkl ) 1/2

Weinstein and Miller

20 40 60 80 number of samples

log 1 log 2

Data Selection

20 40 60 80 number of samples

log 1 log 2

Nested Data Selection

25 50 75 100 125 150 175 200 number of samples

log 1 log 2

Model Selection

20 40 60 80 number of samples

log 1 log 2

Nested Model Selection

Figure 10: Behavior of the Stein volume criterion K, the foreground marginal likelihood with a background volume correction K(a), and the foreground marginal nksd K(b) on toy examples. The plots show the results for 5 randomly generated data sets (thin lines) and the average over 100 random data sets (bold lines). Here, unlike Figure 2, the Pitman-Yor expression for m B is used (Equation 3), with α = 0.5, ν = 1, and D = 0.2.

As shown by Dawid (2011) and Hong and Preston (2005), under regularity conditions,

N(fkl N (θkl N ) fkl N (θkl )) = OP0(1)

N(fkl N (θkl ) fkl(θkl )) = OP0(

Nfkl(θkl ) = OP0(N)

π(θkl )(2π)m/2 det 2 θfkl(θkl ) 1/2

Bayesian Data Selection

The nksd marginal likelihood has a similar decomposition. Following Section 6, deﬁne

fnksd N (θ) := 1

T [ nksd(p0(x) q(x|θ)), θnksd N := argmin θ fnksd N (θ),

fnksd(θ) := 1

T nksd(p0(x) q(x|θ)), θnksd := argmin θ fnksd(θ).

As shown in Theorem 9,

z N := Z exp( Nfnksd N (θ))π(θ)dθ exp( Nfnksd N (θnksd N ))π(θnksd ) det 2 θfnksd(θnksd ) 1/2

almost surely as N . As above, we can rewrite this as

log z N + N(fnksd N (θnksd N ) fnksd N (θnksd ))

+ N(fnksd N (θnksd ) fnksd(θnksd )) + Nfnksd(θnksd )

2 log N log

π(θnksd )(2π)m/2 det 2 θfnksd(θnksd ) 1/2

By Theorem 12, we have

N(fnksd N (θnksd N ) fnksd N (θnksd )) = OP0(1),

N(fnksd N (θnksd ) fnksd(θnksd )) = OP0(

Nfnksd(θnksd ) = OP0(N),

π(θnksd )(2π)m/2 det 2 θfnksd(θnksd ) 1/2

and further, when the model is well-speciﬁed, such that nksd(p0(x) q(x|θnksd )) = 0,

N(fnksd N (θnksd ) fnksd(θnksd )) = OP0(1). (64)

For ease of reference, here are the various scores that we consider for model/data selection. Marginal likelihood of the augmented model (foreground+background):

q(X(1:N)|F) = Z Z q(X(1:N) F |θ) q(X(1:N) B |X(1:N) F , φB)π(θ)πB(φB)dθdφB.

Foreground marginal nksd, background volume correction (a.k.a. the SVC):

m B/2 Z exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ)dθ.

Foreground marginal likelihood, background volume correction:

m B/2 q(X(1:N) F ).

Weinstein and Miller

Foreground marginal nksd:

K(b) := Z exp N

T [ nksd(p0(x F) q(x F|θ)) π(θ)dθ.

Foreground marginal kl, background volume correction:

m B/2 Z exp N c kl(p0(x F) q(x F|θ)) π(θ)dθ.

Foreground nksd, background volume correction:

m B/2 exp N

T min θ [ nksd(p0(x F) q(x F|θ)) .

Foreground nksd, foreground and background volume correction (a.k.a. BIC for SVC)

(m F+m B)/2 exp N

T min θ [ nksd(p0(x F) q(x F|θ)) .

B.2 Data selection

Assume m Bj = o(N/ log N) for j {1, 2}. By Equations 60 and 61,

1 N log K(a) 1 K(a) 2

P0 N EX p0[ log q(XF2|θkl 2, )] EX p0[ log q(XF1|θkl 1, )] (65)

= kl(p0(x F2) q(x F2|θkl 2, )) + HF2 kl(p0(x F1) q(x F1|θkl 1, )) HF1,

so K(a) does not satisfy data selection consistency. The SVC satisﬁes data selection consistency by Theorem 17 (part 1). We show that the other scores also satisfy data selection consistency. Since K(b) = (2π/N) m B/2K where K is the SVC, by Theorem 17 (part 1),

1 N log K(b) 1 K(b) 2

P0 N 1 T nksd(p0(x F2) q(x F2|θnksd 2, )) 1

T nksd(p0(x F1) q(x F1|θnksd 1, )). (66)

By Equation 65 and the fact that K(c) = exp(NHF)K(a), we have

1 N log K(c) 1 K(c) 2

P0 N kl(p0(x F2) q(x F2|θkl 2, )) kl(p0(x F1) q(x F1|θkl 1, )). (67)

Since K(d) = (2π/N)m B/2 exp( Nfnksd N (θnksd N )), then by Equation 63,

1 N log K(d) 1 K(d) 2

P0 N 1 T nksd(p0(x F2) q(x F2|θnksd 2, )) 1

T nksd(p0(x F1) q(x F1|θnksd 1, )). (68)

Similarly, since KBIC = (2π/N)m F/2K(d),

1 N log KBIC 1 KBIC 2

P0 N 1 T nksd(p0(x F2) q(x F2|θnksd 2, )) 1

T nksd(p0(x F1) q(x F1|θnksd 1, )). (69)

Bayesian Data Selection

These methods therefore satisfy data selection consistency. For the marginal likelihood of the augmented model, suppose m B1 and m B2 do not depend on N. Then by Equation 60,

1 N log q(X(1:N)|F1)

q(X(1:N)|F2)

P0 N EXF2 p0[ log q(XF2|θkl 2, )] + EX p0[ log q(XB2|XF2, φkl 2, )]

EXF1 p0[ log q(XF1|θkl 1, )] EX p0[ log q(XB1|XF1, φkl 1, )]

We can rewrite this in terms of the KL divergence. First note the decomposition,

H = Z p0(x) log p0(x)dx = Z p0(x Fj) log p0(x Fj)dx Fj Z p0(x) log p0(x Bj|x Fj)dx

for j {1, 2}. Adding and subtracting the entropy H in Equation 70, and using the fact that the background model is well-speciﬁed,

1 N log q(X(1:N)|F1)

q(X(1:N)|F2)

P0 N kl(p0(x F2) q(x F2|θkl 2, )) + kl(p0(x B2|x F2) q(x B2|x F2, φkl 2, ))

kl(p0(x F1) q(x F1|θkl 1, )) kl(p0(x B1|x F1) q(x B1|x F1, φkl 1, ))

= kl(p0(x F2) q(x F2|θkl 2, )) kl(p0(x F1) q(x F1|θkl 1, )). (71)

B.3 Nested data selection

In nested data selection, we are concerned with situations in which XF2 XF1 and the model is well-speciﬁed over both XF1 and XF2. Assume further that m B2 m B1 does not depend on N. First, consider K(d) and KBIC. Since K(d) = (2π/N)m B/2 exp( Nfnksd N (θnksd N )) and by Theorem 12, fnksd N (θnksd N ) = OP0(1/N), we have

1 log N log K(d) 1 K(d) 2

P0 N m B2 m B1

Likewise, since KBIC = (2π/N)m F/2K(d), it follows that

1 log N log KBIC 1 KBIC 2

P0 N m F2 + m B2 m F1 m B1

As in Section 6.4, it is natural to assume m B2 > m B1 and m F2 + m B2 > m F1 + m B1, in which case these criteria satisfy nested data selection consistency. None of K(a), K(b), and K(c) are guaranteed to satisfy nested data selection consistency, because the contribution of background model complexity is negligible or nonexistent. To see this, note that assuming m Bj = o(N/ log N), by Equation 65 we have

1 N log K(a) 1 K(a) 2

P0 N HF2 HF1. (74)

Meanwhile, since K(b) = (2π/N) m B/2K then by Theorem 17 (part 2),

1 log N log K(b) 1 K(b) 2

P0 N m F2 m F1

Weinstein and Miller

Since XF2 XF1, we have m F2 m F1 except perhaps in highly contrived scenarios. If

m F2 < m F1 then Equation 75 shows that log(K(b) 1 /K(b) 2 ) P0 . On the other hand, if m F2 = m F1, then by Equations 62 and 63, log(K(b) 1 /K(b) 2 ) = OP0(1), so it is not possi-

ble to have log(K(b) 1 /K(b) 2 ) P0 . Therefore, K(b) does not satisfy nested data selection consistency. Since K(c) = e NHFK(a) = e NHF(2π/N)m B/2q(X(1:N) F ), then by Equations 60 and 61,

N log K(c) 1 K(c) 2 =

i=1 log p0(X(i) F1)

p0(X(i) F2) E log p0(XF1)

+ OP0(N 1/2 log N). (76)

If σ2 := VP0(log p0(XF1)/p0(XF2)) is positive and ﬁnite, then by the central limit theorem

and Slutsky s theorem, N 1/2 log(K(c) 1 /K(c) 2 ) D N(0, σ2). Thus, K(c) randomly selects F1 or F2 with equal probability, and therefore, it does not satisfy nested data selection consistency. For the marginal likelihood of the augmented model, suppose m B1 and m B2 do not depend on N. The marginal likelihood achieves nested data selection consistency because the augmented models are both well-speciﬁed and describe the complete data space X; this guarantees that the OP0(

N) terms in the marginal likelihood decomposition cancel. Speciﬁcally, p0(x) = q(x | θkl j, , φkl j, , Fj) for j {1, 2}, and thus, by Equations 60 and 61 applied to the augmented model,

1 log N log q(X(1:N)|F1)

q(X(1:N)|F2)

P0 N m F2 + m B2 m F1 m B1

Nested data selection consistency follows assuming m F2 + m B2 > m F1 + m B1 as before. This can be contrasted with Equation 76, where although both foreground models are well-speciﬁed, they describe diﬀerent data (X(1:N) F1 versus X(1:N) F2 ), so the OP0(

N) terms remain.

B.4 Model selection

All of the criteria we consider satisfy model selection consistency. To see this, we apply the same asymptotic analysis as used for data selection in Section B.2, under the same conditions on m B, obtaining

1 N log q1(X(1:N)|F)

q2(X(1:N)|F)

P0 N kl(p0(x F) q2(x F|θkl 2, )) kl(p0(x F) q1(x F|θkl 1, )), (78)

1 N log K(a) 1 K(a) 2

P0 N kl(p0(x F) q2(x F|θkl 2, )) kl(p0(x F) q1(x F|θkl 1, )), (79)

1 N log K(b) 1 K(b) 2

P0 N 1 T nksd(p0(x F) q2(x F|θnksd 2, )) 1

T nksd(p0(x F) q1(x F|θnksd 1, )), (80)

1 N log K(c) 1 K(c) 2

P0 N kl(p0(x F) q2(x F|θkl 2, )) kl(p0(x F) q1(x F|θkl 1, )), (81)

Bayesian Data Selection

1 N log K(d) 1 K(d) 2

P0 N 1 T nksd(p0(x F) q2(x F|θnksd 2, )) 1

T nksd(p0(x F) q1(x F|θnksd 1, )), (82)

1 N log KBIC 1 KBIC 2

P0 N 1 T nksd(p0(x F) q2(x F|θnksd 2, )) 1

T nksd(p0(x F) q1(x F|θnksd 1, )). (83)

Note that in contrast to the data selection case, K(a) satisﬁes model selection consistency since the entropy terms HFj cancel due to the fact that F is ﬁxed. We can think of this as a consequence of the kl divergence s subsystem independence; if we are just interested in modeling a ﬁxed foreground space, there is no problem considering the foreground marginal likelihood alone (Caticha, 2004, 2011; Rezende, 2018).

B.5 Nested model selection

In nested model selection, since both models are well-speciﬁed, we have qj(x F|θkl j, ) = p0(x F) = qj(x F|θnksd j, ) for j {1, 2}. Thus, the estimated divergences cancel:

[ nksd(p0(x F) q1(x F|θnksd 1, )) = [ nksd(p0(x F) q2(x F|θnksd 2, )),

i=1 log q1(X(i) F |θkl 1, ) =

i=1 log q2(X(i) F |θkl 2, ),

c kl(p0(x F) q1(x F|θkl 1, )) = c kl(p0(x F) q2(x F|θkl 2, )).

Using this along with Equations 60 64, under the same conditions on m B as in Section B.2,

1 log N log q1(X(1:N)|F)

q2(X(1:N)|F)

P0 N m F,2 m F,1

1 log N log K(a) 1 K(a) 2

P0 N m F,2 m F,1

1 log N log K(b) 1 K(b) 2

P0 N m F,2 m F,1

1 log N log K(c) 1 K(c) 2

P0 N m F,2 m F,1

log K(d) 1 K(d) 2 = OP0(1), (88)

1 log N log KBIC 1 KBIC 2

P0 N m F,2 m F,1

where we are using the assumption that the background model is the same in the two augmented models q1 and q2 and so m B,1 = m B,2. Only K(d) fails to satisfy nested model selection consistency.

Weinstein and Miller

Appendix C. Proofs

C.1 Proofs of NKSD properties

Proof of Proposition 3 By assumption, the kernel is bounded, say |k(x, y)| B, and sp, sq L1(P). Thus, by the Cauchy Schwarz inequality,

X (sq(x) sp(x)) (sq(y) sp(y))k(x, y)p(x)p(y)dxdy

X sq(x) sp(x) p(x)dx 2 < .

Since the kernel is integrally strictly positive deﬁnite and |k(x, y)| B,

X k(x, y)p(x)p(y)dxdy B < . (90)

Thus, the nksd is ﬁnite. Equation 30 follows from Theorem 3.6 of Liu et al. (2016).

Proof of Proposition 4 The denominator of the nksd is positive since k is integrally strictly positive deﬁnite. Deﬁning δ(x) = sq(x) sp(x), the numerator of the nksd is

X δ(x) δ(y)k(x, y)p(x)p(y)dxdy =

X δi(x)δi(y)k(x, y)p(x)p(y)dxdy. (91)

If δi(x)p(x) = 0 almost everywhere with respect to Lebesgue measure on X, then the ith term on the right-hand side is zero. Meanwhile, if δi(x)p(x) is not a.e. zero, then R

X |δi(x)|p(x)dx > 0, and hence, the ith term is positive since k is integrally strictly positive deﬁnite and δi L1(P) by assumption. Hence, the nksd is nonnegative, and equals zero if and only if δ(x)p(x) = 0 almost everywhere.

Suppose δ(x)p(x) = 0 almost everywhere. Since p(x) > 0 on X by assumption, this implies sp(x) = sq(x) a.e., and in fact, sp(x) = sq(x) for all x X by continuity. Since X is open and connected, then by the gradient theorem (that is, the fundamental theorem of calculus for line integrals), p(x) q(x), and hence, p(x) = q(x) on X. Conversely, if p(x) = q(x) almost everywhere, then δ(x)p(x) = 0 almost everywhere.

Proof of Proposition 6 Deﬁne

δ1(x1) := x1 log q(x) x1 log p(x) = x1 log q(x1) x1 log p(x1)

δ2(x2) := x2 log q(x) x2 log p(x) = x2 log q(x2) x2 log p(x2).

Bayesian Data Selection

Let X, Y p(x) independently. Note that E[k1(X1, Y1)] > 0 and E[k2(X2, Y2)] > 0 since k1 and k2 are integrally strictly positive deﬁnite by assumption. Therefore,

nksd(p(x) q(x)) = E[( x log q(X) x log p(X)) ( x log q(Y ) x log p(Y ))k(X, Y )]

E[k(X, Y )]

= E[δ1(X1) δ1(Y1)k1(X1, Y1)]E[k2(X2, Y2)]

E[k1(X1, Y1)]E[k2(X2, Y2)] + E[δ2(X2) δ2(Y2)k2(X2, Y2)]E[k1(X1, Y1)]

E[k1(X1, Y1)]E[k2(X2, Y2)]

= E[δ1(X1) δ1(Y1)k1(X1, Y1)]

E[k1(X1, Y1)] + E[δ2(X2) δ2(Y2)k2(X2, Y2)]

E[k2(X2, Y2)] = nksd(p(x1) q(x1)) + nksd(p(x2) q(x2)).

The following modiﬁed version applies to the estimator [ nksd(p q) (Equation 5).

Proposition 20

[ nksd(p(x) q(x)) = nksd(p(x1) q(x1)) + nksd(p(x2) q(x2)) (92)

nksd(p(x1) q(x1)) :=

P i =j u1(X(i) 1 , X(j) 1 )k2(X(i) 2 , X(j) 2 ) P i =j k1(X(i) 1 , X(j) 1 )k2(X(i) 2 , X(j) 2 )

u1(x1, y1) :=sq(x1) sq(y1)k1(x1, y1) + sq(x1) y1k1(x1, y1) + sq(y1) x1k1(x1, y1)

+ trace( x1 y1k1(x1, y1))

sq(x1) := x1 log q(x1),

and vice versa for nksd(p(x2) q(x2)) with the roles of 1 and 2 swapped.

Proof Recall the deﬁnition of [ nksd(p(x) q(x)) in Equation 5. Note that x1k(x, y) = k2(x2, y2) x1k1(x1, y1) and x1 log q(x) = x1 log q(x1). Examining u(x, y) term-by-term,

x log q(x) y log q(y)k(x, y) = x1 log q(x1) y1 log q(y1)k1(x1, y1) k2(x2, y2)

+ x2 log q(x2) y2 log q(y2)k2(x2, y2) k1(x1, y1),

x log q(x) yk(x, y) =[ x1 log q(x1) y1k1(x1, y1)]k2(x2, y2)

+ [ x2 log q(x2) y2k2(x2, y2)]k1(x1, y1),

xk(x, y) y log q(y) =[ x1k1(x1, y1) y1 log q(y1)]k2(x2, y2),

+ [ x2k2(x2, y2) y2 log q(y2)]k1(x1, y1)

trace( x y k(x, y)) = trace( x1 y1k1(x1, y1))k2(x2, y2),

+ trace( x2 y2k2(x2, y2))k1(x1, y1).

Thus, deﬁning u1 and u2 as in Proposition 20, we have

u(x, y) = u1(x1, y1)k2(x2, y2) + u2(x2, y2)k1(x1, y1),

k(x, y) = k1(x1, y1)k2(x2, y2).

Weinstein and Miller

The result follows.

To interpret Proposition 20, note that

EX,Y p[u1(X1, Y1)k2(X2, Y2)] EX,Y p[k1(X1, Y1)k2(X2, Y2)] = EX1,Y1 p(x1)[u1(X1, Y1)]

EX1,Y1 p(x1)[k1(X1, Y1)] = nksd(p(x1) q(x1)),

so nksd(p(x1) q(x1)) is an estimator of nksd(p(x1) q(x1)), and likewise for nksd(p(x2) q(x2)).

C.2 Proof of Theorems 9 and 11

Our proofs in this section build on the proof of Theorem 3 of Barp et al. (2019).

Proposition 21 Under the assumptions of Theorem 9, for any compact convex C Θ,

sup θ C |f N(θ) f(θ)| a.s. 0. (93)

Proof First, we establish almost sure convergence for the denominator of f N(θ). Since k is assumed to be bounded and to have bounded derivatives up to order two, we can choose B < such that B |k| + xk + x y k . In particular, the expected value of the kernel is ﬁnite: Z

X |k(x, y)|P0(dx)P0(dy) B < . (94)

By the strong law of large numbers for U-statistics (Theorem 5.4A of Serﬂing, 2009),

i =j k(X(i), X(j)) a.s. N

X k(x, y)P0(dx)P0(dy). (95)

Note that the limit is positive since k(x, y) > 0 for all x, y X. For the numerator, we establish bounds on uθ and θuθ. Let C Θ be compact and convex. By Equation 5, for all θ C and all x, y X,

|uθ(x, y)| |sqθ(x) sqθ(y)k(x, y)| + |sqθ(x) yk(x, y)|

+ |sqθ(y) xk(x, y)| + | trace( x y k(x, y))|

sqθ(x) sqθ(y) B + sqθ(x) B + sqθ(y) B + Bd

g0,C(x)g0,C(y)B + g0,C(x)B + g0,C(y)B + Bd

=: h0,C(x, y).

Similarly, for all θ C and all x, y X,

θuθ(x, y) θ(sqθ(x) sqθ(y))k(x, y) + θ(sqθ(x) yk(x, y))

+ θ(sqθ(y) xk(x, y)) + θ trace( x y k(x, y))

g0,C(x)g1,C(y)B + g0,C(y)g1,C(x)B + g1,C(x)B + g1,C(y)B

=: h1,C(x, y).

Note that h0,C and h1,C are continuous and belong to L1(P0 P0).

Bayesian Data Selection

Let S1 S2 X be a sequence of compact sets such that M=1SM = X. Note that this implies M=1SM SM = X X. Suppose for the moment that, for each M, the following collections of functions are equicontinuous on C: (A) (θ 7 uθ(x, y) : x, y SM) and (B) θ 7 R uθ(x, y)P0(dy) : x SM . Assuming this, Theorem 1 of Yeo and Johnson (2001) shows that

i =j uθ(X(i), X(j)) Z

X uθ(x, y)P0(dx)P0(dy) a.s. N 0, (98)

and that θ 7 R

X uθ(x, y)P0(dx)P0(dy) is continuous. (Note that although Yeo and Johnson (2001) assume X = R, their proof goes through without further modiﬁcation for any nonempty X Rd.) Combining Equations 95 and 98, we have

supθ C 1 N(N 1) P i =j uθ(X(i), X(j)) R R uθ(x, y)P0(dx)P0(dy)

1 N(N 1) P i =j k(X(i), X(j))

Thus, it follows that supθ C |f N(θ) f(θ)| 0 a.s. by Equations 95 and 96. To complete the proof, we must show that (A) and (B) are equicontinuous on C. (A) Since θ 7 uθ(x, y) is diﬀerentiable on C, then by the mean value theorem, we have that for all θ1, θ2 C and all x, y SM,

|uθ1(x, y) uθ2(x, y)| θ|θ= θ uθ(x, y) θ1 θ2

h1,C(x, y) θ1 θ2

sup x,y SM h1,C(x, y) θ1 θ2 <

where θ = γθ1 + (1 γ)θ2 for some γ [0, 1]. Here, the second inequality holds since θ C by the convexity of C, and the supremum is ﬁnite because a continuous function on a compact set attains its maximum. Therefore, (θ 7 uθ(x, y) : x, y SM) is equicontinuous on C. (B) To see that θ 7 R uθ(x, y)P0(dy) : x SM is equicontinuous on C, ﬁrst note that Z |uθ(x, y)|P0(dy) Z h0,C(x, y)P0(dy) < .

Further, due to Equations 96 and 97, we can apply the Leibniz integral rule (Folland, 1999, Theorem 2.27) and ﬁnd that θ R uθ(x, y)P0(dy) exists and is equal to R θuθ(x, y)P0(dy). Now we apply the mean value theorem and the same reasoning as before to ﬁnd that for all θ1, θ2 C and all x SM, R uθ1(x, y)P0(dy) R uθ2(x, y)P0(dy) θ|θ= θ R uθ(x, y)P0(dy) θ1 θ2

θ1 θ2 Z θ|θ= θ uθ(x, y) P0(dy)

θ1 θ2 sup x SM

Z h1,C(x, y)P0(dy) <

Weinstein and Miller

where θ = γθ1 + (1 γ)θ2 for some γ [0, 1]. The supremum is ﬁnite since x 7 R h1,C(x, y)P0(dy) is continuous, which can easily be seen by plugging in the deﬁnition of h1,C. Therefore, θ 7 R uθ(x, y)P0(dy) : x SM is equicontinuous on C.

Proposition 22 Under the assumptions of Theorem 9, (f N : N N) is uniformly bounded on E.

Proof First, for any x, y X, if we deﬁne g(θ) = sqθ(x) and h(θ) = sqθ(y) then uθ = (g h)k + g ( yk) + h ( xk) + trace( x y k). By diﬀerentiating, applying Minkowski s inequality to the resulting sum of tensors, and applying the Cauchy Schwarz inequality to each term, we have

3 θuθ(x, y) 3g h k + 3 2g h k + 3 g 2h k + g 3h k

+ 3g yk + 3h xk .

Using the symmetry of the kernel to combine like terms, this yields that X

i =j 3 θuθ(X(i), X(j))

2 3 θsqθ(X(i)) sqθ(X(j)) B + 6 2 θsqθ(X(i)) θsqθ(X(j)) B + 2 3 θsqθ(X(i)) B

where B < such that B |k| + xk + x y k . Since f N(θ) = 0 when N = 1 by deﬁnition, we can assume without loss of generality that N 2, so 1 N 1 = 1

N (1 + 1 N 1) 2/N. Since each term is non-negative, we can add in the i = j terms, 1 N(N 1)

i =j 3 θuθ(X(i), X(j))

2 3 θsqθ(X(i)) sqθ(X(j)) + 6 2 θsqθ(X(i)) θsqθ(X(j)) + 2 3 θsqθ(X(i))

i 3 θsqθ(X(i)) 1

j sqθ(X(j)) (99)

i 2 θsqθ(X(i)) 1

j θsqθ(X(j))

i 3 θsqθ(X(i)) .

By assumption, 1

N P i 2 θsqθ(X(i)) : N N, θ E is bounded with probability 1, and similarly for 1

N P i 3 θsqθ(X(i)) : N N, θ E . We show the same for 1

N P i sqθ(X(i)) and 1 N P i θsqθ(X(i)) . By Equation 40, we have Z sup θ E sqθ(x) P0(dx) Z g0, E(x)P0(dx) < .

Bayesian Data Selection

Hence, by Theorem 1.3.3 of Ghosh and Ramamoorthi (2003), 1 N P i sqθ(X(i)) converges uniformly on E, almost surely. In particular, 1

N P i sqθ(X(i)) is uniformly bounded on E, almost surely. The same argument holds for 1 N P i θsqθ(X(i)) using g1, E(x). Therefore, by Equation 99, it follows that 1 N(N 1) P i =j 3 θuθ(X(i), X(j)) is uniformly bounded on

E. Since k is positive by assumption, 1 N(N 1) P i =j k(X(i), X(j)) > 0 for all N 2 and by

Equations 94 and 95, 1 N(N 1) P i =j k(X(i), X(j)) converges a.s. to a ﬁnite quantity greater than 0. We conclude that almost surely,

f N (θ) = 1

1 N(N 1) P i =j 3 θuθ(X(i), X(j))

1 N(N 1) P i =j k(X(i), X(j))

is uniformly bounded on E, for N {2, 3, . . .}. Recall that for N = 1, f N(θ) = 0 by deﬁnition. Therefore, almost surely, (f N : N N) is uniformly bounded on E.

Proof of Theorem 9 We show that the conditions of Theorem 3.2 of Miller (2021) are met, from which the conclusions of this theorem follow immediately. By Condition 10 and Equation 35, f N has continuous third-order partial derivatives on Θ. Let E be the set from Condition 10. With probability 1, f N f uniformly on E (by Proposition 21 with C = E) and (f N ) is uniformly bounded on E (by Proposition 22). Note that f is ﬁnite on Θ by Proposition 3. Thus, by Theorem 3.4 of Miller (2021), f and f exist on E and f N f uniformly on E with probability 1. Since θ is a minimizer of f and θ E, we know that f (θ ) = 0 and f (θ ) is positive semideﬁnite; thus, f (θ ) is positive deﬁnite since it is invertible by assumption. Case (a): Now, consider the case where Θ is compact. Then almost surely, f N f uniformly on Θ by Proposition 21 with C = Θ. Since θ is a unique minimizer of f, we have f(θ) > f(θ ) for all θ Θ \ {θ }. Let H E be an open set such that θ H and H E. We show that lim inf N infθ Θ\ H f N(θ) > f(θ ). Since Θ \ H is compact,

inf θ Θ\ H f(θ) f(θ ) =: ϵ > 0.

By uniform convergence, with probability 1, there exists N such that for all N > N, supθ Θ |f N (θ) f(θ)| ϵ/2, and thus,

inf θ Θ\ H f N (θ) inf θ Θ\ H f(θ) ϵ/2 = f(θ ) + ϵ/2.

Hence, lim inf N infθ Θ\ H f N(θ) > f(θ ) almost surely. Applying Theorem 3.2 of Miller (2021), the conclusion of the theorem follows. Note that f N(θN) f (θ ) a.s. since θN θ and f N f uniformly on E. Case (b): Alternatively, consider the case where Θ is open and f N is convex on Θ. For all θ Θ, with probability 1, f N(θ) f(θ) (by Proposition 21 with C = {θ}). However, we need to show that with probability 1, for all θ Θ, f N(θ) f(θ). We follow the argument in the proof of Theorem 6.3 of Miller (2021). Let W be a countable dense subset of Θ. Since W is countable, with probability 1, for all θ W, f N(θ) f(θ). Since f N is convex, then with probability 1, for all θ Θ, the limit f(θ) := lim N f N(θ) exists and is ﬁnite,

Weinstein and Miller

and f is convex (Theorem 10.8 of Rockafellar, 1970). Since f N is convex and f(θ) is ﬁnite, f(θ) is also convex. Since f and f are convex, they are also continuous (Theorem 10.1 of Rockafellar, 1970). Continuous functions that agree on a dense subset of points must be equal. Thus, with probability 1, for all θ Θ, f N(θ) f(θ). Applying Theorem 3.2 of Miller (2021), the conclusion of the theorem follows.

Proof of Theorem 11 Our proof builds on Appendix D.3 of Barp et al. (2019), which establishes a central limit theorem for the ksd when the model is an exponential family. The outline of the proof is as follows. First, we establish bounds on sqθ and its derivatives, using the assumed bounds on xt(x) and x log λ(x). Second, we establish that f (θ) is positive deﬁnite and independent of θ, and that f N(θ) converges to it almost surely; from this, we conclude that f (θ ) is invertible and f N(θ) is convex. These results rely on the convergence properties of U-statistics and on Sylvester s criterion. The assumption that log λ(x) is continuously diﬀerentiable on X implies that λ(x) > 0 for x X. Since qθ(x) = λ(x) exp(θ t(x) κ(θ)), we have

sqθ(x) = x log λ(x) + ( xt(x)) θ

θsqθ(x) = ( xt(x)) Rd m

2 θsqθ(x) = 0 Rd m m

where ( xt(x))ij = ti/ xj. Thus, sqθ(x) has continuous third-order partial derivatives with respect to θ, and Equations 41 and 42 are trivially satisﬁed. Equation 40 holds for all compact C Θ since x log λ(x) and xt(x) are continuous functions in L1(P0) and

sqθ(x) = x log λ(x) + ( xt(x)) θ x log λ(x) + xt(x) θ ,

θsqθ(x) = xt(x) .

Hence, Condition 10 holds. By Equation 36 and Proposition 3,

T nksd(p0(x) q(x|θ)) = 1 TK

X uθ(x, y)P0(dx)P0(dy) (100)

where K := R R k(x, y)P0(dx)P0(dy). By Equation 57,

uθ(x, y) = θ B2(x, y)θ + B1(x, y) θ + B0(x, y) (101)

B2(x, y) = ( xt(x))( yt(y)) k(x, y),

B1(x, y) = ( yt(y))( x log λ(x))k(x, y) + ( xt(x))( y log λ(y))k(x, y)

+ ( yt(y))( xk(x, y)) + ( xt(x))( yk(x, y)),

B0(x, y) = ( x log λ(x)) ( y log λ(y))k(x, y) + ( y log λ(y)) ( xk(x, y))

+ ( x log λ(x)) ( yk(x, y)) + trace( x y k(x, y)).

By Condition 7, |k(x, y)|, xk(x, y) , and x y k(x, y) are bounded by a constant, say, B < . Thus, it is straightforward to check that B2, B1, and B0 belong to L1(P0 P0) since

Bayesian Data Selection

xt(x) and x log λ(x) are in L1(P0). Further, 0 < K < since 0 < k(x, y) B < by assumption. Thus,

f(θ) = 1 TK

Z Z θ B2(x, y)θ + B1(x, y) θ + B0(x, y) P0(dx)P0(dy) R.

Since k is symmetric, B2(x, y) = B2(y, x). Hence, θ(θ B2(x, y)θ) = (B2(x, y) + B2(y, x))θ, so by Fubini s theorem,

f (θ) = 1 TK

Z Z 2B2(x, y)θ + B1(x, y) P0(dx)P0(dy) Rm,

f (θ) = 2 TK

Z Z B2(x, y)P0(dx)P0(dy) Rm m.

Here, diﬀerentiating under the integral sign is justiﬁed simply by linearity of the expectation. Note that f (θ) is a symmetric matrix since B2(x, y) = B2(y, x). Next, to show f (θ) is positive deﬁnite, let v Rm \ {0}. By assumption, the rows of xt(x) are linearly independent with positive probability under P0. Thus, there is a set E X such that P0(E) > 0 and ( xt(x)) v = 0 for all x E. Deﬁne g(x) = ( xt(x)) v p0(x) Rd. Then R

X |gi(x)|dx > 0 for at least one i, and R

X |gi(x)|dx v R

X xt(x) p0(x)dx < for all i. Thus,

v f (θ)v = 2 TK

Z Z g(x) g(y)k(x, y)dxdy = 2 TK

Z Z gi(x)gi(y)k(x, y)dxdy > 0

since k is integrally strictly positive deﬁnite. Therefore, f (θ) is positive deﬁnite. In particular, f (θ ) is invertible. Finally, we show that with probability 1, for all N suﬃciently large, f N(θ) is convex. By Equations 35 and 101,

P i =j θ B2(X(i), X(j))θ + B1(X(i), X(j)) θ + B0(X(i), X(j))

P i =j k(X(i), X(j)) .

P i =j B2(X(i), X(j)) P i =j k(X(i), X(j)) .

By the strong law of large numbers for U-statistics (Theorem 5.4A of Serﬂing, 2009), we have f N(θ) f (θ) almost surely, since R

X B2(x, y) P0(dx)P0(dy) < and 0 < K < . For a symmetric matrix A, let λ (A) denote the smallest eigenvalue. Since λ (A) is a continuous function of the entries of A, we have λ (f N(θ)) λ (f (θ)) a.s. as N . Thus, with probability 1, for all N suﬃciently large, f N(θ) is positive deﬁnite, and hence, f N is convex. Further, for such N, since f N is a quadratic function with positive deﬁnite Hessian, we have MN := infθ Θ f N(θ) > and z N = R

Θ exp( Nf N(θ))π(θ)dθ exp( NMN) < .

Weinstein and Miller

C.3 Proof of Theorem 12

To establish Theorem 12, we use the properties of U-statistics described in Chapter 5.5 of Serﬂing (2009). When the data distribution matches the model distribution, [ nksd converges more quickly than when it does not match; this same property was used by Liu et al. (2016) to develop a goodness-of-ﬁt test based on the ksd. Proof We ﬁrst study the asymptotics of f N(θ ). Denoting θ θ=θ uθ by θuθ for brevity,

f N(θ ) = 1

1 N(N 1) P i =j θuθ (X(i), X(j))

1 N(N 1) P i =j k(X(i), X(j)) .

The denominator converges a.s. to a ﬁnite positive constant, as in the proof of Proposition 21. It is straightforward to verify that EX,Y P0[ θuθ (X, Y ) 2] < since sqθ and θ θ=θ sqθ are in L2(P0) by assumption. By Theorems 5.5.1A and 5.5.2 of Serﬂing (2009),

i =j θuθ (X(i), X(j)) EX,Y P0[ θuθ (X, Y )] = OP0(N 1/2).

Further, by the Leibniz integral rule (Folland, 1999, Theorem 2.27),

EX,Y P0[ θuθ (X, Y )] = θ θ=θ EX,Y P0[uθ(X, Y )] = T EX,Y P0[k(X, Y )]f (θ ) = 0,

using the fact that f (θ ) = 0 since θ is a minimizer of f. Thus,

f N(θ ) = OP0(N 1/2). (102)

Next, we examine the convergence of θN to θ . For all N suﬃciently large, f N(θN) = 0 by Theorem 9 (part 1), and thus, by Taylor s theorem,

0 = f N(θN) = f N(θ ) + f N(θ+ N)(θN θ ),

where θ+ N is on the line between θN and θ . As in the proof of Theorem 9, f N f

uniformly on the set E deﬁned in Condition 10. Thus, since f N is continuous on E and θ+ N θ , f N(θ+ N) a.s. N f (θ ). (103)

In particular, f N(θ+ N) is invertible for all N suﬃciently large, since f (θ ) is invertible by assumption. Hence, θN θ = f N(θ+ N) 1f N(θ ), (104)

and therefore, by Equation 102,

θN θ f N(θ+ N) 1 f N(θ ) = OP0(N 1/2). (105)

This result matches Theorem 4 in Barp et al. (2019). By Taylor s theorem,

f N(θ ) f N(θN) = f N(θN) (θ θN) + 1

2(θ θN) f N(θ++ N )(θ θN)

Bayesian Data Selection

2(θ θN) f N(θ++ N )(θ θN)

for all N suﬃciently large, where θ++ N is on the line between θN and θ . Therefore, using the same reasoning as for Equations 103 and 105,

|f N(θ ) f N(θN)| 1

2 f N(θ++ N ) θ θN 2 = OP0(N 1). (106)

This proves the ﬁrst part of the theorem (Equation 43). Next, consider f N(θ ) f(θ ). Recall that

f N(θ ) = 1

1 N(N 1) P i =j uθ (X(i), X(j))

1 N(N 1) P i =j k(X(i), X(j)) .

It is straightforward to verify that EX,Y P0[|uθ (X, Y )|2] < since sqθ is in L2(P0). By Theorems 5.5.1A and 5.5.2 of Serﬂing (2009), 1 N(N 1)

i =j uθ (X(i), X(j)) EX,Y P0[uθ (X, Y )] = OP0(N 1/2).

Similarly, since k is bounded, 1 N(N 1)

i =j k(X(i), X(j)) EX,Y P0[k(X, Y )] = OP0(N 1/2).

It is straightforward to check that the second part of the theorem (Equation 44) follows. For the third part, our argument follows that of the proof of Theorem 4.1 of Liu et al. (2016). Suppose nksd(p0(x) q(x|θ )) = 0, and note that P0(x) = Qθ (x) by Proposition 4. Given a diﬀerentiable function g : Rd Rd, deﬁne x g(x) := Pd i=1 gi(x)/ xi. Then

EX P0[uθ (X, y)] = sp0(y) Z

( xp0(x))k(x, y) + p0(x)( xk(x, y)) dx

( xp0(x)) yk(x, y) + p0(x)( x yk(x, y)) dx

X x p0(x)k(x, y) dx + Z

X x y(p0(x)k(x, y))dx. (107)

The ﬁrst term on the right-hand side of Equation 107 is zero since, by assumption, k is in the Stein class of P0 (Condition 2). The second term is also zero since, by the Leibniz integral rule (Folland, 1999, Theorem 2.27), R y x(p0(x)k(x, y))dx = y R x(p0(x)k(x, y))dx, which again equals zero because k is in the Stein class of P0. Therefore, EX P0[uθ (X, y)] = 0 for all y X, and in particular, the variance of this expression is also zero: VY P0[EX P0[uθ (X, Y )]] = 0. By Theorem 5.5.2 of Serﬂing (2009), it follows that 1 N(N 1)

i =j uθ (X(i), X(j)) = OP0(N 1) (108)

since EX,Y P0[uθ (X, Y )] = 0. Although Serﬂing (2009) requires VX,Y P0[uθ (X, Y )] > 0, Equation 108 holds trivially if VX,Y P0[uθ (X, Y )] = 0. As before, since the denominator of f N(θ ) converges a.s. to a ﬁnite positive constant, we have that f N(θ ) = OP0(N 1). Equation 45 follows since f(θ ) = 0 when nksd(p0(x) q(x|θ )) = 0.

Weinstein and Miller

C.4 Proof of Theorem 17

Proof Applying Theorem 9 (part 3) to each foreground model j {1, 2}, we have

log zj,N + Nfj,N(θj,N) log π(θj, ) + log | det f j (θj, )|1/2 1

2m Fj,j log(2π/N) a.s. N 0.

Since Kj,N = (2π/N)m Bj /2zj,N, this implies

log Kj,N + Nfj,N(θj,N) 1

2(m Fj,j + m Bj) log(2π/N) + Cj a.s. N 0

where Cj is a constant that does not depend on N. Hence,

K2,N + N(f1,N(θ1,N) f2,N(θ2,N))

2(m F1,1 + m B1 m F2,2 m B2) log(2π/N) + C1 C2 a.s. N 0. (109)

By Theorem 12, fj,N(θj,N) P0 fj(θj, ), and therefore,

1 N log K1,N

K2,N + f1(θ1, ) f2(θ2, ) P0 N 0.

Plugging in the deﬁnition of fj (Equation 36), this proves part 1 of the theorem. For part 2, suppose f1(θ1, ) = f2(θ2, ) = 0 and m B2 m B1 does not depend on N. Then by Theorem 12, fj,N(θj,N) = OP0(N 1). Using this in Equation 109, we have

1 log N log K1,N

2(m F1,1 + m B1 m F2,2 m B2) P0 N 0. (110)

For part 3, suppose f1(θ1, ) = f2(θ2, ) and m Bj = c Bj

N. Then by Theorem 12, fj,N(θj,N) = fj(θj, ) + OP0(N 1/2). Using this in Equation 109, we have

N log N log K1,N

2(c B1 c B2) P0 N 0. (111)

Appendix D. Additional probabilistic PCA details

D.1 Optimizing the NKSD

Computing the Laplace or BIC approximation to the SVC requires ﬁnding the minimizer of [ nksd(p0(x) q(x|θ)) with respect to θ. In this section, we describe how components of the nksd can be pre-computed to speed up this optimization process. The generative model used for p PCA can be rewritten using the properties of multivariate normal distributions as X N(0, HH + v Id). (112)

Bayesian Data Selection

The Stein score function for the p PCA model is then

sqθ(x) = x log q(x|H, v) = (HH + v Id) 1x.

Deﬁne the matrices

Kij := I(i = j) k(X(i), X(j)),

i=1 I(i = j) k

xb (X(i), X(j)),

where I(E) is the indicator function, which equals 1 when E is true and is 0 otherwise. Deﬁne the scalars

b=1 I(i = j) 2k xb yb (X(i), X(j)).

Letting X RN d be the data matrix, the NKSD can be written as

[ nksd(p0(x) q(x|H, v)) = 1

trace(X KX(HH + v Id) 1(HH + v Id) 1)

2 trace(X K(HH + v Id) 1) + K ,

where we have used the fact that the kernel is symmetric. The terms X KX and X K are the only ones that include sums over the entire data set; these can be pre-computed, before optimizing the parameters H and v. To compute the matrix inversion (HH +v Id) 1 we follow the strategy of Minka (2001),

(HH + v Id) 1 v 1Id = (HH + v Id) 1(Id v 1(HH + v Id))

= (HH + v Id) 1HH v 1

= (U(L v Ik)U + v Id) 1U(L v Ik)U v 1.

Thus, applying the Woodbury matrix identity and using Id U = U = UIk Ik = UIk U U,

(HH + v Id) 1 v 1Id = v 1Id v 2U (L v Ik) 1 + v 1 1U U(L v Ik)U v 1

= U[v 1Ik v 2((L v) 1 + v 1) 1](L v Ik)U v 1

= UL 1(L v Ik)U v 1

= U(L 1 v 1Ik)U .

Therefore, (HH + v Id) 1 = U(L 1 v 1Ik)U + v 1Id.

Weinstein and Miller

Computing L 1 is trivial since the matrix is diagonal. Returning to the nksd we have

[ nksd(p0(x) q(x|U, L, v))

trace X KX[U(L 1 v 1Ik)2U + 2v 1U(L 1 v 1Ik)U + v 2Id]

2 trace X K[U(L 1 v 1Ik)U + v 1Id] + K

trace U X KXU(L 1 v 1Ik)2

+ trace U [2v 1X KX 2X K]U(L 1 v 1Ik)

+ v 1 trace v 1X KX 2X K + K .

We optimized U, L and v using the trust region method implemented in pymanopt (Townsend et al., 2016).

D.2 Data selection with the SVC

We used the approximate optimum technique in Section 2.3.3 to estimate the SVC for diﬀerent foreground subspaces. Following Section A.2, we used the factored IMQ kernel with β = 0.5 and c = 1. We focused on foreground subspaces that correspond to subsets of the data dimensions. More speciﬁcally, recall that XF = V X; then, we impose the restriction that each column of V is a standard basis vector e(b) Rd, where e(b) b = 1 and e(b) b = 0 for b = b. A subspace XF is then characterized by the set of included dimensions SF {1, . . . , d}. The marginal distribution of the model q(x F|H, v) is now straightforward to compute based on Equation 112 and the properties of multivariate normals:

XF N(0, HSFH SF + v I|SF|)

where HSF is the submatrix consisting of rows of H indexed by SF, and |SF| is the size of the set SF. In the projected model, some of the parameters are nuisance variables with no contribution to the likelihood. Since the dimension of a d k matrix on the Stiefel manifold is dk k(k + 1)/2, the total dimension of the foreground model (including contributions from parameters U, L and v) is m F = |SF|k k(k + 1)/2 + k + 1, assuming |SF| k. Code is available at https://github.com/EWeinstein/data-selection.

D.3 Calibration

The T hyperparameter was calibrated as in Section A.1. In detail, we sampled 10 independent true parameter values from the prior, with α = 1 and d = 6. (We used a slightly less disperse prior than during inference, where we set α = 0.1, to avoid numerical instabilities in the ˆT estimate.) Then, for each of the true parameter values, we simulated N = 2000 datapoints. For each simulated true parameter value, we tracked the trend in the ˆT estimator (Equation 55) with increasing N (Figure 11). The median estimated T value at N = 2000 was 0.052 across the 10 runs.

Bayesian Data Selection

Figure 11: Estimated T for increasing number of data samples, for 10 independent parameter samples from the prior. The median value at N = 2000 is ˆT = 0.052.

D.4 P olya tree model

In this section, we describe the P olya tree model (Ferguson, 1974; Mauldin et al., 1992; Lavine, 1992) following the construction of Berger and Guglielmi (2001). Let ϵn := (ϵ1, . . . , ϵn) denote a vector of length n, where each ϵj {0, 1}. Each ϵn vector indexes an interval in R, given by

B ϵn := F 1 Pn j=1 ϵj/2j , F 1 Pn j=1 ϵj/2j + 1/2n i ,

where F 1 is the inverse c.d.f. of some probability distribution. For all n {0, 1, 2, . . .} and all ϵn {0, 1}n, let Y ϵn Beta(ξ ϵn0, ξ ϵn1),

where the ξ s are hyperparameters. We say that a random variable X R is distributed according to a P olya tree model if

P(X B ϵn) =

j=1 (Y ϵj 1)I(ϵj=0)(1 Y ϵj 1)I(ϵj=1),

where I(E) is the indicator function, which equals 1 when E is true and is 0 otherwise. We follow Berger and Guglielmi (2001) and use

µ(B ϵn) := F F 1 Pn j=1 ϵj/2j + 1/2n F F 1 Pn j=1 ϵj/2j ,

ρ( ϵn) := 1

f( F 1(Pn j=1 ϵj/2j + 1/2n+1))

ξ ϵn0 := ρ( ϵn)

µ(B ϵn0) µ(B ϵn1),

ξ ϵn1 := ρ( ϵn)

µ(B ϵn1) µ(B ϵn0),

Weinstein and Miller

where F and f are the c.d.f. and p.d.f. respectively of some probability distribution, and η > 0 is a scale hyperparameter. We denote this complete model as X Polya Tree(F, F, η).

D.5 Data sets and preprocessing

We downloaded two publicly available data sets. The ﬁrst data set was from human peripheral blood mononuclear cells (PBMCs), available at: https://support. 10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k. This is a standard data set used in the tutorials for Seurat (Stuart et al., 2019) and Scanpy (Wolf et al., 2018), for example. The second was taken from a dissociated extranodal marginal zone B-cell tumor, speciﬁcally a mucosa-associated lymphoid tissue (MALT) tumor: https://support.10xgenomics.com/single-cell-gene-expression/ datasets/3.0.0/malt_10k_protein_v3. We pre-processed the data using Scprep (Gigante et al., 2020), following its example: we normalized the total expression of each cell to match the median total expression in the data set, to account for variability in library size, and then square-root transformed the resulting normalized counts.

Appendix E. Additional glass model details

E.1 Glass model inference

We place a standard normal prior on each entry of Hj and a Laplace prior on each entry of Jjj with scale 0.1 to encourage sparsity. To enforce that µ 0 (since sc RNAseq counts are nonnegative) and τ > 0, we place priors on a transformed version of these parameters, as follows:

µ = log(1 + exp( µ))

τ = log(1 + exp( τ)) + 1.

For posterior inference, we employ a mean-ﬁeld variational approximation: independent normal distributions for the entries of Hj, normal distributions for µ and τ, and Laplace distributions for each entry of Jjj . We use the factored IMQ kernel for the NKSD, with β = 0.5 and c = 1. To optimize the variational approximation (Equation 14), we construct stochastic estimates of its gradient. At each optimization step, the expectation Erζ [ nksd(p0(x F) q(x F|θ))

is estimated using a minibatch of 200 randomly selected datapoints and a single sample from the variational approximation rζ. The rest of the variational inference algorithm follows standard practice in stochastic variational inference, as implemented in Pyro: automatic diﬀerentiation to compute gradients, reparameterization estimators for Monte Carlo expectations over the variational distribution, and the Adam optimizer (Kingma and Ba, 2015; Bingham et al., 2019). We also used stochastic optimization to perform data selection, as follows. Let I = (I1, . . . , Id) be an indicator variable that speciﬁes for each gene j whether it is included in the foreground subspace (Ij = 1) or not (Ij = 0). We place a distribution on I such that

Bayesian Data Selection

0 2500 5000 7500 10000 12500 15000 17500 interaction rank

Figure 12: Posterior mean interaction energies Ejj for all selected genes, sorted. Dotted lines show the thresholds for strong interactions (set by visual inspection).

Ij Bernoulli(1/(1 + exp( φj))) for j = 1, . . . , d independently. Then, to perform data selection over all possible subsets of genes, we optimize

argmaxφ E(K(I) | φ) (113)

where the expectation is taken with respect to I, where K(I) is the (estimated) SVC when genes with Ij = 1 are included in the foreground space, and φ = (φ1, . . . , φd) Rd is a vector of log-odds. This stochastic approach to discrete optimization has been used extensively in reinforcement learning and related ﬁelds. We use the Leave-One-Out REINFORCE (LOORF) estimator as described in Section 2.1 of Dimitriev and Zhou (2021) to estimate gradients of φ, using 8 samples per step. We interleave updates to the variational approximation and to φ, using the Adam optimizer with step size 0.01 for each. We ran the procedure with 4 random initial seeds, taking the result with the largest ﬁnal estimated SVC. We halt optimization using the stopping rule proposed in Grathwohl et al. (2020), stopping when the estimated mean minus the estimated variance of the SVC begins to decrease, based on the average over 2000 steps. Code is available at https://github.com/EWeinstein/data-selection.

E.2 Data sets and preprocessing

In addition to the two data sets in D.5, we also explored a data set of E18 mouse neurons: https://support.10xgenomics.com/single-cell-gene-expression/datasets/ 3.0.0/neuron_10k_v3. We preprocessed each data set using Scprep (Gigante et al., 2020) in the same way as in Section D.5. After preprocessing, we used the top 200 most highly expressed genes from

Weinstein and Miller

CD37 HSPA1A HLA-DQA1

LY9 HLA-DRA HLA-DPB1 HLA-DQB1

EZR CD69 FOSB DNAJB1 MTRNR2L12

GAPDH HLA-DPA1

CD55 NFKBIA

JUN RPLP0 PPP1R15A

KLF2 CREM ACTG1 AC058791.1

ID3 BCL2A1 GPR183

IER2 TNFAIP3

RPL4 RPL13A

Figure 13: Posterior mean interaction energies Ejj for the glass model applied to all 200 genes in the MALT data set (rather than the selected 187). Genes shown are the same as in Figure 8, for visual comparison.

among the top 500 most variable genes, according to the Scprep variability score. We log transform the counts, that is we deﬁne xij = log(1 + cij) where cij is the expression count for gene j in cell i.

Bayesian Data Selection

mean posterior std. of Ejj

fraction of cells with zero expression

Figure 14: Comparison of the 187 selected genes and 13 excluded genes using data selection. (a) Violin plot of σj over all excluded and selected genes j, respectively, when applying the model to all 200 genes, where σj is the mean posterior standard deviation of the interaction energies Ejj for gene j, that is, σj := 1 d 1 P j =j std( Ejj | data). (b) Violin plot of fj over all excluded and selected genes j, respectively, where fj is the fraction of cells with count equal to zero for gene j. The data selection procedure excluded all genes with more than 85% zeros and selected all genes with fewer than 85% zeros.

Weinstein and Miller

Uri Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits. CRC Press, July 2019.

Andreas Anastasiou, Alessandro Barp, Fran cois-Xavier Briol, Bruno Ebner, Robert E Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, Lester Mackey, Chris J Oates, Gesine Reinert, and Yvik Swan. Stein s method meets statistics: A review of some recent developments. ar Xiv preprint ar Xiv:2105.03481, May 2021.

Onureena Banerjee, Laurent El Ghaoui, and Alexandre d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9(Mar):485 516, 2008.

Alessandro Barp, Francois-Xavier Briol, Andrew B Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. ar Xiv preprint ar Xiv:1906.08283, June 2019.

Andrew R Barron. Uniformly powerful goodness of ﬁt tests. The Annals of Statistics, 17 (1):107 124, 1989.

Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeﬀrey Mark Siskind. Automatic diﬀerentiation in machine learning: A survey. Journal of Machine Learning Research, 18(153), 2018.

James O Berger and Alessandra Guglielmi. Bayesian and conditional frequentist testing of a parametric model versus nonparametric alternatives. Journal of the American Statistical Association, 96(453):174 184, 2001.

Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research, 20(28):1 6, 2019.

Pier G Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. J. R. Stat. Soc. Series B Stat. Methodol., 78(5):1103 1130, November 2016.

David M Blei. Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1(1):203 232, 2014.

Ariel Caticha. Relative entropy and inductive inference. AIP Conference Proceedings, 707 (1):75 96, 2004.

Ariel Caticha. Entropic inference. AIP Conference Proceedings, 1305(1):20 29, 2011.

Haifen Chen, Jing Guo, Shital K Mishra, Paul Robson, Mahesan Niranjan, and Jie Zheng. Single-cell transcriptional analysis to uncover regulatory circuits driving cell fate decisions in early mouse development. Bioinformatics, 31(7):1060 1066, April 2015.

Bayesian Data Selection

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of ﬁt. In International Conference on Machine Learning (ICML), pages 2606 2615, 2016.

A Philip Dawid. Posterior model probabilities. In Prasanta S Bandyopadhyay and Malcolm R Forster, editors, Philosophy of Statistics, volume 7, pages 607 630. North-Holland, Amsterdam, January 2011.

Charlotte M de Winde, Sharon Veenbergen, Ken H Young, Zijun Y Xu-Monette, Xiao-Xiao Wang, Yi Xia, Kausar J Jabbar, Michiel van den Brand, Alie van der Schaaf, Suraya Elfrink, Inge S van Houdt, Marion J Gijbels, Fons A J van de Loo, Miranda B Bennink, Konnie M Hebeda, Patricia J T A Groenen, J Han van Krieken, Carl G Figdor, and Annemiek B van Spriel. Tetraspanin CD37 protects against the development of B cell lymphoma. The Journal of Clinical Investigation, 126(2):653 666, February 2016.

Alek Dimitriev and Mingyuan Zhou. ARMS: Antithetic-REINFORCE-Multi-Sample gradient for binary variables. In International Conference on Machine Learning (ICML), 2021.

Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 3(2):185 205, April 2005.

Kjell A Doksum and Albert Y Lo. Consistent and robust Bayes procedures for location based on partial information. The Annals of Statistics, 18(1):443 453, 1990.

David Duvenaud, Daniel Eaton, Kevin Murphy, and Mark Schmidt. Causal learning without DAGs. In Neur IPS workshop on causality, 2008.

Thomas S Ferguson. Prior distributions on spaces of probability measures. The Annals of Statistics, 2(4):615 629, July 1974.

Gerald B Folland. Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, 1999.

Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303 (5659):799 805, February 2004.

Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe er. Using bayesian networks to analyze expression data. J. Comput. Biol., 7(3-4):601 620, 2000.

Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.

J K Ghosh and R V Ramamoorthi. Bayesian Nonparametrics. Series in Statistics. Springer, 2003.

Scott Gigante, Daniel Burkhardt, Daniel Dager, Jay Stanley, and Alexander Tong. scprep. https://github.com/Krishnaswamy Lab/scprep, 2020.

Weinstein and Miller

Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A Swiss Army inﬁnitesimal jackknife. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1139 1147. PMLR, 2019.

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning (ICML), pages 1292 1301, Sydney, NSW, Australia, 2017.

Will Grathwohl, Kuan-Chieh Wang, Jorn-Henrik Jacobsen, David Duvenaud, and Richard Zemel. Learning the Stein discrepancy for training and evaluating energy-based models without sampling. In International Conference on Machine Learning (ICML), 2020.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723 773, 2012.

L aszl o Gy orﬁand Edward C Van Der Meulen. A consistent goodness of ﬁt test based on the total variation distance. In George Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 631 645. Springer Netherlands, Dordrecht, 1991.

Stephanie C Hicks, F William Townes, Mingxiang Teng, and Rafael A Irizarry. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics, 19(4):562 578, October 2018.

Matthew D Hoﬀman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14:1303 1347, 2013.

Han Hong and Bruce Preston. Nonnested model selection criteria. 2005.

Peter J Huber. Projection pursuit. The Annals of Statistics, 13(2):435 475, 1985.

Jonathan H Huggins and Lester Mackey. Random feature stein discrepancies. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Jonathan H Huggins and Jeﬀrey W Miller. Reproducible model selection using bagged posteriors. ar Xiv preprint ar Xiv:2007.14845, 2021.

Vˆan Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. Inferring regulatory networks from expression data using tree-based methods. PLo S One, 5(9), September 2010.

Pierre E Jacob, Lawrence M Murray, Chris C Holmes, and Christian P Robert. Better together? Statistical learning in models made of modules. ar Xiv preprint ar Xiv:1708.08719, 2017.

Jack Jewson, Jim Q Smith, and Chris Holmes. Principles of Bayesian inference using general divergence criteria. Entropy, 20(6):442, 2018.

Wenxin Jiang and Martin A Tanner. Gibbs posterior for variable selection in highdimensional classiﬁcation and data mining. The Annals of Statistics, 36(5):2207 2231, 2008.

Bayesian Data Selection

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR), April 2014.

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on Bayes rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132):1 109, 2022.

Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. In International Conference on Machine Learning (ICML), 2017.

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free. In ICLR Workshop: Deep Reinforcement Learning Meets Structured Prediction, 2019.

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic diﬀerentiation variational inference. Journal of Machine Learning Research, 18:1 45, January 2017.

Michael Lavine. Some aspects of Polya tree distributions for statistical modelling. The Annals of Statistics, 20(3):1222 1235, 1992.

John R Lewis, Steven N Mac Eachern, and Yoonkyung Lee. Bayesian restricted likelihood methods: Conditioning on insuﬃcient statistics in Bayesian regression. Bayesian Analysis, 1(1):1 38, 2021.

Bruce G Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221 239, 1988.

Han Liu, John Laﬀerty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295 2328, 2009.

Qiang Liu, Jason D Lee, and Michael Jordan. A kernelized Stein discrepancy for goodnessof-ﬁt tests. In International Conference on Machine Learning, volume 33, pages 276 284, 2016.

Takuo Matsubara, Jeremias Knoblauch, Fran cois-Xavier Briol, and Chris J. Oates. Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(3):997 1022, 2022.

Hirotaka Matsumoto, Hisanori Kiryu, Chikara Furusawa, Minoru S H Ko, Shigeru B H Ko, Norio Gouda, Tetsutaro Hayashi, and Itoshi Nikaido. SCODE: an eﬃcient regulatory network inference algorithm from single-cell RNA-Seq during diﬀerentiation. Bioinformatics, 33(15):2314 2321, August 2017.

R Daniel Mauldin, William D Sudderth, and S C Williams. Polya trees and random distributions. The Annals of Statistics, 20(3):1203 1221, 1992.

Weinstein and Miller

Jeﬀrey W Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research, 22(168):1 53, 2021.

Jeﬀrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113 1125, 2019.

Thomas Minka. Old and new matrix algebra useful for statistics, 2000.

Thomas P Minka. Automatic choice of dimensionality for PCA. In Advances in Neural Information Processing Systems (Neur IPS), pages 598 604, 2001.

Victoria Moignard, Steven Woodhouse, Laleh Haghverdi, Andrew J Lilly, Yosuke Tanaka, Adam C Wilkinson, Florian Buettner, Iain C Macaulay, Wajid Jawaid, Evangelia Diamanti, Shin-Ichi Nishikawa, Nir Piterman, Valerie Kouskoﬀ, Fabian J Theis, Jasmin Fisher, and Berthold G ottgens. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nature Biotechnology, 33(3): 269 276, March 2015.

Emma Pierson and Christopher Yau. ZIFA: Dimensionality reduction for zero-inﬂated single-cell gene expression analysis. Genome Biology, 16:241, November 2015.

Jim Pitman. Combinatorial stochastic processes. Technical Report 621, Dept of Statistics, UC Berkeley, 2002.

Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah A Pliner, and Cole Trapnell. Reversed graph embedding resolves complex single-cell trajectories. Nature Methods, 14(10):979, 2017.

Danilo Jimenez Rezende. Short notes on divergence measures. July 2018.

R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

Robert J Serﬂing. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, September 2009.

Alex K Shalek, Rahul Satija, Xian Adiconis, Rona S Gertner, Jellert T Gaublomme, Raktima Raychowdhury, Schraga Schwartz, Nir Yosef, Christine Malboeuf, Diana Lu, John J Trombetta, Dave Gennert, Andreas Gnirke, Alon Goren, Nir Hacohen, Joshua Z Levin, Hongkun Park, and Aviv Regev. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature, 498(7453):236 240, June 2013.

Stephane Shao, Pierre E Jacob, Jie Ding, and Vahid Tarokh. Bayesian model comparison with the Hyv arinen score: Computation and consistency. Journal of the American Statistical Association, pages 1 24, September 2018.

Zakary S Singer, John Yong, Julia Tischler, Jamie A Hackett, Alphan Altinok, M Azim Surani, Long Cai, and Michael B Elowitz. Dynamic heterogeneity and DNA methylation in embryonic stem cells. Molecular Cell, 55(2):319 331, July 2014.

Bayesian Data Selection

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert R G Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010.

Bharath K Sriperumbudur, Kenji Fukumizu, and Gert R G Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:2389 2410, 2011.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, September 2008.

Tim Stuart, Andrew Butler, Paul Hoﬀman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, 3rd, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888 1902.e21, June 2019.

James Townsend, Niklas Koep, and Sebastian Weichwald. Pymanopt: A Python toolbox for optimization on manifolds using automatic diﬀerentiation. Journal of Machine Learning Research, 17(1):4755 4759, 2016.

David van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J Carr, Cassandra Burdziak, Kevin R Moon, Christine L Chaﬀer, Diwakar Pattabiraman, Brian Bierie, Linas Mazutis, Guy Wolf, Smita Krishnaswamy, and Dana Pe er. Recovering gene interactions from single-cell data using data diﬀusion. Cell, 174(3):716 729.e27, July 2018.

Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Statistica Sinica, 21(1):5 42, January 2011.

Isabella Verdinelli and Larry Wasserman. Bayesian goodness-of-ﬁt testing using inﬁnitedimensional exponential families. The Annals of Statistics, 26(4):1215 1241, August 1998.

Quang H Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society, 57(2):307 333, 1989.

F Alexander Wolf, Philipp Angerer, and Fabian J Theis. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biology, 19(1):15, February 2018.

Zijun Y Xu-Monette, Ling Li, John C Byrd, Kausar J Jabbar, Ganiraju C Manyam, Charlotte Maria de Winde, Michiel van den Brand, Alexandar Tzankov, Carlo Visco, Jing Wang, Karen Dybkaer, April Chiu, Attilio Orazi, Youli Zu, Govind Bhagat, Kristy L Richards, Eric D Hsi, William W L Choi, Jooryung Huh, Maurilio Ponzoni, Andr es J M Ferreri, Michael B Møller, Ben M Parsons, Jane N Winter, Michael Wang, Frederick B Hagemeister, Miguel A Piris, J Han van Krieken, L Jeﬀrey Medeiros, Yong Li, Annemiek B van Spriel, and Ken H Young. Assessment of CD37 B-cell antigen and cell of origin signiﬁcantly improves risk prediction in diﬀuse large B-cell lymphoma. Blood, 128 (26):3083 3100, December 2016.

Daniel Yekutieli. Adjusted Bayesian inference for selected parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):515 541, 2012.

Weinstein and Miller

In-Kwon Yeo and Richard A Johnson. A uniform strong law of large numbers for U-statistics with application to transforming to near symmetry. Statistics & Probability Letters, 51 (1):63 69, 2001.

Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307 1321, 2006a.

Tong Zhang. From ϵ-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180 2210, 2006b.