# data_thinning_for_convolutionclosed_distributions__111ca542.pdf

Journal of Machine Learning Research 25 (2024) 1-35 Submitted 4/23; Revised 2/24; Published 3/24

Data Thinning for Convolution-Closed Distributions

Anna Neufeld aneufeld@fredhutch.org Public Health Sciences Division Fred Hutchinson Cancer Center Seattle, WA 98109, USA

Ameer Dharamshi adharams@uw.edu Department of Biostatistics University of Washington Seattle, WA 98195, USA

Lucy L. Gao lucy.gao@ubc.ca Department of Statistics University of British Columbia Vancouver, British Columbia V6T 1Z4, Canada

Daniela Witten dwitten@uw.edu Departments of Statistics and Biostatistics University of Washington Seattle, WA 98195, USA

Editor: Jie Peng

We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.

Keywords: Cross validation, sample splitting, clustering, principal component analysis

1. Introduction

As scientists ﬁt increasingly complex models to their data, there is an ever-growing need for outof-box methods that can be used to validate these models. In many settings, the most natural option is sample splitting, in which the n observations in a data set are split into a training set, used to ﬁt a model, and a test set, used to validate it (Hastie et al., 2009). Sample splitting can also be applied to conduct inference after model selection (Rinaldo et al., 2019). Sample splitting is ﬂexible and intuitive, and is a vital tool for any practicing data analyst. However, in settings where there is one parameter of interest per observation, or the parameter of interest is a function of the n observations, sample splitting cannot be applied. For example, when estimating a low-rank approximation to a matrix, there is one parameter of interest (a latent

c 2024 Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0446.html.

Neufeld, Dharamshi, Gao, and Witten

variable coordinate) for each of the n rows in the matrix. Similarly, in ﬁxed-covariate regression under model misspeciﬁcation, the target parameter depends on the speciﬁc n observations included in the data set (Buja et al., 2019). Finally, there may be settings in which we wish to draw observation-speciﬁc inferences about each of our n observations; sample splitting does not allow this. In this paper, we consider an alternative to sample splitting that splits a single observation X into independent parts that follow the same distribution as X. Crucially, the fact that we split a single observation avoids the pitfalls of sample splitting in the situations mentioned above: we will split every observation in the data to obtain a training set and a test set that each involve all n observations. The concept of splitting a single observation as an alternative to sample splitting has been explored in several recent papers (Rasines and Young, 2022; Leiner et al., 2023; Oliveira et al., 2021, 2022; Neufeld et al., 2022). We can split X N(µ, σ2) with known σ2 into two independent Gaussian random variables (Rasines and Young, 2022; Leiner et al., 2023; Oliveira et al., 2021), and X Poisson(λ) into two independent Poisson random variables (Neufeld et al., 2022; Leiner et al., 2023; Oliveira et al., 2022). However, outside of these two distributions, no proposals are available to split a random variable into independent parts that follow the same distribution as the original random variable. Leiner et al. (2023) propose data ﬁssion, a general-purpose approach to decompose X into two parts, X(1) and X(2), such that (i) X(1) and X(2) can together be used to reconstruct X, and (ii) the joint distribution of X(1), X(2) is tractable. However, outside of the special cases of the Gaussian and Poisson distributions, the proposal of Leiner et al. (2023) leads to X(1) and X(2) that (a) are not independent, and (b) whose distributions may or may not resemble that of X. As a consequence of both (a) and (b), unless the data are Gaussianor Poissondistributed, data ﬁssion does not serve as a direct alternative to sample splitting, as out-of-the-box tools can no longer be used for validation or inference. We elaborate on these points in Section 5. In this paper, we propose data thinning, a recipe for decomposing a single observation X into two parts, X(1) and X(2), such that (i) X = X(1) + X(2), (ii) X(1) and X(2) are independent, and (iii) X(1) and X(2) follow the same distribution as X, up to a (known) scaling of a parameter. Critically, properties (ii) and (iii) guarantee that this decomposition is straightforward to use in applied settings. For instance, to evaluate the suitability of a model for X, we can ﬁt it to X(1)

(since it follows the same distribution as X), and can validate it using X(2) (since it also follows the same distribution, and furthermore is independent of X(1)). Our recipe can be applied to any distribution that is convolution-closed (Joe, 1996): this includes the multivariate Gaussian, Poisson, negative binomial, gamma, binomial, and multinomial distributions, among others. Thus, our work drastically expands the set of distributions that can be split into independent parts, and provides a uniﬁed lens through which to view seemingly unrelated approaches. Furthermore, data thinning can be used to decompose X into more than two independent random variables. We illustrate our proposal with the following example, which shows that a gamma random variable can be thinned into M independent gamma random variables.

Example 1 (Gamma decomposition into M components, data thinning) Suppose that X Gamma(α, β), where β is unknown. We take (X(1), . . . , X(M)) = XZ, where Z Dirichlet(α/M, . . . , α/M). Then X(1), . . . , X(M) are mutually independent, they sum to X, and each is marginally drawn from a Gamma(α/M, β) distribution.

In other words, data thinning allows us to decompose a Gamma(α, β) random variable, for which β is unknown, into M independent gamma random variables, X(1), . . . , X(M). Therefore, ﬁtting a model to X X(m) and validating it using X(m) is straightforward.

Data Thinning for Convolution-Closed Distributions

The rest of this paper is organized as follows. In Section 2, we brieﬂy review the class of convolution-closed distributions, and introduce a thinning procedure to split a single random variable into two independent random variables, each of which follows the same distribution as the original random variable (up to a scaling of the parameter(s)). We extend this thinning procedure to split a single random variable into an arbitrary number of independent random variables in Section 3. We elaborate on the comparison between sample splitting and data thinning in Section 4, and in Section 5 we elaborate on the comparison between data ﬁssion (Leiner et al., 2023) and data thinning. In Section 6, we focus on validating the results of clustering and low-rank matrix approximations. These are two settings in which the usual cross-validation via sample splitting approach cannot be directly applied (see, e.g. Owen and Perry, 2009; Fu and Perry, 2020), but data thinning provides a simple alternative. An application to single-cell RNA-sequencing data is in Section 7. We close with a discussion in Section 8. All proofs are in the appendix.

2. The Data Thinning Proposal

2.1 A Review of Convolution-Closed Distributions

We begin by deﬁning a convolution-closed distribution (Joe, 1996; Jørgensen and Song, 1998).

Deﬁnition 1 (Convolution-closed) Let Fλ denote a distribution indexed by a parameter λ in parameter space Λ. Let X Fλ1 and X Fλ2 with X X . If X + X Fλ1+λ2 whenever λ1 + λ2 Λ, then Fλ is convolution-closed in the parameter λ.

Many well-known distributions are convolution-closed. While the Poisson(λ) distribution is convolution-closed in its single parameter λ and the N(µ, σ2) distribution is convolution-closed in the two-dimensional parameter (µ, σ2), other distributions, such as the gamma, are convolutionclosed in just one parameter with the other parameter(s) held ﬁxed. Table 1 provides details about some well-known convolution-closed distributions. The following deﬁnition provides a useful property of most convolution-closed distributions.

Deﬁnition 2 (Linear expectation property) Let Fλ denote a distribution indexed by λ Λ. We say that it satisﬁes the linear expectation property if, for X Fλ, E [X] is a linear function of λ.

Remark 3 (Most convolution-closed distributions satisfy the linear expectation property) Let Fλ be a convolution-closed distribution whose ﬁrst moment exists for all λ Λ. By deﬁnition, if X Fλ1 and X Fλ2 and λ1 + λ2 Λ, then X + X Fλ1+λ2. By properties of the expected value, E[X + X ] = E[X ] + E[X ]. Thus, the expectation is additive in λ, and so, outside of contrived counterexamples, Fλ satisﬁes the linear expectation property. The linear expectation property is satisﬁed for all distributions in Table 1.

Remark 4 (Not all convolution-closed distributions satisfy the linear expectation property) The Cauchy(µ, γ) distribution is convolution-closed in the two-dimensional parameter (µ, γ), but does not satisfy the linear expectation property because E[X] does not exist.

For a convolution-closed distribution Fλ, suppose that X Fλ1 and X Fλ2 with X X . Let Gλ1,λ2,x denote the conditional distribution of X | X +X = x. The density of the distribution Gλ1,λ2,x can be written down for any Fλ with a known density function (Jørgensen, 1992). Furthermore, it turns out that Gλ1,λ2,x has a simple closed form for several of the well-known distributions from Table 1; see Table 2. For example, if Fλ is the Poisson(λ) distribution, then Gλ1,λ2,x is the Binomial (x, λ1/(λ1 + λ2)) distribution.

Neufeld, Dharamshi, Gao, and Witten

Table 1: A partial list of convolution-closed distributions. The last two rows contain multivariate distributions. The results in each row are easily veriﬁable. The generalized Poisson and Tweedie distributions are written in their additive exponential dispersion family parameterization; see Jørgensen and Song (1998) for details.

Distribution Notes

X Poisson(λ), where E[X] = λ and Var(X) = λ. Convolution-closed in λ.

X N(µ, σ2), where E[X] = µ and Var[X] = σ2. Convolution-closed in (µ, σ2).

X Negative Binomial(r, p), where E[X] = r 1 p

p and Var[X] = r 1 p

p2 . Convolution-closed in r if p is ﬁxed.

X Gamma(α, β), where E[X] = α

β and Var(X) = α β2 . Convolution-closed in α if β is ﬁxed.

X Binomial(r, p), where E[X] = rp and Var(X) = rp(1 p). Convolution-closed in r if p is ﬁxed.

X Inverse Gaussian(µw, λw2) with E[X] = µw and Var(X) = w3µ3

λ . Convolution-closed in w if µ and λ are ﬁxed.

X Generalized Poisson(λ, θ), see Jørgensen and Song (1998) for parameterization. Convolution-closed in λ if θ is ﬁxed.

X Tweediep(λ, θ), see Jørgensen and Song (1998) for parameterization. Convolution-closed in λ if θ and p are ﬁxed.

X Nk (µ, Σ), with E[X] = µ and Var(X) = Σ. Convolution-closed in (µ, Σ).

X Multinomialk (r, p), with E[X] = rp and Var(X) = r diag(p) pp T . Convolution-closed in r if p is ﬁxed.

2.2 Data Thinning

Recall from Section 2.1 that Gλ1,λ2,x is the conditional distribution of X | X + X = x, where X Fλ1 and X Fλ2 with X X . We now introduce our proposal.

Algorithm 1: Data thinning

Input : A realization x of X Fλ, where Fλ is convolution-closed in λ with parameter space Λ. A scalar ϵ (0, 1) such that ϵλ Λ and (1 ϵ)λ Λ.

1 Draw X(1) | X = x Gϵλ,(1 ϵ)λ,x.

2 Let X(2) = X X(1). Output: X(1), X(2) .

We now introduce our main theorem, which is motivated by a proposal by Joe (1996) to construct autoregressive time series processes with known marginal distributions.

Theorem 5 Suppose that we apply Algorithm 1 to a realization x of X Fλ. Then, the following results hold: (i) X(1) Fϵλ and X(2) F(1 ϵ)λ; (ii) X(1) X(2); (iii) If Fλ satisﬁes the linear expectation property (Deﬁnition 2), then E[X(1)] = ϵ E[X] and E[X(2)] = (1 ϵ) E[X].

Theorem 5 is proven in Appendix A.1. The intuition for parts (i) and (ii) is as follows: if X Fλ, then X could have arisen as the sum of two independent random variables X Fλ1 and X Fλ2, with λ1 + λ2 = λ. Algorithm 1 works backwards to undo this sum by generating X(1)

and X(2) that follow the same distribution as X and X . Part (iii) follows from Deﬁnition 2. As we will see in Section 4.1, ϵ (0, 1) is a tuning parameter that governs a tradeoﬀbetween how much information is in X(1) as opposed to X(2). Theorem 5 guarantees that the decomposition provided by Algorithm 1 satisﬁes the goals given in Section 1: namely X = X(1)+X(2), X(1) X(2), and X(1) and X(2) follow the same distribution as X, up to a (known) scaling of a parameter. Table 2 summarizes the data thinning proposal for

Data Thinning for Convolution-Closed Distributions

several well-known distributions. However, Algorithm 1 and Theorem 5 apply well beyond the set of distributions considered in Table 2.

Remark 6 In Table 2, we focus on distributions where the conditional distribution Gλ1,λ2,x has a recognizable form. For distributions where this is not the case, standard numerical sampling algorithms can be used to generate X(1) and X(2), so long as the conditional distribution can be expressed up to a normalizing constant.

Remark 7 Some of the decompositions presented in Table 2 require knowledge of an additional parameter that is not of primary interest. For example, thinning the N(µ, σ2) distribution requires knowledge of σ2 (see also: Rasines and Young, 2022; Leiner et al., 2023). In Section 2.3, we explore the implications of performing data thinning in the presence of an unknown nuisance parameter.

Remark 8 Table 2 indicates that thinning the Binomial(r, p) distribution or the Multinomial(r, p) distribution requires that ϵr take on an integer value. This is because these distributions are not inﬁnitely divisible (Joe, 1996). This restriction becomes more limiting in the extension to multiple folds given in Section 3, and prevents us from thinning the Bernoulli or categorical distributions.

Remark 9 The thinning recipe for the Gaussian can be shown to be equivalent, up to a simple rescaling by ϵ, to a procedure for splitting a Gaussian random variable with known variance that has been used in several recent papers (Rasines and Young, 2022; Leiner et al., 2023; Oliveira et al., 2021; Tian and Taylor, 2018; Tian, 2020). We derive this equivalence in Section 5.

We now give an example of an application where data thinning is useful in practice.

Example 2 (Model validation using data thinning) Suppose we observe Xij for i = 1, . . . , n and j = 1, . . . , d, where either each Xij is drawn independently from a univariate convolutionclosed distribution that satisﬁes the linear expectation property from Deﬁnition 2, or else each row (Xi1, . . . , Xid)T is drawn independently from a multivariate convolution-closed distribution that satisﬁes the linear expectation property. We wish to evaluate ˆµ(X) as an estimator for E[X]. Computing a loss function such as meansquared error between ˆµ(X) and X is unsatisfactory, since the loss function will take on a small value if we overﬁt the mean. Instead, we apply Algorithm 1 with ϵ (0, 1) to either each element or each row in X, such that each element Xij is thinned into X(1) ij and X(2) ij . We compute ˆµ(X(1)), which is an estimator of E[X(1)] = ϵ E[X] (Theorem 5, part (iii)). We then compute a loss function between ˆµ(X(1)) and X(2). Since X(1) X(2) (Theorem 5, part (ii)), the loss function will not take on small values due to overﬁtting.

In Example 2, if ϵ = 0.5, then E[X(1)] = E[X(2)] = 0.5 E[X]. Thus, ˆµ(X(1)) is an estimator of E[X(2)], and so devising a suitable loss function is straightforward. If ϵ = 0.5, then ˆµ(X(1)) is a plug-in estimator of ϵ 1 ϵ E[X(2)] (Theorem 5). The following example shows how a loss function that takes into account this factor of ϵ can be constructed in practice.

Example 3 (Example 2 with mean squared error loss) Suppose we wish to use mean squared error to deﬁne a loss function between ˆµ(X(1)) and X(2) in Example 2. Since E[X(2)] = 1 ϵ

ϵ E[X(1)], we compute the loss as 1 nd

where the factor of (1 ϵ)/ϵ turns an estimate of E[X(1)] into an estimate of E[X(2)].

We discuss the choice of ϵ in Section 4.1.

Neufeld, Dharamshi, Gao, and Witten

Table 2: Details of data thinning for several well-known distributions, using the parameterizations given in Table 1. While the exponential distribution itself is not convolution-closed in its single parameter, recognizing it as a special case of the gamma distribution with known α = 1 yields a decomposition. In all cases, the distribution of X(2) matches that of X(1), with ϵ replaced by (1 ϵ), with X(1) X(2).

Distribution of X Generate X(1) | X = x as: Dist. of X(1) Notes

Poisson(λ) Draw X(1) | X = x Binomial(x, ϵ). Poisson(ϵλ)

N(µ, σ2) Draw X(1) | X = x N(ϵx, ϵ(1 ϵ)σ2). N(ϵµ, ϵσ2) σ2 must be known.

Negative Binomial(r, p) Draw X(1) | X = x Beta Binomial(x, ϵr, (1 ϵ)r). Negative Binomial(ϵr, p) r must be known.

Gamma(α, β) Draw Z Beta (ϵα, (1 ϵ)α), and let X(1) = x Z. Gamma(ϵα, β) α must be known.

Exponential(λ) Draw Z Beta(ϵ, (1 ϵ)), and let X(1) = x Z. Gamma(ϵ, λ)

Binomial(r, p) Draw X(1) | X = x Hypergeometric(ϵr, (1 ϵ)r, x). Binomial(ϵr, p) r must be known

ϵr must be integer.

Nk(µ, Σ) Draw X(1) | X = x N(ϵx, ϵ(1 ϵ)Σ). Nk(ϵµ, ϵΣ) Σ must be known.

Multinomialk(r, p) Draw X(1) | X = x Multinomialk(ϵr, p) r must be known.

Multivariate Hypergeometric(x1, x2, . . . , xk, ϵr). ϵr must be integer.

2.3 Eﬀect of Unknown Nuisance Parameters

For several of the distributions in Table 2, data thinning requires knowledge of a nuisance parameter. For example, thinning a N(µ, σ2) distribution requires knowledge of σ2. We now consider what happens when we perform data thinning on Gaussian data using an incorrect value of the variance. We refer to this incorrect value as σ2.

Proposition 10 Suppose that we observe x from X N(µ, σ2). We draw X(1) | X = x N ϵx, ϵ(1 ϵ) σ2 , for some σ2 that is not a function of x, and let X(2) = X X(1). Then: (i) X(1) N ϵµ, ϵ2σ2 + ϵ(1 ϵ) σ2 , (ii) X(2) N (1 ϵ)µ, (1 ϵ)2σ2 + ϵ(1 ϵ) σ2 , and (iii) cov X(1), X(2) = ϵ(1 ϵ) σ2 σ2 .

Part (iii) of Proposition 10 indicates that if we apply data thinning with too little noise ( σ2 < σ2), then X(1) and X(2) are positively correlated. On the other hand, if we apply data thinning with too much noise ( σ2 > σ2), then X(1) and X(2) are negatively correlated. Similar results hold for the negative binomial distribution and the gamma distribution.

Proposition 11 Suppose that we observe x from X Negative Binomial(r, p). We draw X(1) | X = x Beta Bin (x, ϵ r, (1 ϵ) r) for some r that is not a function of x, and let X(2) = X X(1).

Then cov X(1), X(2) = ϵ(1 ϵ)r 1 p

Proposition 12 Suppose that we observe x from X Gamma(α, β). We let X(1) = x Z, where Z Beta (ϵ α, (1 ϵ) α) for some α that is not a function of x. We let X(2) = X X(1). Then cov X(1), X(2) = ϵ(1 ϵ) α

Propositions 10 12 are proven in Appendix A.2. Figure 1 veriﬁes these results empirically. The results in this section assume that σ2, r, and α are not a function of x. In practice, one might estimate the unknown parameters σ2, r, and α using additional data.

Data Thinning for Convolution-Closed Distributions

Figure 1: Left: We generate 100,000 realizations of X N(7, 5). For 50 values of σ2, we thin X into X(1) and X(2) using σ2 instead of σ2 = 5. Center: We generate 100,000 realizations of X NB(7, 0.7). For 50 values of r, we thin X into X(1) and X(2) using r instead of r = 7. Right: We generate 100,000 realizations of X Gamma(7, 5). For 50 values of α, we thin X into X(1) and X(2) using α instead of α = 7. All: In each panel, for each value of the nuisance parameter, we display the empirical correlation between X(1) and X(2) (red dots), along with the theoretical correlation suggested by Propositions 10 12 (blue lines). In all cases, we use ϵ = 0.44 for thinning.

3. Multifold Data Thinning

Data thinning involves decomposing X into X(1) and X(2), which each have the same distribution as X (up to a known parameter scaling). It can be applied recursively to create M independent data folds, X(1), . . . , X(M), that sum to X, as in the following example.

Example 4 (Recursive thinning of the normal distribution) Let x denote a realization of X N(µ, σ2). Given ϵ1, ϵ2, ϵ3 (0, 1) with ϵ1+ϵ2+ϵ3 = 1, we ﬁrst draw X(1) | X N ϵ1X, ϵ1(1 ϵ1)σ2 . Let X(2,3) = X X(1). By Theorem 5, X(1), X(2,3) N ϵ1µ, ϵ1σ2 N (1 ϵ1)µ, (1 ϵ1)σ2 .

We next draw X(2) | X(2,3) N ϵ2 1 ϵ1 X(2,3), ϵ2 1 ϵ1 (1 ϵ2 1 ϵ1 )(1 ϵ1)σ2 , and let X(3) = X

X(1) X(2). By Theorem 5, X(2), X(3) N(ϵ2µ, ϵ2σ2) N(ϵ3µ, ϵ3σ2). Furthermore, since X(2), X(3) is a function of X(2,3), the pair X(2), X(3) remains independent of X(1). Thus, X(1), X(2), X(3) N ϵ1µ, ϵ1σ2 N(ϵ2µ, ϵ2σ2) N(ϵ3µ, ϵ3σ2).

While Example 4 can be extended to create M > 3 folds, this recursive approach can be cumbersome. In Example 1 of Section 1, we saw that, for the gamma distribution, there is a simple way to create multiple folds without recursion. We will now provide a general form of this result. Let Gλ1,λ2,...,λM,x denote the joint distribution of (X1, . . . , XM) | X1 + X2 + . . . + XM = x, where

Xm ind. Fλm, for m = 1, . . . , M, and where Fλ is a convolution-closed distribution. The following algorithm and theorem mimic Algorithm 1 and Theorem 5.

Algorithm 2: Multifold data thinning.

Input : A realization x of X Fλ, where Fλ is a convolution-closed distribution with parameter space Λ. Scalars ϵ1, . . . , ϵM (0, 1) such that PM m=1 ϵm = 1 and ϵmλ Λ for m = 1, . . . M.

1 Draw X(1), . . . , X(M) Gϵ1λ,ϵ2λ,...,ϵMλ,x. Output: X(1), . . . , X(M)

Neufeld, Dharamshi, Gao, and Witten

Table 3: Details of how to perform multifold data thinning (Algorithm 2) for several common univariate distributions, where where ϵ = (ϵ1, . . . , ϵM)T . In the decomposition of the binomial distribution, ϵmr must be an integer. Each row can be veriﬁed using properties of these distributions.

Distribution of X Generate X(1), . . . , X(M) | X = X as: Dist. of X(m)

Poisson(λ) X(1), . . . , X(M) | X = x Multinomial(x, ϵ1, . . . , ϵM). Poisson(ϵmλ)

N(µ, σ2) X(1), . . . , X(M) | X = x N µϵ, σ2diag(ϵ) σ2ϵϵT , N(ϵmµ, ϵmσ2),

Negative Binomial(r, p) X(1), . . . , X(M) | X = x Dirichlet Multinomial(X, ϵ1r, . . . , ϵMr). Negative Binomial(ϵmr, p)

Gamma(α, β) Draw Z Dirichlet (ϵ1α, . . . , ϵMα), and and let X(1), . . . , X(M) = x Z Gamma(ϵmα, β)

Exponential(λ) Draw Z Dirichlet(ϵ1, . . . , ϵM), and let X(1), . . . , X(M) = x Z. Gamma(ϵm, λ)

Binomial(r, p) X(1), . . . , X(M) | X = x Multivariate Hypergeometric(ϵ1r, . . . , ϵMr, x). Binomial(ϵmr, p)

Theorem 13 Suppose we apply Algorithm 2 to a realization x of X Fλ, for a convolutionclosed distribution Fλ. Then, the following results hold: (i) X(m) Fϵmλ for m = 1, . . . , M; (ii) X(1), . . . , X(M) are mutually independent; (iii) X(1)+X(2)+ +X(M) = X; and (iv) if Fλ satisﬁes the linear expectation property (Deﬁnition 2), then E[X(m)] = ϵm E[X] for m = 1, . . . , M.

The proof of Theorem 13 is included in Appendix A.1, and is a straightforward extension of that of Theorem 5. The intuition for parts (i)-(iii) is as follows: we know that X Fλ could have arisen as the sum of M mutually independent random variables X1, . . . , XM such that Xm Fϵmλ. If we draw X(1), . . . , X(M) |X = x Gϵ1λ,ϵ2λ,...,ϵMλ,x, then the joint distribution of X(1), . . . , X(M)

equals the joint distribution of (X1, . . . , XM), i.e. it is the joint distribution of M independent random variables with distributions Fϵ1λ, . . . , FϵMλ. Part (iv) follows directly from Deﬁnition 2. We now revisit the case of the Gaussian distribution from Example 4.

Example 5 (Multifold thinning of the normal distribution) Let X N(µ, σ2) and let ϵ1, ϵ2, ϵ3 > 0 with P3 i=1 ϵi = 1. To generate M = 3 independent folds of the data, we draw

"ϵ1x ϵ2x ϵ3x

"ϵ1(1 ϵ1)σ2 ϵ1ϵ2σ2 ϵ1ϵ3σ2

ϵ1ϵ2σ2 ϵ2(1 ϵ2)σ2 ϵ2ϵ3σ2

ϵ1ϵ3σ2 ϵ2ϵ3σ2 ϵ3(1 ϵ3)σ2

One can verify that this multivariate normal corresponds to Gϵ1λ,ϵ2λ,ϵ3λ,x. By Theorem 13, X(1), X(2), and X(3) are independent and X(m) N(ϵmµ, ϵmσ2) for m = 1, 2, 3. This distribution Gϵ1λ,ϵ2λ,ϵ3λ,x is a degenerate multivariate normal distribution, which enforces the constraint that the realized values of X(1), X(2), and X(3) sum to x.

Table 3 reveals that Gϵ1λ,ϵ2λ,...,ϵMλ,x in Algorithm 2 has a very simple form for every univariate distribution in Table 2. We omit the multivariate distributions to avoid cumbersome notation. Once again, in cases where the conditional distribution is not a recognizable distribution, if its density is known up to a normalizing constant we can generate X(1), . . . , X(M) using sampling techniques. The role of the parameters ϵ1, . . . , ϵM in Algorithm 2 is discussed in Section 4.1. We now consider the following extension of Example 2.

Example 6 (Cross validation using multifold thinning) In the setting of Example 2, we apply Algorithm 2 with parameters ϵ1, . . . , ϵM to either each element or each row of X such that each element Xij is thinned into X(1) ij , . . . , X(M) ij .

Data Thinning for Convolution-Closed Distributions

Then, for m = 1, . . . , M, we ﬁrst deﬁne X( m) := X X(m). We obtain ˆµ X( m) , which is an estimator of E[X( m)] = (1 ϵm) E[X]. We then compute a loss function between ˆµ X( m)

and X(m). For example, as in Example 3, we can compute the mean squared error between ϵm 1 ϵm ˆµ X( m) and X(m). We evaluate the estimator ˆµ( ) by averaging the loss across folds.

The advantage of multifold thinning (Example 6) over single fold thinning with ϵ = 1/M (Example 2) is reduction of the variance of the loss function via averaging. We will demonstrate the practical advantages of multifold thinning in Section 6.

4. Comparing Data Thinning and Sample Splitting

In comparing data thinning and sample splitting in a particular setting, there are two considerations. First, we must ﬁgure out if each method is applicable. Then, in settings where both are applicable, we must ﬁgure out if one method is preferable. Data thinning requires an assumption that each entry in our data set is drawn from a speciﬁc (but possibly diﬀerent) convolution-closed distribution. Sample splitting requires no such parametric assumption. On the other hand, we cannot apply sample splitting if our task requires estimating a parameter for every individual observation or drawing conclusions about speciﬁc observations. For example, suppose that we wish to evaluate the performance of a clustering algorithm. After sample splitting, clustering the observations in the training set does not yield cluster assignments for the observations in the test set, and thus there is nothing to evaluate on the test set (see e.g. Gao et al., 2022; Fu and Perry, 2020). Similarly, it is not clear how to use sample splitting to validate a low-rank matrix approximation, for which a latent coordinate must be estimated for each of the n observations (see, e.g. Owen and Perry, 2009). We focus on these examples, where sample splitting is not an option, in Sections 6 and 7. In Section 4.1, we show that the parameters ϵ1, . . . , ϵM in multifold data thinning (Algorithm 2) control the allocation of information across folds. In Section 4.2, we use this information allocation result to theoretically argue that sample splitting and data thinning achieve similar performance, but that data thinning may be preferable in settings where the observations are not identically distributed. In particular, we see that sample splitting is not an attractive choice for ﬁxedcovariate regression in the presence of high-leverage points. In Section 4.3, we empirically compare data thinning and sample splitting in a setting where both are applicable.

4.1 Role of the Parameters ϵ1, . . . , ϵM in Multifold Data Thinning

In Algorithm 2, the parameters ϵ1, . . . , ϵM determine how the information in a random variable X about an unknown parameter is allocated across folds of data.

Theorem 14 Suppose that we thin a random variable X using Algorithm 2 with parameters ϵ1, . . . , ϵM to obtain X(1), . . . , X(M). Let IX(θ) denote the Fisher information contained in X about an unknown parameter θ, i.e. a parameter in the distribution of X that does not appear in Gϵ1λ,...,ϵMλ,x. Assume that this Fisher information exists. Then IX(m)(θ) = ϵm IX(θ) for m = 1, . . . , M.

Remark 15 In the context of Theorem 14, the parameter θ may or may not be a function of the convolution-closed parameter λ, but it must be a parameter that is unknown during the thinning process. Below, we list a few examples where Theorem 14 applies and where it does not.

Poisson distribution: Let X Poisson(λ), and suppose we thin X to obtain X(m) Poisson(ϵmλ). As λ is unknown during the thinning process, Theorem 14 says that IX(m)(λ) = ϵm IX(λ). We can easily verify this by direct calculation: IX(λ) = 1

λ and IX(m)(λ) = ϵm

Neufeld, Dharamshi, Gao, and Witten

Binomial distribution: Let X Binomial(r, p), and suppose that we thin X to obtain X(m) Binomial(ϵmr, p). Then IX(p) = r/(p(1 p)), and direct calculation veriﬁes that IX(m)(p) = ϵm IX(p). On the other hand, Theorem 14 makes no claims about the parameter r, since r must be known during thinning.

Gaussian distribution: Let X N(µ, σ2), and suppose that we thin X to obtain X(m) N(ϵmµ, ϵmσ2). Then direct computation veriﬁes that IX(m)(µ) = ϵm IX(µ). However, Theorem 14 makes no claims about the parameter σ2, since it must be known during thinning.

Theorem 14 implies that when choosing ϵ1, . . . , ϵM for Algorithm 2 or when choosing ϵ for Algorithm 1, one should consider how much information to devote to the training task (i.e. model ﬁtting) as opposed to the testing task (i.e. model evaluation). In Example 2, as ϵ increases, the quality of the estimator ˆµ(X(1)) increases, but the information available for computing the loss between this estimator and X(2) decreases. Recall that applying Algorithm 1 to X with parameter ϵ = 1 M yields X(1) and X(2) with the same distributions as X(m) and X X(m) when we apply Algorithm 2 to X with ϵ1 = . . . = ϵM = 1 M . Thus, the information allocation between X(m) and X X(m) after multifold thinning is the same as the information allocation between X(1) and X(2) after two-fold thinning with ϵ = 1 M . The advantage of using multiple folds for model validation (i.e. Example 6 rather than Example 2) comes from the reduction in variance that results from averaging the loss function across folds.

4.2 Theoretical Comparison to Sample Splitting

In this section, we focus on the case of two-fold thinning (Algorithm 1) for simplicity. Understanding the way in which the parameter ϵ in Algorithm 1 partitions information about parameters of interest helps us draw direct connections between data thinning (with parameter ϵ) and sample splitting (with ϵ denoting the proportion of observations assigned to the training set).

Corollary 16 Suppose that we observe n independent and identically distributed (iid) random variables X = (X1, . . . , Xn), where Xi Fλ. Assume that ϵn is an integer and that ϵλ Λ. Consider the following two methods for splitting X into independent training and test sets.

Sample splitting: Assign a speciﬁc set of ϵn observations to the training set, denoted Xtrain ss , and the remaining (1 ϵ)n observations to the test set, denoted Xtest ss .

Data thinning: For i = 1, . . . , n, thin Xi into X(1) i and X(2) i by applying Algorithm 1 with

parameter ϵ. Let Xtrain dt = X(1) 1 , . . . , X(1) n be the training set and let Xtest dt = X(2) 1 , . . . , X(2) n

be the test set.

Let θ be an unknown parameter of interest, as in Theorem 14. Then, IXtrain ss (θ) = IXtrain dt (θ) = ϵIX(θ), and IXtest ss (θ) = IXtest dt (θ) = (1 ϵ)IX(θ).

In Corollary 16, IXtrain ss (θ) takes on the same value for any speciﬁc allocation of datapoints to the training set. This means that if we instead randomly allocate data points to the training set, as is typical in practice, the information split remains identical to data thinning. Consequently, we expect sample splitting and data thinning to have similar performance regarding inference on unknown parameters in settings where both are options and where the datapoints are independent and identically distributed. A similar point was made by Leiner et al. (2023), who view their data ﬁssion technique as a continuous analog of sample splitting, since the requirement for sample splitting that ϵn be an integer limits the choice of ϵ, especially when n is small.

Data Thinning for Convolution-Closed Distributions

We next consider a setting where the observations are independent but not identically distributed, and thus the diﬀerence between data thinning and sample splitting is more pronounced.

Example 7 (Fixed-covariate regression) Suppose that we observe a ﬁxed set of covariates Z1, . . . , Zn.

Suppose that Xi ind. N βZi, σ2 for i = 1, . . . , n, and let X = (X1, . . . , Xn). In this setting,

IXi(β) = Z2 i σ2 , meaning that observations with larger values of Z2 i (high-leverage points) contain more information about β, the unknown parameter of interest. Let IXtrain ss (β) and IXtrain dt (β) be deﬁned as in Corollary 16, where the unknown parameter of interest is the slope β. Then

IXtrain dt (β) =

i=1 IX(1) i (β) =

i=1 ϵZ2 i σ2 = ϵIX(β).

IXtrain ss (β) = X

i train IXi(β) = X

Z2 i σ2 = ϵIX(β),

where train denotes the speciﬁc indices of the observations assigned to the training set. Thus, while data thinning always allocates a fraction ϵ of the Fisher information to the training set, the information allocation of sample splitting depends on the speciﬁc assignment of observations to the training set.

Example 7 shows that IXtrain ss (β) = IXtrain dt (β) for a particular assignment of observations to the training set. However, when we perform sample splitting, we typically randomly allocate datapoints to the training set. Under such a procedure, IXtrain ss (θ) from Example 7 becomes a random variable that depends on the particular split of the data. If all permissible random splits of the data are equally likely, then E h IXtrain ss (θ) i = IXtrain dt (θ), where the expected value is taken over all permissible

splits of the data. The same equality holds for the test set, i.e. E IXtest ss (θ) = IXtest dt (θ). Despite this equivalence in the expected information allocation, Proposition 1 from Rasines and Young (2022) provides clever insight that tells us why we might prefer the non-random information allocation of data thinning in this setting. Remark 17 instantiates the general result of Rasines and Young (2022) to the simple setting of Example 7.

Remark 17 (Eﬀect on conﬁdence interval width) In the context of Example 7, suppose that our ultimate goal involves forming a conﬁdence interval for β using the test set. If we are using a maximum-likelihood estimator for β, then the width of the conﬁdence interval computed on the test set under sample splitting and data thinning are proportional, respectively, to the square roots of 1 IXtest ss (β) and 1 IXtest dt (β). Jensen s inequality yields the following result:

E 1 IXtest ss (β)

1 E IXtest ss (β) = 1 IXtest dt (β) .

Thus, the conﬁdence intervals for β computed using sample splitting will be wider, on average, than those computed using data thinning in the setting where our observations are not identically distributed. A natural corollary is that sample splitting will achieve lower power than data thinning in this setting, as we will see in Section 4.3.

Neufeld, Dharamshi, Gao, and Witten

4.3 Empirical Comparison to Sample Splitting

We now show empirically that sample splitting and data thinning achieve comparable performance in settings where both are applicable and where the observations are independent and identically distributed. We let n = 100 and p = 20, and we generate 10, 000 realizations of Z Rn p, where the entries of Z are drawn independently from N(0, 1). For each realization, we then generate X | Z Nn (Zβ, In), where β1 = β2 = . . . = β5 = β and β6 = β7 = . . . = β20 = 0. Our goal is to perform model selection to identify the covariates with non-zero entries in β, and then to form conﬁdence intervals for the coeﬃcients of the selected covariates. As we cannot naively use the entire data set (Z, X) to do both model selection and inference, we consider the following approach.

Step 1: Split the data into a training set and a test set.

Step 2: Perform forward stepwise regression on the training set to select a model that includes some subset of the p = 20 covariates. We use the R function step with its default settings.

Step 3: Re-ﬁt the selected model using the test set and report the standard conﬁdence intervals for each selected coeﬃcient.

We carry out this process using two diﬀerent methods.

Sample splitting: Randomly generate a set train {1, . . . , n} with |train| = ϵn. Let {(Zi, Xi) : i train} be the training set and let {(Zi, Xi) : i {1, . . . , n} \ train} be the test set.

Data thinning: Apply Algorithm 1 to each Xi for i = 1, . . . , n. Let n Zi, X(1) i : i {1, . . . , n} o

be the training set and let n Zi, X(2) i : i {1, . . . , n} o be the test set.

We carry out each method using ϵ = 0.2, ϵ = 0.5, and ϵ = 0.8. For each method, we consider (i) detection: the proportion of data sets for which Z3 appears in the selected model, and (ii) power: the proportion of data sets for which the conﬁdence interval for β3 does not include 0, among those where Z3 appeared in the selected model. These metrics focus on one of the important covariates, Z3, but the results are similar for each of the important covariates (i.e. Zj for j {1, . . . , 5}). The results are shown in the top row of Figure 2. As expected, data thinning and sample splitting achieve nearly identical results for these two metrics, with ϵ governing a tradeoﬀbetween detection and power. We then repeat the experiment, but we let n = 42 and we let Z be a ﬁxed matrix that contains a single row whose elements are drawn from N(5, 1) rather than N(0, 1). This corresponds to having a single high-leverage observation in the data set that contains most of the information about β. We only consider ϵ 0.5, since in this setting where n 2p, using a smaller value of ϵ would complicate our ability to ﬁt a linear regression model on the training set. As shown in the bottom row of Figure 2, data thinning outperforms sample splitting in this setting, because for any particular split, sample splitting either leaves very little information in the training set or very little in the test set. The power results conﬁrm the insight from Remark 17. The general ﬁnding that data thinning outperforms sample splitting in this setting mirrors ﬁndings from Leiner et al. (2023) and Rasines and Young (2022), which is not surprising in light of Remark 9, which states that Gaussian data thinning proposal is equivalent (up to a simple rescaling) to the Gaussian randomization or ﬁssion proposals of these prior papers. Thus, the similar conclusions are not a coincidence, and the properties of data thinning seen in this setting can also be interpreted as properties of (Gaussian) data ﬁssion.

Data Thinning for Convolution-Closed Distributions

Figure 2: Comparison of data thinning and sample splitting, using the detection and power metrics deﬁned in Section 4.3. The top row shows the results of the large n setting where the observations are independent and identically distributed (iid), and thus data thinning and sample splitting achieve nearly identical results across values of ϵ. The bottom row shows the results in the small n setting, in which the observations are not identically distributed (non iid) and the presence of a high leverage point causes data thinning to outperform sample splitting.

5. Comparing Data Thinning and Data Fission

As mentioned in Section 1, the data ﬁssion proposal of Leiner et al. (2023) provides an alternate set of strategies to decompose a single realization X into X(1) and X(2). In this section, we compare and contrast the two approaches.

5.1 Independent Decompositions

Leiner et al. (2023) provide a strategy for obtaining independent X(1) and X(2) only in the case where X is Poisson or Gaussian. In the case of the Poisson distribution, the proposal of Leiner et al. (2023) coincides exactly with the proposal obtained from Algorithm 1 in this paper. This proposal follows from a classical property of the Poisson distribution (see e.g. (Durrett, 2019), Section 3.7.2), and has recently been applied in contexts related to that of this paper by Oliveira et al. (2022); Sarkar and Stephens (2021); Gerard (2020); Chen et al. (2021) and Neufeld et al. (2022). In the case of the Gaussian distribution, the proposal of Leiner et al. (2023) has also been used by Tian and Taylor (2018), Oliveira et al. (2021), and Rasines and Young (2022), among others. It does not follow directly from Algorithm 1 in this paper, since X = X(1) + X(2). However, in Example 8, we show the proposal of Leiner et al. (2023) is a simple rescaling of the proposal in this paper.

Example 8 (Comparison of two Gaussian decompositions) Consider the task of splitting the N(µ, σ2) distribution into two independent normally-distributed random variables, with σ known.

Neufeld, Dharamshi, Gao, and Witten

The data thinning proposal is given in Table 2, and leads to X(1) N(ϵµ, ϵσ2) and X(2) N((1 ϵ)µ, (1 ϵ)σ2), where X(1) + X(2) = X. The data ﬁssion proposal is as follows: given a value of τ > 0, we draw Z N(0, σ2), and then let X(1) = X + τZ and X(2) = X 1

τ Z. Then, X N(µ, (1 + τ 2)σ2) and X N(µ, (1 + 1

τ 2 )σ2), with X X . Under this decomposition, X + X = X, but

1 1 + τ 2 X + τ 2

1 + τ 2 X = 1 1 + τ 2 (X + τZ) + τ 2

1 + τ 2 (X 1

We can easily verify that, if we let ϵ = 1 1+τ 2 , then the random variables 1 1+τ 2 X and τ 2 1+τ 2 X

obtained via data ﬁssion have the same distributions (both marginally and conditional on X = x) as X(1) and X(2) obtained via data thinning. For example, we can easily verify that 1 1+τ 2 X

N 1 1+τ 2 µ, 1 1+τ 2 σ2 = N ϵµ, ϵσ2 . Thus, the two decompositions are identical, up to a scaling of

X(1) and X(2) by a (known) constant.

The main idea of Example 8 extends to the decomposition of the multivariate normal given in Table 2 of this paper, and the corresponding decomposition from Leiner et al. (2023).

5.2 Non-independent Decompositions

With the exception of the Gaussian and Poisson distributions, the decompositions of Leiner et al. (2023) do not yield X(1) and X(2) that are independent. Instead, the goal of Leiner et al. (2023) is to obtain a decomposition such that the distributions of X(1) and X(2) | X(1) are tractable. While in principle we can ﬁt a model to X(1) and validate it using the conditional distribution of X(2) | X(1), we will see in this section that this can be diﬃcult to carry out in practice. In particular, we note the following drawbacks of the non-independent decompositions of Leiner et al. (2023).

(1) The distribution of X(1), and the conditional distribution of X(2)|X(1), need not resemble the distribution of X. Thus, if the goal is to evaluate a potential model for X, it is not always clear what model to ﬁt to X(1). We will illustrate this drawback in Example 9.

(2) The parameters of interest are entangled in the conditional distribution of X(2) | X(1). We will illustrate this issue in Example 10.

(3) The tuning parameter that governs the information trade-oﬀbetween X(1) and X(2) can be hard to interpret. For instance, in the case of the gamma decomposition in Example 9, the tuning parameter is B {1, 2, . . .}, but in the case of the negative binomial distribution in Example 10, it is ϵ (0, 1). These both contrast with Example 8, where the tuning parameter was τ > 0. Whereas our Theorem 14 guarantees a speciﬁc, pre-speciﬁed allocation of Fisher information between X(1) and X(2) after data thinning, the allocation of Fisher information between X(1) and X(2) | X(1) after data ﬁssion can depend on the value of unknown parameters. We will illustrate this ﬁnal point in Example 9.

(4) The roles of X(1) and X(2) cannot be interchanged. For example, in the decomposition of the Bernoulli(θ) distribution given in Leiner et al. (2023), the distributions of X(1) and X(2) | X(1) each contain information about θ. However, the distribution of X(1) | X(2)

contains no information about θ. Furthermore, while Remark 1 in Leiner et al. (2023) provides a strategy for obtaining multiple folds of training data for the data ﬁssion decompositions that are constructed using the conjugate prior strategy, these folds are not marginally

Data Thinning for Convolution-Closed Distributions

independent of one another (they are conditionally independent given X). Beyond these speciﬁc decompositions, Leiner et al. (2023) do not provide a clear strategy for extending their decompositions to the case of multiple folds. Thus, it is not clear in general how to use data ﬁssion decompositions to carry out cross validation.

To illustrate points (1) and (3), we consider the gamma distribution.

Example 9 (Gamma decomposition, data ﬁssion approach) Suppose X Gamma(α, β). For a tuning parameter B {1, 2, . . .}, Leiner et al. (2023) propose drawing Z = (Z1, . . . , ZB),

where Zi ind. Poisson(X), and thus each Zi marginally follows a Negative Binomial(α, β/(β + 1)) distribution, and the Zi are independent conditional on X. Take X(1) = Z, and X(2) = X. Then, the conditional distribution of X(2) | X(1) is Gamma(α + PB i=1 Zi, β + B). Furthermore, while IX(β) = α

β2 , direct calculation yields that IX(2)|X(1)(β) = α β2+βB.

This stands in notable contrast to Example 1 in Section 1, in which data thinning provides independent (and gamma-distributed) random variables. Given that X(1) does not resemble X, it is not clear how to apply this decomposition in the setting of Example 12 from Section 6, where we would like to apply a clustering algorithm to X(1) to estimate the true cluster structure of X. Unless B = 1, X(1) and X do not even have the same dimensions, which makes it diﬃcult to know what type of clustering algorithm to apply or how to interpret the results. Furthermore, the proportion of Fisher information about β in X that is allocated to the test set X(2) | X(1) in Example 9 depends not only on the tuning parameter B, but also on the value of the unknown parameter β. This is in contrast to data thinning, where Theorem 14 guarantees that this proportion will always be equal to the tuning parameter ϵ2. To illustrate point (2), we revisit Example 2 to see a concrete example in which data thinning is straightforward but the proposal of Leiner et al. (2023) is diﬃcult to use in practice.

Example 10 (A comparison of negative binomial decompositions) We observe Xij Negative Binomial (rij, pij) for i = 1, . . . , n and j = 1, . . . , p. Our goal is to evaluate a function ˆµ(X) as an estimator for E[X], where E[X]ij = rij 1 pij

From Table 2, data thinning requires rij to be known, and yields X(1) ij Negative Binomial (ϵrij, pij)

and X(2) ij Negative Binomial ((1 ϵ)rij, pij), with X(1) X(2), E[X(1)] = ϵ E[X], and E[X(2)] = (1 ϵ) E[X]. Thus, as in Example 2 and Example 3, ˆµ X(1) is an estimator for ϵ E[X], and we can evaluate ˆµ(X(1)) by computing the mean squared error between X(2) and 1 ϵ

ϵ ˆµ X(1) .

For ϵ (0, 1), the data ﬁssion proposal of Leiner et al. (2023) draws X(1) ij | Xij Binomial(Xij, ϵ) and sets X(2) = X X(1). Under this decomposition, X(1) ij Negative Binomial rij, pij pij+ϵ(1 pij) , and so E[X(1)] = ϵ E[X]. Although marginally E[X(2)] =

(1 ϵ) E[X], as X(1) and X(2) are not independent, we cannot simply use mean squared error loss between X(2) and 1 ϵ

ϵ ˆµ X(1) to evaluate the estimator. Instead, we must construct a loss function that evaluates ˆµ X(1) as an estimator of E[X] in the conditional distribution of

X(2) | X(1), which is given by X(2) ij | X(1) ij Negative Binomial rij + X(1) ij , pij + ϵ pijϵ . As

E[X(2) ij | X(1) ij ] = rij + X(1) ij 1 pij ϵ+ϵpij

is not a simple function of E[X], this is not a straightforward task. At a minimum, it involves disentangling the roles of the parameters rij and pij. Furthermore, while at ﬁrst glance it might appear that an advantage of data ﬁssion over data thinning is that the former does not require knowledge of rij to obtain X(1) ij and X(2) ij , evaluating an

Neufeld, Dharamshi, Gao, and Witten

estimator of E[X] using this conditional distribution will require knowing or accurately estimating the nuisance parameters rij due to the aforementioned parameter entanglement issue.

Similar issues arise for other decompositions given in Leiner et al. (2023). For example, the data ﬁssion decomposition of the binomial distribution yields a complicated unnamed distribution for X(2) | X(1), which would be very diﬃcult to use in the context of Example 11.

6. Simulation Study

6.1 Simulation Setup

In this section, we focus on the application of data thinning to cross-validation in two settings. We contrast its performance to naive approaches that use the same data to both ﬁt and validate the models. Speciﬁcally, we consider Examples 11 and 12. In each of these examples, sample splitting is not a viable option, as the parameters of interest have dimensions equal to the number of observations. Furthermore, as pointed out in Section 5.2, applying data ﬁssion in these settings is not straightforward.

Example 11 (Choosing the number of principal components on binomial data) We generate data with n = 250 observations and d = 100 dimensions. Speciﬁcally, for i = 1, . . . , n and j = 1, . . . , d, we generate Xij ind. Binomial(r, pij) where r = 100 and p is an unknown n d matrix of probabilities. We construct logit(p) as a rank-K = 10 matrix with singular values 5, 6, . . . , 14. Additional details are provided in Section D. Our goal is to estimate K .

Example 12 (Choosing the number of clusters on gamma data) We generate data sets X Rn d such that there are 100 observations in each of K clusters, for a total of n = 100K observations. Our objective is to estimate K . We let Xij ind. Gamma(λ, θci,j), for i = 1, . . . , n and j = 1, . . . , d, where ci {1, 2, . . . , K } indexes the true cluster membership of the ith observation. The shape parameter λ is a known constant common across all clusters and all dimensions, whereas the rate parameter θ is an unknown K d matrix such that each cluster has its own d-dimensional rate parameter. We generate data under two regimes: (1) a small d, small K regime in which d = 2 and K = 4, and (2) a large d, large K regime in which d = 100 and K = 10. The values of λ and θ are provided in Section D. A sample small d, small K data set is presented in Figure 3, alongside the output of data thinning with ϵ = 0.5.

Data Thinning for Convolution-Closed Distributions

Figure 3: Left: A simulated data set in the d = 2, K = 4 setting described in Example 12. Center/Right: The result of data thinning with ϵ = 0.5.

6.2 Methods

We use Algorithm 3 to select the number of principal components in binomial data, as in Example 11, using data thinning.

Algorithm 3: Evaluating binomial principal components with negative log-likelihood loss

Input : A positive integer K, a matrix X Zn d [0,r], where Xij ind. Binomial(r, pij), and

positive scalars ϵ(train) and ϵ(test) = 1 ϵ(train) such that ϵ(train)r, ϵ(test)r Z>0.

1 Apply data thinning to X to obtain X(train) and X(test), where

X(train) ij ind. Binomial ϵ(train)r, pij and X(test) ij ind. Binomial ϵ(test)r, pij .

2 Compute the singular value decomposition of the log-odds of X(train),

ˆU ˆD ˆV T = logit n (X(train) + 0.001)/(ϵ(train)r + 0.002) o .

Pseudo-counts prevent taking the logit of 0 or 1.

3 Construct the rank-K approximation of X(train), p(K) := expit ˆU1:K ˆD1:K ˆV1:K .

4 Compute the negative log-likelihood loss on X(test),

Pn i=1 Pd j=1 log f X(test) ij ϵ(test)r, p(K) ij , where f( | r, p) is the density function for the Binomial(r, p) distribution.

Neufeld, Dharamshi, Gao, and Witten

We use Algorithm 4 to select the number of clusters in gamma data, as in Example 12, using data thinning.

Algorithm 4: Evaluating gamma clusters with negative log-likelihood loss

Input : A positive integer K, and X Rn d >0 where Xij ind. Gamma (λ, θci,j). Here, θ (0, )K d where θci,j is the true but unknown rate parameter for the cith cluster in the jth dimension, ci {1, 2, . . . , K }, and λ is the known shape parameter. Also, positive scalars ϵ(train) and ϵ(test) = 1 ϵ(train).

1 Apply data thinning to X to obtain X(train) and X(test), where

X(train) ij ind. Gamma ϵ(train)λ, θci,j and X(test) ij ind. Gamma ϵ(test)λ, θci,j .

2 Run K-means on X(train) to estimate K clusters. Denote the cluster assignment of the ith observation as ˆci.

3 Within each cluster, estimate the parameters using X(train) (Ye and Chen, 2017; Louzada et al., 2019). Let ˆλ(K) and ˆθ(K) denote the K d estimated parameter matrices.

4 Compute the loss on X(test) as Pn i=1 Pd j=1 log f X(test) ij ˆλ(K) ˆci,j ϵ(test)/ϵ(train), ˆθ(K) ˆci,j , where f( | λ, θ) is the density function for the Gamma(λ, θ) distribution.

We apply Algorithms 3 and 4 in three diﬀerent ways. First, we apply them without modiﬁcation, with ϵ(train) = 0.5 and ϵ(train) = 0.8. Next, we slightly modify these algorithms by replacing step 1 with multi-fold thinning (Algorithm 2) with M = 5 and ϵ1 = = ϵM = 0.2. For m = 1, . . . , M, we then perform steps 2 4 using X(train) = X X(m), ϵ(train) = (M 1)/M and X(test) = X(m), ϵ(test) = 1/M. We then average the loss functions obtained across the M applications of step 4. Finally, we consider a naive method that re-uses data, by skipping step 1, and simply taking X(train) = X(test) = X in steps 2 4 and ϵ(train) = ϵ(test) = 1 in step 4. Our goal is to select the value of K that minimizes the loss function. Because data thinning produces independent training and test sets, we expect that the data thinning approaches will produce U-shaped loss function curves, as a function of K. By contrast, in the naive approach, the full data X is used to ﬁt the model and to compute the loss functions in Algorithms 3 and 4, resulting in monotonically decreasing loss curves, as a function of K. Other loss functions can be used in lieu of the negative log-likelihood loss in Algorithms 3 and 4. In Appendix E, we extend Algorithms 3 and 4 to the case of mean squared error loss, and show similar results.

6.3 Results

Figure 4 displays the loss function for all three simulation settings as a function of K; results have been averaged over 2, 000 simulated data sets and rescaled to the [0, 1] interval for ease of comparison. The values of K with the lowest average loss function are circled on the plots. As expected, the data thinning approaches in Figure 4 exhibit sharp minimum values, as opposed to the monotonically decreasing curves produced by the naive method. The data thinning approaches correctly select the true value of K = K in all three settings, except for data thinning with ϵ(train) = 0.5 in the binomial principal components setting. In that case, the low value of ϵ(train)

allocates too much information to the test set, resulting in inadequate signal from the weakest principal components in the training set. Selecting a larger value of ϵ(train) remedies this issue, as seen with ϵ(train) = 0.8. We further investigate the role of ϵ(train) by repeating the simulation study using diﬀerent values of ϵ(train) for single-fold data thinning. In Figure 5, we plot the proportion of simulations that select

Data Thinning for Convolution-Closed Distributions

Figure 4: The negative log-likelihood loss averaged over 2,000 simulated data sets, as a function of K, for the naive method (purple), data thinning with ϵ(train) = 0.5 (red), data thinning with ϵ(train) = 0.8 (blue), and multifold thinning with M = 5 folds (green). Each curve has been rescaled to take on values between 0 and 1, for ease of comparison. The minimum loss values for each method are circled, and K is indicated by the vertical black line.

Figure 5: The proportion of simulations for which data thinning selects the true value of K with the negative log-likelihood loss, as a function of ϵ(train), for the simulation study described in Section 6.1. The optimal value of ϵ(train) depends on the problem at hand.

the correct value of K (i.e. the proportion of simulations in which the loss function is minimized at K = K ) in each of the three settings, as a function of ϵ(train). We ﬁnd that in the gamma clustering simulations, lower values of ϵ(train) are adequate. However, settings with weaker signal, such as the binomial principal components example, require larger values of ϵ(train) to identify the true latent structure. In all settings, as ϵ(train) approaches 1, performance begins to decay. This is a consequence of inadequate information remaining in the test set under large values of ϵ(train), and is consistent with the discussion of Section 4.1. These ﬁndings suggest that in practice, the optimal value of ϵ(train) is context-dependent. Finally, we examine the beneﬁts of multifold data thinning over single-fold data thinning. Figure 6 displays histograms of the number of simulations that select each value of K. Here we only include data thinning with ϵ(train) = 0.8 and multifold thinning with M = 5, so that both methods use the same allocation of information between training and test sets. We see that multifold thinning generally selects the correct value of K more often than single-fold data thinning, mirroring the

Neufeld, Dharamshi, Gao, and Witten

Figure 6: The proportion of simulated data sets in which each candidate value of K is selected, with the negative log-likelihood loss, under data thinning with ϵ(train) = 0.8 (blue) and multifold thinning with M = 5 (green), for each of the simulation settings described in Section 6.1. The true value of K is indicated by the vertical black line. Multifold thinning tends to select the true value of K more often than single-fold thinning.

improvement of M-fold cross-validation using sample splitting over single-fold sample splitting in supervised settings. However, in the large gamma setting, the signal is strong enough that multifold thinning does not provide a beneﬁt over single-fold thinning.

7. Selecting the Number of Principal Components in Gene Expression Data

In this section, we revisit an analysis of a data set from a single-cell RNA sequencing experiment conducted on a set of peripheral blood mononuclear cells. The data set is freely available from 10X Genomics, and was previously analyzed in the Guided Clustering Tutorial vignette (Hoﬀman et al., 2022) for the popular R package Seurat (Hao et al., 2021; Stuart et al., 2019; Satija et al., 2015). The data set X is a sparse matrix of non-negative integers, representing counts from 32, 738 genes in each of 2, 700 cells. We consider applying principal components analysis to learn a low-dimensional representation of the data. In the Seurat vignette, ﬁltering, normalization, logtransformation, feature selection, centering, and scaling are applied to the data, yielding a transformed matrix Y R2638 2000. Details are provided in Section F.1. Finally, the singular value decomposition of Y is computed, such that Y = UDV T . Here we let Uk represent the kth column of the matrix U, and let U1:KD1:KV T 1:K represent the rank-K approximation of Y . Our goal is to select the number of dimensions to use in this low-rank approximation. In the Seurat vignette, the authors rely on heuristic solutions such as looking for an elbow in the plot of the standard deviation of UKDK as a function of K; see Figure 7(a) (James et al., 2013). Based on the elbow plot, the authors suggest retaining around 7 principal components. Other heuristic approaches suggest as many as 12 principal components. Before introducing the data thinning solution, we introduce a squared-error based formulation that is mathematically equivalent to the traditional elbow plot (see Section F.2), but will facilitate a direct comparison with data thinning. For K = 1, . . . , 20, we compute the sum of squared errors between the matrix Y and its rank-K approximation:

Y U1:KD1:KV T 1:K 2

Data Thinning for Convolution-Closed Distributions

Because the low-rank approximation U1:KD1:KV T 1:K is computed using Y , this loss function monotonically decreases with K. A heuristic solution for deciding how many principal components to retain involves looking for the point in which the slope of the curve in Figure 7(b) begins to ﬂatten. While this appears to happen around 5 7 principal components, which is consistent with the ﬁnding from Figure 7(a), the exact number of principal components to retain is still unclear. We now show that data thinning provides a principled approach for estimating the number of principal components. Single-cell RNA-sequencing data are often modeled as independent Poisson random variables (Wang et al., 2018; Sarkar and Stephens, 2021). Thus, we assume that Xij Poisson(Λij). Starting with the raw data matrix X Z2700 32738 0 , we perform Poisson data thinning with ϵ = 0.5 to obtain a training set X(1) and a test set X(2), which are independent if the Poisson assumption holds. Furthermore, as ϵ = 0.5, they are identically distributed. We then carry out the data processing described in Section F.1 on X(1) to obtain Y (1) R2638 2000. We obtain Y (2) R2638 2000 by applying the same data processing steps to X(2), but retaining only the features that were selected on X(1), so that the rows and columns of Y (1) and Y (2) correspond to the same genes and cells. Details are in Section F.1. We compute the singular value decomposition on the training set, Y (1) = U(1)D(1)(V (1))T . For a range of values of K, we then compute the sum of squared errors between Y (2) and U(1) 1:KD(1) 1:K(V (1) 1:K)T :

Y (2) U(1) 1:KD(1) 1:K(V (1) 1:K)T 2

The results are shown in Figure 7(c). As we are not computing and evaluating the singular value decomposition using the same data, the plot of K vs. the loss function is not monotonically decreasing in K. Instead, it reaches a clear minimum at K = 7, suggesting that the rank-7 approximation provides the best ﬁt to the observed data. While our analysis thus far assumes a Poisson model for single-cell RNA sequencing data, there is evidence that a negative binomial model may be preferable in some settings (Choudhary and Satija, 2022). Thus, we modiﬁed our analysis to use negative binomial thinning instead of Poisson thinning. Details are provided in Appendix F.3, and the results are shown in Figure 7(d). This modiﬁed analysis yields similar results: it suggests that we retain K = 5 principal components. There are two potential reasons for the discrepancy between K = 5 and K = 7.

1. Suppose that we perform data thinning under the Poisson assumption, but the data are in fact overdispersed relative to the Poisson distribution. Then there will be positive correlation between X(1) and X(2). This could cause Poisson thinning to overestimate the number of principal components to retain in the low-rank approximation of X.

2. The negative binomial thinning procedure injects more noise into X(1) than does the Poisson thinning procedure. This could cause negative binomial thinning to miss out on true signal in the data, and thus underestimate the number of principal components.

While either of these explanations is plausible, it is comforting to see that the two methods provide similar estimates for K. The take home message is that, for a given distributional assumption, data thinning provides a simple and non-heuristic way to select the number of principal components.

Remark 18 (Choice of ϵ) If we had chosen ϵ = 0.5, then while E[X(2)] = 1 ϵ

ϵ E[X(1)], the relationship between E[ Y (1)] and E[ Y (2)] would depend on the details of the data processing described in Section 2, and the loss function in (1) would need to be modiﬁed accordingly.

Neufeld, Dharamshi, Gao, and Witten

Figure 7: Results for the data analysis in Section 7. (a) An elbow plot of the standard deviation of the principal components, which reproduces the plot given in the Seurat guided clustering tutorial. (b) Due to the relationship between the sum of squared errors and the standard deviation of the principal components (see Section F.2), looking for an elbow in (a) is equivalent to looking for an elbow in (b). (c) We perform Poisson data thinning to obtain a training set (used to compute the singular value decompoistion) and a test set (used to compute the sum of squared errors). The sum of squared errors is minimized at 7 principal components. (d) The negative binomial data thinning version of (c), which shows a minimum in the loss function at 5 principal components.

8. Discussion

We have proposed data thinning, a new way to decompose a single observation into two or more independent observations that sum to yield the original. This proposal applies to a very broad class of distributions. Furthermore, we have compared data thinning to sample splitting, and have shown that the former is applicable in many cases that the latter is not, and may be preferable even when both are applicable. In Section 2.3, we considered the impact of using the incorrect value of a nuisance parameter when performing data thinning, but we did not consider what happens when the nuisance parameter is estimated using the data itself. In future work, we will consider the theoretical and empirical implications of performing data thinning with an estimated nuisance parameter. Furthermore, we focused on convolution-closed distributions and thus used additive decompositions where X = X(1) + X(2). For distributions with bounded support, such as the beta distribution, non-additive decompositions are needed. We leave such decompositions to future work. An R package implementing data thinning and scripts to reproduce the results in this paper are available at https://anna-neufeld.github.io/datathin/.

Acknowledgments

This material is based upon research supported in part by the Oﬃce of Naval Research (award number N000142312589), the National Science Foundation (award Number 2322920), the Simons Foundation (Simons Investigator Award in Mathematical Modeling of Living Systems), the National Institutes of Health (R01 EB026908, R01 GM123993, and R01 DA047869), and the Keck

Data Thinning for Convolution-Closed Distributions

Foundation. Lucy Gao and Ameer Dharamshi were supported in part by the Natural Sciences and Engineering Research Council of Canada.

Neufeld, Dharamshi, Gao, and Witten

Appendix A. Proofs from Section 2

A.1 Proof of Theorem 5

Proof This proof is due to Joe (1996) and Jørgensen and Song (1998), but has been adapted to ﬁt our notation. Let x, our observed data, be a realization of a random variable X Fλ. Let ϵ (0, 1) be chosen such that ϵλ and (1 ϵ)λ are in the parameter space Λ. We draw X(1) | X = x Gϵλ,(1 ϵ)λ,x, where this notation was deﬁned in Section 2.1, and let X(2) = X X(1). Separately, let X Fϵλ and X F(1 ϵ)λ be independent, and let Y = X + X . As Fλ is a convolution-closed distribution, it follows that Y Fλ and thus we know that Y has the same marginal distribution as X. We ﬁrst argue that the joint distribution of (X(1), X) is the same as the joint distribution of (X , Y ). The conditional distribution of X(1) | X is the same as the conditional distribution of X | Y by deﬁnition of the distribution Gϵλ,(1 ϵ)λ,x. Furthermore, we have already noted that the marginal distributions of X and Y are the same. Thus, the joint distribution of (X(1), X) is the same as the joint distribution of (X , Y ). As X(2) is deterministic given X(1) and X, the joint distribution of (X(1), X(2)) is the same as the joint distribution of (X(1), X). Similarly, the joint distribution of (X , Y ) is the same as the joint distribution of (X , X ). Thus, the joint distribution of (X(1), X(2)) is the same as the joint distribution of (X , X ). As the joint distribution of X and X is known to be the product of independent distributions Fϵλ and F(1 ϵ)λ, this completes the proof of parts (i) and (ii) of Theorem 5. The ﬁnal statement of Theorem 5 follows directly from Deﬁnition 2.

A.2 Proof of Proposition 10

Proof To prove (i), note that since X(1) | X = x is normally distributed and X is normally distributed, a well-known property of the normal distribution tells us that the marginal distribution of X(1) is normal. We then use the law of total expectation and the law of total variance to compute its mean and variance.

E[X(1)] = E[E[X(1) | X]] = E[ϵX] = ϵµ

Var(X(1)) = Var E h X(1) | X i + E h Var X(1) | X i

= Var (ϵX) + E ϵ(1 ϵ) σ2

= ϵ2σ2 + ϵ(1 ϵ) σ2.

To prove (ii), note that the diﬀerence between two normally distributed variables (X and X(1)) is normal. Then note that

E[X(2)] = E[X] E[X(1)] = µ ϵµ = (1 ϵ)µ.

Var(X(2)) = Var E h X(2) | X i + E h Var X(2) | X i

= Var E h X X(1) | X i + E h Var X X(1) | X i

= Var ((1 ϵ)X) + E h Var X(1) | X i

= (1 ϵ)2σ2 + ϵ(1 ϵ) σ2,

Data Thinning for Convolution-Closed Distributions

which completes the proof of (ii). Finally, to prove (iii), note that

2 Cov(X(1), X(2)) = Var(X) Var(X(1)) Var(X(2))

= σ2 ϵ2σ2 ϵ(1 ϵ) σ2 (1 ϵ)2σ2 ϵ(1 ϵ) σ2

= 2ϵ(1 ϵ) σ2 σ2 .

A.3 Proof of Proposition 11

Proof Recall that if A Beta Binomial (r, α, β), then E[A] = rα α+β and Var(A) = rαβ(α+β+r) (α+β)2(α+β+1).

Then we can derive the marginal variance of X(1) using the law of total variance and the fact that X(1) | X Beta Binomial(X, ϵ r, (1 ϵ) r).

Var(X(1)) = E[Var(X(1) | X)] + Var(E[X(1) | X])

= E Xϵ(1 ϵ)( r + X)

r + 1 r E [X] + E X2 + ϵ2 Var (X)

r + 1 r E [X] + Var(X) + E[X]2 + ϵ2 Var (X) .

Next note that Var(X(2) | X) = Var(X X(1) | X) = Var(X(1) | X) and that E[X(2) | X] = (1 ϵ)X. Thus, we arrive at:

Var(X(2)) = ϵ(1 ϵ)

r + 1 r E [X] + Var(X) + E[X]2 + (1 ϵ)2 Var (X) .

To derive the covariance, note that:

2 Cov(X(1), X(2)) = Var(X) Var(X(1)) Var(X(2))

= Var(X) 2ϵ(1 ϵ)

r + 1 r E [X] + Var(X) + E[X]2 ϵ2 + (1 ϵ)2 Var (X)

p2 + r2 (1 p)2

+ 2ϵ(1 ϵ)r1 p

= 2ϵ(1 ϵ)r(1 p)2

Neufeld, Dharamshi, Gao, and Witten

A.4 Proof of Proposition 12

Proof The proof structure is identical to those of Proposition 10 and Proposition 11. We start by recalling that if A Gamma(α, β), then E[X] = α

β and Var(X) = α

Var(X(1)) = E h Var X(1) | X i + Var E h X(1) | X i

= E [Var (XZ | X)] + Var (E [XZ | X])

= E X2 Var (Z) + Var (X E [Z])

= E X2 ϵ(1 ϵ)

α + 1 Var(X) + E[X]2 + ϵ2 Var (X) .

Similarly, we note that Var(X(2) | X) = Var(X X(1) | X) = Var(X(1) | X), while E(X(2) | X) = (1 ϵ)X. This allows us to do a similar derivation and arrive at:

Var(X(2)) = E h Var X(2) | X i + Var E h X(2) | X i

α + 1 Var(X) + E[X]2 + (1 ϵ)2 Var (X) .

2 Cov(X(1), X(2)) = Var(X) Var(X(1)) Var(X(2))

= Var(X) 2ϵ(1 ϵ)

α + 1 Var(X) + E[X]2 ϵ2 + 1 2ϵ + ϵ2 Var(X)

α + 1 Var(X) + E[X]2 + 2ϵ(1 ϵ) Var(X)

+ 2ϵ(1 ϵ) α

= 2ϵ(1 ϵ) α

Appendix B. Proof of Theorem 13

Proof The proof is nearly identical to that of Theorem 5. It extends ideas from Jørgensen and Song (1998) and Joe (1996) to the setting of multiple folds. Let x, our observed data, be a realization of random variable X Fλ. Let ϵ1, . . . , ϵM be chosen such that PM m=1 ϵm = 1, ϵm > 0, and ϵmλ is in the parameter space Λ for m = 1, . . . , M. Suppose we draw X(1), . . . , X(M) | X = x Gϵ1λ,ϵ2λ,...,ϵMλ,x, where Gϵ1λ,ϵ2λ,...,ϵMλ,x was deﬁned in Section 3. Separately, let X1, X2, . . . , XM be mutually independent random variables, where Xm Fϵmλ, and let Y = PM m=1 Xm. As Fλ is a convolution-closed distribution, we know that Y Fλ and thus Y has the same marginal distribution as X.

Data Thinning for Convolution-Closed Distributions

The conditional joint distribution of (X(1), X(2), . . . , X(M)) | X is the same as the conditional joint distribution of (X1, X2, . . . , XM) | Y , by deﬁnition of the distribution Gϵ1λ,...,ϵMλ,x. Furthermore, we have already seen that the marginal distributions of X and Y are the same. Thus, the marginal joint distribution of (X(1), X(2), . . . , X(M)) is the same as the marginal joint distribution of (X1, X2, . . . , XM). Furthermore, by construction, the marginal joint distribution of (X1, X2, . . . , XM) is that of M mutually independent random variables, where Xm Fϵmλ. This concludes the proof of Theorem 13, parts 1-3. Part 4 of Theorem 13 follows directly from Deﬁnition 2.

Appendix C. Proof of Theorem 14

To prove Theorem 14, we rely on Lemma 19 and Lemma 20.

Lemma 19 Suppose that we thin a random variable X Fλ using Algorithm 2 with ϵ1 = . . . = ϵM = 1 M to obtain X(1), . . . , X(M). Let IX(θ) denote the Fisher information contained in X about an unknown parameter θ (assume that this Fisher information exists). Then the Fisher information contained in X(m) for m = 1, . . . , M about θ, denoted IX(m)(θ), is equal to 1 M IX(θ).

Proof We began with a random variable X and we constructed X(1), . . . , X(m) without knowledge of θ. We cannot create information from nothing. Thus, if I(X(1),...,X(m))(θ) denotes the

information about θ in the joint distribution of X(1), . . . , X(m) , then

I(X(1),...,X(m))(θ) IX(θ). (2)

To prove (2) formally, we can use the chain rule property of Fisher information to write I(X(1),...,X(m),X)(θ) in two ways:

I(X(1),...,X(m),X)(θ) = I(X(1),...,X(m))(θ) + IX|(X(1),...,X(m))(θ) = I(X(1),...,X(m))|X(θ) + IX(θ).

Since θ is unknown during the thinning process, the distribution X(1), . . . , X(m) | X must not involve θ, so I(X(1),...,X(m))|X(θ) = 0. This implies that

I(X(1),...,X(m))(θ) + IX|(X(1),...,X(m))(θ) = IX(θ),

and since Fisher information is non-negative, (2) follows. Separately, since X = X(1) + . . . + X(m), we can reconstruct X from X(1), . . . , X(m) , and so a fundamental property of Fisher information tells us that

IX(θ) = IPM m=1 X(m)(θ) I(X(1),...,X(m))(θ). (3)

Combining (2) and (3), we have that

IX(θ) = I(X(1),...,X(m))(θ).

Finally, as X(1), . . . , X(M) are independent and identically distributed,

IX(θ) = I(X(1),...,X(M))(θ) =

m=1 IX(m)(θ) = MIX(1)(θ).

Neufeld, Dharamshi, Gao, and Witten

It follows immediately that IX(1)(θ) = 1 M IX(θ). Similarly, IX(m)(θ) = 1 M IX(θ) for m = 2, . . . , M.

Lemma 20 In the setting of Lemma 19, the Fisher information contained in PK m=1 X(m) for K < M, denoted IPK m=1 X(m)(θ), is equal to K

Proof The convolution-closed property of Fλ says that PK m=1 X(m) F K

M λ. As PK m=1 X(m) is a

function of X(1), . . . , X(K), we have that

I(X(1),...,X(K))(θ) IPK m=1 X(m)(θ).

We can construct a random variable Y F K

M λ with the same distribution as PK m=1 X(m) by applying Algorithm 2 to X with ϵ1 = K/M and ϵ2 = 1 K/M and calling the ﬁrst fold of data Y . We then can thin Y with ϵ1 = . . . = ϵK = 1 K to obtain Y (1), . . . , Y (K), whose joint and marginal distributions are equal to X(1), . . . , X(K) (independent random variables that follow F 1

M λ). This allows us to rewrite the inequality above as

I(Y (1),...,Y (K))(θ) = I(X(1),...,X(K))(θ) IPK m=1 X(m)(θ) = IY (θ). (4)

By the same logic that was given in the proof of Lemma 19, we know that we can produce I(Y (1),...,Y (K)) from Y without knowing θ. Thus, Y (1), . . . , Y (K) cannot contain more information about θ than Y . Thus, we have that

I(Y (1),...,Y (K))(θ) = I(X(1),...,X(K))(θ) IPK m=1 X(m)(θ) = IY (θ), (5)

where the last equality holds since two random variables with the same distribution contain the same amount of information about θ. Combining the inequalities in (4) and (5) yields

IPK m=1 X(m)(θ) = I(X(1),...,X(K))(θ).

Finally, we note that

IPK m=1 X(m)(θ) = I(X(1),...,X(K))(θ) =

m=1 IX(m)(θ) = K

where the second equality follows from independence, and the third from Lemma 19. This completes the proof.

We now proceed with the proof of Theorem 14. Proof For simplicity, assume that ϵm is a rational number for m = 1, . . . , M. Then, we can rewrite ϵ1, . . . , ϵM as K1 M , K2

M , . . . , KM

M for integers M , K1, . . . , KM. Further assume that 1 M λ is in the parameter space Λ for our model. Then, for m = 1, . . . , M, X(m) has the same distribution as PKm j=1 X(j), where X(1), . . . , X(M ) are random variables that we would obtain if we thinned X into M equally sized folds. By Lemma 20, IX(m)(θ) = IPKm j=1 X(j)(θ) = Km

M IX(θ) = ϵm IX(θ).

Data Thinning for Convolution-Closed Distributions

Remark 21 The proof of Theorem 14 assumes that (i) ϵm is rational for m = 1, . . . , M, and that (ii) 1 M λ Λ for the common denominator M . Among the distributions in Table 2, only the Binomial(r, p) and the Multinomial(r, p) have relevant restrictions on the parameter space Λ. To be able to thin one of these distributions with parameter ϵm, we must have that ϵmr is a positive integer. Thus, ϵm = Zm

r for some integer Zm, i.e. (i) is satisﬁed. Adopting the notation of the proof of Theorem 14, we note that Km = Zm and that M = r = λ. Thus, the requirement that λ M is in the parameter space (i.e. that it is a positive integer) follows immediately, since λ M = 1.

Appendix D. Simulation Study Supporting Details

In this section, we provide additional details about the simulation studies described in Section 6.1. For Example 11, in which we select the number of principal components for binomial data, we use the following setup. For K = 10, we compute θ = UDV T where U is a n K random orthogonal matrix, D is a K K diagonal matrix with diagonal elements equal to 5, 6, . . . , 14, and V is a d K random orthogonal matrix. Then, pij = exp (θij) 1+exp (θij) for i = 1, . . . , n and j = 1, . . . , d. For Example 12, in which we select the number of clusters in gamma-distributed data, we use the following setup. In the small d, small K clustering setting described in Example 12, observations from each cluster are generated as Xij ind Gamma(λ, θci,j) where λ = 20,

0.5 5 5 0.5 10 10 0.5 0.5

and ci {1, 2, 3, 4} is the true cluster membership for the ith observation. In the large d, large K clustering setting described in Example 12, observations from each cluster are generated as Xij ind Gamma(λ, θci,j) where λ = 2, the K d matrix θ is constructed such that for j = 1, . . . , d and k = 1, . . . , K ,

( 0.1 if k 9 and 10k 9 j 10k + 10, 1 otherwise,

and ci {1, 2, . . . , 10} is the true cluster membership for the ith observation.

Appendix E. Simulation with Mean Squared Error Loss Function

E.1 Methods

As an alternative to the negative log-likelihood loss used in Section 6, here we consider applying a mean squared error loss. To do this, we simply replace the negative log likelihood from Step 4 of Algorithms 3 and 4 with the mean squared error, deﬁned as

X(test) ij ϵ(test)rp(K) ij 2 (6)

in the case of Algorithm 3 and

X(test) ij ϵ(test)

ϵ(train) ˆµ(K) ˆci,j

Neufeld, Dharamshi, Gao, and Witten

Figure 8: The mean squared error loss averaged over 2,000 simulated data sets, as a function of K, for the naive method (purple), data thinning with ϵ(train) = 0.5 (red), data thinning with ϵ(train) = 0.8 (blue), and multifold thinning with 5 folds (green). Each curve has been rescaled to take on values between 0 and 1, for ease of comparison. The minimum loss values for each method are circled, and K is indicated by the vertical black line.

in the case of Algorithm 4. After replacing the loss functions, these algorithms can be applied directly to obtain simulation results for data thinning, and with slight modiﬁcation to obtain results for multi-fold data thinning and the naive method, as described in Section 4.2.

E.2 Results

In Figure 8, we plot the average mean squared error curves, as a function of K. As with the negative log-likelihood loss, data thinning approaches produce curves with sharp minimum values at or near K = K , as opposed to the naive method s monotonically-decreasing curves. In Figure 9, we plot the proportion of simulations that select the correct value of K using the mean squared error loss, as a function of ϵ(train). Results are largely similar to Figure 5. Finally, we compare multi-fold to single-fold thinning, under the mean squared error loss, in Figure 10. As in Figure 6, we ﬁnd that multi-fold thinning tends to select the correct value of K more often than single-fold thinning.

Appendix F. Details for the Real Data Analysis in Section 7

F.1 Preprocessing

We ﬁrst explain in detail the preprocessing done to the matrix X in the Seurat tutorial.

(1) Initial data ﬁltering: We initially ﬁlter the data such that only cells with between 200 and 2500 total counts remain (with fewer than 5% of the counts coming from from mitochondrial genes) and only genes that are expressed in at least 200 cells remain. This reduces the size of X from 2, 700 32, 738 to 2, 638 13, 714

Data Thinning for Convolution-Closed Distributions

Figure 9: The proportion of simulations for which data thinning selects the true value of K with the mean squared error loss, as a function of ϵ(train), for the simulation study described in Section 6.1. The optimal value of ϵ(train) depends on the problem at hand.

Figure 10: The proportion of simulated data sets in which each candidate value of K is selected, with the mean squared error loss, under data thinning with ϵ(train) = 0.8 (blue) and multifold thinning with M = 5 (green), for each of the simulation settings described in Section 6.1. The true value of K is indicated by the vertical black line. Multifold thinning tends to select the true value of K more often than single-fold thinning.

Neufeld, Dharamshi, Gao, and Witten

(2) Log normalization: Next, the data are normalized and log transformed, such that

Xij P13714 t=1 Xit 10, 000 + 1

(3) Feature selection: Following this transformation, the top 2000 highly variable genes are selected using the function Find Variable Features from the Seurat package. The goal of the function is to ﬁnd a subset of features with high cell-to-cell variation after accounting for the inherent mean-variance relationship, as these are most likely to be interesting in downstream analysis, and it implements methodology from Stuart et al. (2019).

(4) Centering and scaling: Finally, the columns of the subsetted matrix Y R2638 2000 are centered and scaled to obtain the matrix Y .

After these preprocessing steps, the principal components of Y are computed. We now explain the preprocessing for X(1) and X(2) that we use for our Poisson data thinning alternative to the Seurat tutorial. We follow the same four steps as above, but we are careful to specify what we do on the training set X(1) as opposed to the test set X(2).

(1) Initial data ﬁltering: We perform the initial data ﬁltering from Step (1) above on X(1). We then subset X(2) to include the same genes and cells as those in X(1). After this step, X(1)

and X(2) are both in Z2638 13258 0 .

(2) Log normalization: We normalize and log-transform both X(1) and X(2), such that:

Y (1) ij = log

X(1) ij P13258 t=1 X(1) it 10, 000 + 1

, Y (2) ij = log

X(2) ij P13258 t=1 X(2) it 10, 000 + 1

We note that these random variables are still independent and identically distributed under our Poisson assumption.

(3) Feature selection: We then apply the Seurat function Find Variable Features to the matrix Y (1) to select the top 2000 highly variable genes (Stuart et al., 2019). We subset both Y (1)

and Y (2) to contain only these genes, such that Y (1), Y (2) R2638 2000.

(4) Centering and scaling: We center and scale the columns of the subsetted Y (1) to obtain Y (1). We also center and scale the columns of the subsetted Y (2) to obtain Y (2).

After these preprocessing steps, the principal components of Y (1) are computed, and the loss function is computed using Y (2). The negative binomial data thinning analysis, presented in Figure 7(d), follows exactly the same steps as above, but the Poisson data thinning step is replaced with negative binomial data thinning. Details are given in Appendix F.3.

F.2 Equivalence Between Standard Deviation and Sum of Squared Errors

We now explain the identity that makes Figure 7(a) and Figure 7(b) mathematically equivalent. In Section 7, we deﬁned

SSEK( Y ) = Y U1:KD1:KV T 1:K 2

Data Thinning for Convolution-Closed Distributions

We see that: Y U1:KD1:KV T 1:K 2

F = Y 2 2 2trace Y T U1:KD1:KV T 1:K + trace V1:KDT 1:KUT 1:KU1:KD1:KV T 1:K

= Y 2 F trace DT 1:KD1:K = Y 2

Thus, if we compute SSEK( Y ) for K = 1, . . . , 20, since Y 2

F is ﬁxed, we can easily obtain the

values of PK j=1 D2 jj for K = 1, . . . , 20. By taking the diﬀerences between these values for K and K + 1, we obtain DKK for K = 1, . . . , 20. Finally, we note that the standard deviation of the Kth principal component (UKDKK) can be written as

(DKKUK)T (DKKUK) = q

Since the standard deviation of the Kth principal component (plotted in Figure 7(a)) can be obtained directly from the sums of squared error plotted in Figure 7(b), we say that the two plots are mathematically equivalent.

F.3 Negative Binomial Thinning

For the negative binomial analysis in Section 7, we assume that Xij Neg Bin(µij, bj), where

here we adopt the parameterization such that E[Xij] = µij and Var(Xij) = µij + µ2 ij bj . We refer to bj as the overdispersion parameter, and we note that it is equivalent to the size parameter r in the parameterization of the negative binomial distribution used in the rest of this paper (e.g. Table 2 and Proposition 11). As in Choudhary and Satija (2022), we assume that the overdispersion parameters bj are gene-speciﬁc, but not cell speciﬁc. In order to use negative binomial thinning (Table 2), rather than Poisson thinning, to obtain X(1) and X(2), we need to estimate the gene-speciﬁc overdispersion parameters bj. We do this following the methodology of Hafemeister and Satija (2019), implemented in the R package sctransform, which assumes a smooth relationship between the average expression for each gene and the overdispersion parameters for each gene. As these estimates are computed using the data X, the results in Proposition 11 do not apply exactly. However, the results in Proposition 11 heuristically tell us that negative binomial thinning should perform reasonably well if these estimated overdispersion parameters are close to the true overdispersion parameters. We refer interested readers to Neufeld et al. (2023) for more information on negative binomial data thinning in this context, as well as simulations showing that negative binomial data thinning performs well with estimated overdispersion parameters. We leave the matter of theoretical guarantees on the performance of data thinning with estimated nuisance parameters to future work.

Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, and Linda Zhao. Models as approximations ii: A model-free theory of parametric regression. Statistical Science, 34(4):545 565, 2019.

Fan Chen, Sebastien Roch, Karl Rohe, and Shuqi Yu. Estimating graph dimension with crossvalidated eigenvalues. ar Xiv preprint ar Xiv:2108.03336, 2021.

Neufeld, Dharamshi, Gao, and Witten

Saket Choudhary and Rahul Satija. Comparison and evaluation of statistical error models for sc RNA-seq. Genome Biology, 23(1):1 20, 2022.

Rick Durrett. Probability: theory and examples, volume 49. Cambridge University Press, 2019.

Wei Fu and Patrick O Perry. Estimating the number of clusters using cross-validation. Journal of Computational and Graphical Statistics, 29(1):162 173, 2020.

Lucy L Gao, Jacob Bien, and Daniela Witten. Selective inference for hierarchical clustering. Journal of the American Statistical Association, pages 1 11, 2022.

David Gerard. Data-based RNA-Seq simulations by binomial thinning. BMC Bioinformatics, 21 (1):1 14, 2020.

Christoph Hafemeister and Rahul Satija. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome biology, 20(1):296, 2019.

Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M Mauck, Shiwei Zheng, Andrew Butler, Maddie J Lee, Aaron J Wilk, Charlotte Darby, Michael Zager, et al. Integrated analysis of multimodal single-cell data. Cell, 184(13):3573 3587, 2021.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer, 2009.

P. Hoﬀman et al. Seurat guided clustering tutorial. https://satijalab.org/seurat/ articles/pbmc3k tutorial.html, 2022. Accessed: 09-12-2022.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning, volume 112. Springer, 2013.

Harry Joe. Time series models with univariate margins in the convolution-closed inﬁnitely divisible class. Journal of Applied Probability, 33(3):664 677, 1996.

Bent Jørgensen. Exponential dispersion models and extensions: A review. International Statistical Review/Revue Internationale de Statistique, pages 5 20, 1992.

Bent Jørgensen and Peter Xue-Kun Song. Stationary time series models with exponential dispersion model margins. Journal of Applied Probability, 35(1):78 92, 1998.

James Leiner, Boyan Duan, Larry Wasserman, and Aaditya Ramdas. Data ﬁssion: splitting a single data point. Journal of the American Statistical Association, (just-accepted):1 22, 2023.

Francisco Louzada, Pedro L. Ramos, and Eduardo Ramos. A note on bias of closed-form estimators for the gamma distribution derived from likelihood equations. The American Statistician, 73(2): 195 199, 2019.

Anna Neufeld, Lucy L Gao, Joshua Popp, Alexis Battle, and Daniela Witten. Inference after latent variable estimation for single-cell RNA sequencing data. Biostatistics, page kxac047, 2022. ISSN 1465-4644.

Anna Neufeld, Joshua Popp, Lucy L Gao, Alexis Battle, and Daniela Witten. Negative binomial count splitting for single-cell RNA sequencing data. ar Xiv preprint ar Xiv:2307.12985, 2023.

Data Thinning for Convolution-Closed Distributions

Natalia L Oliveira, Jing Lei, and Ryan J Tibshirani. Unbiased risk estimation in the normal means problem via coupled bootstrap techniques. ar Xiv preprint ar Xiv:2111.09447, 2021.

Natalia L Oliveira, Jing Lei, and Ryan J Tibshirani. Coupled bootstrap test error estimation for Poisson variables. ar Xiv preprint ar Xiv:2212.01943, 2022.

Art B Owen and Patrick O Perry. Bi-cross-validation of the SVD and the nonnegative matrix factorization. The Annals of Applied Statistics, 3(2):564 594, 2009.

Daniel G Rasines and G Alastair Young. Splitting strategies for post-selection inference. Biometrika, 2022. ISSN 1464-3510.

Alessandro Rinaldo, Larry Wasserman, and Max G Sell. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. The Annals of Statistics, 47(6):3438 3469, 2019.

Abhishek Sarkar and Matthew Stephens. Separating measurement and expression models clariﬁes confusion in single-cell RNA sequencing analysis. Nature Genetics, 53(6):770 777, 2021.

Rahul Satija, Jeﬀrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33:495 502, 2015.

Tim Stuart, Andrew Butler, Paul Hoﬀman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, et al. Comprehensive integration of single-cell data. Cell, 177:1888 1902, 2019.

Xiaoying Tian. Prediction error after model search. The Annals of Statistics, 48(2):763 784, 2020.

Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response. The Annals of Statistics, 46(2):679 710, 2018.

Jingshu Wang, Mo Huang, Eduardo Torre, Hannah Dueck, Sydney Shaﬀer, John Murray, et al. Gene expression distribution deconvolution in single-cell RNA sequencing. Proceedings of the National Academy of Sciences, 115(28):E6437 E6446, 2018.

Zhi-Sheng Ye and Nan Chen. Closed-form estimators for the gamma distribution derived from likelihood equations. The American Statistician, 71(2):177 181, 2017.