# learning_exponential_families_from_truncated_samples__b28e7362.pdf

Learning Exponential Families from Truncated Samples

Jane H. Lee Department of Computer Science Yale University jane.h.lee@yale.edu

Andre Wibisono Department of Computer Science Yale University andre.wibisono@yale.edu

Manolis Zampetakis Department of Computer Science Yale University emmanouil.zampetakis@yale.edu

Missing data problems have many manifestations across many scientific fields. A fundamental type of missing data problem arises when samples are truncated, i.e., samples that lie in a subset of the support are not observed. Statistical estimation from truncated samples is a classical problem in statistics which dates back to Galton, Pearson, and Fisher. A recent line of work provides the first efficient estimation algorithms for the parameters of a Gaussian distribution [10] and for linear regression with Gaussian noise [11, 14, 37]. In this paper we generalize these results to log-concave exponential families. We provide an estimation algorithm that shows that extrapolation is possible for a much larger class of distributions while it maintains a polynomial sample and time complexity on average. Our algorithm is based on Projected Stochastic Gradient Descent and is not only applicable in a more general setting but is also simpler than the recent algorithms of [10, 26, 11, 14, 37]. Our work also has interesting implications for learning general log-concave distributions and sampling given only access to truncated data.

1 Introduction

In many statistical estimation and inference problems, we have access to only a limited part of the data that would be necessary for the classical statistical methods to work, which motivates the development of statistical methods that are resilient to missing data [29]. Truncation [32, 8] is a fundamental and frequent type of missing data and arises when samples that lie outside a subset of the support are not observed and their count is also not observed. Statistical estimation from truncated samples is the focus of the field of truncated statistics, which was developed since the beginning of the twentieth century starting with the work of Galton [19], Pearson and Lee [34, 35], and Fisher [16]. Truncated statistics is widely applicable in Econometrics and many other theoretical and applied fields [32].

A recent line of work establishes the first sample optimal and computationally efficient methods for fundamental statistical estimation problems from truncated samples [10, 26, 11, 13, 14, 24, 37]. All the aforementioned works though heavily rely on the Gaussianity of the distribution of data or the Gaussianity of the noise in regression problems. Gaussianity is an idealized assumption and the question of generalizing truncated statistics beyond Gaussianity has been explored in many existing works, e.g., [1, 22, 38]. The only results in this regime though are for single dimensional problems and truncations that can be described as intervals.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Visualizing the density and truncation set for 2-dimensional exponential distribution. The plot on the left is the density of p(x1, x2) exp( x1

2 ). On the right is a contour plot of the truncation set [0, 2] [0, 2] under p(x1, x2). Note that E[X1] = 5 and E[X2] = 2, so the truncation set excludes the mean in one direction and includes it only as a boundary point in the other.

In this work we provide statistically and computationally efficient methods for estimating the parameters of exponential families from truncated samples. Our results generalize the recent work of [10] and are the first to provide an estimation algorithm for this problem for a general class of exponential families and for a general class of truncation biases.

Exponential families are one of the most influential type of distribution classes since they include many fundamental distributions such as normal, exponential, beta, gamma, chi-squared, and Weibull distributions. They were first introduced by Fisher [17] and later generalized by [9, 27, 36]. The estimation of the parameters of exponential families over continuous domains is the subject of many classical and recent results; starting from the work of Fisher [17] until the recent results of [25, 41]. This line of work has also found applications in many areas of statistics including causal inference [40]. Our work contributes in this line of research as well since we show how to estimate exponential families even when we only have access to truncated samples.

1.1 Our Results

For every distribution p, we define the truncated version p S as the distribution with density that satisfies p S(x) 1{x S} p(x). We consider distributions pθ with support X Rm parameterized by a vector of parameters θ Rk that can be written in the following form:

pθ(x) h(x) exp(θ T(x))

where h and T are known functions and T is called the sufficient statistics of the exponential family. Our goal is to estimate the vector of parameters θ up to ϵ error in ℓ2 distance using only samples from p S θ when we have oracle access to S, i.e., there is an oracle that for every x X can answer whether x S or not.

Without any further assumptions estimation from truncated samples is impossible as shown in [10]. In this paper we use the following assumptions to get our estimation guarantees:

Assumption I: We assume a lower bound and an upper bound on the variance of the sufficient statistics in any direction. This assumption is used in estimation of the parameters of exponential families even without truncated samples, e.g., Assumption 4.1 in [41]. Assumption II: We assume that the exponential families contain log-concave distributions. We use this assumption in two places: (1) to show our extrapolation result, our distributions must satisfy an anti-concentration property; this is a property that is heavily utilized even in estimating a Gaussian distribution [10], (2) we require access to a sampling oracle for the underlying distributions; this sampling oracle can be implemented efficiently using Langevin dynamics if we have log-concavity. Assumption III: We assume that the sufficient statistics T are bounded degree polynomials. This assumption combined with log-concavity provides the anti-concentration property that we

need. Alternatively, we can assume that the sufficient statistics belongs to a function class that satisfies anti-concentration but for simplicity of exposition we focus on the case where T(x) is a polynomial.

Our main result is the following. Informal Theorem 1.1 (See Theorem 3.1 for the formal statement). Under Assumptions I, II, III, and given samples n (at least e O(k/ϵ2) for some ϵ) from p S θ , where S is a measurable set to which we have oracle access, there exists an estimation algorithm with (expected) running time poly(m, k, n) that outputs bθ such that with probability at least 99%, it holds that bθ θ < ϵ.

Our main result has the following important implications:

We show that our assumptions are satisfied from exponential distributions, Weibull distributions, continuous Bernoulli, continuous Poisson, Gaussian distributions, and generalized linear models. Hence, our result implies an efficient method for estimation from truncated samples for all these distribution classes. Another interesting corollary of our result is that we can combine it with the ideas of [12] and get a general method for learning log-concave distributions from truncated samples. In particular, assume that we want to learn from truncated samples a distribution that can be written in the form p(x) exp( f(x)) where f is a convex function. Now under mild assumptions, we can replace f(x) with a finite Taylor approximation, i.e., we have that f(x) P

i aiti(x) for some polynomials ti(x). Then, using our method we can estimate the parameters ai and output an estimation of p. In the context of sampling, the gradient of the log-likelihood is the score function that is needed to run Langevin dynamics (e.g., see [30, 7]). Our result also says that we are able to sample from the original distribution, given that we only observe truncated samples.

Technical Contributions. Our algorithm is projected stochastic gradient descent (PSGD) on the negative log-likelihood function and is the same algorithm that is used in many of the recent works in truncated statistics. As in the previous work though, the main challenge is to show that PSGD will converge in our general setting. The previous analysis of the convergence of PSGD was based on exact properties of the Gaussian distribution. When we move away from Gaussianity every step of the convergence analysis becomes much more technical and we need to make sure to only use properties that generalize to exponential families beyond Gaussian distributions. In particular, certain results such as that of Lemma 3.6 and its related quantities in C.1 can be generalized even beyond exponential families, holding for any density.

1.2 Related Work

Our most related literature is the recent series of works on truncated statistics which includes the following results: estimation of multivariate normal distributions [10], linear regression with Gaussian noise [10, 26, 11, 13, 14, 37], estimation of product distributions over the hypercube [18], non-parametric density estimation [12]. All of these works heavily rely on properties of the Gaussian distributions, or product distributions over the hypercube, or their dependence in the number of dimensions is not efficient, e.g., [12]. In our work we identify the properties of exponential families that are only required to get the efficient estimation results and we show that linear dependence on the dimension is achievable in settings that are more general than the Gaussian case.

Another related work is that of [30] that solves parameter estimation of a truncated density given samples through the score matching technique. To derive a tractable objective, we need appropriate boundary conditions which are not satisfied by truncated densities, but [30] instead uses a modified weighted Fisher distance given that the truncation set S is a Lipschitz domain (a type of open and connected set). On the other hand, our work assumes no particular structure about S and hence our results are more general and applicable in much more complicated settings for exponential families.

2 Preliminaries

Notation. Lowercase bold letters will denote real-valued vectors, e.g., x Rm, and uppercase bold letters will denote matrices with real values, e.g., A Rn m. For a random vector x ρ,

Cov[x] = Cov[x, x] = E[(x E[x])(x E[x]) ] is its covariance matrix, and Var(x) is the trace of the covariance matrix (a scalar value). Depending on whether it is clear from context, Cov and Var may include subscripts to indicate the distribution ρ. The notation B(c, R) is the Euclidean ball centered at c with radius R > 0.

Exponential Families. Let x X Rm. We are interested in a class of densities which have the form, pθ(x) = h(x) exp(θ T(x) A(θ)), where h : Rm 7 R+ is the base or carrier measure, θ Θ with Θ = {θ Rk : A(θ) < } is the natural parameter space, T : Rm 7 Rk is the sufficient statistic for θ, and A(θ) = log Z(θ) is the log-partition function, where Z(θ) = R pθ(x)dx.

A regular exponential family is one where Θ is an open set. It is minimal if the θ and T(x) are each linearly independent. Any non-minimal family can be made minimal by appropriate reparametrization. In any regular exponential family, A(θ) is convex. It is strictly convex if the representation is minimal. Exponential families have several nice properties (e.g., see Theorem 1 of [4]), among which are that A(θ) = Epθ[T(x)] and 2A(θ) = Covpθ[T(x)].

Truncated Distributions. Let ρ be a probability distribution on Rm. We represent ρ as a probability density function with respect to the Lebesgue measure dx on Rm. Let S Rm be such that ρ(S) = α for some α (0, 1]. Let ρS := ρ( | S) be the conditional distribution of x ρ given that x S. Concretely, the density of ρS is

ρS(x) = ρ(x) 1{x S}

For exponential families, we have the truncated density p S θ (x) is:

p S θ(x) = pθ(x) R

S pθ(x)dx1{x S} = h(x) exp(θ T(x)) R

S h(x) exp(θ T(x))dx 1{x S}.

See Figure 1 for an illustration.

Sub-Exponential Distributions. Although the term sub-exponential has been overloaded (e.g., [21] v.s.[44]), the definition we will use describes a class of distributions whose tails decay at least as fast as an exponential, but with potentially heavier tails than Gaussians [44].

There are several equivalent characterizations of sub-exponential random variables (e.g., see Prop. 2.7.1 of [44]), one of which uses the moment generating function. Definition (Sub-exponential random variable). A centered, real-valued random variable X SE(ν2, β) is sub-exponential with parameters ν2, β > 0 if

E[eλX] e ν2λ2

2 , λ : |λ| < 1/β.

Membership Oracle of a Set. Let S Rm. A membership oracle is an efficient procedure which computes 1{x S}.

3 Projected Stochastic Gradient Descent Algorithm

Problem Setup. We are given truncated samples {xi}n i=1, with each xi p S θ , where pθ (S) = α > 0. Without knowledge of the truncation set S beyond access to a membership oracle, can one recover θ and thus pθ efficiently?

We answer this question positively, under the following assumptions: Assumption A1 (Strong Convexity, Smoothness of Non-truncated Negative Log-Likelihood over Θ).

λI Covz pθ[T(z), T(z)] LI θ Θ, for some λ, L > 0. Here, we ve abused notation for Θ which can be a subset of the entire natural parameter space. As mentioned earlier, this is always at least convex for exponential families and strictly convex in minimal representation. Thus the negative log-likelihood (of the non-truncated density) can be made strongly convex and smooth by restricting the natural parameter space appropriately.

Assumption A2 (Log-Concave Density). The density pθ(x) is log-concave in x. Assumption A3 (Sufficient Statistics T(x) is polynomial in x). T(x) Rk has components which are polynomial in x, with degree at most d.

Assumptions A2 and A3 allow us to use the anti-concentration result needed for Lemma 3.2 which is heavily utilized even in the Gaussian case. While A2 also allows for efficient sampling via Langevin dynamics, the latter is only used in Lemma 3.2. Refer back to 1.1 for discussion of these assumptions.

Main Result. Theorem 3.1 (Main). Given membership oracle access to a measurable set S whose measure is some constant α (0, 1] under an unknown exponential family distribution pθ which satisfies A1, A2, A3, and given samples x1, . . . , xn from pθ that are truncated to this set, there exists an expected polynomial-time algorithm that recovers an estimate bθ. That is, for any ϵ > 0 the algorithm

Uses an expected e O(k/ϵ2) truncated samples and queries to the membership oracle,

Runs in expected poly(m, k, 1/ϵ) time.

Produces an estimate bθ such that with probability at least 99%,

In order the solve this problem, we need to define an objective whose optimum is θ and we need to be able to recover it uniquely. To use maximum likelihood estimation (or minimize the negative log-likelihood), we have to be able to compute gradients which depend on the truncation set S, which we cannot do directly without more knowledge about S. However, we can sample unbiased estimates of the gradient, as long we have non-trivial mass on S at a current parameter estimate (otherwise the truncated likelihood function at that parameter is not well-defined and rejection sampling would take infinite time). To address all of these issues, the organization of this section is as follows:

Section 3.1 establishes that after truncation, the negative log-likelihood remains strongly convex and smooth (in θ) over a subset of parameters which have non-trivial mass on the truncation set. In Section 3.3, we show that while we do not know the truncation set, we can solve the non-truncated MLE problem with truncated samples to find an initial parameter θ0 which assigns non-trivial mass to the truncation set. Then given this θ0, in Section 3.2 we show that we can construct a set of parameters K which all assign non-trivial mass to the truncation set (and contains the true parameter θ ). In Section 3.4, we use results from the previous sections to prove that we can efficiently recover the true parameter θ using a stochastic gradient descent procedure minimizing the truncated negative log-likelihood, which projects to the parameter space K.

3.1 Strong Convexity and Smoothness of Truncated Negative Log-Likelihood

Without truncation, recovering the true parameter θ for any parameterized distribution given samples is a classical problem solved by maximizing the likelihood (or minimizing its negation). Here, we state the main objective we will minimize through a stochastic gradient descent procedure as well as the properties of this objective that will allow us to recover θ . Define:

ℓ(θ) := Ex p S θ log p S θ(x)

θℓ(θ) = Ez p S θ [T(z)] Ex p S θ [T(x)]

2 θℓ(θ) = Covz p S θ [T(z), T(z)]

Note that since the Hessian is a covariance matrix which is at least PSD, this objective is always convex in θ. Thus θ is a minimizer since it satisfies the first-order optimality condition. (These calculations can be found in Appendix A.1.) However, if the objective is too flat, we may not be able to recover θ even after sufficiently reducing the objective value. For this, we prove that if the original non-truncated covariance has bounded eigenvalues, the truncated one does as well under A1, A2, and A3 at parameters which assign non-trivial mass to S.

Lemma 3.2 (Preservation of Strong Convexity under Truncation). Assume the lower bound in A1, A2, A3. If pθ(S) > 0, then

Covz p S θ [T(z), T(z)] 1

where C is a universal constant guaranteed by Theorem 8 of [5] and d is the maximum degree of T(x). See proof in Appendix A.2 which follows that of [10]. Lemma 3.3 (Preservation of Smoothness under Truncation). Assume the upper bound in A1. Suppose pθ(S) > 0, then

Covz p S θ [T(z), T(z)] 1 pθ(S)LI.

See proof in Appendix A.3. The proof is simple and can be done similarly to the previous lemma.

Thus, we have shown that as long as we optimize over a parameter space where every θ assigns non-trivial mass to the truncation set, our objective is both strongly convex and smooth. The following sections will help us determine and then construct this set given samples. Remark. Note that the upper bound increased and the lower bound decreased for the eigenvalues of the truncated covariance matrix. Even in a one-dimensional simple case like N(0, 1), it is easy to construct examples of both increasing the shrinking the variance given freedom to place mass anywhere under N(0, 1). Thus, it may be natural that the eigenvalue range expands after truncation. Remark. Lemma 3.2 and the log-likelihood calculations are direct generalizations of prior work in the Gaussian case, where we can recover the results of [10] by noting that the re-parameterization of Gaussian parameters (µ, Σ) as ν = Σ 1µ and T = Σ 1 is the natural parameterization of multivariate Gaussian distributions in exponential family form (up to constants). The sufficient statistics here T(x) [x, xx ] has components which are polynomial in x with degree at most 2, and plugging in d = 2 to Lemma 3.2 recovers Lemma 4 in [10]. Appendix B includes other examples beyond Gaussians which satisfy A1, A2, and A3.

3.2 Parameter Space with Non-Trivial Mass on Truncation Set

The prior section established that the strong convexity and smoothness parameter of the truncated objective is controlled by the mass that pθ assigns to the truncation set S for any given θ. In this section, we will prove lower bounds on the mass that pθ assigns to the truncation set, given that the parameter distance to the optimum θ θ is bounded. Lemma 3.4 (Lower bound for mass on truncation set under smoothness given parameter distance). Assume A1. Let θ, θ Θ. Then for two distributions from the same exponential family

pθ(S) pθ (S)2 exp 3L

Proof is provided in Appendix C.3, and only needs smoothness. Thus, we can lower bound the mass that a parameter θ assigns to S given its distance θ θ from θ which is assumed to have pθ (S) = α.

Thus, to make use of this property we want to establish a procedure to initialize a parameter θ0 such that its distance to the optimum θ is bounded. Then during the optimization procedure we will make sure to make progress toward θ . The following is the result we will be able to prove after proving some results between truncated and non-truncated quantities in the following Section 3.3. Corollary 3.5 (Parameter space with non-trivial mass on truncation set). Given θ0 such that Epθ0[T(x)] = T where T = 1 n Pn i=1 T(xi) is the empirical mean sufficient statistics given our samples {xi}n i=1 for each xi p S θ , if we define

K = B θ0, ϵS + O(log 1/α)

then pθ(S) α2 exp 6L

λ2 (ϵS + O(log 1/α))2 > 0 holds θ K (with probability at least 1 δ for n Ω(log(1/δ))). Furthermore, θ K (as long as θ Θ satisfies conditions of Claim 1).

This result will follow from Lemma 3.4, Claim 1, Lemma 3.7, and Lemma 3.9 in the next section.

3.3 Initialization with Empirical Samples and Non-truncated MLE

Given samples from the truncated density p S θ , one may first try to solve the non-truncated empirical MLE problem to find a parameter θ0 without truncation and hope that it is good enough. (We will show that it is.) In order to understand how good this initial guess is, we need to establish some relationships between the truncated and non-truncated density. Lemma 3.6 (Truncated vs. Non-truncated Mean Sufficient Statistics for General Densities). Let ρ be a probability distribution on Rd (not necessarily from an exponential family). Let S Rd with ρ(S) > 0. Then

EρS[x] Eρ[x]

Proof of this lemma and several related quantities for general truncated densities is in Appendix C.1. In low dimensions, this variance term may effectively be a constant; however, in high-dimensional settings this term can grow with dimension (which is undesirable if we want an efficient algorithm). Given more assumptions about the density, we can get better dimension-free bounds which generalize the results from the Gaussian case. Claim 1. Let θ Θ such that θ + 1

β u Θ for some β > 0 for all unit vectors u and such that Assumption A1 holds for pθ from an exponential family. Then X := u (T(x) Ex pθ[T(x)]) is SE(L, β). (Proof provided in Appendix C.2.) Remark. This claim allows us to have concentration of the empirical mean sufficient statistics to its population mean for the non-truncated distribution. The prior work [10] analyzed the mean and covariance of a Gaussian separately, but for instance to establish bounds on distances between the truncated and non-truncated mean parameter, it made use of Gaussian concentration inequalities (which are tighter than sub-exponential ones). The relationship between the truncated density and the non-truncated one will allow us to say that the empirical truncated mean sufficient statistics is also somewhat close to the non-truncated population mean. Lemma 3.7 (Concentration of Empirical vs. Non-truncated Mean Sufficient Statistics). Suppose θ satisfies the conditions of Claim 1 and pθ (S) = α > 0. Let T = 1 n Pn i=1 T(xi) be the empirical mean sufficient statistics given our samples {xi}n i=1 each xi p S θ . Let ϵS > 0. For

n Ω 2β ϵS log 1

δ , with probability at least 1 δ,

T Epθ [T(x)] ϵS + O(log 1/α).

It should be noted that it suffices to consider ϵS as some independent constant (and not related to the ϵ accuracy parameter in the main theorem). However, by taking n at least e O(k/ϵ2) as stated in the main theorem, this will be small. See proof in Appendix C.4. Remark. At a high level, the truncated samples can be thought of as O(n/α) samples from the non-truncated distribution (keeping only those in S), and each are not too far (depending on how much mass the set S has under the non-truncated distribution) from the non-truncated mean due to concentration. Note that the O(log 1/α) term will never disappear even as we increase n, which quantifies the inherent bias that the truncated mean sufficient statistics will have with respect to the non-truncated one (and is large if the mass α is small). From this, we can also say something about the population mean sufficient statistics, one on the truncated distribution and one on the non-truncated. Corollary 3.8 (Truncated vs. Non-truncated Mean Sufficient Statistics). Let θ satisfy the conditions of Claim 1 and pθ(S) > 0. Then

Ep S θ [T(x)] Ex pθ[T(x)] O(log 1/pθ(S)).

The proof follows from the preceding lemma, replacing α with pθ(S) and taking n . Compare this to the Gaussian case [10], where the mean and truncated means were bounded as µ µS O( p

log 1/pθ(S)) and separately the covariances were bounded as Σ 1/2ΣSΣ 1/2 I F O(log 1/pθ(S)). Note the smaller O( p

log 1/pθ(S)) quantity due to tighter Gaussian concentration vs. the sub-exponential rate.

Once we have bounds on the norm of the difference between the truncated and non-truncated mean sufficient statistics, we can bound distance in parameter space. The following completes this.

Lemma 3.9 (Non-truncated MLE Solution Distance to θ ). Suppose θ satisfies the conditions of Claim 1 and pθ (S) = α > 0. Let θ0 be such that Epθ0[T(x)] = T where T = 1

n Pn i=1 T(xi) given

each xi p S θ . Let ϵS > 0 and n > Ω 2β ϵS log(1/δ) . Then w.p. at least 1 δ,

λ(O(log 1/α) + ϵS).

Proof. Define ℓ untr(θ) := Ex pθ0[ log pθ(x)]. Its gradient and Hessian calculations can be done similarly to ℓ(θ), the truncated version, but with S = X the full support of the distribution.

Since Ez pθ [T(z)] Ex pθ0[T(x)] = ℓ(θ )untr is the gradient of the untruncated negative log-likelihood whose optimum is at θ0, by A1 this gives

ℓ untr(θ ) ℓ untr(θ0) | {z } 0

λ θ0 θ θ0 θ 1

λ ℓ untr(θ )

where the result follows from the fact that Ez pθ0[T(z)] = T and Ez pθ [T(z)] T O(log 1/α + ϵS) w.p. 1 δ from Lemma 3.7.

Note that this result combined with Lemma 3.4 gives Corollary 3.9.

3.4 Analysis of Projected Stochastic Gradient Descent Algorithm

Now we have all the tools we need to analyze the main algorithm. For ease of notation, define d(α) := ϵS + O(log 1/α) which is a constant that depends on α. The following describes the projected stochastic gradient descent algorithm referenced by Theorem 3.1.

Algorithm 1 Projected SGD Algorithm Given Truncated Samples

Given {xi}n i=1, each xi p S θ Initial θ0 Rk s.t. Ez pθ0[T(z)] = T, where T = 1

i T(xi). for i = 0, . . . , N do

vi = Sample Gradient(xi, θi) θi+1 θi ηvi Project θi+1 onto K = B(θ0, d(α)

λ ) Θ. end for Return θT

Algorithm 2 Sample Gradient

Input: x, θ while True do

Sample z pθ if 1{z S} via membership oracle then

Return T(z) T(x) end if end while

Given the results from the previous sections, we can now prove the main result. The analysis is based on that of Chapter 5 (Theorem 5.7) of [20] which we modify and state below: Theorem 3.10 (SGD Convergence). Let f be a λ-strongly convex function. Let θ arg minθ K f(θ). Consider the sequence {θt}N t=1 generated by SGD (Algorithm 3) and {vt}N t=1 the sequence random vectors satisfying E[vt | θt] = f(θt) and E[ vt 2 | θt] < ρ2 for all t, with a constant step size η satisfying 0 < η < 1

λ. It follows that for t 0,

E θt θ 2 (1 2ηλ)t θ0 θ 2 + η

The proof is adapted and reproduced in Appendix D.1 for completeness. To apply the above theorem we need to take care of statistical problems:

(i) strong convexity f is a strongly convex function over K a convex set

(ii) smoothness f is a Lipschitz-smooth function over K

(iii) feasibility of optimal solution θ K.

(iv) bounded variance step for all t, E[ vt 2 | θt] < ρ2 for some ρ2

and algorithmic ones:

(a) initial feasible point efficiently compute an initial feasible point θ0 (b) unbiased gradient estimation efficiently sample an unbiased estimate of f (= ℓ)

(c) efficient projection efficiently project to parameter space K

Statistical problems. Firstly, (iii) is assumed. (i-ii) is addressed by Lemmas 3.2, 3.3, 3.6, 3.9, 3.7, 3.4. In particular, we can initialize with θ0 such that θ0 θ d(α)

λ by Lemmas 3.6, 3.9, 3.7, with probability at least 1 δ. Given this θ0, we can construct K = B(θ0, d(α)

λ ) Θ which has the property that

λd(α), θ K.

Thus by Lemma 3.4, we will also have

pθ(S) α2 exp 6κ

λ (d(α))2 > 0, θ K

where κ = L/λ is the condition number. Since we are projecting to K in which all parameters have non-trivial mass, our objective remains strongly convex. In particular, our objective ℓ(θ) has

λSI 2ℓ(θ) LSI, θ K,

where λS = 1

α2 exp( 6 κ

λ (d(α))2) 4Cdeg

2deg λ and LS = exp(6 κ

α2 L are some constants which

depend on α, λ, L, and the maximum degree, deg, of the sufficient statistics. It remains to address (iv), which is done by the following lemma. Lemma 3.11 (Bounded variance step). Let vi denote the output of Sample Gradient(xi, θi) at any iteration i [N]. Provided that ES pθ[T(x)] Epθ[T(x)] O(log 1/pθ(S)) for all θ K,

E[ vi | θi 2] k LS + k L + (1 + 2κ)2(O(log 1/pθi(S)))2.

Proof. At any iteration i (arbitrary),

E[ vi 2 | θi] = E(z,x) p S θi p S θ T(z) T(x) 2

= E(z,x) p S θi p S θ T(z) 2 2 T(z), T(x) + T(x) 2

= Tr (Cov[T(z)]) + (E[ T(z) ])2 + Tr (Cov[T(x)]) + (E[ T(x) ])2 2 E[T(z)], E[T(x)]

= Tr (Cov[T(z)]) + Tr (Cov[T(x)]) + Ep S θi[T(z)] Ep S θ [T(x)] 2

k LS + k L + (1 + 2κ)2(O(log 1/pθi(S)))2

In the last step, we ve used the fact that

Ep S θi[T(z)] Ep S θ [T(x)] Ep S θi[T(z)] Epθi[T(z)] + Epθi[T(z)] Ep S θ [T(z)]

= Ep S θi[T(z)] Epθi[T(z)] + Epθi[T(z)] Epθ0[T(z)]

O(log 1/pθi(S)) + (2L/λ)(O(log 1/pθi(S)) + ϵS)

by assumption, smoothness, Cor. 3.8 and Lemma 3.9.

Algorithmic problems. For the algorithmic problems, by Cor. 3.8 and Lemmas 3.9, 3.7, we can address (a) by solving the empirical MLE problem with no truncation. Given that we can efficiently sample exactly (or approximately see Appendix D.2) from the non-truncated pθ for any θ, we can sample unbiased gradients via Algorithm 2 with expected O(1/pθi(S)) = O exp(6κ (d(α))2)

samples at each step t to address (b). Point (c) can be done efficiently, since our parameter space is a simple intersection of Euclidean balls if we choose Θ to be a Euclidean ball that sits inside the whole parameter space which contains θ .

Let D(k, L, λ, α) = k(LS+L)+(1+2κ)2(6κ(d(α))2 2 log α)2

λ2 Sϵ2 . Putting everything together, to get

E θN θ 2 ϵ2, the number of iterations and samples should be

N max D(k, L, λ, α), 1

provided that η = min{ λSϵ2

2ρ2 , 1 λS }, applying Lemma D.1 to the bound from Theorem 3.10 with

λS , C = λS, µ = 2λS, and ρ2 = k LS + k L + (1 + 2κ)2(O(log 1/pθ(S)))2 by Lemma 3.11.

Further, Lemma 3.4 guarantees O(log 1/pθ(S)) O log exp(6κ (d(α))2)

α2 for all θ K.

In probability, we get P( θN θ 2 3ϵ2) 1/3 by Markov s inequality. Then we can amplify the probability of success to 1 δ by repeating the procedure from scratch log 1/δ times, as in [10]. Given a polynomial time (poly(m, k, 1/ϵ)) algorithm AS to sample from pθ for all θ, each

iteration takes expected O(1/pθi(S)) = O exp(6 κ

α2 times the running time of AS plus the projection step (which is also efficient). This completes the result of Theorem 3.1. Remark. In the Gaussian case, the sample complexity was given in terms of m, the dimension of the x and was stated in [10] as e O(m2/ϵ2). For multivariate Gaussian, the dimension k of θ is the dimension of [x, xx ] (vectorized) which is m + m2, thus we can recover the previous result.

Numerical example. While a lot of the analyses shown can seem complicated, they give some guarantees for a rather simple algorithm: simply initialize your parameters using MLE as if there is no truncation at all, then run projected stochastic gradient descent using rejection sampling. We do not even need to describe the truncation set S as long as we can query whether a point is inside S or not efficiently. As a proof of concept for the use of this algorithm, we ve implemented our simple projected SGD algorithm to learn the parameters of multivariate exponential distributions, given truncated samples. To save space in the main body for exposition of the theoretical results, we ve included the results and details in Appendix E.

4 Discussion

To our knowledge, this work is the first which develops a computationally and statistically efficient algorithm for learning from samples truncated to very general sets S of the form in [10] in high dimensions whose distribution does not rely on Gaussianity. This work also has interesting implications for learning general log-concave distributions through applying the Taylor theorem to the log density as in [12] and for sampling (e.g., as in [30]). Through generalizing the previous Gaussian results to general exponential families, we also extract the broader abstract properties (e.g., concentration and anti-concentration) of distributions for which the previously proposed projected stochastic gradient descent procedure still applies. It would be interesting to understand how much these results can be generalized to densities beyond exponential families, or even to extend the current work presented from expected polynomial running time to deterministic polynomial running time. We hope that our work provides the foundation for future work in this direction.

5 Acknowledgements

We thank Tim Kunisky for insightful discussions during the preparation of this work. Jane Lee was supported by a GFSD fellowship sponsored by the U.S. National Security Agency (NSA).

[1] MA Beg. Optimal tests and estimators for truncated exponential families. Metrika, 29(1):103 113, 1982.

[2] Daniel Bernoulli. Essai d une nouvelle analyse de la mortalité causée par la petite vérole, et des avantages de l inoculation pour la prévenir. Histoire de l Acad., Roy. Sci.(Paris) avec Mem, pages 1 45, 1760.

[3] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. Discrete & Computational Geometry, 59, 06 2018.

[4] Róbert Busa-Fekete, Dimitris Fotakis, Balázs Szörényi, and Manolis Zampetakis. Optimal learning of mallows block model. Ar Xiv, abs/1906.01009, 2019.

[5] A Carbery and J Wright. Distributional and l-q norm inequalities for polynomials over convex bodies in r-n. Mathematical research letters, 8(3):233 248, May 2001.

[6] Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, and Manolis Zampetakis. What makes a good fisherman? linear regression under self-selection bias, 2022.

[7] Sinho Chewi. Log-concave sampling. Book draft available at https://chewisinho. github. io, 2022.

[8] A Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 2016.

[9] Georges Darmois. Sur les lois de probabilitéa estimation exhaustive. CR Acad. Sci. Paris, 260(1265):85, 1935.

[10] C. Daskalakis, T. Gouleakis, C. Tzamos, and M. Zampetakis. Efficient statistics, in high dimensions, from truncated samples. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 639 649, Los Alamitos, CA, USA, oct 2018. IEEE Computer Society.

[11] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Computationally and statistically efficient truncated regression. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 955 960. PMLR, 25 28 Jun 2019.

[12] Constantinos Daskalakis, Vasilis Kontonis, Christos Tzamos, and Emmanouil Zampetakis. A statistical taylor theorem and extrapolation of truncated densities. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 1395 1398. PMLR, 15 19 Aug 2021.

[13] Constantinos Daskalakis, Dhruv Rohatgi, and Emmanouil Zampetakis. Truncated linear regression in high dimensions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 10338 10347. Curran Associates, Inc., 2020.

[14] Constantinos Daskalakis, Patroklos Stefanou, Rui Yao, and Emmanouil Zampetakis. Efficient truncated linear regression with unknown noise variance. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 1952 1963. Curran Associates, Inc., 2021.

[15] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast. Journal of Machine Learning Research, 20(183):1 42, 2019.

[16] RA Fisher. Properties and applications of Hh functions. Mathematical tables, 1:815 852, 1931.

[17] Ronald Aylmer Fisher. Two new properties of mathematical likelihood. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 144(852):285 307, 1934.

[18] Dimitris Fotakis, Alkis Kalavasis, and Christos Tzamos. Efficient parameter estimation of truncated boolean product distributions. In Conference on Learning Theory, pages 1586 1600. PMLR, 2020.

[19] Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379387):310 315, 1897.

[20] Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods, 2023.

[21] Charles M. Goldie and Claudia Klüppelberg. Subexponential Distributions, page 435 459. Birkhauser Boston Inc., USA, 1998.

[22] Patrick M Hannon and Ram C Dahiya. Estimation of parameters for the truncated exponential distribution. Communications in Statistics-Theory and Methods, 28(11):2591 2612, 1999.

[23] Andrii Ilienko. Continuous counterparts of poisson and binomial distributions and their properties. Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae. Sectio Computatorica, 39, 03 2013.

[24] Andrew Ilyas, Emmanouil Zampetakis, and Constantinos Daskalakis. A theoretical and practical framework for regression and classification from truncated samples. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4463 4473. PMLR, 26 28 Aug 2020.

[25] Sham Kakade, Ohad Shamir, Karthik Sindharan, and Ambuj Tewari. Learning exponential families in high-dimensions: Strong convexity and sparsity. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 381 388, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR.

[26] V. Kontonis, C. Tzamos, and M. Zampetakis. Efficient truncated statistics with unknown truncation. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1578 1595, Los Alamitos, CA, USA, nov 2019. IEEE Computer Society.

[27] Bernard Osgood Koopman. On distributions admitting a sufficient statistic. Transactions of the American Mathematical society, 39(3):399 409, 1936.

[28] Alice Lee. Table of the Gaussian Tail Functions; When the Tail is Larger than the Body. Biometrika, 10(2/3):208 214, 1914.

[29] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.

[30] Song Liu, Takafumi Kanamori, and Daniel J. Williams. Estimating density models with truncation boundaries using score matching. Journal of Machine Learning Research, 23(186):1 38, 2022.

[31] Gabriel Loaiza-Ganem and John P. Cunningham. The continuous bernoulli: fixing a pervasive error in variational autoencoders. Ar Xiv, abs/1907.06845, 2019.

[32] Gangadharrao S Maddala. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press, 1986.

[33] Oren Mangoubi and Nisheeth Vishnoi. Sampling from log-concave distributions with infinitydistance guarantees. Advances in Neural Information Processing Systems, 35:12633 12646, 2022.

[34] Karl Pearson. On the systematic fitting of frequency curves. Biometrika, 2:2 7, 1902.

[35] Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59 68, 1908.

[36] Edwin James George Pitman. Sufficient statistics and intrinsic accuracy. In Mathematical Proceedings of the cambridge Philosophical society, volume 32, pages 567 579. Cambridge University Press, 1936.

[37] Orestis Plevrakis. Learning from censored and dependent data: The case of linear dynamics. Co RR, abs/2104.05087, 2021.

[38] Mathias Raschke. Inference for the truncated exponential distribution. Stochastic environmental research and risk assessment, 26(1):127 138, 2012.

[39] Adrien Saumard and Jon A. Wellner. Log-concavity and strong log-concavity: A review. Statistics Surveys, 8(none):45 114, 2014.

[40] Abhin Shah, Raaz Dwivedi, Devavrat Shah, and Gregory W Wornell. On counterfactual inference with unobserved confounding. ar Xiv preprint ar Xiv:2211.08209, 2022.

[41] Abhin Shah, Devavrat Shah, and Gregory Wornell. A computationally efficient method for learning exponential family distributions. Advances in neural information processing systems, 34:15841 15854, 2021.

[42] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014.

[43] John W Tukey. Sufficiency, truncation and selection. The Annals of Mathematical Statistics, pages 309 311, 1949.

[44] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

A Proofs and Calculations Regarding the Objective

A.1 The Truncated Negative Expected Log-Likelihood Function

The negative log-likelihood that x S is a sample of p S θ(x) is

ℓ(θ, x) | {z } log p S θ (x)

:= log h(x) θ T(x) + log Z

S h(x ) exp(θ T(x ))dx .

Its gradient w.r.t. θ is

ℓ(θ, x) = T(x) +

S T(x )h(x ) exp(θ T(x ))dx

S h(x ) exp(θ T(x ))dx

S T(x )h(x ) exp(θ T(x ) A(θ))dx

S h(x ) exp(θ T(x ) A(θ))dx

= T(x) + Ez p S θ [T(z)]

The Hessian is

2ℓ(θ) = ( R

S T(x)T(x) h(x) exp(θ T(x) A(θ))dx)

S h(x) exp(θ T(x) A(θ))dx)

S T(x)h(x) exp(θ T(x) A(θ))dx)

S h(x) exp(θ T(x) A(θ))dx)

S T(x)h(x) exp(θ T(x) A(θ))dx)

S h(x) exp(θ T(x) A(θ))dx)

= Covx p S θ [T(x), T(x)]

We can similarly define the population negative log-likelihood as

ℓ(θ) := Ex p S θ h log h(x) θ T(x) i + log Z

S h(x) exp(θ T(x))dx),

ℓ(θ) = Ex p S θ [ T(x)] + Ex p S θ [T(x)] ,

2ℓ(θ) = 2ℓ(θ)

A.2 Proof of Lemma 3.2

Proof. Define the following quantities:

R = Ex pθ h (T(x) Ex pθ[T(x)]) (T(x) Ex pθ[T(x)]) i

T(x) Ex p S θ [T(x)] T(x) Ex p S θ [T(x)]

R = Ex p S θ

T(x) Ex p S θ [T(x)] T(x) Ex p S θ [T(x)]

Claim 2. R R . (Proof in Appendix A.4.)

Now, let ξ Rk with ξ 2 2 = 1 arbitrary. Then

ξ R ξ = ξ Ex pθ (T(x) Ex pθ[T(x)]) (T(x) Ex pθ[T(x)]) ξ = Ex pθ [pξ(x)]

ξ R ξ = Ex pθ p ξ(x)

ξ Rξ = Ex p S θ p ξ(x)

where pξ(x), p ξ(x) are polynomials of degree at most 2d whose coefficients depend on ξ (under A3). Furthermore, note that for any ξ Rk, pξ(x) 0 and p ξ(x) 0 (due to the rank one matrix inside the expectation being PSD).

First, since R R ξ R ξ ξ R ξ, we have

Ez pθ p ξ(z) Ez pθ [pξ(z)] λ.

Now define the set A := {x : p ξ(x) γ} for γ = β 4Cd 2d λ where pθ(S) = β > 0. Theorem 8 of [5] says

pθ(A) Cqγ1/(2d) Ez pθ h p ξ(z) iq/2d 1/q q=2d = 2Cdγ1/(2d)

Ez pθ p ξ(z)

1/(2d) 2Cd γ1/(2d)

λ1/(2d) = β

Now we can split Ez p S θ

h p ξ(z) i into the part on S A and S Ac. Note that if pθ(S) = β and

2 , this implies pθ(S Ac) β

pθ(S Ac) pθ(S) + pθ(Ac) pθ(S Ac) β + 1 β

Ez p S A θ p ξ(z) +Ez p S Ac θ p ξ(z) pθ(S A)

pθ(S) 0+pθ(S Ac)

2γ Ez p S θ p ξ(z) 1

and the claim follows.

A.3 Proof of Lemma 3.3

Proof. Similar to the proof of the previous lemma, define the following quantities:

R = Ex pθ h (T(x) Ex pθ[T(x)]) (T(x) Ex pθ[T(x)]) i

R = Ex p S θ

h (T(x) Ex pθ[T(x)]) (T(x) Ex pθ[T(x)]) i

R = Ex p S θ

T(x) Ex p S θ [T(x)] T(x) Ex p S θ [T(x)]

Claim 3. It holds that R R. (Similar proof to Claim 2.)

Let ξ Rk with ξ 2 2 = 1 arbitrary. Then

ξ R ξ = Ex pθ[fξ(x)]

ξ R ξ = Ex p S θ [fξ(x)]

ξ Rξ = Ex p S θ [f ξ(x)]

where fξ(x), f ξ(x) are some functions which depend on x and ξ (e.g., polynomials of degree at most 2d under A3). By the previous claim, we also have

Ex p S θ [fξ(x)] Ex p S θ [f ξ(x)].

Ex p S θ [fξ(x)] = Z

X p S θ(x) fξ(x)dx = Z

1 pθ(S)pθ(x) fξ(x) 1{x S}dx 1 pθ(S)

Z pθ(x)fξ(x)dx. | {z } =Ex pθ [fξ(x)]

Since λI R LI by A1, it holds that ξ R ξ = Ex pθ[fξ(x)] L, thus the following inequalities hold:

ξ Rξ = Ex p S θ [f ξ(x)] Ex p S θ [fξ(x)] 1 pθ(S)L.

A.4 Proof of Claim 2

We will prove a general claim which should take care of both claims in Lemmas 3.2 and 3.3.

Claim 4. Let x ρ be a random vector with mean µ. Let b be another vector such that b = µ. Then

Covx ρ[x, x] = Ex ρ[(x µ)(x µ) ] = Ex ρ[(x b)(x b) ] (b µ)(b µ) .

E[(x µ)(x µ) ]

= E[(x b + b µ)(x b + b µ) ]

= E[(x b)(x b) ] + E[(x b)(b µ) ] | {z } =( 1) E[(b µ)(b µ) ]

+ E[(b µ)(x b) ] | {z } =( 1) E[(b µ)(b µ) ]

+ E[(b µ)(b µ) ] | {z } =E[(b µ)(b µ) ]

= E[(x b)(x b) ] E[(b µ)(b µ) ]

As a corollary, since the second term is a rank-1 matrix (thus PSD), we have that E[(x b)(x b) ] E[(x µ)(x µ) ].

B Examples of Other Distributions which Satisfy Assumptions

Example 1 (Exponential Distribution). The exponential distribution density can be written

pλ(x) = λ exp( λx) = exp( λx + log(λ)),

defined on x R+ which is a convex set and for λ > 0. In natural form, it is

pθ(x) = exp(θx + log( θ))),

defined for θ < 0. Note that

T(x) = x is a polynomial in x.

This is log-linear in x (so log-concave in x).

Variance of the sufficient statistic is simply the variance, which is 1/θ2 > 0 for any θ < 0. If we restrict θ in a bounded set, the negative log-likelihood will be strongly convex and smooth in θ.

Example 2 (Weibull Distribution with known shape k). The Weibull distribution with known shape k > 0 has density

pλ(x) = exp((k 1) log x + 1

xk + log k k log λ)

defined on x R+ and λ > 0. We can re-parameterize this in terms of θ = 1

λk with θ < 0 as

pθ(x) = xk 1 exp(θ xk + log k + log( θ)).

T(x) = xk is polynomial in x.

pθ(x) is log-concave in x if k > 1 (where recall x R+ and θ < 0).

The variance of the sufficient statistic can also be found by taking the second derivative of A(θ) = log k log( θ) w.r.t. θ, which is also 1/θ2 > 0.

Example 3 (Continuous Bernoulli). The continuous Bernoulli density [31] can be written

pλ(x) = exp

log λ 1 λ log 1 2λ (1 λ) log 1 λ

with support x [0, 1] and λ (0, 1). We can re-parameterize this in terms of θ = log λ 1 λ with θ [0, ) so

pθ(x) = exp θx log eθ 1

T(x) = x is polynomial in x.

pθ(x) is log-linear in x (so log-concave).

The variance of sufficient statistic is simply the variance again, which is given by

( 1/12 if λ = 1/2 (λ 1)λ (1 2λ)2 + 1 (2tanh 1(1 2λ))2 otherwise

This is strictly positive and bounded for all values of λ (thus all values of θ). Example 4 (Continuous Poisson). A continuous version of the Poisson distribution (although there can be others [23]) can be written

pλ(x) = 1 Z(λ) e λλx

with support x [0, ) and λ (0, ). We can write this with θ = log λ so

pθ(x) = 1 Γ(x + 1) exp(θx A(θ)).

T(x) = x is polynomial in x.

pθ(x) is log-concave in x for x R+.

In λ parameters, the mean of this distribution is λ through usual calculations (e.g., similar to those of the Gamma distribution). Note: we can absorb the e λ term into the partition function.

E[X] = 1 Z(λ)

x Γ(x)dx Γ(x + 1) = x Γ(x)

Γ(x) dx Partition function, change var. z = x 1

Similarly, we should be able to show the variance is λ as usual. In θ space, this means the variance is exp(θ) for θ R which is always positive. Again, we can make it bounded by restricting θ to some set. Example 5 (Multivariate Gaussian). The multivariate Gaussian also satisfies all of these properties. Recall that the sufficient statistics of the multivariate Gaussian has

T(x) = [x, xx ] is a polynomial in the components of x with degree at most 2 (where the xx term can be thought of as the vector after standard vectorization).

The multivariate Gaussian density is strongly log-concave.

The covariance matrix (of the sufficient statistics) has a complicated form, which the authors of [10] have analyzed the lower bound for, e.g., in their Claims 1 and 2. As before, we can restrict our parameter space to ensure upper bounds. Example 6 (Generalized Linear Models). This example is the same as the one given in [25] for generalized linear models. It is restated here for completeness.

Consider when we have some covariance, response pair (X, Y ) drawn from some distribution D. Suppose that we have a family of distributions P( | θ; X) such that, for each X, it is an exponential family with sufficient statistic ty,X

P(y | θ; X) = h(y) exp ( θ, ty,X A(θ, X)) .

We can consider a one-dimensional exponential family qν with parameterization ν = θ, X , then

P(y | θ; X) = h(y) exp (y θ, X log Z( θ, X ))

where we see that ty,X = y X and the log partition function A(θ, X) = log Z( θ, X ). When qν is Bernoulli family or unit variance Gaussian family, this corresponds to logistic regression or least squares regression, respectively.

We can appropriately generalize this to beyond linear models (e.g., polynomials) provided that we can keep the distribution log-concave.

Comment on A3. We mentioned in the main paper that this assumption combined with logconcavity provides the anti-concentration property that we need for Lemma 3.2. We assume it for simplicity of exposition, but it should be noted that as long as we have the type of anti-concentration property to control how much the covariance can shrink under truncation, we do not necessarily need T(x) to be polynomial. However, we ve provided examples of exponential families which already satisfy this above (and there are potentially more which can be addressed by this framework that do not have polynomial sufficient statistics but nonetheless exhibit similar anti-concentration properties).

C Proofs Relating Truncated and Non-Truncated Quantities

C.1 General Truncated Densities

Let ρ be a probability distribution on Rd. Let S Rd be such that ρ(S) = α for some α (0, 1]. Let ρS := ρ( | S) be the conditional distribution of x ρ given that x S.

ρS(x) = ρ(x) 1{x S}

Note that the relative density is ρS(x)

ρ(x) = 1{x S}

Then we can compute that the Rényi divergence is a constant for any order 1 q .

KL(ρS ρ) = EρS log ρS

= EρS log 1 ρ(S)

χ2(ρS ρ) = EρS ρS

1 = 1 ρ(S) 1 = 1

Rq(ρS ρ) = 1 q 1 log EρS

= 1 q 1 log 1 ρ(S)q 1 = log 1 ρ(S) = log 1

R (ρS ρ) = log sup x ρS(x)

ρ(x) = log 1 ρ(S) = log 1

Note R2(ρS ρ) = log(1 + χ2(ρS ρ)).

We recall the following general estimates.

Lemma C.1. For any probability distributions ρ, π (such that the quantities below are finite):

1. Eρ[x] Eπ[x] p

2. |Eρ[ x 2] Eπ[ x 2]| p

3. |Varρ(x) Varπ(x)| p

(χ2(ρ π) + 1)2 1 p

2Eπ[ x Eπ[x] 4].

Proof. The first two claims are immediate by Cauchy-Schwarz. For the third one, recall we can write

Varρ(x) = 1

2Eρ 2[ x y 2].

Then by applying part (1) to ρ 2 and (π) 2, we get

|Varρ(x) Varπ(x)| 1

χ2(ρ 2 π 2) p

Eπ 2[ x y 4]

(χ2(ρ π) + 1)2 1 p

2Eπ[ x Eπ[x] 4] + 6Eπ[ x Eπ[x] 2]2

(χ2(ρ π) + 1)2 1 p

8Eπ[ x Eπ[x] 4].

For our application, we have the following. Given a probability distribution ρ on Rd, we let µ(ρ) = Eρ[x] be its mean, and for k N,

Mk(ρ) := Eρ[ x µ(ρ) k]1/k.

So for example we have M2(ρ) = p

Varρ(x). We also have Mk(ρ) Mℓ(ρ) if k ℓ.

Lemma C.2. Let ρ be a probability distribution on Rd. Let S Rd with ρ(S) = α (0, 1]. Then

1. EρS[x] Eρ[x] q

2. |VarρS(x) Varρ(x)|

In particular, if α (0, 1] is such that 1 α2 1 + c2M2(ρ)4

2M4(ρ)4 for some 0 c < 1, then

VarρS(x) (1 c)Varρ(x).

Note that the constraint on α above implies 1 α2 3

2/3. But if M2(ρ) M4(ρ), then 1 α will be very small.

Recall also that under some conditions, e.g. if ρ is log-concave, then we have the reverse bound that

M2(ρ) C2,4M4(ρ)

for a universal constant C2,4, so the constraint above is not too restrictive, as it allows 1 α of constant size.

C.2 Exponential Families with Strongly Convex and Smooth Log-Partition Functions are Sub-Exponential

Let θ Θ such that θ + 1

β u Θ for some β > 0 for all unit vectors u and such that Assumption A1 holds for pθ. Then X := u (T(x) Ex pθ[T(x)]) is SE(L, β).

Proof. WLOG, consider pθ in the transformed space x 7 T(x) so that

pθ(t) = h(t) exp(θ t A(θ))dt,

where θ Θ and A(θ) = log(Z(θ)) = log R

T (X) h(t) exp(θ t)dt is the log-partition function.

Note that 2A(θ) = Covt pθ(t)[t] = Covx pθ(x)[T(x)], and by A1, A(θ) is a λ-strongly convex and L-smooth function in θ.

To show that pθ(t) is sub-exponential with parameters (ν2, β) we need to show that its moment generating function satisfies E[eγu (t µ)] eγ2ν2/2, where µ = Epθ[t], u is a unit vector, for |γ| < 1/β.

E[eγu (t µ)] = Z eγu t γu µ h(t)eθ t A(θ)dt

= exp( γu µ)

Z h(t) exp((γu + θ) t)dt

= Z(γu + θ)

Z(θ) exp( γu µ)

The inequality we need to show is equivalent to proving

E[eγu (t µ)] eγ2ν2/2

Z(θ) e γu µ eγ2ν2/2

Z(θ) eγu µ eγ2ν2/2

A(γu + θ) A(θ) γu µ + γ2ν2

2 Since A(θ) is L-smooth, we have that

A(γu + θ) A(θ) A(θ) | {z } =µ , γu + L

2 γu 2 = γu µ + γ2L

where we ve used the property of exponential families that the gradient of the log partition function is the mean sufficient statistic. Now we can see that the appropriate parameter for ν2 is L and γ must be small enough so that γu + θ Θ, i.e., |γ| < 1

β for some β > 0. This is possible if θ is bounded away from the boundary of Θ.

Remark. In the above, we only needed to use that pθ is an exponential family distribution and that its log-partition function A(θ) is smooth. It is also possible to show that pθ has exponentially decreasing tails (in quantities involving x rather than T(x)) if it is log-concave in x (assumption A2), e.g., by [39].

C.3 Proof of Lemma 3.4

Let pθ(x) = h(x) exp( θ, T(x) A(θ)) and A: Θ R is the log-partition function:

X h(x) exp( θ, T(x) )dx.

Lemma C.3. For any q > 1, θ, θ Θ:

q = exp (q 1)A(θ) q A(θ ) + A qθ (q 1)θ .

X h(x) exp ( θ, T(x) A(θ)) exp q θ θ, T(x) q A(θ ) + q A(θ) dx

= exp (q 1)A(θ) q A(θ ) + A qθ (q 1)θ .

Lemma C.4. Assume A is convex and L-smooth on Θ. For any S X, and θ, θ Θ:

pθ(S) pθ (S)2 exp 3L

Proof. By Cauchy-Schwarz,

pθ (S)2 = Epθ

= pθ(S) exp A(θ) 2A(θ ) + A 2θ θ .

Since A is convex and L-smooth,

A(θ) A(θ ) + A(θ), θ θ

A(2θ θ) A(θ ) + A(θ ), θ θ + L

A(θ) 2A(θ ) + A 2θ θ A(θ ) A(θ), θ θ + L

Compare this to the Gaussian case (e.g., see H.8 of [37]) where this was pθ(S)

2 log 1/α 1

2r2 for θ θ < r.

C.4 Proof of Lemma 3.7

n Pn i=1 T(xi) be the empirical mean sufficient statistics given our samples {xi}n i=1 each xi p S θ .

Let ϵS > 0. For n Ω 2β ϵS log 1

T Epθ [T(x)] ϵS + O(log 1/α)

with probability at least 1 δ.

Proof. Let ν = Ex pθ [T(x)].

For any event A, we have that

Pp S θ [A] = Z 1{ω A}dp S θ (ω) = 1

Z 1{ω A}1{ω S}dpθ (ω) 1

and for the product measure with n independent components PΠi [n]p S θ [A] 1

α n PΠi [n]pθ [A]. So we can bound the probability of events on p S θ with those on pθ . In particular, by Claim 1 and by the composition property of independent sub-exponential random variables, we have that

i T(xi) ν ! t

for any unit vector u

i T(xi) ν t

To translate this to the probability of the same event on p S θ , note that 1

δ exp n log 1/α t

which holds when t = 2β log 1/α + 1

n log 1/δ . Thus for n > 2β ϵS log 1/δ samples from the truncated p S θ we have that with probability at least 1 δ, the quantity T ν 2β(log 1/α)+ϵS.

D Additional Proofs for Algorithm Analysis

Algorithm 3 Stochastic Gradient Descent

Initialize some θ0 K. for iteration t = 1, 2, . . . , T do

Compute vt such that E[vt | θt] = f(θt) eθt+1 θt ηvt eθt+1 = ΠK(eθt+1) (Project onto K) end for Return θT

D.1 SGD Algorithm and its Analysis

Although the setting of Theorem 5.7 of [20] is when the objective is a sum of many functions, the proof and its result can be easily adapted to our setting.

Theorem. Let f be a λ-strongly convex function. Let θ arg minθ K f(θ). Consider the sequence {θt}N t=1 generated by SGD (Algorithm 3) and {vt}N t=1 the sequence random vectors satisfying E[vt | θt] = f(θt) and E[ vt 2 | θt] < ρ2 for all t, with a constant step size η satisfying 0 < η < 1

λ. It follows that for t 0,

E θt θ 2 (1 2ηλ)t θ0 θ 2 + η

Proof. At any iteration i, eθi+1 = θi ηvi eθi+1 θ = θi θ ηvi (1)

eθi+1 θ 2 = θi θ 2 2η vi, θi θ + η2 vi 2

where the last line comes from multiplying the line (1) with the transpose of the same equation on either side. After projecting to the set K to obtain θi+1 = arg minθ K eθi+1 θ 2 and given that θ K, we have that eθi+1 θ 2 θi+1 θ 2, so θi+1 θ 2 θi θ 2 2η vi, θi θ + η2 vi 2 (2)

Fixing θi in the ith iteration and taking the conditional expectation in (2) gives E[ θi+1 θ 2 | θi] θi θ 2 2η f(θi), θi θ + η2E[ vt 2 | θi]

(1 2ηλ) θi θ 2 + η2E[ vt 2 | θi] where the last line is due to strong convexity, f(θi), θi θ λ θi θ 2. By taking iterated expectations and recursively applying the above, we get that

E[ θT θ 2] (1 2ηλ)T θ0 θ 2 + η2ρ2 T 1 X

i=0 (1 2ηλ)i

(1 2ηλ)T θ0 θ 2 + η2 1

= (1 2ηλ)T θ0 θ 2 + η

where in the second line we used that PT 1 i=0 (1 2ηλ)i = 1 (1 2ηλ)T

1 (1 2ηλ) < 2 2ηλ provided η < 1/λ.

We can derive the complexity (number of iterations) to get E[ θT θ 2] < ϵ using the following Lemma from [20]. Lemma D.1 (Lemma A.2 of [20]). Consider the recurrence given by αk (1 ηµ)tα0 + Aη, where µ > 0, and A, C 0 are given constants and η < 1/C. If

Note that to get bounds on θT θ rather than θT θ 2, we can solve the number of iterations we need to get ϵ2 on the right hand side, and we will get the number of iterations for θT θ < ϵ. Then the resulting complexity bounds will replace 1/ϵ with 1/ϵ2.

D.2 Approximate Sampling of Non-Truncated Distribution

Many analyses of stochastic gradient descent assume unbiased directions at every iteration of the algorithm, but since we need to be able to sample from pθi multiple times at each iteration i until we get a sample in S, our directions are only unbiased if we can indeed sample exactly from pθi each time for all i.

However if pθi is complicated, exact sampling can be difficult or take too long. Since we assume that pθ is log-concave in x for all θ Θ, we can at least approximately sample from it efficiently via Langevin Monte Carlo, MALA, or other algorithms with convergence guarantees for log-concave densities. Lemma D.2 (SGD Analysis with Biased Directions). Let f be a λ-strongly convex function. Let θ arg minθ K B(θ ,D) f(θ). Consider the sequence {θt}N t=1 generated by SGD but with {evt}N t=1 the sequence random vectors satisfying E[evt | θt] = bt f(θt) with bt < B and E[ evt 2 | θt] < ρ 2 for all t, with a constant step size η satisfying 0 < η < 1

λ. It follows that for t 0,

E θt θ 2 (1 2ηλ)t θ0 θ 2 + η

Proof. Define bt := E[bt | θt] f(θt) the bias for each t 0. The analysis of Theorem 3.10 can be applied generically to get Eq. (2): θi+1 θ 2 θi θ 2 2η evi, θi θ + η2 evi 2

Now when taking the conditional expectation, we get E[ θi+1 θ 2 | θt] θi θ 2 2η bt, θi θ 2η f(θi), θi θ η2E[ evi 2 | θi]

(1 2ηλ) θi θ 2 + η2ρ 2 + 2η bi θi θ Taking iterated expectations and recursively applying this gives now

E[ θT θ 2] (1 2ηλ)T θ0 θ 2 +

i=0 (1 2ηλ)i η2ρ 2 + 2η bi θi θ

(1 2ηλ)T θ0 θ 2 + η2ρ 2 + 2ηBD

= (1 2ηλ)T θ0 θ 2 + ηρ 2

λ where the second line holds if bi B, θi θ < D, i (which holds under the assumptions).

Note that in our Algorithm 1, D here is simply 2

λd(α) by construction. We can also control B through the following.

Bounding the bias. Fix some t. Let evt := T(z) T(x) where z ep S θi and x p S θ . Then we can write bt = Ez ep S θi[T(z)] Ex p S θ [T(x)] + Ez p S θi[T(z)] Ex p S θ [T(x)]

= Ez ep S θi[T(z)] Ez p S θi[T(z)]

X T(x) (ep S θi(x) p S θi(x))dx (3)

If we know that T(x) is bounded over S, we can upper-bound this given the TV distance between ep S θi and p S θi: bt sup x S T(x) ep S θi p S θi T V .

Since we assume that Epθi[T(x)] is finite, it should be the case that T(x) is bounded over its support except potentially on some negligible sets. In that case, we can replace T(x) with e T(x) which replaces those potentially infinite values on negligible sets with 0 and the integral expression in (3) would be equal to one which uses e T(x) instead of T(x), and the bound on its norm holds given that e T(x) is bounded.

Otherwise we can use bounds from Lemma C.1 to bound this as

χ2(ep S θi||p S θi) q

Varp S θi(T(x))

if we have control over the chi-square divergence (see Section C.1 for definitions).

If we know that T(x) is a 1-Lipschitz, real-valued function (e.g., when T(x) = x), we can use the dual representation of W1 distance to bound this as

S T(x)dep S θi Z

S T(x)dp S θi sup f F1Lip

Z f(x)dep S θi Z f(x)dp S θi = W1(ep S θi, p S θi).

Proposition D.3 (Bounds on truncated total variation, given bounds on non-truncated). Suppose epθi pθi T V ϵT V for some ϵT V > 0. Then

ep S θi p S θi T V ϵ2 T V pθi(S) ϵT V .

Proof. First, given that epθi pθi T V ϵT V , we have by one characterization of the total variation distance (the supremum of the difference in mass over all measurable sets) epθi(S) pθi(S) ϵT V . Now for the truncated densities,

ep S θi p S θi T V = 1

Z |ep S θi(x) ps θi(x)|dx

Z 1{x S} epθi(x) epθi(S) pθi(x)

ϵT V pθi(S) ϵT V 1

Z |epθi(x) pθi(x)|dx

ϵ2 T V pθi(S) ϵT V

Efficient, Approximate Sampling. There exist several results in sampling which give bounds in TV distance in polynomial time (mixing time bounds) for log-concave distributions, e.g., [15], [3], [33] but which also usually require that the log density is also smooth (in x, not θ). There are also proofs for Langevin Monte Carlo when the log density is convex and Lipschtiz, not necessarily smooth (e.g., see Chapter 4 of [7]), or under LSI (which is implied by strong log-concavity) with convergence in Renyi divergence (e.g., Chapter 5 of [7]). We can also use the proximal sampler to achieve convergence in KL divergence under log-concavity (e.g., Chapter 8.4 of [7]), which by Pinsker s inequality can bound the TV distance.

Proposition D.4 (Bounded variance step with bias). If E[ vi | θi] ρ2 where vi = T(z) T(x) with z pp S θi and x p S θ ,then E[ evi | θi] ρ 2 for evi = T(ez) T(x) with ez epp S θi and

x p S θ , where ρ 2 = Varep S θi(T(z)) Varp S θi(T(z)) + ρ2 + B2,

where bi = Eez ep S θi[T(ez)] Ez p S θi[T(z)] < B.

Proof. As in the proof of the exact sampling version, we can write

E[ evi | θi] = Varep S θi(T(z)) + Varp S θ (T(x)) + Eep S θi[T(z)] Ep S θ [T(x)] 2

Varep S θi(T(z)) + Varp S θ (T(x)) + Ep S θi[T(z)] Ep S θ [T(x)] 2 + bi 2

= Varep S θi(T(z)) Varp S θi(T(z))

+ Varp S θi(T(z)) + Varp S θ (T(x)) + Ep S θi[T(z)] Ep S θ [T(x)] 2 | {z } ρ2

+ bi 2 | {z } B2

Varep S θi(T(z)) Varp S θi(T(z)) + ρ2 + B2

We can bound the difference Varep S θi(T(z)) Varp S θi(T(z)) if have bounds on Varep S θi(T(z)), through bounds like in Lemma C.1 if we can say something about the chi-square divergence, or through similar arguments to the bias bound if we assume some bounds on T(x) 2 over its support.

E Numerical Example

To illustrate how the algorithm performs in different dimensions, we implemented our algorithm for 2-, 5-, 10-, and 20-dimensional exponential distributions. In all cases, the truncation set is the (hyper-)cube [0, 2]d. We chose true parameters in all cases which resulted in an initial error at most 2.5. In all cases, we use 1500 iterations and step size 0.01, each repeated 10 times. In the end, all have (average) L2 error at most 0.15. For stability (and to bypass repeating the algorithm multiple times as stated in the analysis), we instead calculated gradients using the average of 10 samples which was sufficient to have stable training results. See Figure 2.

Figure 2: Learning 2-, 5-, 10-, and 20-dimensional truncated exponential distributions. In all cases, the truncation set is the (hyper-)cube [0, 2]d.

The wall clock time to finish all 1500 iterations of training for 5-, 10-, and 20-dimensions was 42.9 2.2, 49.1 6.5, and 61.2 3.0 seconds, respectively. We can see that the running time is not doubling with the doubling of dimensions.