# oracle_efficient_truncated_statistics__0df8f548.pdf

Published as a conference paper at ICLR 2025

ORACLE EFFICIENT TRUNCATED STATISTICS

Konstantinos Karatapanis1,2, Vasilis Kontonis3, Christos Tzamos1,4

1 Archimedes, Athena Research Center, Greece 2 National Technical University of Athens 3 University of Texas at Austin 4 University of Athens k.karatapanis@athenarc.gr, vasilis@cs.utexas.edu, chtzamos@di.uoa.gr

We study the problem of learning from truncated samples: instead of observing samples from some underlying population p , we observe only the examples that fall in some survival set S Rd whose probability mass (measured with respect to p ) is at least α. Assuming membership oracle access to the truncation set S, prior works obtained algorithms for the case where p is Gaussian Daskalakis et al. (2018) or more generally an exponential family with strongly convex likelihood Lee et al. (2023) albeit with a super-polynomial dependency on the (inverse) survival mass 1/α both in terms of runtime and in number of oracle calls to the set S. In this work we design a new learning method with runtime and query complexity polynomial in 1/α. Our result significantly improves over the prior works by focusing on efficiently solving the underlying optimization problem using a general purpose optimization algorithm with minimal assumptions.

1 INTRODUCTION

We study the problem of inference from truncated samples: assuming an underlying population p over data in Rd the learner has access to samples from the conditional measure p S over some survival set S Rd. This conditioning or truncation of the underlying population data may be the outcome of censorship, imperfect data collection processes, measurement errors, user preferences, etc. Inference from truncated data is a central problem in statistics going back to fundamental works in the beginning of the previous century Galton (1898); Pearson (1902b); Pearson & Lee (1908b). Due to its practical importance, more recently, the problem of inference from truncated data has seen significant developments with a focus on providing computationally efficient estimation algorithms Daskalakis et al. (2018); Kontonis et al. (2019); Ilyas et al. (2020); Daskalakis et al. (2020b;a; 2021); Lee et al. (2023). In this work we consider the problem of fitting a parametric model to truncated data given oracle access to the survival set S (i.e., the set of examples that are not truncated). In this model Daskalakis et al. (2018) was the first work that gave a computationally and statistically efficient algorithm for learning truncated Gaussian distributions with membership oracle access to the truncation set. More recently, the work Lee et al. (2023) extended those results and gave efficient algorithms when the underlying population follows an exponential family. In this work we study learning truncated exponential families with computationally and statistically efficient algorithms that perform well even when the fraction of observed data is very small (compared to the full population).

Learning truncated exponential families An exponential family pθ, parameterized by some k-dimensional vector θ Θ is a density of the form pθ(x) = h(x) exp(θ T(x) A(θ)), where h(x) : Rd 7 [0, ) is a weight function, T(x) : Rd 7 Rk is the sufficient statistic, A(θ) = log R h(x) exp(θ T(x))dx is the appropriate normalization constant (aka the logpartition function). Given an exponential family pθ and a survival set S, we define the corresponding truncated exponential family as p S θ (x) = pθ(x)1S(x)/ R

S pθ(x)dx, where 1S(x) denotes the indicator function of whether x belongs in S. Given samples from some truncated exponential family p S θ the goal of the learner is to identify an approximation to the (unknown) parameter θ . Given some target accuracy ϵ > 0, the approximation guarantees for the learned θ can either be in some distance over the parameter space, i.e., θ θ ϵ (parameter recovery) or over distributions, e.g., that the total variation distance between pθ and pθ is at most ϵ. The goal is to find a good candidate parameter θ with as few (i) samples from p S θ ; (ii) oracle calls to 1S(x) and (iii) computational resources as possible.

Published as a conference paper at ICLR 2025

The fraction of observed data or equivalently the probability that a sample from the underlying population pθ is truncated, captures the difficulty (both statistical and computational) of inference from truncated data. More precisely, we define α = pθ (S) to be the probability assigned to the survival set by the unknown exponential family pθ . While previous works Daskalakis et al. (2018); Lee et al. (2023) on learning truncated exponential families have polynomial dependencies on the parameter dimension k and the accuracy parameter ϵ, their dependency on the survival probability α is far from optimal. In particular, their runtime, sample complexity, and oracle calls all depend super-polynomially (i.e., (1/α)poly(log(1/α))) to the survival probability α. In this work our focus will be to provide algorithms that achieve dependencies on the dimension k and accuracy ϵ but also on the survival mass α.

Is there a computationally efficient algorithm that can learn truncated exponential families with a polynomial dependence on the survival mass 1/α? Can we get state of art dependencies on dimension and accuracy parameters at the same time?

Aiming for algorithms with low sample-complexity is perhaps the most well-studied task from a statistical perspective and is of great importance for virtually every inference task. We remark that in the context of truncated statistics getting oracle efficient algorithms is also important as the membership oracle may correspond to a costly or complex mechanism (e.g., we may need to find candidates to complete a new questionaire, perform a new physical experiment, etc.). Our main contribution is a positive answer to the above question by providing a new analysis of the truncated negative log likelihood objective.

1.1 OUR CONTRIBUTIONS

We first formally define the class of exponential family distributions that we consider in our work. Similarly to Lee et al. (2023), we assume that the non-truncated negative log-likelihood of the exponential family is strongly convex and smooth as a function of the parameter θ. Given some exponential family pθ parameterized by θ Rk, the negative log-likelihood (NLL) over a population distribution q is defined as follows:

L(θ) = E x q[log pθ(x)] . (1)

In what follows we will refer to L as non-truncated . Moreover, for a truncated exponential family p S θ we define the following truncated version of the NLL objective as:

LS(θ) = E x q[log p S θ (x)] = L(θ) + log pθ(S) . (2)

Definition 1.1 (Strongly convex and smooth exponential families). We assume that L(θ) is λstrongly convex and L-smooth as a function of θ, i.e.,

λI Covx pθ[T(x), T(x)] LI .

Remark 1.2. Lee et al. (2023) further places strong structural assumptions on pθ. In particular, that the underlying non-truncated exponential family is log-concave (i.e., its log-density is concave as a function of x) and also that the sufficient statistic T(x) is a polynomial function of x. As we will see our main learning result is quite general by providing a weaker learning guarantee (than parameter recovery) and therefore requires no such assumptions a priori. However, given those assumptions we can obtain similar guarantees to the prior works Daskalakis et al. (2018); Lee et al. (2023).

We remark that, as observed in Lee et al. (2023) the above assumption is quite general and captures many well-studied exponential families such as Gaussians, Exponentials, Weibull, (Continuous) Poisson, etc. Since our focus in this work is learning (as opposed to sampling from) exponential families, we treat the sample generation process as a black-box and define the following sampling oracle that returns samples from a given exponential family or truncated exponential family. For the truncated case, we crucially require that the learner must give the oracle some pθ that assigns non-trivial mass to the survival set S. This is to ensure that our sampling oracle can be readily implemented via rejection sampling without a large number of rejected samples (that would in turn require a large number of calls to the membership oracle to S). For example, assume that the survival set is S = [0, 1] and

Published as a conference paper at ICLR 2025

the ground-truth exponential family is a normal distribution (not necessarily with unit variance) that assigns α mass to the set S. The learner can request samples from unit variance normal distributions whose means are roughly within O p

log(1/α) distance from 0 but not, for example, from a

normal centered at 1/α which puts exponentially small (i.e.,2 Ω(1/α2)) mass to S. Definition 1.3 (Sample Oracles). Given as input an exponential family pθ F, the (non-truncated) sample oracle returns a sample x pθ. Moreover, given as input an exponential family pθ F, the β-truncated sample oracle over a survival set S returns a sample x p S θ , provided that pθ(S) β. In case this constraint does not hold the behavior of the oracle is undefined. Remark 1.4 (Implementing a β-Truncated Sample Oracle). Given a membership oracle to the set S and a sampler for the non-truncated exponential family pθ, one can directly implement a β-truncated sample oracle by rejecting samples until one falls in S.

We now present our main result on learning truncated exponential families that requires roughly k ϵ2 poly(log(1/α)) samples from the unknown truncated density p S θ and makes k

ϵ2 poly(log(1/α)) sample-oracle calls to an α-truncated sample oracle of Definition 1.3 to learn a truncated density close to the observed truncated distribution in Kullback-Leibler (KL) divergence. Theorem 1.5 (Truncated Exponential Families). Suppose F is an exponential family whose negative log-likelihood (NLL), L(θ), satisfies the conditions in Definition 1.1. Then, there exists an algorithm

that uses e O k

α samples from p S θ , makes e O k ϵ2 log 1

α 2 calls to an Ω(α) sample or-

acle (of Definition 1.3) runs in time poly(k, 1/ϵ, log( 1

α)) and computes an estimate ˆθ, such that KL(p S θ p S ˆθ ) ϵ with probability at least 99%.

We remark that the fact that our algorithm only requires an α-truncated sample oracle is what enables us to achieve a polynomial number of oracle calls to the membership oracle to S and significantly improve over the prior works Daskalakis et al. (2018); Lee et al. (2023) (that required super-polynomial oracle calls and samples). For instances where truncated sampling can be done more efficiently than rejection sample we even get logarithmic dependence on the mass of the survival set. Observe that the algorithm of Theorem 1.5 returns a hypothesis pθ whose corresponding truncated density p S θ is close in KL divergence. Typically one asks for the learned density pθ to be close to the target non-truncated distribution p θ, that is to extrapolate and match the density beyond the region of observation given by the survival set S. We phrased our result in terms of KL divergence of the truncated distributions to keep the required assumptions on the exponential families minimal. In particular, by making the assumptions that render the parameter θ identifiable (as in the prior works) one can readily translate the guarantee of our result to parameter recovery or (learning in KL or total variation) with respect to the target non-truncated density. We next present some new results for well-studied special cases of Definition 1.1 that we obtain as corollaries of Theorem 1.5. We first present our result for learning truncated Gaussian distributions. A truncated Gaussian density is defined as N(x; µ, Σ, S) 1S(x)N(x; µ, Σ) where µ Rd, is its mean and Σ Rd d is its covariance. Corollary 1.6 (Gaussians). Let q = N(µ , Σ , S) be a truncated normal density. There exists an algorithm that draws M = d2 ϵ2 poly(log(1/α)) samples from q, makes M poly(1/α) oracle calls to the set indicator 1S( ), runs in poly(M, d, 1/α) and learns parameters ˆµ, ˆΣ such that d TV (N(µ, Σ), N(ˆµ, ˆΣ)) ϵ.

Our next corollary is on the generalization of log-concave exponential families that satify Definition 1.1 and furthermore have log-concave density and polynomial sufficient statistic (see Lee et al. (2023)). Corollary 1.7 (Log-Concave Exponential Families). Fix ϵ > 0. Let F be a class of log-concave exponential families that satisfy Definition 1.1 and their sufficient statistic T(x) of each pθ F is a polynomial of degree ℓthat maps Rd to Rk. Let q = p S θ be a truncated normal density in F. There exists an algorithm that draws M = k

ϵ2 poly(log(1/α)) samples from q, makes M poly(1/α) oracle calls to the set indicator 1S( ), runs in poly(M, d, 1/α), and learns a parameter ˆθ such that ˆθ θ 2 ϵ and d TV (pˆθ, pθ ) ϵ.

Published as a conference paper at ICLR 2025

Remark: We refer the reader to Appendix A for a detailed discussion comparing our results in Corollaries 1.6 and 1.7 with the corresponding results in the literature.

1.2 OUR TECHNICAL CONTRIBUTION

A common challenge in the line of work on truncated statistics is to ensure that during the optimization phase the parameters of the distribution remain within a region that assigns significantly large mass to the survival set. This is important for many reasons, one being to ensure that gradients remain bounded and another being to ensure that rejection sampling works. As such all prior works restrict the optimization in custom regions of parameters that depend on the class of distributions one is trying to learn. For every different setting, learning Gaussian distributions, product distributions on the hypercube, learning permutations and rankings, and log-concave exponential families, different algorithms have been provided that focus on the specific intricacies of the setting e,g. discrete domain, polynomial features and so on. Our main technical contribution is offering a unified viewpoint for all those settings as exponential families over an unrestricted domain. Our key technical tool is establishing a very simple region for optimization that ensures the probability of the survival set is bounded and is given directly by the non-truncated loss of the model. More specifically, given an initialization θ0, we define the following optimization problem that we aim to solve via projected stochastic gradient descent.

min θ Θ LS(θ) s.t. L(θ) L(θ0) + log 1

Given an initial point θ0 whose loss is sufficiently close to the minimum non-truncated loss minθ Θ L(θ), any point θ in the feasible region is guaranteed to assign at least Ω(α) mass to the survival set. Beyond our novel understanding of the optimization and sampling landscape, which enables significantly faster convergence, we introduce a much simpler algorithm that makes minimal assumptions about the setting and the distributions involved. Unlike previous works that simultaneously bound sample complexity, query complexity, and parameter closeness, we achieve a tighter analysis by decoupling these terms. Specifically, we demonstrate the following:

We show that the dependence on samples from the true distribution is limited to the estimation of the mean sufficient statistic, which requires only a few samples. This is possible due to the good concentration properties of the truncated distribution, which behaves as a subexponential distribution. We adopt an optimization perspective, focusing on minimizing the given objective as efficiently as possible, rather than directly recovering the underlying parameters. Our guarantees are provided in terms of the closeness between the true truncated density and the learned truncated density. Leveraging existing statistical results that relate the closeness of truncated densities to the closeness of parameters and non-truncated densities, we derive parameter recovery algorithms with polynomial dependence on the mass of the survival set.

1.3 RELATED WORK

The field of truncated statistics has a long history, which finds its roots in Bernoulli s analysis of smallpox morbidity and mortality data Bernoulli (1760), and the early works of Galton (1897); Pearson (1902a); Pearson & Lee (1908a); Fisher (1931). Thus, we cannot do it justice here and for an overview of the field we refer the reader to Schneider (1986); Cohen (2016); Balakrishnan & Cramer (2014). We also refer the reader to Tobin (1958); Amemiya (1973); Hausman & Wise (1977); Heckman (1979); Maddala (1986); Keane (1993); Hajivassiliou & Mc Fadden (1998), and their references, for further work in statistics and econometrics. There still remain numerous outstanding challenges, targeting density estimation, regression and classification tasks. Many recent works have focused on providing computationally efficient algorithms. There has been a large number of recent works dealing inference with truncated data from a Gaussian distribution Daskalakis et al. (2018); Kontonis et al. (2019), mixtures of Gaussians Nagarajan & Panageas (2019), linear regression Daskalakis et al. (2019); Ilyas et al. (2020); Daskalakis et al. (2020b), sparse Graphical models Bhattacharyya et al. (2021) or Boolean product distributions Fotakis et al. (2020), nonparametric estimation Daskalakis et al. (2021).

Published as a conference paper at ICLR 2025

The area of robust statistics Huber & Ronchetti (2009) is closely related to our work, as it also addresses the challenge of handling biased datasets with the goal of identifying the underlying data-generating distribution. Recently, there has been substantial theoretical progress in developing computationally efficient methods for robustly estimating high-dimensional distributions in the presence of arbitrary corruptions affecting a small ε fraction of the samples Diakonikolas et al. (2016); Charikar et al. (2017); Lai et al. (2016); Diakonikolas et al. (2017); Klivans et al. (2018); Diakonikolas et al. (2019). Our work contributes to this broader field of statistical learning, specifically focusing on cases where certain impediments or biases are present. One of the most notable works in this area is Lee et al. (2023), which introduced efficient algorithms for learning from truncated exponential families. In the following subsection, we provide a detailed comparison between their results and our improved approach.

1.3.1 COMPARISON TO LEE ET AL. (2023)

The work of Lee et al. (2023) presents a framework for learning from truncated samples. We outline the key improvements of our method below:

Oracle Efficiency: A significant part of our improvement arises from requiring only α-truncated sample oracle access (Definition 3.1) throughout our algorithm, whereas Lee et al. (2023) requires access to an a log(1/a)-truncated sample oracle. Furthermore, any additional improvement on our result is unattainable, as all parameters near θ necessarily require access to an α-truncated sample oracle.

Parameter Estimation: In Lee et al. (2023), convergence requires approximately a log(1/a) iterations, whereas our approach achieves convergence in polynomial time relative to 1/a. Both Lee et al. (2023) and our work, as shown in Corollary 1.7, have iteration counts that depend on poly(1/mass). However, our advantage stems from being able to identify regions of higher mass more effectively.

Mass and Sample Complexity: The guarantees provided in Lee et al. (2023) concern areas of a log(1/a) mass, which can lead to inefficiencies. Our approach, by contrast, offers sharper covariance estimates for the truncated distribution, leading to initialization points that are closer to the true truncated parameter θ S = Ex p S θ (T(x)). While Lee et al. (2023) s analysis requires remaining within a ball of radius log(1/a) around the true parameter θ , effective learning necessitates both proximity to θ within this radius and maintaining a low negative log-likelihood (NLL) L(θ). Had we used the covariance estimates from Lee et al. (2023), our sample complexity would be poly(1/a), but by employing our improved estimates, we achieve poly(log(1/a)), matching the sample complexity seen in Lee et al. (2023). Initialization and Convergence: While both methods use similar initialization points, our analysis demonstrates faster convergence. Specifically, we show that the initialization point in our method yields a lower starting value for L(θ), ensuring the algorithm focuses on high-mass regions more effectively. In contrast, Lee et al. (2023) does not capitalize on this aspect as robustly.

Summary of Differences:

Sharper Covariance Estimates: Our approach guarantees that initialization starts within high-mass areas by employing refined covariance estimates. Higher Mass Guarantees: We emphasize maintaining low NLL values, incrementally broadening the search radius with controlled increases in NLL, which is critical for efficient convergence. This perspective is not as effectively addressed in Lee et al. (2023).

2 PRELIMINARIES

A distribution belongs to the exponential family if its density is expressed as

pθ(x) = h(x) exp(θ T(x) A(θ)).

For a given set S, we denote by pθ(S) the probability that the measure pθ assigns to S. The function h(x), often referred to as the weight function, is also known by other terms in the literature, such as the carrier measure.

Published as a conference paper at ICLR 2025

The parameter space of this family is defined as Θ = {θ Rk : A(θ) < }, which is an open set. After potentially reparameterizing, it is assumed that θ and T(x) are linearly independent. This form of representation is known as the minimal representation. Given a minimal representation, the function A(θ) is convex. The following properties of an exponential family are standard:

1. The gradient of A(θ), A(θ), is the expected value of the sufficient statistic under the distribution: A(θ) = Ex pθ[T(x)]. 2. The Hessian of A(θ), 2A(θ), is the covariance of the sufficient statistic under the distribution: 2A(θ) = Covpθ[T(x)]. We include the definition of a univariate sub-exponential random variable, which can be used to define a corresponding class for the multivariate case by taking the supremum over the unit sphere. There are many equivalent definitions to choose from, but the most convenient for us is by using the moment generating function (MGF). Namely, Definition 2.1 (Moment Generating Function). The moment generating function (MGF) of a distribution D, denoted by MD(t), is defined as

MD(t) = E x D[etx],

provided this expectation exists. The function MD(t) is defined for all values of t in some interval containing t = 0.

We provide a definition of the sub-exponential distribution, which we will rely on throughout. This definition is similar to the one in Vershynin (2010), with a minor adjustment to account for potential restrictions that arise when scaling moves us outside the natural parameter space Θ of our model. Note that, aside from some scaling, the definition used here aligns with that in Lee et al. (2023). Definition 2.2 (Sub-exponential distribution). Let x be a univariate random variable with zero mean. The random variable x is said to belong to the class SE(K2 1, β) if the following condition holds: E[exp(λx)] exp(K2 1λ2), for all λ such that |λ| 1

The following proposition provides an equivalent definition for the sub-exponential class, adapted from Vershynin (2010) with slight modifications to suit our needs. However, since the proof remains unchanged, we omit it. Proposition 2.3 (Sub-exponential distribution). Let x be a univariate random variable with zero mean. Fix parameters Ki > 0 for i = 1, 2, and β > 0, such that x SE(K2 1, β). Then, the following two properties are equivalent. Additionally, the quantities max(K1, β) and K2, which appear in the properties, differ by a universal constant.

(a) The moment generating function (MGF) of x satisfies

E[exp(λx)] exp(K2 1λ2)

for all λ such that |λ| 1

β . (b) The moments of x satisfy (E [|x|p])1/p K2p for all p 1.

This class of functions is typically associated with a concentration inequality. The one we use, which can also be found in Vershynin (2010), is as follows: Fact 2.4 (Bernstein s Inequality). Let x1, . . . , x N be independent, identically distributed, meanzero, sub-exponential random variables belonging in the SE(K2 1, β). Then, for every t 0, we have

2 exp c N min t2

max2(K1, β), t max(K1, β)

where c > 0 is an absolute constant.

2.1 TRUNCATED STATISTICS

We begin by defining the notion of a truncated distribution. Namely, let p be a probability density function on Rd. For a given set S Rd, let p S be the conditional distribution of x p given that x S, i.e. p S θ (x) = p(x) 1S(x)

pθ(S) . Note that the relative density is p S(x)

p(x) = 1S(x)

Published as a conference paper at ICLR 2025

In the context of truncated statistics, we engage with an oracle, denoted by MS(x) = 1S(x), which represents the indicator function of the set S. This setup is prevalent in scenarios where direct access to the set S is unavailable, yet a mechanism exists to verify the membership of elements within S. This method allows for the indirect exploration and analysis of the set through conditional access provided by the oracle.

Non-truncated and Truncated NLL Given some exponential family pθ parameterized by θ Rk, the negative log-likelihood (NLL) over a population distribution q is defined as follows:

L(θ) = E x q[log pθ(x)] . (4)

In what follows we will refer to L as non-truncated . Moreover, for a truncated exponential family p S θ , such that the support of q is contained in S, we define the following truncated version of the NLL objective as: LS(θ) = E x q[log p S θ (x)] = L(θ) + log pθ(S) . (5)

The gradient and Hessian of LS( ) are given below.

θLS(θ) = Ex p S θ [T(x)] Ex q [T(x)] , 2 θLS(θ) = Covx p S θ [T(x)] .

3 LEARNING TRUNCATED DENSITIES

In this section we prove our main result, namely an efficient algorithm for learning truncated exponential families.

Definition 3.1 (Sample Oracles). Given as input an exponential family pθ F, the (non-truncated) sample oracle returns a sample x pθ. Moreover, given as input an exponential family pθ F, the β-truncated sample oracle over a survival set S returns a sample x p S θ , provided that pθ(S) β. In case this constraint does not hold the behavior of the oracle is undefined.

Remark 3.2 (Implementing a β-Truncated Sample Oracle). Given a membership oracle to the set S and a sampler for the non-truncated exponential family pθ, one can directly implement a β-truncated sample oracle by rejecting samples until one falls in S.

This oracle enables us to reduce the inference problem to an optimization that use the gradients as rewards.

Theorem 3.3 (Learning Truncated Exponential Families). Fix ϵ > 0. Let F be an exponential family with sufficient statistic T( ) such that for any pθ F, λI Covx pθ[T(x)] LI. Assume sample access to an unknown distribution pθ F truncated with a survival set S of mass α > 0, and an Ω(α2)-truncated sample oracle Q over S. There exists an algorithm that uses e O k

samples from p S θ , makes e O k ϵ2 log 1

α 2 calls to the oracle Q, runs in time poly(k, 1/ϵ, log( 1

and computes an estimate ˆθ, such that KL(p S θ p S ˆθ ) ϵ with probability at least 99%.

Roadmap to the Proof: To prove Theorem 3.3, we proceed as follows:

1. Setup and Definitions: Our learning algorithm outputs ˆθ by performing PSGD (Theorem 3.12) starting from an initial point θ0 over a domain D. The key properties are that θ D and, for all θ D, the bound Prx pθ[x S] cα 2 holds. 2. Main Argument: The parameter θ0 serves as a good approximation of Ex p S θ [T(x)]. This approximation is achieved through a collection of smoothness results, specifically Lemma 3.5, Lemma 3.11, and Lemma 3.7. This good initialization is crucial for defining D = {θ : L(θ) L(θ0) log(1/α)} with the necessary properties. As shown in Corollary 3.9, initialization from a point with sufficiently low loss L(θ0) L(θ0) + ϵ allows the application of Corollary 3.10, where θ0 is defined in Corollary 3.10. 3. PSGD Specifications: We apply Theorem 3.12 with f = LS. The smoothness parameter LS is L(1 + log2(1/α)) over the domain D, owing to the smoothness of the truncation (Lemma 3.5) and the sufficient mass within D. Similarly, as shown in Lemma 3.13, the gradient variance of LS is of order k L log6(1/α) on D.

Published as a conference paper at ICLR 2025

Using the same techniques as in Theorem 3.3, we can demonstrate that an Ω(α)-truncated sample oracle Q over S is feasible. Note that since the parameter θ assigns mass α to S, we cannot hope to achieve something more efficient than an Ω(α)-truncated sample oracle. The main idea for this improvement in Theorem 3.3 is to incrementally expand the exploration domain to avoid overshooting and inadvertently ending up in a low-mass region. We include the theorem statement below; for a detailed proof, along with the necessary apparatus, look at the appendix Appendix C. Theorem 3.4 (Learning Truncated Exponential Families ). Fix ϵ > 0. Let F be an exponential family with sufficient statistic T( ) such that for any pθ F, λI Covx pθ[T(x)] LI. Assume sample access to an unknown distribution pθ F truncated with a survival set S of mass α > 0, and an Ω(α)-truncated sample oracle Q over S. There exists an algorithm that uses e O k

samples from p S θ , makes e O k ϵ2 log 1

α 2 calls to the oracle Q, runs in time poly(k, 1/ϵ, log( 1

and computes an estimate ˆθ, such that KL(p S θ p S ˆθ ) ϵ with probability at least 99%.

3.1 FINDING A GOOD INITIALIZATION

To demonstrate that a good initialization can be found with e O k

α samples from p S θ , we leverage the concentration properties of the distribution p S θ . This is a threefold process, and our end goal is to find in which sub-exponential class x p S θ belongs. In order to accomplish that, we have

to bound Covx p S θ (T(X)), and in order to do that we need Ep S θ [T(x)] µ 2 log 1 pθ(S) . The next lemma will deal with those two latter stepping stones. Lemma 3.5 (Moment preservation after truncation). Assume that Ex pθ[T(x)] = µ,

Covx pθ(T(x)) LI, and pθ(S) > 0. Then, Ex p S θ [T(x)] µ 2 log 1 pθ(S) . Similarly, we

have the following covariance estimate: Covx p S θ (T(x)) O log2 1 pθ(S) + L I .

The previous lemma gives us a bound on how much the smoothness is affected by the truncation. Hence, tighter bounds guarantee needing fewer samples since the truncated distribution enjoys stronger concentration properties. The analysis is achieved by performing a worst-case analysis. Next, we see how truncation affects the sub-exponential property that the non-truncated distribution possesses. For the proof we refer to Appendix B. Lemma 3.6 (Truncated density is sub-exponential). Let F be an exponential family with sufficient statistic T( ) such that for any pθ F, we have λI Covx pθ[T(x)] LI. Let x follow the truncated distribution: p S θ (x) = pθ(x)1S(x)

pθ(S) , where pθ(x) = h S(x) exp(θ T(x) AS(θ)) and pθ(S) is the normalization constant. Then, the random variable T(x), under this truncated distribution, is sub-exponential, denoted SE(K2, β), with parameters K2 = (1 + log2(1/α))L as K appears in Definition 2.2, and β is the reciprocal of the largest radius r of the ball B(θ, r), centered at θ, that is contained in B(θ, r) Θ.

Now, we have developed all the tools necessary to find a sharp estimate of the samples needed for the empirical truncated distribution to approximate its mean. Lemma 3.7 (Empirical estimation of the truncated sufficient statistic). Fix ϵ > 0. Let pθ be some exponential family such that pθ (S) = α. Moreover, let θ S = Ex p S θ [T(x)] be the expected

sufficient statistic of the truncated exponential family p S θ . Moreover, let ˆθ = 1 N PN i=1 T(x(i)) be the corresponding empirical statistic for x(1), . . . , x(N) i.i.d. samples from p S θ . Using N = O k L

α samples, with probability at least 1 δ, it holds ˆθ θ S 2 < ϵ .

Proof. Our estimator is ˆθ = 1 N PN i=1 T(xi), where xi p S θ . We have shown that, for any unit vector u, Yi = u T T(xi) u T Exi p S θ [T(xi)] belongs in the sub-exponential class with K2 =

O L + log2 1

α We fix t = ϵ > 0, now by using Bernstein s inequality, as seen in Fact 2.4, we obtain

2 exp c N min t2

max2(K, β), t max(K, β)|

= 2 exp c Nϵ2

Published as a conference paper at ICLR 2025

Since the previous inequality holds for all u, and by the superposition property of sub-exponential random variables, we obtain the previously mentioned bound for the norm, namely:

p S θ ˆθ Exi p S θ (T(xi)) t 2 exp c Nϵ2

Since b, L is a constant independent of α, we have the following bound: max2(K, β) = O log2 1

α . Therefore, for N O k

δ , we get,

p S θ ˆθ Exi p S θ (T(xi)) t δ.

3.2 THE FEASIBLE REGION AROUND INITIALIZATION

To ensure feasiblity we will use the following simple observation.

Observation 3.8 (Mass along the sub-level sets). Let pθ be some exponential family and define θ to be a minimizer of the constrained objective {minθ LS(θ) : s.t. L(θ) L} for some L. For any θ such that L(θ) L, it holds Prθ[S] Prθ [S] exp(L(θ ) L(θ)).

Proof. This holds as LS(θ ) LS(θ) which directly gives the required bound.

Using Observation 3.8 we readily obtain the following corollary.

Corollary 3.9 (Mass monotonicity along sub-level thresholds). Let θ be the minimizer of the truncated NLL objective LS for some exponential family pθ. Denote by α = Prθ [S] the mass assigned to the survival set S. Define the following constrained optimization problem: {minθ LS(θ) : s.t. L(θ) L} and denote by θ1 its solution with L = L1 and by θ2 its solution of the same optimization problem with L = L2, where L2 > L1.

1. (Monotonicity) The mass assigned to S is decreasing as a function of L: Prθ1[S] Prθ2[S] α. 2. (Exponential decrease) The mass assigned to S by θ2 drops exponentially fast, i.e., it holds that Prθ2[S] Prθ1[S] e (L2 L1).

Proof. Notice that by definition of θ1, θ1 {minθ LS(θ) : s.t. L(θ) L2}. Hence, Prθ1[S] Prθ2[S] exp(L2 L1) Prθ2[S]. It only remains to show that Prθ2[S] α. Since, the {minθ LS(θ) : s.t. L(θ) L2} decreases as L2 increases, L(θ ) = {minθ LS(θ) : s.t. L(θ) < } = min{LS(θ) : s.t. L(θ) L(θ )}. Which means for θ that is the solution to a constrained optimization problem with lower L = L2, we have θ {LS(θ) : s.t. L(θ) L(θ )}. Hence, L L(θ ) and from the previous part this implies α = Prθ [S] Prθ [S]. Hence, we conclude.

Given the exponential decrease, i.e. Corollary 3.9 we obtain the following corollary. Which is ensuring that our initialization of PSGD is close in LLN distance, and that all our parameters inside our projection domain give poly(1/α) to the survival set S.

Corollary 3.10 (Feasibility of θ0). Suppose that θ0 is θ0 = Ex p S θ [T(x)]. Then, L(θ ) L(θ0) + log 1

α . Furthermore, if for any θ such that L(θ) L(θ0) + log 1

α + ϵ, then pθ(S) pθ0(S)αe ϵ α2e ϵ.

Proof. From Corollary 3.9 it holds that pθ0(S) > α. Hence, by the exponential decrease, provided by the same corollary, we find that L(θ ) L(θ0) + log 1

α . The second part is an immediate consequence of the Observation 3.8.

3.3 PSGD CONVERGENCE

We prove the following lemma show that the truncated densities are sub-exponential, see Appendix B.

Lemma 3.11 (Sub-exponential Property for the Truncated Distribution). Let x follow the truncated distribution: p S θ (x) = pθ(x)1S(x)

pθ(S) , where pθ(x) = h(x) exp(θ T(x) A(θ)) and Prx pθ[x S]

Published as a conference paper at ICLR 2025

is the normalization constant. Then the random variable T(x) under this truncated distribution is sub-exponential. More specifically, it belongs to the family SE(K2, β) where K2 = L + log2 1

α , β 1 = inf u =1 sup{γ : γu + θ Θ}.

The result for PSGD we will use is the following. Its proof is standard, see, .e.g, Shalev-Shwartz & Ben-David (2014) or Appendix B for a proof.

Theorem 3.12 (Convergence of Projected Stochastic Gradient Descent). Consider the Projected Stochastic Gradient Descent (PSGD) algorithm for minimizing a convex function f over a convex set K. Assume the following:

1. Lipschitz Continuous Gradients: There exists a constant M > 0 such that for all x and y, f(x) f(y) M x y . 2. Bounded Variance: The variance of the stochastic gradients is bounded, i.e., there exists a constant b2 such that E[ g(t) 2] b2, where g(t) = f(xt).

Then, for w = 1

T PT i=1 w(i) where the {w(i)} are generated by the PSGD algorithm with step size

ht = 1 Mt, it holds E [f( w)] f(w ) w(0) w 2

2MT + b2π2 12M 2T .

To apply Theorem 3.12 we first show that our unbiased gradients have bounded second moment. The proof can be found in Appendix B.

Lemma 3.13 (Bounded variances of stochastic gradients ). Let T(z) : Rd 7 Rk be the sufficient statistic of a truncated exponential family p S θ . We have that v = T(z) T(x), where z p S θ and x p S θ is an unbiased estimator of LS(θ). Moreover, v has bounded second moment, namely, E[ v 2] = Ez p S θ Ex p S θ T(z) T(x) 2 k L log6(1/α) .

3.4 THE PROOF OF THEOREM 3.3

We are now ready to prove our main result. Apply Theorem 3.12 (PSGD) to the optimization problem Equation (3) for f = LS and w(0) = θ0, with projecting region D = {θ : L(θ)

L(θ0) + log 1

α }. Here, θ0 = 1 N PN i=1 T(xi) for N O k L3

α . We also set

θ0 = Ex p S θ [T(x)]. Then, by applying Lemma 3.7 with ϵ = ϵ L, we obtain θ0 θ0 2 ϵ L. Observe now, that by the L smoothness of L and the fact that L(θ0) = 0, we have the bound L(θ) L r for all θ {θ : θ θ0 r}, hence |L(θ0) L(θ0)| ϵ.

Furthermore, by Corollary 3.10 we get θ D {θ : L(θ) L(θ0) log( 1

α) + ϵ}, and pθ(S) α2e ϵ for all θ D. By strong convexity of L(θ), we have that λ

2 θ0 θ 2 2 O log 1

Hence, Theorem 3.12 in our setup gives an upper bound C log(1/α) 2(L+log2(1/α))T + bπ2

12(L+log2(1/α))2T . So, for

ϵ 6C log2(1/α)(L+log2(1/α))2+π2

12(L+log2(1/α))2 , Theorem 3.12 gives us an θ such that E f(θ) f(θ ) ϵ.

From Markov s inequality, we get Pr f(θ) f(θ ) 3ϵ 1

3 We can easily amplify this probability by repeating this process independently and hence obtaining a sequence of θ1, θ2, . . . , θm and then choosing θ = argminθi f(θi). Set ˆθ := θ and observe that

Pr h f(ˆθ) f(θ ) 3ϵ i 1

3 m. Hence by choosing m log(δ)/ log(1/3), we obtain a ˆθ that

satisfies Pr h f(ˆθ) f(θ ) 3ϵ i δ. To accomplish this, however, we need access to the value

of LS(si) = L(si) + log(Prsi[S]). Since we have access to S only through its oracle, in order to calculate Prwi[S], we use concentration for a Bernoulli random variable. More specifically, by Hoeffding s inequality, we need O 1

α samples to estimate log (Prwi[S]), ϵ-close with probability at least 1 δ.

Therefore, we obtain a ˆθ with high probability that achieves high precision, namely LS(ˆθ) LS(θ ) < ϵ or equivalently KL(p S θ p S ˆθ ) < ϵ.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0, funded by the European Union under the Next Generation EU Program.

Takeshi Amemiya. Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, pp. 997 1016, 1973.

N Balakrishnan and Erhard Cramer. The art of progressive censoring. Springer, 2014.

Daniel Bernoulli. Essai d une nouvelle analyse de la mortalit e caus ee par la petite v erole, et des avantages de l inoculation pour la pr evenir. Histoire de l Acad., Roy. Sci.(Paris) avec Mem, pp. 1 45, 1760.

Arnab Bhattacharyya, Rathin Desai, Sai Ganesh Nagarajan, and Ioannis Panageas. Efficient statistics for sparse graphical models from truncated samples. In International Conference on Artificial Intelligence and Statistics, pp. 1450 1458. PMLR, 2021.

M. Charikar, J. Steinhardt, and G. Valiant. Learning from untrusted data. In Proceedings of STOC 2017, pp. 47 60, 2017.

A Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 2016.

Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Efficient statistics, in high dimensions, from truncated samples. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 639 649, 2018. URL https: //api.semanticscholar.org/Corpus ID:52189557.

Constantinos Daskalakis, Themis Gouleakis, Chistos Tzamos, and Manolis Zampetakis. Computationally and statistically efficient truncated regression. In the 32nd Annual Conference on Learning Theory (COLT), 2019.

Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Computationally and statistically efficient truncated regression. Ar Xiv, abs/2010.12000, 2020a. URL https://api.semanticscholar.org/Corpus ID:147702805.

Constantinos Daskalakis, Dhruv Rohatgi, and Emmanouil Zampetakis. Truncated linear regression in high dimensions. Advances in Neural Information Processing Systems, 33:10338 10347, 2020b.

Constantinos Daskalakis, Vasilis Kontonis, Christos Tzamos, and Emmanouil Zampetakis. A statistical taylor theorem and extrapolation of truncated densities. In Conference on learning theory, pp. 1395 1398. PMLR, 2021.

I. Diakonikolas, D. M. Kane, and A. Stewart. Robust learning of fixed-structure bayesian networks. Co RR, abs/1606.07384, 2016.

I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pp. 999 1008, 2017.

I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 2019.

RA Fisher. Properties and applications of hh functions. Mathematical tables, 1:815 852, 1931.

Dimitris Fotakis, Alkis Kalavasis, and Christos Tzamos. Efficient parameter estimation of truncated boolean product distributions. In Conference on Learning Theory, pp. 1586 1600. PMLR, 2020.

Published as a conference paper at ICLR 2025

Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379-387): 310 315, 1897.

Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62:310 315, 1898. doi: 10.1098/rspl.1897.0115. URL http://doi.org/10.1098/rspl.1897.0115.

Vassilis A Hajivassiliou and Daniel L Mc Fadden. The method of simulated scores for the estimation of ldv models. Econometrica, pp. 863 896, 1998.

Jerry A Hausman and David A Wise. Social experimentation, truncated distributions, and efficient estimation. Econometrica: Journal of the Econometric Society, pp. 919 938, 1977.

James J Heckman. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pp. 153 161, 1979.

P. J. Huber and E. M. Ronchetti. Robust statistics. Wiley New York, 2009.

Andrew Ilyas, Emmanouil Zampetakis, and Constantinos Daskalakis. A theoretical and practical framework for regression and classification from truncated samples. In International Conference on Artificial Intelligence and Statistics, 2020. URL https://api.semanticscholar. org/Corpus ID:220095790.

Michael P Keane. Simulation estimation for panel data models with limited dependent variables. Elsevier, 1993.

A. R. Klivans, P. K. Kothari, and R. Meka. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, COLT 2018, pp. 1420 1430, 2018.

V. Kontonis, C. Tzamos, and M. Zampetakis. Efficient truncated statistics with unknown truncation. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 1578 1595. IEEE, 2019.

K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of FOCS 16, 2016.

Jane Lee, Andre Wibisono, and Manolis Zampetakis. Learning exponential families from truncated samples. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Pxc WJq O3qj.

Gangadharrao S Maddala. Limited-dependent and qualitative variables in econometrics. Cambridge university press, 1986.

Sai Ganesh Nagarajan and Ioannis Panageas. On the Analysis of EM for truncated mixtures of two Gaussians. In 31st International Conference on Algorithmic Learning Theory (ALT), pp. 955 960, 2019.

Karl Pearson. On the systematic fitting of frequency curves. Biometrika, 2:2 7, 1902a.

Karl Pearson. On the systematic fitting of curves to observations and measurments: Part ii. Biometrika, 2(1):1 23, 1902b.

Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59 68, 1908a.

Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59 68, 1908b.

Helmut Schneider. Truncated and censored samples from normal populations. Marcel Dekker, Inc., 1986.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014. ISBN 1107057132.

Published as a conference paper at ICLR 2025

James Tobin. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, pp. 24 36, 1958.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices, 2010.

A COMPARISON WITH PREVIOUS WORK

In this appendix, we present the main results from prior research on learning truncated distributions, specifically those Daskalakis et al. (2018) on truncated Gaussian distributions and Lee et al. (2023) on truncated exponential families. We then compare these results with ours to highlight improvements in oracle complexity and dependence on the truncation parameter α.

A.1 RESULTS FROM DASKALAKIS ET AL. (2018)

Daskalakis et al. (2018) investigated the problem of learning a Gaussian distribution truncated to a measurable set. Their main result can be summarized as follows: Theorem A.1 (Learning Truncated Gaussians Daskalakis et al. (2018)). Let q = N(µ , Σ , S) be a Gaussian distribution truncated to a measurable set S Rd, with mean µ Rd and covariance Σ Rd d. Set α = Px N(µ ,Σ )[x S]. Then, there exists an algorithm that:

Requires a number of samples M = d2

ϵ2 poly(log(1/α)) and makes M α poly(log(1/α)) oracle calls to the set indicator 1S( ). Runs in poly(M, d, α poly(log(1/α))).

The algorithm computes estimates ˆµ and ˆΣ such that, with high probability,

d TV (N(ˆµ, ˆΣ), N(µ , Σ )) < ϵ

The super-polynomial dependence on 1

α arises because the mass assigned to the truncation set S within the optimization domain used in their analysis becomes super-polynomially small.

A.2 RESULTS FROM LEE ET AL. (2023)

Lee et al. (2023) extended the analysis to general exponential families truncated to a measurable set S. Their main result is as follows: Theorem A.2 (Learning Truncated Exponential Families Lee et al. (2023)). Let F be a class of log-concave exponential families that satisfy Definition 1.1 and their sufficient statistic T(x) of each pθ F is a polynomial of degree ℓthat maps Rd to Rk. Suppose pθ is the target distribution, and q = p S θ is its truncation to a set S. Set α = Px pθ [x S]. Then, there exists an algorithm that:

Requires a number of samples M = k ϵ2 poly(log(1/α)) and makes M poly(α log(1/α)) oracle calls to the set indicator 1S( ). Runs in poly(M, k, α log(1/α)).

The algorithm outputs an estimate ˆθ such that, with high probability, d TV (pˆθ, pθ ) ϵ.

Again, the super-polynomial dependence on 1

α is due to the small probability mass assigned to S, which adversely impacts the optimization landscape.

A.3 COMPARISON AND DISCUSSION

The primary difference between these prior results and our work lies in the dependence on the truncation parameter α. In both Daskalakis et al. (2018) and Lee et al. (2023), the algorithms require a number of oracle calls, and computational time that are super-polynomial in 1

α. This is mainly because the probability mass assigned to the truncation set S as it becomes small it negatively affects the smoothness and strong convexity properties needed for efficient optimization. In contrast, our algorithms achieve similar estimation guarantees while requiring a number of calls to the indicator of the truncation set 1S that depends polynomially on 1

α. Specifically, we ensure that for all parameters θ in the domain D of our algorithm, the probability mass assigned to S satisfies

Px pθ[x S] poly(α).

Published as a conference paper at ICLR 2025

This lower bound prevents the mass from becoming negligibly small as α decreases. Consequently, the strong convexity constant of the truncated negative log-likelihood LS(θ) (Equation (3)) depends polynomially on α, enabling efficient optimization.

A.4 CONCLUSION

By carefully analyzing the distribution of the mass assigned to the truncation set S, we have developed algorithms that reduce the dependence on α from super-polynomial to polynomial in 1

α. This improvement makes our methods more efficient for small values of α by optimizing the computational landscape and reducing the complexity of implementing the required oracle.

B OMMITED PROOFS

B.1 THE PROOF OF LEMMA 3.11

The truncated exponential family distribution can be written as:

p S θ (x) = h S(x) exp(θ T(x) AS(θ)),

h S(x) = h(x)1S(x) is the modified base measure, zero outside the set S, AS(θ) = A(θ) + log pθ(S) is the modified log-partition function reflecting truncation. Define the truncated log-partition function:

AS(θ) = log Z

S h(x) exp(θ T(x)) dx.

The expected value and covariance under truncation are given by:

µS = AS(θ),

Covt p S θ (T(x)) = 2AS(θ).

Consider the moment generating function (MGF): To simplify the calculations deal with the pushforward measure induced by the mapping x T(x) and we denote the corresponding density as p S θ (t) = h S(t) exp(θt AS(θ)).

Et p S θ [eγu (t µS)] = e γu µS ZS(γu + θ)

and to establish subexponentiality, we need to establish an inequality of the form:

ZS(θ) e γu µS eγ2K2

for γ such that both θ + γu and θ + γu belong in our parameter space. Using the L + log2 1

α - smooth property of AS(θ) we find:

AS(γu + θ) AS(θ) γu µS + L + log2 1

Et p S θ [eγu (t µS)] = e γu µS ZS(γu + θ)

ZS(θ) exp γ2 L + log2 1

Therefore, T(x) is SE(K2, β) with

K2 = L + log2 1

, β 1 = min (sup{γ : γu + θ Θ}, sup{γ : γu + θ Θ}) .

Published as a conference paper at ICLR 2025

B.2 THE PROOF OF LEMMA 3.5

First, for convenience, suppose that pθ(S) = α. To establish the bound, we will show that for any vector u Rd, the following holds: Ex p S θ [u T T(x)] Ex pθ[u T T(x)] O log 1

Indeed, for all vectors w and all norms , there is a linear functional f such that f(w) = w . Since every linear functional is of the form w u T w, we can be sure that, after choosing the proper u, we can get a bound for any norm that may be of use. Define CR = {x : u T T(x) > R}. We change the measure in the first expectation, so we can compare them more readily, namely,

Ex p S θ [u T T(x)] Ex pθ[u T T(x)]

p u T T(x) Ex pθ u T T(x)

u T T(x)1CR

u T T(x)1CRc

u T T(x)1CR + Ex pθ

u T T(x)1CR

αEx pθ u T T(x)1CR + Ex pθ

αEx pθ u T T(x)1CR + R

It remains to work on the term Ex pθ u T T(x)1CR :

Ex pθ u T T(x)1CR = Ex pθ u T T(x)|CR pθ(CR)

= Ex pθ log(exp u T T(x)|CR) pθ(CR)

= log Ex pθ[exp u T T(x)|CR)] pθ(CR)

log Ex pθ[exp u T T(x)1CR]) 1 pθ(CR)

log Ex pθ[exp u T T(x)1CR]) 1 pθ(CR)

log Ex pθ[exp u T T(x)1CR]) pθ(CR) log (pθ(CR)) pθ(CR)

We switch our focus to the term log Ex pθ[exp u T T(x)1CR]

log Ex pθ[exp u T T(x)1CR)] 1

2 log Ex pθ exp(2u T T(x)) Ex pθ[1CR)]

2 log Ex pθ exp(2u T T(x)) log (Ex pθ[1CR)])

2 log Ex pθ exp(2u T T(x)) log (pθ(CR))

2 log exp 4L + 2u T E x pθ[T(x] log (pθ(CR))

2 4L + 2u T Ex pθ[T(x)] log (pθ(CR))

Published as a conference paper at ICLR 2025

For convenience, we set pθ(CR) = c(R), and by putting everything together, we obtain Ex p S θ [u T T(x)] Ex pθ[u T T(x)] 1

2α 4L + 2u T Epθ[T(x)] log (c(R)) c(R)

α log(c(R)) c(R) + R

Since u T T(x) is subexponential, we have that pθ(CR) C exp( c LR). Where the constant C = O (log Ex pθ[p(x)]), which is paid since p(x) is not centered. Substituting back into the original inequality, we get a bound of the form

α exp ( c L R) + R.

We used the big O notation to suppress any constants, which we may eliminate by paying only log(constant). Therefore, since the above holds true for all R, we may minimize it or choose an R that is satisfactory for our purposes. We may choose R = O(log(1/α)). Therefore,

α exp ( c L R) + R = O log 1

so we conclude the first part. For the second part we use a similar analysis. To establish the bound, we will bound the corresponding quadratic form of the matrix Covx p S θ (T(x)). It suffices to bound Ex p S θ u T T(x) T T (x)u Ex pθ u T T(x) T T (x)u .

To simplify the expression we set u T T(x) = p(x) where p(x) is a polynomial of degree k. Ex p S θ u T T(x) T T (x)u Ex pθ u T T(x) T T (x)u

= Ex p S θ p2 Ex pθ p2(x)

αEx pθ p21CR + R2

The term Ex pθ p2(x)1CR can be bounded using Cauchy-Schwarz as follows

Ex pθ p2(x)1CR 2 Ex pθ p4(x) Ex pθ [1CR] C exp( c LR),

where C is a constant. To obtain the constant C, we use Proposition 2.3 (b), which gives

Ex pθ p4 (4K2)4

Substituting back into the original inequality, we get a bound of the form

We used the big O notation to suppress any constants, which we may eliminate by paying only log(constant). Therefore, since the above holds true for all R, we may minimize it or choose an R that is satisfactory for our purposes. We may choose R = O(log2(1/α)). Therefore,

2 R + R2 = O log2 1

so we conclude.

Published as a conference paper at ICLR 2025

B.3 THE PROOF OF THEOREM 1.5

Proof. We aim to show that the sequence {w(t)} generated by the algorithm converges to an optimal point w . We use the non-expansiveness of the projection operator. Namely, the projection operator onto a convex set is non-expansive, meaning for any x and y, ΠK(x) ΠK(y) x y .

Using the convexity and smoothness properties, and the non-expansiveness of the projection operator, we can see the progress achieved after each iteration.

w(t) w 2 w(t 1) htg(t) w 2. Expanding the right-hand side, we get:

w(t) w 2 w(t 1) w 2 2ht g(t), w(t 1) w + h2 t g(t) 2.

Since f is convex, we have: f(w(t 1)) f(w ) f(w(t 1)), w(t 1) w . Taking expectations conditioned on w(t 1), and noting that E g(t) | w(t 1) = f(w(t 1)), we get: E h f(w(t 1)) i f(w ) E h f(w(t 1)), w(t 1) w i .

Combining everything together, taking expectations, and summing over t = 1 to T,

t=1 E h f(w(t 1)) i Tf(w )

t=1 E h f(w(t 1)), w(t 1) w i

t=1 E h g(t), w(t 1) w i

t=1 E h w(t 1) w 2 w(t) w 2i + h2 t E g(t) 2 .

Using the boundedness of E g(t) 2 b, and that ht = 1 Lt, we have:

t=1 h2 t g(t) 2 b

t=1 E h f(w(t 1)) i f(w ) w(0) w 2

Finally, from convexity we obtain

t=1 f(w(t 1))

t=1 E h f(w(t 1)) i ,

hence we conclude.

B.4 THE PROOF OF LEMMA 3.13

Let v denote the output of the rejection sampling procedure to find an unbiased estimate of the gradient. We have E[ v 2]

= E z p S θ E x p S θ

T(z) [T(x)] 2

= E z p S θ E x p S θ

T(z) 2 2T(z) T(x) + [T(x)] 2

= Tr(Cov[T(z)]) + (E[ T(z) ])2 + Tr(Cov[T(x)]) + (E[ T(x) ])2 2 E[T(z)], E[T(x)]

= Tr(Cov[T(z)]) + Tr(Cov[T(x)]) + Ex p S θi[T(z)] Ex p S θ [T(x)] 2

Published as a conference paper at ICLR 2025

k(L + log 1

α ) + k L + O λ 1(L + log2 1

Ex p S θi[T(z)] Ex p S θ [T(x)] = LS(θi) LS(θ )

O L + log2 1

O λ 1 L + log2 1

where we used Lemma 3.5 in combination with the assumption of strong convexity of L(θ) and the upper bound of the smoothness of LS(θ).

C PROOF FOR THEOREM 3.4

To minimize the number of queries to the oracle, our approach cautiously progresses towards the optimal parameter θ . Specifically, we incrementally explore increasingly larger convex sets, ensuring each set assigns sufficient probability mass to the set S. To this end, for a given threshold L, we formulate and solve the following constrained optimization problem:

θ = arg min θ LS(θ)

s.t. L(θ) L (6)

Proposition C.1 (Progress of the Truncated Loss along Sublevel Sets). Suppose θ m is the minimizer of the optimization problem {minθ LS(θ) : s.t. L(θ) L} for L := L + m, where m N. Moreover, let l be the smallest integer, such that θ is the solution of the same constrained minimization problem for L := L + l. Then, for m l 1

(a) LS(θ m) LS(θ m+1) LS(θ m) LS(θ )

l m (b) so, particularly, for all m, LS(θ m) LS(θ m+1) LS(θ m) LS(θ ) log( 1

Proof of Proposition C.1

Proof. Suppose u = θ θ m θ θ m 2 . Then, the functions

g(t) = LS(θ m + tu), w(t) = L(θ m + tu),

are convex. Furthermore, the functions g and w are decreasing and increasing, respectively, for t [0, θ θ m 2]. Denote by tk the points such that w(tk) = L + mb + kb, where k = 0, . . . , l k 1 and tl m = θ θ m 2. Since the function w is convex and increasing, the length of the line segments [tk 1, tk] is decreasing. Also, since g is convex and decreasing,

g(tk) g(tk 1) g(tk 1) g(tk 2), for m = 1, . . . , l.

Consequently,

LS(θ m) LS(θ ) = g(0) g( θ θ m 2)

i=1 (g(ti 1) g(ti))

(l m) (g(t0) g(t1))

(l m) LS(θ m) LS(θ m+1) .

Hence, we divide by l m and obtain the first part. Also, since l m log 1

α , we conclude.

Published as a conference paper at ICLR 2025

C.1 ALGORITHMS INVOLVED IN COMPUTING THE OUTPUT ˆθ WITH PROPERTIES FROM THEOREM 3.4

In this subsection, we include the algorithms used to obtain the output ˆθ, which satisfies the properties outlined in the statement of Theorem 3.4. The purpose is twofold: first, to simplify the exposition of the proof of Theorem 3.4; and second, to make the implementation of the procedures more accessible for reproduction.

The first Algorithm 1 generates the desired output, ˆθ, while incorporating all subsequent algorithms. The algorithm terminates at line 4 when the successive outputs for Equation (6), corresponding to the sublevel sets L1 and L2 (with L2 L1 = g), differ by no more than ϵ . The parameter ϵ is chosen to be small enough such that, at termination, we are within ϵ of the minimizer of LS(θ). This comparison is justified by Proposition C.1, which compares the successive distances to the distance from the global minimum.

Algorithm 1 Algorithm for Minimizing L-Constrained Function

1: for L {Lmin + 1, Lmin + 2, . . . , Lmin + log( 1

α) + 2 } do 2: Execute projected stochastic gradient descent Algorithm 2 {For sublevel set L} 3: Take the output θ L of Algorithm 2 and calculate LS (θ L) 4: if |LS θ L 1 LS (θ L) | ϵ then 5: Stop and return θ L 6: end if 7: end for

Next, we have the projected stochastic gradient descent (PSGD) algorithm that is performed at each sublevel set.

Algorithm 2 Projected SGD Algorithm Given Truncated Samples and Sublevel Set

Require: L, {xi}n i=1, each xi p S θ , Initial θ0 Rk, where θ0 = 1 n Pn i=1 T(xi), step size parameter L 1: for t = 1, . . . , N do 2: Sample x(t) from the data distribution 3: ht 1 Lt 4: g(t) Sample Gradient(θt, x(t)) 5: θt+1 θt htg(t)

6: Project θt+1 onto {θ : L(θ) L} 7: end for 8: return θN

Finally, we have the process where rejection sampling is performed.

Algorithm 3 Sample Gradient

Require: x, θ

1: while True do 2: Sample z pθ 3: if MS(z) = 1 where MS is the membership oracle then 4: return T(z) T(x) 5: end if 6: end while

C.2 THE PROOF OF THEOREM 3.4

Proof. The parameter θ that will satisfy the properties of the theorem will be the output of the Algorithm 1. However, before handling this, we first show that there is a way to amplify the probability

Published as a conference paper at ICLR 2025

of the output θ L as it appears on line 3, by running the PSGD independently and then choosing the best output among all the trials. Apply Theorem 3.12 (PSGD) to the optimization problem in Equation (6) for the sublevel sets L = Lmin+m+ϵS, where m [0, log 1

α +2], Lmin = L Ex p S θ [T(x)] , θ0 = 1

N PN i=1 T(xi),

and θ0 Ex p S θ [T(x)] 2 ϵS.

Choose ϵS = ϵ L, which holds for N O k L3

α . Given the L-smoothness of L

and the fact that L(Ex p S θ [T(x)]) = 0, we obtain the bound L(θ) L r for all θ {θ : θ Ex p S θ [T(x)] r}.

Therefore, with this choice of ϵS, we have L(θ0) Lmin ϵ, and thus the following bound holds:

w(0) w 2 = θ0 θ 2 2

λ (L(θ ) L(θ0)) O 1

Hence, Theorem 3.12 in our setup gives an upper bound:

C log2(1/α) 2(L + log(1/α))T + bπ2

12(L + log(1/α))2T

C log2(1/α) + 6(L + log(1/α))bπ2

12(L + log(1/α))2

Theorem 3.12 gives us a θ such that:

E f(θ) f(θ ) ϵ.

From Markov s inequality, we get:

Pr f(θ) f(θ ) 3ϵ 1

3 We can easily amplify this probability by repeating this process independently and hence obtaining a sequence of w1, w2, . . . , wm, and then choosing:

Since θ := w satisfies:

Pr f(θ) f(θ ) 3ϵ 1

by choosing m log(δ)/ log(1/3), we obtain an w that satisfies:

Pr f(θ) f(θ ) 3ϵ δ.

To accomplish this, however, we need access to the value of:

LS(wi) = L(wi) + log Prx pwi[x S] .

Since we have access to S only through its oracle, in order to calculate Prx pwi[x S], we use concentration for a Bernoulli random variable. More specifically, by Hoeffding s inequality, we need O 1

α samples to estimate log Prx pwi[x S] ϵ-close with probability at least 1 δ.

Therefore, for each sublevel set Lmin + m + ϵS, when we perform PSGD, we have an output θm such that, with high probability, it achieves high precision:

LS(θm) LS(θ m) < ϵ

where θ m is the solution to the optimization problem Equation (6) for the sublevel set Lmin+m+ϵS. Suppose Algorithm 1 is terminated when the if statement on line 4 is verified. Then, the θ L generated on line 3 satisfies LS(θ L 1) LS(θ L) ϵ . Fix ϵ > 0, and set ϵ = ϵ log(1/α). We claim that for this choice of ϵ , Algorithm 1 gives output θf with high probability, such that LS(θf) LS(θ ) ϵ.

Published as a conference paper at ICLR 2025

Observe that Algorithm 1 terminates. By Corollary 3.10, and we have

θ L(θ) Lmin log 1

L(θ) L(θ0) log 1

Consequently, there are at least two distinct sublevel sets within the range of Algorithm 1 that contain θ . Therefore, for each of these optimization problems, if we run PSGD with an accuracy of ϵ = ϵ

2 , the stopping criterion on line 4 of Algorithm 1 will be triggered, as both sublevel sets approximate the same minimizer with high probability and accuracy ϵ

2 . Now that we have established that Algorithm 1 terminates we distinguish two cases, namely θ is either in {θ : L(θ) L(θ0) m} or it is not. Suppose the first case, i.e. θ {θ : L(θ) L(θ0) m}. Since θ is the minimizer of the convex function LS(θ), and it is inside the optimization domain {θ : L(θ) L(θ0) m} it implies that θm the output of the PSGD satisfies LS(θm) LS(θ ) ϵ . We deal now with the other case, i.e. when θ / {θ : L(θ) L(θ0) m}. From Proposition C.1

LS(θ m) LS(θ m+1) LS(θ m) LS(θ )

And suppose we get θf = θm+1. From our assumption, recall that

|LS(θm+1) LS(θm)| ϵ ,

where ϵ is our threshold value for stopping Algorithm 1.

LS(θ m) LS(θ m+1) |LS(θm) LS(θm+1)| + 2ϵ log(1/α)

ϵ + 2ϵ log(1/α)

Therefore, by our choice of ϵ , ϵ = ϵ log(1/α), we get that

LS(θ m) LS(θ )

α 3ϵ log(1/α).

LS(θf) LS(θ ) LS(θ m) LS(θ ) 3ϵ.

In the previous case, when termination, of Algorithm 1, occurred at line 4, we obtained a suitable output by virtue of identifying a threshold value ϵ so small that if the difference in the values of the approximated minima between two successive sets was bounded by ϵ , then even if we had continued our search, our progress in terms of reducing the value of LS would have been negligible. Finally, it remains to show that throughout the process, we always maintain a linear mass in terms of α. This follows directly, since upon termination of Algorithm 1, the output parameter θ lies in the sublevel L = L(θ0) + m, which satisfies L L(θ ) + ϵS + 2. Therefore, by Observation 3.8, we conclude that all parameters θ accessed by Algorithm 1 possess at least αe ϵS 2 mass.