# truncated_linear_regression_in_high_dimensions__a9128520.pdf

Truncated Linear Regression in High Dimensions

Constantinos Daskalakis MIT costis@mit.edu

Dhruv Rohatgi MIT drohatgi@mit.edu

Manolis Zampetakis MIT mzampet@mit.edu

As in standard linear regression, in truncated linear regression, we are given access to observations (Ai, yi)i whose dependent variable equals yi = AT i x + ηi, where x is some ﬁxed unknown vector of interest and ηi is independent noise; except we are only given an observation if its dependent variable yi lies in some truncation set S R. The goal is to recover x under some favorable conditions on the Ai s and the noise distribution. We prove that there exists a computationally and statistically efﬁcient method for recovering k-sparse n-dimensional vectors x from m truncated samples, which attains an optimal ℓ2 reconstruction error of O( p

(k log n)/m). As a corollary, our guarantees imply a computationally efﬁcient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) Daskalakis et al. [9], where no regularization is needed due to the lowdimensionality of the data, and (2) Wainright [27], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.

1 Introduction

In the vanilla linear regression setting, we are given m n observations of the form (Ai, yi), where Ai Rn, yi = AT i x + ηi, x is some unknown coefﬁcient vector that we wish to recover, and ηi is independent and identically distributed across different observations i random noise. Under favorable conditions about the Ai s and the distribution of the noise, it is well-known that x can be recovered to within ℓ2-reconstruction error O( p

The classical model and its associated guarantees might, however, be inadequate to address many situations which frequently arise in both theory and practice. We focus on two common and widely studied deviations from the standard model. First, it is often the case that m n, i.e. the number of observations is much smaller than the dimension of the unknown vector x . In this underdetermined regime, it is fairly clear that it is impossible to expect a non-trivial reconstruction of the underlying x , since there are inﬁnitely many x Rn such that AT i x = AT i x for all i = 1, . . . , m. To sidestep this impossibility, we must exploit additional structural properties that we might know x satisﬁes. One such property might be sparsity, i.e. that x has k n non-zero coordinates. Linear regression under sparsity assumptions has been widely studied, motivated by applications such as model selection and compressed sensing; see e.g. the celebrated works of [24, 6, 11, 27] on this topic. It is known, in particular, that a k-sparse x can be recovered to within ℓ2 error O( p

k log n/m), when the Ai s are drawn from the standard multivariate Normal, or satisfy other favorable conditions [27]. The recovery algorithm solves a least squares optimization problem with ℓ1 regularization, i.e. what is called LASSO optimization in Statistics, in order to reward sparsity.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Another common deviation from the standard model is the presence of truncation. Truncation occurs when the sample (Ai, yi) is not observed whenever yi falls outside of a subset S R. Truncation arises quite often in practice as a result of saturation of measurement devices, bad data collection practices, incorrect experimental design, and legal or privacy constraints which might preclude the use of some of the data. Truncation is known to affect linear regression in counter-intuitive ways, as illustrated in Fig. 1, where the linear ﬁts obtained via least squares regression before and after

Figure 1: Truncation in one-dimensional linear regression, along with the linear ﬁt obtained via least squares regression before and after truncation.

truncation of the data based on the value of the response variable are also shown. More broadly, it is well-understood that naive statistical inference using truncated data commonly leads to bias. Accordingly, a long line of research in Statistics and Econometrics has strived to develop regression methods that are robust to truncation [25, 1, 15, 19, 16, 5, 14]. This line of work falls into the broader ﬁeld of Truncated Statistics [23, 8, 2], which ﬁnds its roots in the early works of [3], [13], [20, 21], and [12]. Despite this voluminous work, computationally and statistically efﬁcient methods for truncated linear regression have only recently been obtained in [9], where it was shown that, under favorable assumptions about the Ai s, the truncation set S, and assuming the ηi s are drawn from a Gaussian, the negative log likelihood of the truncated sample can be optimized efﬁciently, and

approximately recovers the true parameter vector with an ℓ2 reconstruction error O q

Our contribution. In this work, we solve the general problem addressing both of the aforedescribed challenges together. Namely, we provide efﬁcient algorithms for the high-dimensional (m n) truncated linear regression problem. This problem is very common, including in compressed sensing applications with measurement saturation, as studied e.g. in [10, 17].

Under standard conditions on the design matrix and the noise distribution (namely that the Ai s and ηi s are sampled from independent Gaussian distributions before truncation), and under mild assumptions on the truncation set S (roughly, that it permits a constant fraction of the samples yi = AT i x + ηi to survive truncation), we show that the SGD algorithm on the truncated LASSO optimization program, our proposed adaptation of the standard LASSO optimization to accommodate truncation, is a computationally and statistically efﬁcient method for recovering x , attaining an optimal ℓ2 reconstruction error of O( p

(k log n)/m), where k is the sparsity of x .

We formally state the model and assumptions in Section 2, and our result in Section 3.

1.1 Overview of proofs and techniques

The problem that we solve in this paper encompasses the two difﬁculties of the problems considered in: (1) Wainwright [27], which tackles the problem of high-dimensional sparse linear regression with Gaussian noise, and (2) Daskalakis et al. [9], which tackles the problem of truncated linear regression. The tools developed in those papers do not sufﬁce to solve our problem, since each difﬁculty interferes with the other. Hence, we introduce new ideas and develop new interesting tools that allow us to bridge the gap between [27] and [9]. We begin our overview in this section with a brief description of the approaches of [27, 9] and subsequently outline the additional challenges that arise in our setting, and how we address them.

Wainwright [27] uses as an estimator the solution of the regularized least squares program, also called LASSO, to handle the high-dimensionality of the data. Wainwright then uses a primal-dual witness method to bound the number of samples that are needed in order for the solution of the LASSO program to be close to the true coefﬁcient vector x . The computational task is not discussed

in detail in Wainwright [27], since the objective function of the LASSO program is very simple and standard convex optimization tools can be used.

Daskalakis et al. [9] use as their estimator the solution to the log-likelihood maximization problem. In contrast to [27], their convex optimization problem takes a very complicated form due to the presence of truncation, which introduces an intractable log-partition function term in the log-likelihood. The main idea of Daskalakis et al. [9] to overcome this difﬁculty is identifying a convex set D such that: (1) it contains the true coefﬁcient vector x , (2) their objective function is strongly convex inside D, (3) inside D there exists an efﬁcient rejection sampling algorithm to compute unbiased estimates of the gradients of their objective function, (4) the norm of the stochastic gradients inside D is bounded, and (5) projecting onto D is efﬁcient. These ﬁve properties are essentially what they need to prove that the SGD with projection set D converges quickly to a good estimate of x .

Our reconstruction algorithm is inspired by both [27] and [9]. We formulate our optimization program as the ℓ1-regularized version of the negative log-likelihood function in the truncated setting, which we call truncated LASSO. In particular, our objective contains an intractable log-partition function term. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefﬁcient vector x . Second, we show that this optimization problem can be solved efﬁciently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efﬁcient maximum likelihood estimation in the presence of truncation. Unfortunately, these two techniques are not a priori compatible to stitch together.

Roughly speaking, the technique of [9] relies heavily on the very carefully chosen projection set D in which the SGD is restricted, as we explained above. This projection set cannot be used in high dimensions because it effectively requires knowing the low-dimension subspace in which the true solution lies. The projection set was the key to the aforementioned nice properties: strong convexity, efﬁcient gradient estimation, and bounded gradient. In its absence, we need to deal with each of these issues individually. The primal-dual witness method of [27] cannot also be applied directly in our setting. In our case the gradient of the truncated LASSO does not have a nice closed form and hence ﬁnding the correct way to construct the primal-dual witness requires a more delicate argument. Our proof manages to overcome all these issues. In a nutshell the architecture of our full proof is the following.

1. Optimality on the low dimensional subspace. The ﬁrst thing that we need to prove is that the optimum of the truncated LASSO program when restricted to the low dimensional subspace deﬁned by the non-zero coordinates of x is close to the true solution. This step of the proof was unnecessary in [9] due to the lack of regularization in their objective, and was trivial in [27] due to the simple loss function, i.e. the regularized least square.

2. Optimality on the high dimensional space. We prove that the optimum of the truncated LASSO program in the low dimensional subspace is also optimal for the whole space. This step is done using the primal-dual witness method as in [27]. However, in our case the expression of the gradient is much more complicated due to the very convoluted objective function. Hence, we ﬁnd a more general way to prove this step that does not rely on the exact expression of the gradient.

These two steps of the proof sufﬁce to upper bound the number of samples that we need to recover the coefﬁcient vector x via the truncated LASSO program. Next, we provide a computationally efﬁcient method to solve the truncated LASSO program.

3. Initialization of SGD. The ﬁrst step of our algorithm to solve truncated LASSO is ﬁnding a good initial point for the SGD. This was unnecessary in [27] due to the simple objective and in [9] due to the existence of the projection set D (where efﬁcient projection onto D immediately gave an initial point). We propose the simple answer of bootstrapping: start with the solution of the ℓ1-regularized ordinary least squares program. This is a biased estimate, but we show it s good enough for initialization.

4. Projection of SGD. Next, we need to choose a projection set to make sure that Projected SGD (PSGD) converges. The projection set chosen in [9] is not helpful in our case unless we a priori know the set of non-zero coordinates of x . Hence, we deﬁne a different,

simpler set which admits efﬁcient projection algorithms. As a necessary side-effect, in contrast to [9], our set cannot guarantee many of the important properties that we need to prove fast convergence of SGD. 5. Lack of strong convexity and gradient estimation. Our different projection set cannot guarantee the strong convexity and efﬁcient gradient estimation enjoyed in [9]. There are two problems here: First, we know that PSGD converges to a point with small loss, but why must the point be near the optimum? Since strong convexity fails in high dimensions, it is not clear. We provide a workaround to resolve this issue that can be applied to other regularized programs with stochastic access to the gradient function. Second, computing unbiased estimates of the gradient is now difﬁcult. The prior work employed rejection sampling, but in our setting this may take exponential time. For this reason we provide a more explicit method for estimating the gradient much faster, whenever the truncation set is reasonably well-behaved.

An important tool that we leverage repeatedly in our analysis and we have not mentioned above is a strong isometry property for our measurement matrix, which has truncated Gaussian rows. Similar properties have been explored in the compressed sensing literature for matrices with i.i.d. Gaussian and sub-Gaussian entries [26].

We refer to Section 5 for a more detailed overview of the proofs of our main results.

Organization. Alphabetic sections (A, B, C, etc.) are located in the supplementary material.

2 High-dimensional truncated linear regression model

Notation. Let Z N(0, 1) refer to a standard normal random variable. For t R and measurable S R, let Zt N(t, 1; S) refer to the truncated normal (Z + t)|Z + t S. Let µt = E[Zt]. Let γS(t) = Pr[Z + t S]. Additionally, for a, x Rn let Za,x refer to Za T x (or Zax if a is a row vector), and let γS(a, x) refer to γS(a T x). For a matrix A Rm n, let ZA,x Rm be the random vector with (ZA,x)j = ZAj,x. For sets I [n] and J [m], let AI,J refer to the submatrix [Ai,j]i I,j J. For i [n] we treat the row Ai as a row vector. In a slight abuse of notation, we will often write AU (or sometimes, AV ); this will always mean A[m],U. By AT U we mean (A[m],U)T ). For x Rn, deﬁne supp(x) to be the set of indices i [n] such that xi = 0.

Let x Rn be the unknown parameter vector which we are trying to recover. We assume that it is k-sparse; that is, supp(x ) has cardinality at most k. Let S R be a measurable subset of the real line. The main focus of this paper is the setting of Gaussian noise: we assume that we are given m truncated samples (Ai, yi) generated by the following process:

1. Pick Ai Rn according to the standard normal distribution N(0, 1)n. 2. Sample ηi N(0, 1) and compute yi as

yi = Aix + ηi. (1)

3. If yi S, then return sample (Ai, yi). Otherwise restart the process from step 1.

We also brieﬂy discuss the setting of arbitrary noise, in which ηi may be arbitrary and we are interested in approximations to x which have guarantees bounded in terms of η 2.

Together, m samples deﬁne a pair (A, y) where A Rm n and y Rm. We make the following assumptions about set S. Assumption I (Constant Survival Probability). Taking expectation over vectors a N(0, 1)n, we have EγS(a, x ) α for a constant α > 0. Assumption II (Efﬁcient Sampling). There is an T(γS(t))-time algorithm which takes input t R and produces an unbiased sample z N(t, 1; S).

We do not require that T( ) is a constant, but it will affect the efﬁciency of our algorithm. To be precise, our algorithm will make poly(n) queries to the sampling algorithm. As we explain in Lemma K.4 in Section K, if the set S is a union of r intervals r i=1[ai, bi], then the Assumption II is satisﬁed with T(γS(t)) = poly(log(1/γS(t)), r). We express the theorems below with the assumption that S is a union of r intervals in which case the algorithms have polynomial running time, but all the statements below can be replaced with the more general Assumption II and the running time changes from poly(n, r) to poly(n, T(em/α)).

3 Statistically and computationally efﬁcient recovery

In this section we formally state our main results for recovery of a sparse high-dimensional coefﬁcient vector from truncated linear regression samples. In Section 3.1, we present our result under the standard assumption that the error distribution is Gaussian, whereas in Section 3.2 we present our results for the case of adversarial error.

3.1 Gaussian noise

In the setting of Gaussian noise (before truncation), we prove the following theorem.

Theorem 3.1. Suppose that Assumption I holds, and that we have m samples (Ai, yi) generated from Process (1), with n m Mk log n for a sufﬁciently large constant M. Then, there is an algorithm which outputs x satisfying x x 2 O( p

(k log n)/m) with probability 1 1/n O(k exp( m1/5)). Furthermore, if the survival set S is a union of r intervals the running time of our algorithm is poly(n, r).

From now on, we will use the term with high probability when the rate of decay is not of importance. This phrase will always mean with failure probability no worse than O(1/n) + k exp( Ω(m1/5)).

Observe that even without the added difﬁculty of truncation (e.g. if S = R), sparse linear regression requires Ω(k log n) samples by known information-theoretic arguments [27]. Thus, our sample complexity is information-theoretically optimal.

In one sentence, the algorithm optimizes the ℓ1-regularized sample negative log-likelihood via projected SGD. The negative log-likelihood of x Rn for a single sample (a, y) is

nll(x; a, y) = 1

2(a T x y)2 + log Z

S e (a T x z)2/2 dz.

Given multiple samples (A, y), we can then write nll(x; A, y) = 1 m Pm j=1 nll(x; Aj, yj). We also deﬁne the regularized negative log-likelihood f : Rn R by f(x) = nll(x; A, y) + λ x 1. We claim that optimizing the following program approximately recovers the true parameter vector x with high probability, for sufﬁciently many samples and appropriate regularization λ:

min x Rn nll(x; A, y) + λ x 1 . (2)

The ﬁrst step is to show that any solution to Program (2) will be near the true solution x . To this end, we prove the following theorem, which already shows that O(k log n) samples are sufﬁcient to solve the problem of statistical recovery of x :

Proposition 3.2. Suppose that Assumption I holds. There are constants1 κ, d, and σ with the following property. Suppose that m > κ k log n, and let (A, y) be m samples drawn from Process 1. Let ˆx be any optimal solution to Program (2) with regularization constant λ = σ p

(log n)/m. Then ˆx x 2 d p

(k log n)/m with high probability.

Then it remains to show that Program (2) can be solved efﬁciently. Theorem 3.1 follows immediately from the concatenation of Propositions 3.2 and 3.3.

1In the entirety of this paper, constants may depend on α.

Proposition 3.3. Suppose that Assumption I holds and let (A, y) be m samples drawn from Process 1 and ˆx be any optimal solution to Program (2). There exists a constant M such that if m Mk log n then there is an algorithm which outputs x Rn satisfying x ˆx 2 O( p

(k log n)/m) with high probability. Furthermore, if the survival set S is a union of r intervals the running time of our algorithm is poly(n, r).

We present a more detailed description of the algorithm that we use in Section 4.

3.2 Adversarial noise

In the setting of arbitrary noise, optimizing negative log-likelihood no longer makes sense, and indeed our results from the setting of Gaussian noise no longer hold. However, we may apply results from compressed sensing which describe sufﬁcient conditions on the measurement matrix for recovery to be possible in the face of arbitrary bounded error. We obtain the following theorem: Theorem 3.4. Suppose that Assumption I holds and let ϵ > 0. There are constants c and M such that if m Mk log n, Ax y 2 ϵ, and ˆx minimizes x 1 in the region {x Rn : Ax y 2 ϵ}, then ˆx x 2 cϵ/ m.

The proof is a corollary of our result that A satisﬁes the Restricted Isometry Property from [6] with high probability even when we only observe truncated samples; see Corollary G.6 and the subsequent discussion in Section G.

The remainder of the paper is dedicated to the case where the noise is Gaussian before truncation.

4 The efﬁcient estimation algorithm

Deﬁne Er = {x Rn : Ax y 2 r m}. To solve Program 2, our algorithm is Projected Stochastic Gradient Descent (PSGD) with projection set Er, for an appropriate constant r (speciﬁed in Lemma 5.4). We pick an initial feasible point by computing

x(0) = argmin x Er x 1 .

Subsequently, the algorithm performs N updates, where N = poly(n). Deﬁne a random update to x(t) Rn as follows. Pick j [m] uniformly at random. Sample z(t) ZAj,x(i). Then set

v(t) := Aj(z(t) yj) + λ sign(x(t))

w(t) := x(t)

1 n N v(t); x(t+1) := argmin x Er

Finally, the algorithm outputs x = 1

N PN 1 t=0 x(t).

See Section 5.2 for the motivation of this algorithm, and a proof sketch of correctness and efﬁciency. Section E contains a summary of the complete algorithm in pseudocode.

5 Overview of proofs and techniques

This section outlines our techniques. The ﬁrst step, outlined in Section 5.1, is proving Proposition 3.2. The second step, outlined in Section 5.2, is proving Proposition 3.3, by showing that the algorithm described in Section 4 efﬁciently recovers an approximate solution to Program 2.

5.1 Statistical recovery

Our approach to proving Proposition 3.2 is the Primal-Dual Witness (PDW) method introduced in [27]. Namely, we are interested in showing that the solution of Program (2) is near x with high probability. Let U be the (unknown) support of the true parameter vector x . Deﬁne the following (hypothetical) program in which the solution space is restricted to vectors with support in U:

argmin x Rn:supp(x) U nll(x; A, y) + λ x 1 (3)

In the untruncated setting, the PDW method is to apply the subgradient optimality condition to the solution x of this restricted program, which is by deﬁnition sparse. Proving that x satisﬁes subgradient optimality for the original program implies that the original program has a sparse solution x, and under mild extra conditions x is the unique solution. Thus, the original program recovers the true basis. We use the PDW method for a slightly different purpose; we apply subgradient optimality to x to show that x x 2 is small, and then use this to prove that x solves the original program.

Truncation introduces its own challenges. While Program 2 is still convex [9], it is much more convoluted than ordinary least squares. In particular, the gradient and Hessian of the negative loglikelihood have the following form (see Section C for the proof). Lemma 5.1. For all (A, y), the gradient of the negative log-likelihood is nll(x; A, y) = 1 m Pm j=1 AT j (EZAj,x yj). The Hessian is H(x; A, y) = 1

m Pm j=1 AT j Aj Var(ZAj,x).

We now state the two key facts which make the PDW method work. First, the solution x to the restricted program must have a zero subgradient in all directions in RU. Second, if this subgradient can be extended to all of Rn, then x is optimal for the original program. Formally: Lemma 5.2. Fix any (A, y). Let x be an optimal solution to Program 3.

(a) There is some z U RU such that z U 1 and λ z U = 1

m AT U(EZA, x y).

(b) Extend z U to Rn by deﬁning λ z U c = 1

m AT U c(EZA, x y). If z U c < 1, and AT UAU is invertible, then x is the unique optimal solution to Program 2.

See Section C for the proof. The utility of this lemma is in reducing Proposition 3.2 to showing that with high probability over (A, y), the following conditions both hold:

1. x x is small (Theorem A.7). 2. z U c < 1 (Theorem B.3) and AT UAU is invertible (corollary of Theorem G.1).

In Section A, we prove Condition (1) by dissecting the subgradient optimality condition. In Section B we then prove Condition (2) and complete the proof of Proposition 3.2.

5.2 Computational recovery

To prove Proposition 3.3, we need to show how to efﬁciently solve Program 2, i.e. minimize f(x) = nll(x; A, y) + λ x 1 . The gradient of nll(x; A, y) doesn t have a closed form, but it can be written cleanly as an expectation:

nll(x; A, y) = 1

j=1 AT j (EZAj,x yj).

Let us assume that ZAj,x can be sampled efﬁciently (Assumption II). Then we may hope to optimize the convex function f(x) by stochastic gradient descent. But problematically, in our highdimensional setting f is nowhere strongly convex. Convex optimization results which do not require strong convexity have several strings attached: Theorem 5.3. Let f : Rn R be a convex function achieving its optimum at x Rn. Let x(0), x(1), . . . , x(N) be a sequence of random vectors in Rn. Suppose that x(i+1) = x(i) ηv(i)

where E[v(i)|x(i)] f(x(i)). Set x = 1

N PN i=1 x(i). Then

E[f( x)] f( x) (ηN) 1E x(0) x 2

i=1 E v(i) 2

In particular, to use to this result to show that SGD converges to a good estimate of x in a polynomial number of iterations, we need to solve three technical problems:

1. We need to efﬁciently ﬁnd an initial point x(0) with bounded distance from x. 2. The gradient does not have bounded norm for arbitrary x Rn. Thus, we need to pick a projection set in which the bound holds, and project to this set at each SGD iteration. If x lies in the projection set, then the guarantees of Theorem 5.3 still hold (see Theorem H.1).

3. Since f is not strongly convex, we need some other way of converting the bound on f( x) f( x) into a bound on x x 2.

As deﬁned in Section 4, our chosen projection set is Er = {x Rn : Ax y 2 r m}, for an appropriate constant r > 0. To pick an initial point in Er, we solve the program x(0) = argminx Er x 1 . This estimate is biased due to the truncation, but the key point is that by classical results from compressed sensing, it has bounded distance from x (and therefore from x).

The algorithm then consists of projected stochastic gradient descent with projection set Er. To bound the number of update steps required for the algorithm to converge to a good estimate of x, we need to solve several statistical problems (which are direct consequences of assumptions in Theorem 5.3).

Properties of Er. First, x must be feasible and a bounded distance from the initial point (due to high-dimensionality, Er is unbounded, so this is not immediate). The following lemmas formalize this; see Sections J.11 and J.12 for the respective proofs. Lemma 5.4 speciﬁes the choice of r. Lemma 5.4. With high probability, x Er for an appropriate constant r > 0. Lemma 5.5. With high probability, x(0) x 2 O(1).

Second, the SGD updates at points within Er must be unbiased estimates of the gradient, with bounded square-norm in expectation. The following lemma shows that the updates v(t) deﬁned in Section 4 satisfy this property. See Section J.13 for the proof. Lemma 5.6. With high probability over A, the following statement holds. Let 0 t < T. Then E[v(t)|x(t)] f(x(t)), and E[ v(t) 2 2] O(n).

Addressing the lack of strong convexity. Third, we need to show that the algorithm converges in parameter space and not just in loss. That is, from the preceding bounds it follows that the algorithm produces an estimate x for which f(x) f( x) is small, but we want to show that x x 2 is small as well. Note that f is not strongly convex even in Er, due to the high dimension. So we need a more careful approach. In the subspace RU, f is indeed strongly convex near x, as shown in the following lemma (proof in Section J.14):

Lemma 5.7. There is a constant ζ such that with high probability over A, f(x) f( x) ζ

2 x x 2 2 for all x Rn with supp(x) U and x x 2 1.

But we need a bound for all Rn, since the SGD iterates in general may be dense vectors. The idea is to prove a lower bound on f(x) f( x) for x near x, and then use convexity to scale the bound linearly in x x 2. The above lemma provides a lower bound for x near x if supp(x) U, and we need to show that adding an RU c-component to x increases f proportionally. This is precisely the content of Theorem B.4. Putting these pieces together we obtain the following lemma. See Section J.15 for the full proof. Lemma 5.8. There are constants cf > 0 and c f such that with high probability over A the following holds. Let x Rn. If f(x) f( x) cf(log n)/m3, then

x x 2 2 c f m log n(f(x) f( x)).

Convergence of PSGD. It now follows from the above lemmas and Theorem 5.3 that the PSGD algorithm, as outlined above and described in Section 4, converges to a good approximation of x in a polynomial number of updates. The following theorem formalizes the guarantee. See Section J.16 for the proof. Theorem 5.9. With high probability over (A, y) and over the execution of the algorithm, we get

(k log n)/m).

Efﬁcient implementation. Finally, in Section F we prove that initialization and each update step is efﬁcient. Efﬁcient gradient estimation in the projection set (i.e. sampling ZAj,x) does not follow from the prior work, since our projection set is by necessity laxer than that of the prior work [9]. So we replace their rejection sampling procedure with a novel approximate sampling procedure under mild assumptions about the truncation set. Together with the convergence bound claimed in Theorem 5.9, these prove Proposition 3.3.

Proof of Proposition 3.3. The correctness guarantee follows from Theorem 5.9. For the efﬁciency guarantee, note that the algorithm performs initialization and then N = poly(n) update steps. By Section F, the initialization takes poly(n) time, and each update step takes poly(n) + T(e Θ(m)). This implies the desired bounds on overall time complexity.

Broader impact

Our contribution is mathematical and computational in nature. We show that truncated linear regression is computationally and statistically efﬁciently solvable even in the sparse, high-dimensional setting. Our advances in statistical inference with truncation present opportunities in decreasing the bias of models that are trained on biased data. Biased data is a real and signiﬁcant problem in applications of Machine Learning, as well as broader data science applications and scientiﬁc reasoning. There are many types of bias and their source and type might be known or unknown. In this paper, we focus on bias coming from systematic data truncation, which is an important type of bias as discussed in the introduction. In extending prior work on this topic to high dimensions we address another growing challenge namely, the dimensionality of the data which must now be processed. As usual, in interpreting our methodological advancements and using our algorithms in practice, we must recognize that the theory is only as good as the model: we make several assumptions (e.g. Gaussian measurements and noise), and the practitioner must judge whether these assumptions are justiﬁed in their application.

Acknowledgments

This research was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE Ph ILMs project (No. DE-AC05-76RL01830), by the DARPA award HR00111990021, by the MIT Undergraduate Research Opportunities Program, and by a Google Ph D Fellowship.

[1] Takeshi Amemiya. Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, pages 997 1016, 1973.

[2] N Balakrishnan and Erhard Cramer. The art of progressive censoring. Springer, 2014.

[3] Daniel Bernoulli. Essai d une nouvelle analyse de la mortalit e caus ee par la petite v erole, et des avantages de l inoculation pour la pr evenir. Histoire de l Acad., Roy. Sci.(Paris) avec Mem, pages 1 45, 1760.

[4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[5] Richard Breen et al. Regression models: Censored, sample selected, or truncated data, volume 111. Sage, 1996.

[6] Emmanuel J. Cand es, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207 1223, 2006.

[7] Sylvain Chevillard. The functions erf and erfc computed with arbitrary precision and explicit error bounds. Information and Computation, 216:72 95, 2012.

[8] A Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 2016.

[9] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Computationally and Statistically Efﬁcient Truncated Regression. In the 32nd Conference on Learning Theory (COLT), 2019.

[10] Mark A Davenport, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. A simple proof that random matrices are democratic. ar Xiv preprint ar Xiv:0911.0736, 2009.

[11] David L Donoho. For most large underdetermined systems of linear equations the minimal ℓ1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6):797 829, 2006.

[12] RA Fisher. Properties and applications of Hh functions. Mathematical tables, 1:815 852, 1931.

[13] Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379387):310 315, 1897.

[14] Vassilis A Hajivassiliou and Daniel L Mc Fadden. The method of simulated scores for the estimation of ldv models. Econometrica, pages 863 896, 1998.

[15] Jerry A Hausman and David A Wise. Social experimentation, truncated distributions, and efﬁcient estimation. Econometrica: Journal of the Econometric Society, pages 919 938, 1977.

[16] Michael P Keane. 20 simulation estimation for panel data models with limited dependent variables. 1993.

[17] Jason N Laska. Democracy in action: Quantization, saturation, and compressive sensing. Ph D thesis, 2010.

[18] Tengyu Ma and Avi Wigderson. Sum-of-squares lower bounds for sparse pca. In Advances in Neural Information Processing Systems, pages 1612 1620, 2015.

[19] Gangadharrao S Maddala. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press, 1986.

[20] Karl Pearson. On the systematic ﬁtting of frequency curves. Biometrika, 2:2 7, 1902.

[21] Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59 68, 1908.

[22] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II IV: Invited Lectures, pages 1576 1602. World Scientiﬁc, 2010.

[23] Helmut Schneider. Truncated and censored samples from normal populations. Marcel Dekker, Inc., 1986.

[24] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

[25] James Tobin. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, pages 24 36, 1958.

[26] Vladislav Voroninski and Zhiqiang Xu. A strong restricted isometry property, with an application to phaseless compressed sensing. Applied and Computational Harmonic Analysis, 40(2):386 395, 2016.

[27] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183 2202, 2009.