# linear_regression_without_correspondence__988b5dd1.pdf

Linear regression without correspondence

Daniel Hsu Columbia University

New York, NY djhsu@cs.columbia.edu

Kevin Shi Columbia University

New York, NY kshi@cs.columbia.edu

Xiaorui Sun Microsoft Research

Redmond, WA xiaoruisun@cs.columbia.edu

This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d. draws from a standard multivariate normal distribution, an efﬁcient algorithm based on lattice basis reduction is shown to exactly recover the unknown linear function in arbitrary dimension. Finally, lower bounds on the signal-to-noise ratio are established for approximate recovery of the unknown linear function by any estimator.

1 Introduction

Consider the problem of recovering an unknown vector w 2 Rd from noisy linear measurements when the correspondence between the measurement vectors and the measurements themselves is unknown. The measurement vectors (i.e., covariates) from Rd are denoted by x1, x2, . . . , xn; for each i 2 [n] := {1, 2, . . . , n}, the i-th measurement (i.e., response) yi is obtained using x (i):

>x (i) + "i , i 2 [n] . (1)

Above, is an unknown permutation on [n], and the "1, "2, . . . , "n are unknown measurement errors.

This problem, which has been called unlabeled sensing [22], linear regression with an unknown permutation [18], and linear regression with shufﬂed labels [1], arises in many settings; see the aforementioned references for more details. In short, sensing limitations may create ambiguity in or even completely lose the ordering of measurements. The problem is also interesting because the missing correspondence makes an otherwise well-understood problem into one with very different computational and statistical properties.

Prior works. Unnikrishnan et al. [22] study conditions on the measurement vectors that permit recovery of any target vector w under noiseless measurements. They show that when the entries of the xi are drawn i.i.d. from a continuous distribution, and n 2d, then almost surely, every vector

w 2 Rd is uniquely determined by noiseless correspondence-free measurements as in (1). (Under noisy measurements, it is shown that w can be recovered when an appropriate signal-to-noise ratio tends to inﬁnity.) It is also shown that n 2d is necessary for such a guarantee that holds for all vectors w 2 Rd.

Pananjady et al. [18] study statistical and computational limits on recovering the unknown permutation

. On the statistical front, they consider necessary and sufﬁcient conditions on the signal-to-noise ratio SNR :=k wk2

2 /σ2 when the measurement errors ("i)n

i=1 are i.i.d. draws from the normal distribution N(0, σ2) and the measurement vectors (xi)n

i=1 are i.i.d. draws from the standard multivariate normal distribution N(0, Id). Roughly speaking, exact recovery of is possible via maximum likelihood

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

when SNR nc for some absolute constant c > 0, and approximate recovery is impossible for any method when SNR nc0 for some other absolute constant c0 > 0. On the computational front, they show that the least squares problem (which is equivalent to maximum likelihood problem)

given arbitrary x1, x2, . . . , xn 2 Rd and y1, y2, . . . , yn 2 R is NP-hard when d = (n)1, but admits a polynomial-time algorithm (in fact, an O(n log n)-time algorithm based on sorting) when d = 1.

Abid et al. [1] observe that the maximum likelihood estimator can be inconsistent for estimating w in certain settings (including the normal setting of Pananjady et al. [18], with SNR ﬁxed but n ! 1). One of the alternative estimators they suggest is consistent under additional assumptions in dimension d = 1. Elhami et al. [8] give a O(dnd+1)-time algorithm that, in dimension d = 2, is guaranteed to approximately recover w when the measurement vectors are chosen in a very particular way from the unit circle and the measurement errors are uniformly bounded.

Contributions. We make progress on both computational and statistical aspects of the problem.

1. We give an approximation algorithm for the least squares problem from (2) that, any given

i=1, and 2 (0, 1), returns a solution with objective value at most 1 + times that of the minimum in time (n/ )O(d). This a fully polynomial-time approximation scheme for any constant dimension. 2. We give an algorithm that exactly recovers w in the measurement model from (1), under the

assumption that there are no measurement errors and the covariates (xi)n

i=1 are i.i.d. draws from N(0, Id). The algorithm, which is based on a reduction to a lattice problem and employs the lattice basis reduction algorithm of Lenstra et al. [16], runs in poly(n, d) time when the covariate vectors (xi)n

i=1 and target vector w are appropriately quantized. This result may also be regarded as for each-type guarantee for exactly recovering a ﬁxed vector w, which complements the for all-type results of Unnikrishnan et al. [22] concerning the number of measurement vectors needed for recovering all possible vectors. 3. We show that in the measurement model from (1) where the measurement errors are i.i.d. draws

from N(0, σ2) and the covariate vectors are i.i.d. draws from N(0, Id), then no algorithm can approximately recover w unless SNR C min {1, d/ log log(n)} for some absolute constant C > 0. We also show that when the covariate vectors are i.i.d. draws from the uniform distribution on [ 1/2, 1/2]d, then approximate recovery is impossible unless SNR C0 for some other absolute constant C0 > 0.

Our algorithms are not meant for practical deployment, but instead are intended to shed light on the computational difﬁculty of the least squares problem and the average-case recovery problem. Indeed, note that a naïve brute-force search over permutations requires time (n!) = n (n), and the only other previous algorithms (already discussed above) were restricted to d = 1 [18] or only had some form of approximation guarantee when d = 2 [8]. We are not aware of previous algorithms for the average-case problem in general dimension d.2

Our lower bounds on SNR stand in contrast to what is achievable in the classical linear regression model (where the covariate/response correspondence is known): in that model, the SNR requirement for approximately recovering w scales as d/n, and hence the problem becomes easier with n. The lack of correspondence thus drastically changes the difﬁculty of the problem.

2 Approximation algorithm for the least squares problem

In this section, we consider the least squares problem from Equation (2). The inputs are an arbitrary matrix X = [x1|x2| |xn]> 2 Rn d and an arbitrary vector y = (y1, y2, . . . , yn)> 2 Rn, and the

1Pananjady et al. [18] prove that PARTITION reduces to the problem of deciding if the optimal value of (2) is zero or non-zero. Note that PARTITION is weakly, but not strongly, NP-hard: it admits a pseudo-polynomial-time algorithm [10, Section 4.2]. In Appendix A, we prove that the least squares problem is strongly NP-hard by reduction from 3-PARTITION (which is strongly NP-complete [10, Section 4.2.2]).

2A recent algorithm of Pananjady et al. [19] exploits a similar average-case setting but only for a somewhat easier variant of the problem where more information about the unknown correspondence is provided.

Algorithm 1 Approximation algorithm for least squares problem

input Covariate matrix X = [x1|x2| |xn]> 2 Rn k; response vector y = (y1, y2, . . . , yn)> 2

Rn; approximation parameter 2 (0, 1). assume X

>X = Ik. output Weight vector ˆw 2 Rk and permutation matrix ˆ 2 Pn.

1: Run Row Sampling algorithm with input matrix X to obtain a matrix S 2 Rr n with r = 4k. 2: Let B be the set of vectors b = (b1, b2, . . . , bn)> 2 Rn satisfying the following: for each i 2 [n],

if the i-th column of S is all zeros, then bi = 0; otherwise, bi 2 {y1, y2, . . . , yn}.

3: Let c := 1 + 4(1 +

n/(4k))2. 4: for each b 2 B do 5: Compute wb 2 arg minw2Rk k S(Xw b)k2

2, and let rb := min 2Pn k X wb

2. 6: Construct a

rb/c-net Nb for the Euclidean ball of radius pcrb around wb, so that for each v 2 Rk with kv wbk2 pcrb, there exists v0 2 Nb such that kv v0k2

rb/c. 7: end for 8: return ˆw 2 arg min w2S

min 2Pn k Xw

2 and ˆ 2 arg min

goal is to ﬁnd a vector w 2 Rd and permutation matrix 2 Pn (where Pn denotes the space of n n permutation matrices3) to minimize k Xw

2. This problem is NP-hard in the case where d = (n) [18] (see also Appendix A). We give an approximation scheme that, for any 2 (0, 1), returns a (1 + )-approximation in time (n/ )O(k) + poly(n, d), where k := rank(X) min{n, d}.

We assume without loss of generality that X 2 Rn k and X

>X = Ik. This is because we can always replace X with its matrix of left singular vectors U 2 Rn k, obtained via singular value decomposition X = U V

>V = Ik and 0 is diagonal. A solution (w, ) for (U, y) has the same cost as the solution (V 1w, ) for (X, y), and a solution (w, )

for (X, y) has the same cost as the solution ( V

>w, ) for (U, y).

2.1 Algorithm

Our approximation algorithm, shown as Algorithm 1, uses a careful enumeration to beat the naïve brute-force running time of (|Pn|) = (n!). It uses as a subroutine a Row Sampling algorithm of Boutsidis et al. [5] (described in Appendix B), which has the following property. Theorem 1 (Specialization of Theorem 12 in [5]). There is an algorithm ( Row Sampling ) that, given any matrix A 2 Rn k with n k, returns in poly(n, k) time a matrix S 2 Rr n with r = 4k such that the following hold.

1. Every row of S has at most one non-zero entry.

2. For every b 2 Rn, every w0 2 arg minw2Rk k S(Aw b)k2

2 satisﬁes k Aw0 bk2

2 c minw2Rk k Aw bk2

2 for c = 1 + 4(1 +

n/(4k))2 = O(n/k).

The matrix S returned by Row Sampling determines a (weighted) subset of O(k) rows of A such that solving a (ordinary) least squares problem (with any right-hand side b) on this subset of rows and corresponding right-hand side entries yields a O(n/k)-approximation to the least squares problem over all rows and right-hand side entries. Row Sampling does not directly apply to our problem because (1) it does not minimize over permutations of the right-hand side, and (2) the approximation factor is too large. However, we are able to use it to narrow the search space in our problem.

An alternative to Row Sampling is to simply enumerate all subsets of k rows of X. This is justiﬁed by a recent result of Derezi nski and Warmuth [7], which shows that for any right-hand side b 2 Rn, using volume sampling [3] to choose a matrix S 2 {0, 1}k k (where each row has one non-zero entry) gives a similar guarantee as that of Row Sampling, except with the O(n/k) factor replaced by k + 1 in expectation.

3Each permutation matrix 2 Pn corresponds to a permutation on [n]; the (i, j)-th entry of is one if (i) = j and is zero otherwise.

2.2 Analysis

The approximation guarantee of Algorithm 1 is given in the following theorem.

Theorem 2. Algorithm 1 returns ˆw 2 Rk and ˆ 2 Pn satisfying

2 (1 + ) min w2Rk, 2Pn

Proof. Let opt := minw, k Xw

2 be the optimal cost, and let (w?, ?) denote a solution achieving this cost. The optimality implies that w? satisﬁes the normal equations X

> ? y. Observe that there exists a vector b? 2 B satisfying Sb? = S

> ? y. By Theorem 1 and the normal equations, the vector wb? and cost value rb? satisfy

%%X( wb? w?)

2 + opt c opt .

Moreover, since X

>X = Ik, we have that k wb? w?k2

(c 1) opt pcrb?. By construction of Nb?, there exists w 2 Nb? satisfying kw w?k2

2 = k X(w w?)k2

2 rb?/c opt. For this w, the normal equations imply

min 2Pn k Xw

2 = k X(w w?)k2

2 + opt (1 + ) opt .

Therefore, the solution returned by Algorithm 1 has cost no more than (1 + ) opt.

By the results of Pananjady et al. [18] for maximum likelihood estimation, our algorithm enjoys recovery guarantees for w and when the data come from the Gaussian measurement model (1). However, the approximation guarantee also holds for worst-case inputs without generative assumptions.

Running time. We now consider the running time of Algorithm 1. There is the initial cost for singular value decomposition (as discussed at the beginning of the section), and also for Row Sampling ; both of these take poly(n, d) time. For the rest of the algorithm, we need to consider the size of B and the size of the net Nb for each b 2 B. First, we have |B| nr = n O(k), since S has only 4k rows and each row has at most a single non-zero entry. Next, for each b 2 B, we construct the δ-net Nb (for δ :=

rb/c) by constructing a δ/

k-net for the 1-ball of radius pcrb centered at wb (using an appropriate axis-aligned grid). This has size |Nb| (4c2k/ )k/2 = (n/ )O(k). Finally, each arg minw2Rk computation takes O(nk2) time, and each (arg) min 2Pn takes O(nk + n log n) time [18] (also see Appendix B). So, the overall running time is (n/ )O(k) + poly(n, d).

3 Exact recovery algorithm in noiseless Gaussian setting

To counter the intractability of the least squares problem in (2) confronted in Section 2, it is natural to explore distributional assumptions that may lead to faster algorithms. In this section, we consider the noiseless measurement model where the (xi)n

i=1 are i.i.d. draws from N(0, Id) (as in [18]). We give an algorithm that exactly recovers w with high probability when n d + 1. The algorithm runs in poly(n, d)-time when (xi)n

i=1 and w are appropriately quantized.

It will be notationally simpler to consider n + 1 covariate vectors and responses

>x (i) , i = 0, 1, . . . , n . (3)

Here, (xi)n

i=0 are n + 1 i.i.d. draws from N(0, Id), the unknown permutation is over {0, 1, . . . , n}, and the requirement of at least d + 1 measurements is expressed as n d.

In fact, we shall consider a variant of the problem in which we are given one of the values of the unknown permutation . Without loss of generality, assume we are given that (0) = 0. Solving this variant of the problem sufﬁces because there are only n + 1 possible values of (0): we can try them all, incurring just a factor n + 1 in the computation time. So henceforth, we just consider as an unknown permutation on [n].

Algorithm 2 Find permutation

input Covariate vectors x0, x1, x2, . . . , xn in Rd; response values y0, y1, y2, . . . , yn in R; conﬁ-

dence parameter δ 2 (0, 1); lattice parameter β > 0. assume there exists w 2 Rd and permutation on [n] such that yi = w>x (i) for each i 2 [n], and

that y0 = w>x0. output Permutation ˆ on [n] or failure.

1: Let X = [x1|x2| |xn]> 2 Rn d, and its pseudoinverse be X = [ x1| x2| | xn]. 2: Create Subset Sum instance with n2 source numbers ci,j := yi x

> j x0 for (i, j) 2 [n] [n] and target sum y0. 3: Run Algorithm 3 with Subset Sum instance and lattice parameter β. 4: if Algorithm 3 returns a solution S [n] [n] then 5: return any permutation ˆ on [n] such that ˆ (i) = j implies (i, j) 2 S. 6: else 7: return failure. 8: end if

Algorithm 3 Lagarias and Odlyzko [12] subset sum algorithm input Source numbers {ci}i2I R; target sum t 2 R; lattice parameter β > 0. output Subset ˆS I or failure.

1: Construct lattice basis B 2 R(|I|+2) (|I|+1) where

" I|I|+1 βt βci : i 2 I

2 R(|I|+2) (|I|+1) .

2: Run basis reduction [e.g., 16] to ﬁnd non-zero lattice vector v of length at most 2|I|/2 λ1(B). 3: if v = z(1, χ>

ˆ S, 0)>, with z 2 Z and χ ˆ S 2 {0, 1}I is characteristic vector for some ˆS I then

4: return ˆS. 5: else 6: return failure. 7: end if

3.1 Algorithm

Our algorithm, shown as Algorithm 2, is based on a reduction to the Subset Sum problem. An instance of Subset Sum is speciﬁed by an unordered collection of source numbers {ci}i2I R, and a target sum t 2 R. The goal is to ﬁnd a subset S I such that P

i2S ci = t. Although Subset Sum is NP-hard in the worst case, it is tractable for certain structured instances [12, 9]. We prove that Algorithm 2 constructs such an instance with high probability. A similar algorithm based on such a reduction was recently used by Andoni et al. [2] for a different but related problem.

Algorithm 2 proceeds by (i) solving a Subset Sum instance based on the covariate vectors and response values (using Algorithm 3), and (ii) constructing a permutation ˆ on [n] based on the solution to the Subset Sum instance. With the permutation ˆ in hand, we (try to) ﬁnd a solution w 2 Rd to the system of linear equations yi = w>xˆ (i) for i 2 [n]. If ˆ = , then there is a unique such solution almost surely.

3.2 Analysis

The following theorem is the main recovery guarantee for Algorithm 2. Theorem 3. Pick any δ 2 (0, 1). Suppose (xi)n

i=0 are i.i.d. draws from N(0, Id), and (y0)n

i=1 follow the noiseless measurement model from (3) for some w 2 Rd and permutation on [n] (and

(0) = 0), and that n d. Furthermore, suppose Algorithm 2 is run with inputs (xi)n

i=0, δ, and β, and also that β 2n2/" where " is deﬁned in Equation (8). With probability at least 1 δ, Algorithm 2 returns ˆ = .

Remark 1. The value of " from Equation (8) is directly proportional tok wk2, and Algorithm 2 requires a lower bound on " (in the setting of the lattice parameter β). Hence, it sufﬁces to determine

a lower bound onk wk2. Such a bound can be obtained from the measurement values: a standard tail bound (Lemma 6 in Appendix C) shows that with high probability,

i /(2n) is a lower bound on k wk2, and is within a constant factor of it as well. Remark 2. Algorithm 2 strongly exploits the assumption of noiseless measurements, which is expected given the SNR lower bounds of Pananjady et al. [18] for recovering . The algorithm, however, is also very brittle and very likely fails in the presence of noise. Remark 3. The recovery result does not contradict the results of Unnikrishnan et al. [22], which show that a collection of 2d measurement vectors are necessary for recovering all w, even in the noiseless measurement model of (3). Indeed, our result shows that for a ﬁxed w 2 Rd, with high probability d + 1 measurements in the model of (3) sufﬁce to permit exactly recovery of w, but this same set of measurement vectors (when d + 1 < 2d) will fail for some other w0.

The proof of Theorem 3 is based on the following theorem essentially due to Lagarias and Odlyzko [12] and Frieze [9] concerning certain structured instances of Subset Sum that can be solved using the lattice basis reduction algorithm of Lenstra et al. [16]. Given a basis B = [b1|b2| |bk] 2 Rm k for a lattice

zibi : z1, z2, . . . , zk 2 Z

this algorithm can be used to ﬁnd a non-zero vector v 2 L(B)\{0} whose length is at most 2(k 1)/2 times that of the shortest non-zero vector in the lattice

λ1(B) := min v2L(B)\{0}kvk2 .

Theorem 4 ([12, 9]). Suppose the Subset Sum instance speciﬁed by source numbers {ci}i2I R and target sum t 2 R satisfy the following properties.

1. There is a subset S? I such that P

i2S? ci = t.

2. Deﬁne R := 2|I|/2p

|S?| + 1 and ZR := {(z0, z) 2 Z ZI : 0 < z2

i R2}. There exists " > 0 such that |z0 t P

i2I zi ci| " for each (z0, z) 2 ZR that is not an integer multiple of (1, χ?), where χ? 2 {0, 1}I is the characteristic vector for S?.

Let B be the lattice basis B constructed by Algorithm 3, and assume β 2|I|/2/". Then every non-zero vector in the lattice (B) with length at most 2|I|/2 times the length of the shortest non-zero vector in (B) is an integer multiple of the vector (1, χS?, 0), and the basis reduction algorithm of Lenstra et al. [16] returns such a non-zero vector.

The Subset Sum instance constructed in Algorithm 2 has n2 source numbers {ci,j : (i, j) 2 [n] [n]} and target sum y0. We need to show that it satisﬁes the two conditions of Theorem 4.

Let S := {(i, j) : (i) = j} [n] [n], and let = ( i,j)(i,j)2[n] [n] 2 Pn be the permutation matrix with i,j := 1{ (i) = j} for all (i, j) 2 [n] [n]. Note that is the characteristic vector for S . Deﬁne R := 2n2/2pn + 1 and

(z0, Z) 2 Z Zn n : 0 < z2

A crude bound shows that|ZR| 2O(n4).

The following lemma establishes the ﬁrst required property in Theorem 4. Lemma 1. The random matrix X has rank d almost surely, and the subset S satisﬁes y0 = P

(i,j)2S ci,j.

Proof. That X has rank d almost surely follows from the fact that the probability density of X is supported on all of Rn d. This implies that X X = Pn

j = Id, and

> 0 xj yi 1{ (i) = j} =

ci,j 1{ (i) = j} .

The next lemma establishes the second required property in Theorem 4. Here, we use the fact that the Frobenius norm

F is at least one whenever (z0, Z) 2 Z Zn n is not an integer multiple of (1, ). Lemma 2. Pick any , 0 > 0 such that 3|ZR| + 0 < 1. With probability at least 1 3|ZR| 0, every (z0, Z) 2 ZR with Z = (Zi,j)(i,j)2[n] [n] satisﬁes

11111 ( /4)

(d 1)/n 2+ 1 d 1 pn +

Proof. By Lemma 1, the matrix satisﬁes y0 = P

i,j i,j ci,j. Fix any (z0, Z) 2 ZR with Z = (Zi,j)(i,j)2[n] [n]. Then

Zi,j ci,j =

(z0 i,j Zi,j) x

Using matrix and vector notations, this can be written compactly as the inner product x>

0 (X (z0 Z)> X w). Since x0 N(0, Id) and is independent of X, the distribution of the inner product is normal with mean zero and standard deviation equal to k X (z0 Z)> X wk2. By Lemma 7 (in Appendix C), with probability at least 1 ,

311 k X (z0 Z)

Observe that X = (X

> since X has rank d by Lemma 1, so

> X wk2 k X

>(z0 Z)> X wk2

By Lemma 4 (in Appendix C), with probability at least 1 0,

And by Lemma 9 (in Appendix C), with probability at least 1 2 ,

8n 1+1/(d 1) . (7)

Since is orthogonal, we have that k(z0 Z)> k F = kz0 Zk F . Combining this with (4), (5), (6), and (7), and union bounds over all (z0, Z) 2 ZR proves the claim.

Proof of Theorem 3. Lemma 1 and Lemma 2 (with 0 := δ/2 and := δ/(6|ZR|)) together imply that with probability at least 1 δ, the source numbers {ci,j : (i, j) 2 [n] [n]} and target sum y0 satisfy the conditions of Theorem 4 with

S? := {(i, j) 2 [n] [n] : (i) = j} ,

(d 1)/n (δ/(6|ZR|))2+ 1 d 1 pn +

2 k wk2 2 poly(n, log(1/δ)) k wk2 . (8)

Thus, in this event, Algorithm 3 (with β satisfying β 2n2/2/") returns ˆS = S?, which uniquely determines the permutation ˆ = returned by Algorithm 2.

Running time. The basis reduction algorithm of Lenstra et al. [16] is iterative, with each iteration primarily consisting of Gram-Schmidt orthogonalization and another efﬁcient linear algebraic process called size reduction . The total number of iterations required is

k maxi2[k]kbik2

In our case, k = n2 and λ1(B) = pn + 1; and by Lemma 10 (in Appendix C), each of the basis vectors constructed has squared length at most 1 + β2 poly(d, log(n), 1/δ) k wk2

2. Using the tight setting of β required in Theorem 3, this gives a poly(n, d, log(1/δ)) bound on the total number of iterations as well as on the total running time.

However, the basis reduction algorithm requires both arithmetic and rounding operations, which are typically only available for ﬁnite precision rational inputs. Therefore, a formal running time analysis would require the idealized real-valued covariate vectors (xi)n

i=0 and unknown target vector w to be quantized to ﬁnite precision values. This is doable, and is similar to using a discretized Gaussian distribution for the distribution of the covariate vectors (and assuming w is a vector of ﬁnite precision values), but leads to a messier analysis incomparable to the setup of previous works. Nevertheless, it would be desirable to ﬁnd a different algorithm that avoids lattice basis reduction that still works with just d + 1 measurements.

4 Lower bounds on signal-to-noise for approximate recovery

In this section, we consider the measurement model from (1) where (xi)n

i=1 are i.i.d. draws from either N(0, Id) or the uniform distribution on [ 1/2, 1/2]d, and ("i)n

i=1 are i.i.d. draws from N(0, σ2). We establish lower bounds on the signal-to-noise ratio (SNR),

SNR = k wk2

required by any estimator ˆw = ˆw((xi)n

i=1) for w to approximately recover w in expectation. The estimators may have a priori knowledge of the values ofk wk2 and σ2.

Theorem 5. Assume ("i)n

i=1 are i.i.d. draws from N(0, σ2).

1. There is an absolute constant C > 0 such that the following holds. If n 3, d 22, (xi)n

i=1 are i.i.d. draws from N(0, Id), (yi)n

i=1 follow the measurement model from (1), and

d log log(n), 1

then for any estimator ˆw, there exists some w 2 Rd such that

1 24k wk2 .

2. If (xi)n

i=1 are i.i.d. draws from the uniform distribution on [ 1/2, 1/2]d, and (yi)n

i=1 follow the measurement model from (1), and

SNR 2 , then for any estimator ˆw, there exists some w 2 Rd such that

Note that in the classical linear regression model where yi = w>xi + "i for i 2 [n], the maximum likelihood estimator ˆwmle satisﬁes Ek ˆwmle wk2 Cσ

d/n, where C > 0 is an absolute constant. Therefore, the SNR requirement to approximately recover w up to (say) Euclidean distancek wk2 /24 is SNR 242Cd/n. Compared to this setting, Theorem 5 implies that with the measurement model of (1), the SNR requirement (as a function of n) is at substantially higher (d/ log log(n) in the normal covariate case, or a constant not even decreasing with n in the uniform covariate case).

For the normal covariate case, Pananjady et al. [18] show that if n > d, < pn, and

SNR nc n n d + ,

then the maximum likelihood estimator ( ˆwmle, ˆ mle) (i.e., any minimizer of (2)) satisﬁes ˆ mle = with probability at least 1 c0n 2 . (Here, c > 0 and c0 > 0 are absolute constants.) It is straightforward to see that, on the same event, we havek ˆwmle wk2 Cσ

d/n for some absolute

constant C > 0. Therefore, the necessary and sufﬁcient conditions on SNR for approximate recovery of w lie between C0d/ log log(n) and n C00 (for absolute constants C0, C00 > 0). Narrowing this range remains an interesting open problem.

A sketch of the proof in the normal covariate case is as follows. Without loss of generality, we restrict attention to the case where w is a unit vector. We construct a 1/

2-packing of the unit sphere in Rd; the target w will be chosen from from this set. Observe that for any distinct u, u0 2 U, each of (x>

i=1 and (x>

i=1 is an i.i.d. sample from N(0, 1) of size n; we prove that they therefore determine empirical distributions that are close to each other in Wasserstein-2 distance with high probability. We then prove that conditional on this event, the resulting distributions of (yi)n

i=1 under x = u and x = u0 (for any pair u, u0 2 U) are close in Kullback-Leibler divergence. Hence, by (a generalization of) Fano s inequality [see, e.g., 11], no estimator can determine the correct u 2 U with high probability.

The proof for the uniform case is similar, using U = {e1, e1} where e1 = (1, 0, . . . , 0)>. The full proof of Theorem 5 is given in Appendix D.

Acknowledgments

We are grateful to Ashwin Pananjady, Michał Derezi nski, and Manfred Warmuth for helpful discussions. DH was supported in part by NSF awards DMR-1534910 and IIS-1563785, a Bloomberg Data Science Research Grant, and a Sloan Research Fellowship. XS was supported in part by a grant from the Simons Foundation (#320173 to Xiaorui Sun). This work was done in part while DH and KS were research visitors and XS was a research fellow at the Simons Institute for the Theory of Computing.

[1] Abubakar Abid, Ada Poon, and James Zou. Linear regression with shufﬂed labels. ar Xiv

preprint ar Xiv:1705.01342, 2017.

[2] Alexandr Andoni, Daniel Hsu, Kevin Shi, and Xiaorui Sun. Correspondence retrieval. In

Conference on Learning Theory, 2017.

[3] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM

Journal on Matrix Analysis and Applications, 34(4):1464 1499, 2013.

[4] Sergey Bobkov and Michel Ledoux. One-dimensional empirical measures, order statistics and

Kantorovich transport distances. preprint, 2014.

[5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for

least-squares regression. IEEE Transactions on Information Theory, 59(10):6880 6892, 2013.

[6] Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and

banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.

[7] Michał Derezi nski and Manfred K Warmuth. Unbiased estimates for linear regression via

volume sampling. ar Xiv preprint ar Xiv:1705.06908, 2017.

[8] Golnooshsadat Elhami, Adam James Scholeﬁeld, Benjamin Bejar Haro, and Martin Vetterli.

Unlabeled sensing: Reconstruction algorithm and theoretical guarantees. In Proceedings of the 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.

[9] Alan M Frieze. On the lagarias-odlyzko algorithm for the subset sum problem. SIAM Journal

on Computing, 15(2):536 539, 1986.

[10] Michael R Garey and David S Johnson. Computers and Intractability: A Guide to the Theory of

NP-completeness. WH Freeman and Company, New York, 1979.

[11] Te Sun Han and Sergio Verdú. Generalizing the Fano inequality. IEEE Transactions on

Information Theory, 40(4):1247 1251, 1994.

[12] Jeffrey C Lagarias and Andrew M Odlyzko. Solving low-density subset sum problems. Journal

of the ACM, 32(1):229 246, 1985.

[13] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model

selection. Annals of Statistics, pages 1302 1338, 2000.

[14] Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of

Statistics, pages 38 53, 1973.

[15] Michel Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society,

[16] Arjen Klaas Lenstra, Hendrik Willem Lenstra, and László Lovász. Factoring polynomials with

rational coefﬁcients. Mathematische Annalen, 261(4):515 534, 1982.

[17] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer, 2007.

[18] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with an

unknown permutation: Statistical and computational limits. In 54th Annual Allerton Conference on Communication, Control, and Computing, pages 417 424, 2016.

[19] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Denoising linear models

with permuted data. ar Xiv preprint ar Xiv:1704.07461, 2017.

[20] Rolf-Dieter Reiss. Approximate distributions of order statistics: with applications to nonpara-

metric statistics. Springer Science & Business Media, 2012.

[21] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme

singular values. ar Xiv preprint ar Xiv:1003.2990, 2010.

[22] Jayakrishnan Unnikrishnan, Saeid Haghighatshoar, and Martin Vetterli. Unlabeled sensing with

random linear measurements. ar Xiv preprint ar Xiv:1512.00115, 2015.

[23] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in

Theoretical Computer Science, 10(1 2):1 157, 2014.

[24] Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423 435. Springer,