# linear_regression_without_correspondence__988b5dd1.pdf Linear regression without correspondence Daniel Hsu Columbia University New York, NY djhsu@cs.columbia.edu Kevin Shi Columbia University New York, NY kshi@cs.columbia.edu Xiaorui Sun Microsoft Research Redmond, WA xiaoruisun@cs.columbia.edu This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d. draws from a standard multivariate normal distribution, an efficient algorithm based on lattice basis reduction is shown to exactly recover the unknown linear function in arbitrary dimension. Finally, lower bounds on the signal-to-noise ratio are established for approximate recovery of the unknown linear function by any estimator. 1 Introduction Consider the problem of recovering an unknown vector w 2 Rd from noisy linear measurements when the correspondence between the measurement vectors and the measurements themselves is unknown. The measurement vectors (i.e., covariates) from Rd are denoted by x1, x2, . . . , xn; for each i 2 [n] := {1, 2, . . . , n}, the i-th measurement (i.e., response) yi is obtained using x (i): >x (i) + "i , i 2 [n] . (1) Above, is an unknown permutation on [n], and the "1, "2, . . . , "n are unknown measurement errors. This problem, which has been called unlabeled sensing [22], linear regression with an unknown permutation [18], and linear regression with shuffled labels [1], arises in many settings; see the aforementioned references for more details. In short, sensing limitations may create ambiguity in or even completely lose the ordering of measurements. The problem is also interesting because the missing correspondence makes an otherwise well-understood problem into one with very different computational and statistical properties. Prior works. Unnikrishnan et al. [22] study conditions on the measurement vectors that permit recovery of any target vector w under noiseless measurements. They show that when the entries of the xi are drawn i.i.d. from a continuous distribution, and n 2d, then almost surely, every vector w 2 Rd is uniquely determined by noiseless correspondence-free measurements as in (1). (Under noisy measurements, it is shown that w can be recovered when an appropriate signal-to-noise ratio tends to infinity.) It is also shown that n 2d is necessary for such a guarantee that holds for all vectors w 2 Rd. Pananjady et al. [18] study statistical and computational limits on recovering the unknown permutation . On the statistical front, they consider necessary and sufficient conditions on the signal-to-noise ratio SNR :=k wk2 2 /σ2 when the measurement errors ("i)n i=1 are i.i.d. draws from the normal distribution N(0, σ2) and the measurement vectors (xi)n i=1 are i.i.d. draws from the standard multivariate normal distribution N(0, Id). Roughly speaking, exact recovery of is possible via maximum likelihood 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. when SNR nc for some absolute constant c > 0, and approximate recovery is impossible for any method when SNR nc0 for some other absolute constant c0 > 0. On the computational front, they show that the least squares problem (which is equivalent to maximum likelihood problem) given arbitrary x1, x2, . . . , xn 2 Rd and y1, y2, . . . , yn 2 R is NP-hard when d = (n)1, but admits a polynomial-time algorithm (in fact, an O(n log n)-time algorithm based on sorting) when d = 1. Abid et al. [1] observe that the maximum likelihood estimator can be inconsistent for estimating w in certain settings (including the normal setting of Pananjady et al. [18], with SNR fixed but n ! 1). One of the alternative estimators they suggest is consistent under additional assumptions in dimension d = 1. Elhami et al. [8] give a O(dnd+1)-time algorithm that, in dimension d = 2, is guaranteed to approximately recover w when the measurement vectors are chosen in a very particular way from the unit circle and the measurement errors are uniformly bounded. Contributions. We make progress on both computational and statistical aspects of the problem. 1. We give an approximation algorithm for the least squares problem from (2) that, any given i=1, and 2 (0, 1), returns a solution with objective value at most 1 + times that of the minimum in time (n/ )O(d). This a fully polynomial-time approximation scheme for any constant dimension. 2. We give an algorithm that exactly recovers w in the measurement model from (1), under the assumption that there are no measurement errors and the covariates (xi)n i=1 are i.i.d. draws from N(0, Id). The algorithm, which is based on a reduction to a lattice problem and employs the lattice basis reduction algorithm of Lenstra et al. [16], runs in poly(n, d) time when the covariate vectors (xi)n i=1 and target vector w are appropriately quantized. This result may also be regarded as for each-type guarantee for exactly recovering a fixed vector w, which complements the for all-type results of Unnikrishnan et al. [22] concerning the number of measurement vectors needed for recovering all possible vectors. 3. We show that in the measurement model from (1) where the measurement errors are i.i.d. draws from N(0, σ2) and the covariate vectors are i.i.d. draws from N(0, Id), then no algorithm can approximately recover w unless SNR C min {1, d/ log log(n)} for some absolute constant C > 0. We also show that when the covariate vectors are i.i.d. draws from the uniform distribution on [ 1/2, 1/2]d, then approximate recovery is impossible unless SNR C0 for some other absolute constant C0 > 0. Our algorithms are not meant for practical deployment, but instead are intended to shed light on the computational difficulty of the least squares problem and the average-case recovery problem. Indeed, note that a naïve brute-force search over permutations requires time (n!) = n (n), and the only other previous algorithms (already discussed above) were restricted to d = 1 [18] or only had some form of approximation guarantee when d = 2 [8]. We are not aware of previous algorithms for the average-case problem in general dimension d.2 Our lower bounds on SNR stand in contrast to what is achievable in the classical linear regression model (where the covariate/response correspondence is known): in that model, the SNR requirement for approximately recovering w scales as d/n, and hence the problem becomes easier with n. The lack of correspondence thus drastically changes the difficulty of the problem. 2 Approximation algorithm for the least squares problem In this section, we consider the least squares problem from Equation (2). The inputs are an arbitrary matrix X = [x1|x2| |xn]> 2 Rn d and an arbitrary vector y = (y1, y2, . . . , yn)> 2 Rn, and the 1Pananjady et al. [18] prove that PARTITION reduces to the problem of deciding if the optimal value of (2) is zero or non-zero. Note that PARTITION is weakly, but not strongly, NP-hard: it admits a pseudo-polynomial-time algorithm [10, Section 4.2]. In Appendix A, we prove that the least squares problem is strongly NP-hard by reduction from 3-PARTITION (which is strongly NP-complete [10, Section 4.2.2]). 2A recent algorithm of Pananjady et al. [19] exploits a similar average-case setting but only for a somewhat easier variant of the problem where more information about the unknown correspondence is provided. Algorithm 1 Approximation algorithm for least squares problem input Covariate matrix X = [x1|x2| |xn]> 2 Rn k; response vector y = (y1, y2, . . . , yn)> 2 Rn; approximation parameter 2 (0, 1). assume X >X = Ik. output Weight vector ˆw 2 Rk and permutation matrix ˆ 2 Pn. 1: Run Row Sampling algorithm with input matrix X to obtain a matrix S 2 Rr n with r = 4k. 2: Let B be the set of vectors b = (b1, b2, . . . , bn)> 2 Rn satisfying the following: for each i 2 [n], if the i-th column of S is all zeros, then bi = 0; otherwise, bi 2 {y1, y2, . . . , yn}. 3: Let c := 1 + 4(1 + n/(4k))2. 4: for each b 2 B do 5: Compute wb 2 arg minw2Rk k S(Xw b)k2 2, and let rb := min 2Pn k X wb 2. 6: Construct a rb/c-net Nb for the Euclidean ball of radius pcrb around wb, so that for each v 2 Rk with kv wbk2 pcrb, there exists v0 2 Nb such that kv v0k2 rb/c. 7: end for 8: return ˆw 2 arg min w2S min 2Pn k Xw 2 and ˆ 2 arg min goal is to find a vector w 2 Rd and permutation matrix 2 Pn (where Pn denotes the space of n n permutation matrices3) to minimize k Xw 2. This problem is NP-hard in the case where d = (n) [18] (see also Appendix A). We give an approximation scheme that, for any 2 (0, 1), returns a (1 + )-approximation in time (n/ )O(k) + poly(n, d), where k := rank(X) min{n, d}. We assume without loss of generality that X 2 Rn k and X >X = Ik. This is because we can always replace X with its matrix of left singular vectors U 2 Rn k, obtained via singular value decomposition X = U V >V = Ik and 0 is diagonal. A solution (w, ) for (U, y) has the same cost as the solution (V 1w, ) for (X, y), and a solution (w, ) for (X, y) has the same cost as the solution ( V >w, ) for (U, y). 2.1 Algorithm Our approximation algorithm, shown as Algorithm 1, uses a careful enumeration to beat the naïve brute-force running time of (|Pn|) = (n!). It uses as a subroutine a Row Sampling algorithm of Boutsidis et al. [5] (described in Appendix B), which has the following property. Theorem 1 (Specialization of Theorem 12 in [5]). There is an algorithm ( Row Sampling ) that, given any matrix A 2 Rn k with n k, returns in poly(n, k) time a matrix S 2 Rr n with r = 4k such that the following hold. 1. Every row of S has at most one non-zero entry. 2. For every b 2 Rn, every w0 2 arg minw2Rk k S(Aw b)k2 2 satisfies k Aw0 bk2 2 c minw2Rk k Aw bk2 2 for c = 1 + 4(1 + n/(4k))2 = O(n/k). The matrix S returned by Row Sampling determines a (weighted) subset of O(k) rows of A such that solving a (ordinary) least squares problem (with any right-hand side b) on this subset of rows and corresponding right-hand side entries yields a O(n/k)-approximation to the least squares problem over all rows and right-hand side entries. Row Sampling does not directly apply to our problem because (1) it does not minimize over permutations of the right-hand side, and (2) the approximation factor is too large. However, we are able to use it to narrow the search space in our problem. An alternative to Row Sampling is to simply enumerate all subsets of k rows of X. This is justified by a recent result of Derezi nski and Warmuth [7], which shows that for any right-hand side b 2 Rn, using volume sampling [3] to choose a matrix S 2 {0, 1}k k (where each row has one non-zero entry) gives a similar guarantee as that of Row Sampling, except with the O(n/k) factor replaced by k + 1 in expectation. 3Each permutation matrix 2 Pn corresponds to a permutation on [n]; the (i, j)-th entry of is one if (i) = j and is zero otherwise. 2.2 Analysis The approximation guarantee of Algorithm 1 is given in the following theorem. Theorem 2. Algorithm 1 returns ˆw 2 Rk and ˆ 2 Pn satisfying 2 (1 + ) min w2Rk, 2Pn Proof. Let opt := minw, k Xw 2 be the optimal cost, and let (w?, ?) denote a solution achieving this cost. The optimality implies that w? satisfies the normal equations X > ? y. Observe that there exists a vector b? 2 B satisfying Sb? = S > ? y. By Theorem 1 and the normal equations, the vector wb? and cost value rb? satisfy %%X( wb? w?) 2 + opt c opt . Moreover, since X >X = Ik, we have that k wb? w?k2 (c 1) opt pcrb?. By construction of Nb?, there exists w 2 Nb? satisfying kw w?k2 2 = k X(w w?)k2 2 rb?/c opt. For this w, the normal equations imply min 2Pn k Xw 2 = k X(w w?)k2 2 + opt (1 + ) opt . Therefore, the solution returned by Algorithm 1 has cost no more than (1 + ) opt. By the results of Pananjady et al. [18] for maximum likelihood estimation, our algorithm enjoys recovery guarantees for w and when the data come from the Gaussian measurement model (1). However, the approximation guarantee also holds for worst-case inputs without generative assumptions. Running time. We now consider the running time of Algorithm 1. There is the initial cost for singular value decomposition (as discussed at the beginning of the section), and also for Row Sampling ; both of these take poly(n, d) time. For the rest of the algorithm, we need to consider the size of B and the size of the net Nb for each b 2 B. First, we have |B| nr = n O(k), since S has only 4k rows and each row has at most a single non-zero entry. Next, for each b 2 B, we construct the δ-net Nb (for δ := rb/c) by constructing a δ/ k-net for the 1-ball of radius pcrb centered at wb (using an appropriate axis-aligned grid). This has size |Nb| (4c2k/ )k/2 = (n/ )O(k). Finally, each arg minw2Rk computation takes O(nk2) time, and each (arg) min 2Pn takes O(nk + n log n) time [18] (also see Appendix B). So, the overall running time is (n/ )O(k) + poly(n, d). 3 Exact recovery algorithm in noiseless Gaussian setting To counter the intractability of the least squares problem in (2) confronted in Section 2, it is natural to explore distributional assumptions that may lead to faster algorithms. In this section, we consider the noiseless measurement model where the (xi)n i=1 are i.i.d. draws from N(0, Id) (as in [18]). We give an algorithm that exactly recovers w with high probability when n d + 1. The algorithm runs in poly(n, d)-time when (xi)n i=1 and w are appropriately quantized. It will be notationally simpler to consider n + 1 covariate vectors and responses >x (i) , i = 0, 1, . . . , n . (3) Here, (xi)n i=0 are n + 1 i.i.d. draws from N(0, Id), the unknown permutation is over {0, 1, . . . , n}, and the requirement of at least d + 1 measurements is expressed as n d. In fact, we shall consider a variant of the problem in which we are given one of the values of the unknown permutation . Without loss of generality, assume we are given that (0) = 0. Solving this variant of the problem suffices because there are only n + 1 possible values of (0): we can try them all, incurring just a factor n + 1 in the computation time. So henceforth, we just consider as an unknown permutation on [n]. Algorithm 2 Find permutation input Covariate vectors x0, x1, x2, . . . , xn in Rd; response values y0, y1, y2, . . . , yn in R; confi- dence parameter δ 2 (0, 1); lattice parameter β > 0. assume there exists w 2 Rd and permutation on [n] such that yi = w>x (i) for each i 2 [n], and that y0 = w>x0. output Permutation ˆ on [n] or failure. 1: Let X = [x1|x2| |xn]> 2 Rn d, and its pseudoinverse be X = [ x1| x2| | xn]. 2: Create Subset Sum instance with n2 source numbers ci,j := yi x > j x0 for (i, j) 2 [n] [n] and target sum y0. 3: Run Algorithm 3 with Subset Sum instance and lattice parameter β. 4: if Algorithm 3 returns a solution S [n] [n] then 5: return any permutation ˆ on [n] such that ˆ (i) = j implies (i, j) 2 S. 6: else 7: return failure. 8: end if Algorithm 3 Lagarias and Odlyzko [12] subset sum algorithm input Source numbers {ci}i2I R; target sum t 2 R; lattice parameter β > 0. output Subset ˆS I or failure. 1: Construct lattice basis B 2 R(|I|+2) (|I|+1) where " I|I|+1 βt βci : i 2 I 2 R(|I|+2) (|I|+1) . 2: Run basis reduction [e.g., 16] to find non-zero lattice vector v of length at most 2|I|/2 λ1(B). 3: if v = z(1, χ> ˆ S, 0)>, with z 2 Z and χ ˆ S 2 {0, 1}I is characteristic vector for some ˆS I then 4: return ˆS. 5: else 6: return failure. 7: end if 3.1 Algorithm Our algorithm, shown as Algorithm 2, is based on a reduction to the Subset Sum problem. An instance of Subset Sum is specified by an unordered collection of source numbers {ci}i2I R, and a target sum t 2 R. The goal is to find a subset S I such that P i2S ci = t. Although Subset Sum is NP-hard in the worst case, it is tractable for certain structured instances [12, 9]. We prove that Algorithm 2 constructs such an instance with high probability. A similar algorithm based on such a reduction was recently used by Andoni et al. [2] for a different but related problem. Algorithm 2 proceeds by (i) solving a Subset Sum instance based on the covariate vectors and response values (using Algorithm 3), and (ii) constructing a permutation ˆ on [n] based on the solution to the Subset Sum instance. With the permutation ˆ in hand, we (try to) find a solution w 2 Rd to the system of linear equations yi = w>xˆ (i) for i 2 [n]. If ˆ = , then there is a unique such solution almost surely. 3.2 Analysis The following theorem is the main recovery guarantee for Algorithm 2. Theorem 3. Pick any δ 2 (0, 1). Suppose (xi)n i=0 are i.i.d. draws from N(0, Id), and (y0)n i=1 follow the noiseless measurement model from (3) for some w 2 Rd and permutation on [n] (and (0) = 0), and that n d. Furthermore, suppose Algorithm 2 is run with inputs (xi)n i=0, δ, and β, and also that β 2n2/" where " is defined in Equation (8). With probability at least 1 δ, Algorithm 2 returns ˆ = . Remark 1. The value of " from Equation (8) is directly proportional tok wk2, and Algorithm 2 requires a lower bound on " (in the setting of the lattice parameter β). Hence, it suffices to determine a lower bound onk wk2. Such a bound can be obtained from the measurement values: a standard tail bound (Lemma 6 in Appendix C) shows that with high probability, i /(2n) is a lower bound on k wk2, and is within a constant factor of it as well. Remark 2. Algorithm 2 strongly exploits the assumption of noiseless measurements, which is expected given the SNR lower bounds of Pananjady et al. [18] for recovering . The algorithm, however, is also very brittle and very likely fails in the presence of noise. Remark 3. The recovery result does not contradict the results of Unnikrishnan et al. [22], which show that a collection of 2d measurement vectors are necessary for recovering all w, even in the noiseless measurement model of (3). Indeed, our result shows that for a fixed w 2 Rd, with high probability d + 1 measurements in the model of (3) suffice to permit exactly recovery of w, but this same set of measurement vectors (when d + 1 < 2d) will fail for some other w0. The proof of Theorem 3 is based on the following theorem essentially due to Lagarias and Odlyzko [12] and Frieze [9] concerning certain structured instances of Subset Sum that can be solved using the lattice basis reduction algorithm of Lenstra et al. [16]. Given a basis B = [b1|b2| |bk] 2 Rm k for a lattice zibi : z1, z2, . . . , zk 2 Z this algorithm can be used to find a non-zero vector v 2 L(B)\{0} whose length is at most 2(k 1)/2 times that of the shortest non-zero vector in the lattice λ1(B) := min v2L(B)\{0}kvk2 . Theorem 4 ([12, 9]). Suppose the Subset Sum instance specified by source numbers {ci}i2I R and target sum t 2 R satisfy the following properties. 1. There is a subset S? I such that P i2S? ci = t. 2. Define R := 2|I|/2p |S?| + 1 and ZR := {(z0, z) 2 Z ZI : 0 < z2 i R2}. There exists " > 0 such that |z0 t P i2I zi ci| " for each (z0, z) 2 ZR that is not an integer multiple of (1, χ?), where χ? 2 {0, 1}I is the characteristic vector for S?. Let B be the lattice basis B constructed by Algorithm 3, and assume β 2|I|/2/". Then every non-zero vector in the lattice (B) with length at most 2|I|/2 times the length of the shortest non-zero vector in (B) is an integer multiple of the vector (1, χS?, 0), and the basis reduction algorithm of Lenstra et al. [16] returns such a non-zero vector. The Subset Sum instance constructed in Algorithm 2 has n2 source numbers {ci,j : (i, j) 2 [n] [n]} and target sum y0. We need to show that it satisfies the two conditions of Theorem 4. Let S := {(i, j) : (i) = j} [n] [n], and let = ( i,j)(i,j)2[n] [n] 2 Pn be the permutation matrix with i,j := 1{ (i) = j} for all (i, j) 2 [n] [n]. Note that is the characteristic vector for S . Define R := 2n2/2pn + 1 and (z0, Z) 2 Z Zn n : 0 < z2 A crude bound shows that|ZR| 2O(n4). The following lemma establishes the first required property in Theorem 4. Lemma 1. The random matrix X has rank d almost surely, and the subset S satisfies y0 = P (i,j)2S ci,j. Proof. That X has rank d almost surely follows from the fact that the probability density of X is supported on all of Rn d. This implies that X X = Pn j = Id, and > 0 xj yi 1{ (i) = j} = ci,j 1{ (i) = j} . The next lemma establishes the second required property in Theorem 4. Here, we use the fact that the Frobenius norm F is at least one whenever (z0, Z) 2 Z Zn n is not an integer multiple of (1, ). Lemma 2. Pick any , 0 > 0 such that 3|ZR| + 0 < 1. With probability at least 1 3|ZR| 0, every (z0, Z) 2 ZR with Z = (Zi,j)(i,j)2[n] [n] satisfies 11111 ( /4) (d 1)/n 2+ 1 d 1 pn + Proof. By Lemma 1, the matrix satisfies y0 = P i,j i,j ci,j. Fix any (z0, Z) 2 ZR with Z = (Zi,j)(i,j)2[n] [n]. Then Zi,j ci,j = (z0 i,j Zi,j) x Using matrix and vector notations, this can be written compactly as the inner product x> 0 (X (z0 Z)> X w). Since x0 N(0, Id) and is independent of X, the distribution of the inner product is normal with mean zero and standard deviation equal to k X (z0 Z)> X wk2. By Lemma 7 (in Appendix C), with probability at least 1 , 311 k X (z0 Z) Observe that X = (X > since X has rank d by Lemma 1, so > X wk2 k X >(z0 Z)> X wk2 By Lemma 4 (in Appendix C), with probability at least 1 0, And by Lemma 9 (in Appendix C), with probability at least 1 2 , 8n 1+1/(d 1) . (7) Since is orthogonal, we have that k(z0 Z)> k F = kz0 Zk F . Combining this with (4), (5), (6), and (7), and union bounds over all (z0, Z) 2 ZR proves the claim. Proof of Theorem 3. Lemma 1 and Lemma 2 (with 0 := δ/2 and := δ/(6|ZR|)) together imply that with probability at least 1 δ, the source numbers {ci,j : (i, j) 2 [n] [n]} and target sum y0 satisfy the conditions of Theorem 4 with S? := {(i, j) 2 [n] [n] : (i) = j} , (d 1)/n (δ/(6|ZR|))2+ 1 d 1 pn + 2 k wk2 2 poly(n, log(1/δ)) k wk2 . (8) Thus, in this event, Algorithm 3 (with β satisfying β 2n2/2/") returns ˆS = S?, which uniquely determines the permutation ˆ = returned by Algorithm 2. Running time. The basis reduction algorithm of Lenstra et al. [16] is iterative, with each iteration primarily consisting of Gram-Schmidt orthogonalization and another efficient linear algebraic process called size reduction . The total number of iterations required is k maxi2[k]kbik2 In our case, k = n2 and λ1(B) = pn + 1; and by Lemma 10 (in Appendix C), each of the basis vectors constructed has squared length at most 1 + β2 poly(d, log(n), 1/δ) k wk2 2. Using the tight setting of β required in Theorem 3, this gives a poly(n, d, log(1/δ)) bound on the total number of iterations as well as on the total running time. However, the basis reduction algorithm requires both arithmetic and rounding operations, which are typically only available for finite precision rational inputs. Therefore, a formal running time analysis would require the idealized real-valued covariate vectors (xi)n i=0 and unknown target vector w to be quantized to finite precision values. This is doable, and is similar to using a discretized Gaussian distribution for the distribution of the covariate vectors (and assuming w is a vector of finite precision values), but leads to a messier analysis incomparable to the setup of previous works. Nevertheless, it would be desirable to find a different algorithm that avoids lattice basis reduction that still works with just d + 1 measurements. 4 Lower bounds on signal-to-noise for approximate recovery In this section, we consider the measurement model from (1) where (xi)n i=1 are i.i.d. draws from either N(0, Id) or the uniform distribution on [ 1/2, 1/2]d, and ("i)n i=1 are i.i.d. draws from N(0, σ2). We establish lower bounds on the signal-to-noise ratio (SNR), SNR = k wk2 required by any estimator ˆw = ˆw((xi)n i=1) for w to approximately recover w in expectation. The estimators may have a priori knowledge of the values ofk wk2 and σ2. Theorem 5. Assume ("i)n i=1 are i.i.d. draws from N(0, σ2). 1. There is an absolute constant C > 0 such that the following holds. If n 3, d 22, (xi)n i=1 are i.i.d. draws from N(0, Id), (yi)n i=1 follow the measurement model from (1), and d log log(n), 1 then for any estimator ˆw, there exists some w 2 Rd such that 1 24k wk2 . 2. If (xi)n i=1 are i.i.d. draws from the uniform distribution on [ 1/2, 1/2]d, and (yi)n i=1 follow the measurement model from (1), and SNR 2 , then for any estimator ˆw, there exists some w 2 Rd such that Note that in the classical linear regression model where yi = w>xi + "i for i 2 [n], the maximum likelihood estimator ˆwmle satisfies Ek ˆwmle wk2 Cσ d/n, where C > 0 is an absolute constant. Therefore, the SNR requirement to approximately recover w up to (say) Euclidean distancek wk2 /24 is SNR 242Cd/n. Compared to this setting, Theorem 5 implies that with the measurement model of (1), the SNR requirement (as a function of n) is at substantially higher (d/ log log(n) in the normal covariate case, or a constant not even decreasing with n in the uniform covariate case). For the normal covariate case, Pananjady et al. [18] show that if n > d, < pn, and SNR nc n n d + , then the maximum likelihood estimator ( ˆwmle, ˆ mle) (i.e., any minimizer of (2)) satisfies ˆ mle = with probability at least 1 c0n 2 . (Here, c > 0 and c0 > 0 are absolute constants.) It is straightforward to see that, on the same event, we havek ˆwmle wk2 Cσ d/n for some absolute constant C > 0. Therefore, the necessary and sufficient conditions on SNR for approximate recovery of w lie between C0d/ log log(n) and n C00 (for absolute constants C0, C00 > 0). Narrowing this range remains an interesting open problem. A sketch of the proof in the normal covariate case is as follows. Without loss of generality, we restrict attention to the case where w is a unit vector. We construct a 1/ 2-packing of the unit sphere in Rd; the target w will be chosen from from this set. Observe that for any distinct u, u0 2 U, each of (x> i=1 and (x> i=1 is an i.i.d. sample from N(0, 1) of size n; we prove that they therefore determine empirical distributions that are close to each other in Wasserstein-2 distance with high probability. We then prove that conditional on this event, the resulting distributions of (yi)n i=1 under x = u and x = u0 (for any pair u, u0 2 U) are close in Kullback-Leibler divergence. Hence, by (a generalization of) Fano s inequality [see, e.g., 11], no estimator can determine the correct u 2 U with high probability. The proof for the uniform case is similar, using U = {e1, e1} where e1 = (1, 0, . . . , 0)>. The full proof of Theorem 5 is given in Appendix D. Acknowledgments We are grateful to Ashwin Pananjady, Michał Derezi nski, and Manfred Warmuth for helpful discussions. DH was supported in part by NSF awards DMR-1534910 and IIS-1563785, a Bloomberg Data Science Research Grant, and a Sloan Research Fellowship. XS was supported in part by a grant from the Simons Foundation (#320173 to Xiaorui Sun). This work was done in part while DH and KS were research visitors and XS was a research fellow at the Simons Institute for the Theory of Computing. [1] Abubakar Abid, Ada Poon, and James Zou. Linear regression with shuffled labels. ar Xiv preprint ar Xiv:1705.01342, 2017. [2] Alexandr Andoni, Daniel Hsu, Kevin Shi, and Xiaorui Sun. Correspondence retrieval. In Conference on Learning Theory, 2017. [3] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464 1499, 2013. [4] Sergey Bobkov and Michel Ledoux. One-dimensional empirical measures, order statistics and Kantorovich transport distances. preprint, 2014. [5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory, 59(10):6880 6892, 2013. [6] Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001. [7] Michał Derezi nski and Manfred K Warmuth. Unbiased estimates for linear regression via volume sampling. ar Xiv preprint ar Xiv:1705.06908, 2017. [8] Golnooshsadat Elhami, Adam James Scholefield, Benjamin Bejar Haro, and Martin Vetterli. Unlabeled sensing: Reconstruction algorithm and theoretical guarantees. In Proceedings of the 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, 2017. [9] Alan M Frieze. On the lagarias-odlyzko algorithm for the subset sum problem. SIAM Journal on Computing, 15(2):536 539, 1986. [10] Michael R Garey and David S Johnson. Computers and Intractability: A Guide to the Theory of NP-completeness. WH Freeman and Company, New York, 1979. [11] Te Sun Han and Sergio Verdú. Generalizing the Fano inequality. IEEE Transactions on Information Theory, 40(4):1247 1251, 1994. [12] Jeffrey C Lagarias and Andrew M Odlyzko. Solving low-density subset sum problems. Journal of the ACM, 32(1):229 246, 1985. [13] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. [14] Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, pages 38 53, 1973. [15] Michel Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society, [16] Arjen Klaas Lenstra, Hendrik Willem Lenstra, and László Lovász. Factoring polynomials with rational coefficients. Mathematische Annalen, 261(4):515 534, 1982. [17] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer, 2007. [18] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with an unknown permutation: Statistical and computational limits. In 54th Annual Allerton Conference on Communication, Control, and Computing, pages 417 424, 2016. [19] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Denoising linear models with permuted data. ar Xiv preprint ar Xiv:1704.07461, 2017. [20] Rolf-Dieter Reiss. Approximate distributions of order statistics: with applications to nonpara- metric statistics. Springer Science & Business Media, 2012. [21] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. ar Xiv preprint ar Xiv:1003.2990, 2010. [22] Jayakrishnan Unnikrishnan, Saeid Haghighatshoar, and Martin Vetterli. Unlabeled sensing with random linear measurements. ar Xiv preprint ar Xiv:1512.00115, 2015. [23] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014. [24] Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423 435. Springer,