# fast_regression_for_structured_inputs__616a2ad2.pdf Published as a conference paper at ICLR 2022 FAST REGRESSION FOR STRUCTURED INPUTS Raphael A. Meyer New York University ram900@nyu.edu Cameron Musco University of Massachusetts Amherst cmusco@cs.umass.edu Christopher Musco New York University cmusco@nyu.edu David P. Woodruff Carnegie Mellon University dwoodruf@andrew.cmu.edu Samson Zhou Carnegie Mellon University samsonzhou@gmail.com We study the ℓp regression problem, which requires finding x Rd that minimizes Ax b p for a matrix A Rn d and response vector b Rn. There has been recent interest in developing subsampling methods for this problem that can outperform standard techniques when n is very large. However, all known subsampling approaches have run time that depends exponentially on p, typically, d O(p), which can be prohibitively expensive. We improve on this work by showing that for a large class of common structured matrices, such as combinations of low-rank matrices, sparse matrices, and Vandermonde matrices, there are subsampling based methods for ℓp regression that depend polynomially on p. For example, we give an algorithm for ℓp regression on Vandermonde matrices that runs in time O(n log3 n + (dp2)0.5+ω polylog n), where ω is the exponent of matrix multiplication. The polynomial dependence on p crucially allows our algorithms to extend naturally to efficient algorithms for ℓ regression, via approximation of ℓ by ℓO(log n). Of practical interest, we also develop a new subsampling algorithm for ℓp regression for arbitrary matrices, which is simpler than previous approaches for p 4. 1 INTRODUCTION Given a matrix A Rn d and a vector b Rn, the goal of linear regression is to find a vector x Rd such that Ax is as close as possible to b. In approximate ℓp linear regression in particular, we seek to find x Rd such that, for some approximation parameter ε > 0, A x b p (1 + ε) min x Rd Ax b p. Here for a vector y Rn, y p = (Pn i=1 |yi|p)1/p. ℓp regression is central in statistical data analysis, and has numerous applications in machine learning, data science, and applied mathematics (Friedman et al., 2001; Chatterjee & Hadi, 2006). There are a number of algorithmic approaches to solving ℓp regression. For example, we can directly apply iterative methods like gradient descent or stochastic gradient descent. Alternatively, we can use iteratively reweighted least squares, which reduces the regression problem to solving poly(d) linear systems (Adil et al., 2019b;a). Both the above approaches require repeated passes over the matrix A, so while their runtimes are typically linear in nd, or more generally on the time to multiply the matrix A by a vector, that factor is multiplied by other parameters, such as the number of iterations to convergence. An alternative approach, which can lead to faster running time when n is large, is to apply sketch-and-solve methods. This approaches begins with an inexpensive subsampling step, which selects a subset of rows in A to produce a smaller matrix M with poly d, 1 ε, log n n rows. M can be written as M = SA where S is a row sampling and rescaling matrix. The goal is for SAx Sb p to be a good approximation to Ax b p for all x Rd. If this the case, then an approximate solution to the original ℓp regression problem can be obtained by solving the subsampled problem, which has smaller size, thus allowing for more efficient computation. Published as a conference paper at ICLR 2022 The standard approach to subsampling for ℓp regression is to sample rows with probability proportional to their so-called ℓp Lewis weights (Cohen & Peng, 2015). Unfortunately, for general inputs A, ℓp Lewis weight sampling requires O dmax(1,p/2) rows, and it can be shown that no subsampling method can take fewer than d O(p) (Li et al., 2021). This means that sampling is only helpful in the limited regime where n dp/2. However, there are many applications in which the matrix A has additional structure, which can be leveraged to design more efficient algorithms. For example, Vandermonde matrices are used in the polynomial regression problem, which has been studied for over 200 years (Gergonne, 1974) and has applications to machine learning (Kalai et al., 2008), applied statistics (Mac Neill, 1978), and computer graphics (Pratt, 1987). The goal is to fit a signal, which is measured at time points t1, . . . , tn using a degree d polynomial. This problem can be formulated as ℓp regression with a Vandermonde feature matrix A, whose ith row ai is of the form [1, ti, (ti)2, . . . , (ti)d 1]. Regression problems with Vandermonde matrices also arise in the settings of Fourier-constrained function fitting (Avron et al., 2019) and Toeplitz covariance estimation (Eldar et al., 2020). Notably, (Shi & Woodruff, 2019) leverages the structure of Vandermonde matrices to more quickly build a subsampled matrix M given any Vandermonde matrix A than would be possible for a general input. Their method does not change how many rows are in the matrix M, so the overall algorithm still incurs an exponential dependence in p. So while the approach is a helpful improvement for Vandermonde regression where p is small, this leaves an infeasible runtime for important problems like ℓ regression, which can be approximated by ℓp regression for p = O (log n). 1.1 OUR CONTRIBUTIONS We first show that for ℓp regression on Vandermonde matrices, it is possible to reduce the size of the subsampled matrix M to depend polynomially instead of exponentially on p. Theorem 1.1. Given ε (0, 1) and p 1, a Vandermonde matrix A Rn d, and b Rn, there exists an algorithm that uses O n log3 n + d0.5+ω poly 1 ε, p, log n time to compute a sampling matrix S Rm n with m = O p2d ε3 log2 n rows so that with high probability, for all x Rd, (1 ε) Ax b p SAx Sb p (1 + ε) Ax b p, and then to return a vector bx Rd such that Abx b p (1 + ε) minx Rd Ax b p. The best known previous work required at least Ω(n log2 n + dp/2 poly(1/ε)) time (Shi & Woodruff, 2019). This has an exponential dependence in p, while our algorithm has just a polynomial dependence. Building on Theorem 1.1, we observe that to obtain a (1 + ε)-approximation to the fundamental problem of ℓ polynomial regression, it suffices to consider ℓp regression for p = O log n ε . Since our results have polynomial dependence on p rather than exponential, we thus obtain the first subsampling guarantees for Vandermonde ℓ regression. Theorem 1.2. Given ε (0, 1), a Vandermonde matrix A Rn d, and b Rd, there exists an algorithm that uses O n log3 n + d0.5+ω poly 1 ε, log n time to compute a sampling matrix S Rm n with m = O d ε5 log4 n such that with high probability, for all x Rd, (1 ε) Ax b SAx Sb (1 + ε) Ax b , and then to return a vector bx Rd such that Abx b (1 + ε) minx Rd Ax b . To the best of our knowledge, this is the first known dimensionality reduction for ℓ regression with provable guarantees for any (nontrivial) input matrix. We summarize these results in Table 1. Our second contribution is to show that improved sampling bounds for ℓp regression can be extended to a broad class of inputs, beyond Vandemonde matrices. We introduce the following definition to capture the true dimension of regression problems for structured input matrices. Definition 1.3 (Rank of Regression Problem). Given an integer p 1 and a matrix A Rn d, suppose there exists a matrix M Rn t and a fixed function f : Rd Rt so that for all x Rd, | ai, x |p = | mi, f(x) |. Then we call the minimal such t the rank of the ℓp regression problem. Published as a conference paper at ICLR 2022 Rows Sampled, ℓp Regression Rows Sampled, ℓ Regression Reference dp poly log n, 1 n (Avron et al., 2013) dp poly log n, 1 n (Shi & Woodruff, 2019) dp2 poly log n, 1 ε (Theorem 1.1) d poly log n, 1 ε (Theorem 1.2) Our Results Table 1: Sample complexity for regression on Vandermonde matrices Theorems 1.1 and 1.2 rely on the following key structural property that we prove about the p-fold tensor product of rows of the Vandermonde matrix. Lemma 1.4. For integer p 1, the rank of the ℓp regression problem on a Vandermonde matrix A Rn d is O (dp). Lemma 1.4 implies that the ℓp loss function for a row of a Vandermonde matrix can potentially be expressed as a linear combination of O (dp) variables even though the entries of the measurement vector b can be arbitrary. By comparison, the ℓp loss function for a row ai of a general matrix A is | ai, x bi|p p can only be expressed as a linear combination of O (pdp) variables, corresponding to each of the dk k-wise products of coordinates of x, for each k [p]. As a corollary of Lemma 1.4, Theorems 1.1 and 1.2 obtain a small coreset for ℓp regression (as well as ℓ regression) on a Vandermonde matrix, which can thus be used as a preconditioner for ℓp regression. We generalize Lemma 1.4 to similar guarantees for ℓp regression on a matrix A that is the sum of a low-rank matrix K and an s-sparse matrix S that has at most s non-zero entries per row. Thus using the notion of the rank of the regression problem for such a matrix A, we obtain the following guarantee: Theorem 1.5. Given ε (0, 1) and p 1, a rank k matrix K Rn d and a s-sparse matrix S Rn d so that A := K + S, and b Rn, there exists an algorithm that uses O nkω 1 + n poly 2p, ds, kp, sp, 1 ε, log n time to compute a sampling matrix T Rm n containing m = O pds(k+s)p(s+p) ε3 log2(pn) rows so that with high probability, for all x Rd, (1 ε) Ax b p TAx Tb p (1 + ε) Ax b p, and then to return a vector bx Rd such that Abx b p (1 + ε) minx Rd Ax b p. Further, if the low-rank factorization of K is given explicitly, then this runtime can be improved to n poly 2p, ds, kp, sp, 1 Similarly, we obtain efficient guarantees for ℓp regression on a matrix A that is the sum of a Vandermonde matrix V and a sparse matrix S that has at most s non-zero entries per row. Theorem 1.6. Given ε (0, 1) and p 1, a Vandermonde matrix V Rn d and an s-sparse matrix S Rn d such that A := V + S, and b Rn, there exists an algorithm that uses O n log3 n + poly p2d, ds, sp, log n, 1 ε time to compute a sampling matrix T Rm n with m = O p2ds+1sp(s+p) ε3 log2(pn) rows, so that with high probability for all x Rd, (1 ε) Ax b p TAx Tb p (1 + ε) Ax b p, and then to return a vector bx Rd such that Abx b p (1 + ε) minx Rd Ax b p. Surprisingly, our methods even yield practical algorithms for general matrices with absolutely no structure. Although the rank of the ℓp regression problem for general matrices is O (dp), we still obtain the optimal sample complexity (see (Li et al., 2021) for a lower bound) of roughly O dp/2 . Furthermore, an advantage of our approach is that we only need to perform ℓq Lewis weight sampling for q [1, 4] and there are known efficient iterative methods for computing the ℓq Lewis weights for q [1, 4], e.g., see Section 2.1 of this paper or Section 3 in (Cohen & Peng, 2015). By contrast, previous methods relied on computing general ℓp weights for p > 4, which, prior to the recent work of (Fazel et al., 2021), required solving a large convex program, e.g., through semidefinite programming, see Section 4 in (Cohen & Peng, 2015). Finally, we experimentally validate our theory on synthetic data. In particular, we consider matrices and response vectors that are motivated by existing lower bounds for subsampling and sketching methods. For Vandermonde regression, our experiments demonstrate that the number of rows Published as a conference paper at ICLR 2022 needed is polynomial in p, reinforcing how structured matrices outperform the worst-case bound. For unstructured matrix ℓp regression, we demonstrate that our ℓq Lewis Weight subsampling scheme is effective and accurate. 1.2 OVERVIEW OF OUR TECHNIQUES Our algorithmic contributions rely on two key observations, which we describe below. For the sake of presentation, we assume p is an integer in this overview, though we handle arbitrary real values of p in the subsequent algorithms and analyses. Reduced rank of the ℓp regression problem on structured inputs. The first main ingredient is the simple yet powerful observation that the rank of the ℓp regression problem on structured inputs such as Vandermonde matrices A Rn d does not need to be the O (dp) rank of the ℓp regression problem on general matrices. For a general matrix, ai, x p for an integer p can be rewritten as a linear combination P αiyi of O (dp) terms, where each coefficient αi is a p-wise product of entries of the row vector ai Rd and similarly each yi is a p-wise product of entries of the vector x Rd. However, when A is a Vandermonde matrix, then the j-th entry of ai is simply Aj 1 i,2 . Thus each coefficient in ai, x p can be written as a linear combination of 1, Ai,2, (Ai,2)2, . . . , (Ai,2)p(d 1) and so we can express ai, x p as a linear combination of O (dp) variables rather than O (dp) variables. We can similarly show that the rank of the ℓp regression problem on rank k-matrix A is O (kp) by writing each row ai as a linear combination of basis vectors v1, . . . , vk. Then, using the Hadamard Product-Kronecker Product mixed-product property, we can rewrite ai, x p as a linear combination of p-wise products of the variables v1, x , . . ., vk, x , i.e., a linear combination of O (kp) variables rather than O (dp) variables. It follows that the rank of the ℓp regression problem on a matrix A whose rows have at most s non-zero entries is O (dssp) by noting that (1) there are d s = O (ds) sparsity patterns and that (2) for a fixed sparsity pattern, the rank of the induced matrix is at most s, which induces a linear combination of O (sp) variables for the ℓp regression problem. These decomposition techniques can also be generalized to show that the rank of the ℓp regression problem is low on matrices A such that A = K + V, A = K + S or A = V + S, where K is a low-rank matrix, V is a Vandermonde matrix, and S is a sparse matrix. However, we remark that although the observation that the rank of the ℓp regression problem on structured inputs can be low, this itself does not yield an algorithm for ℓp regression. This is because the loss function is | ai, x bi|p rather than ai, x p and there can be n possible different values of bi across all i [n]. Rounding and truncating the measurement vector: a novel algorithmic technique. Thus, the second main ingredient is manipulating the measurement vector b Rn so that the loss function can utilize the low-rank property of the ℓp regression problem on structured inputs. A natural approach would be to round the entries of b, say to the nearest power of (1 + ε). Unfortunately, such a rounding approach would roughly preserve b p up to a multiplicative (1 + ε) factor, but it would not preserve Ax b p. For example, suppose ai, x = bi = N for some arbitrarily large value N. Then | ai, x bi|p = 0, but if bi were rounded to (1+ε)N, we would have | ai, x bi|p = εp N p, which can be arbitrarily large. The lesson from this counterexample is that when bi is significantly larger than ai, x bi, any rounding technique can be arbitrarily bad because b p can be significantly larger than OPT := minx Rd Ax b p. On the other hand, if b p is a constant factor approximation to OPT, then the previous counterexample cannot happen because either (1) bi is large relative to b p and ai, x cannot be too close to bi so that rounding bi will not significantly affect the difference ai, x bi or (2) bi is small relative to b p and so any rounding of bi will not affect the contribution of the i-th row of A in the overall loss. Thus our task is reduced to manipulating the input so that b p = O (OPT). To that end, we note that if A x b p = O (OPT) for a vector x Rd, then by a triangle inequality argument, we have that the residual vector b = b A x Rd satisfies b p = O (OPT). Namely, we show that to find such a vector x, it suffices by triangle inequality to find a subspace embedding for A, i.e., a matrix M Rm d with m n such that (1 C) Ax p Mx p (1 + C) Ax p Published as a conference paper at ICLR 2022 for some constant C (0, 1). Typically, such a matrix M can be found by sampling the rows of A according to their ℓp Lewis weights to generate a matrix with m = O dmax(1,p/2) rows. However, using our observation that the rank of the ℓp regression problem on structured inputs can be low, we can sample O (dp) rows of A with probabilities proportional to their ℓq Lewis weights, where q = p 2r for an integer r such that 2r p < 2r+1. Crucially, we can find such a vector x without reading the coordinates of b since we only require to read the rows of A to perform Lewis weight sampling in this phase. Thus we can efficiently find a residual vector b such that b p = O (OPT). Partitioning the matrix into groups. We then round the entries of b to obtain a vector b with at most ℓ:= O log n ε unique values, truncating all entries that are less in magnitude than 1 poly(n) to zero instead. We now solve the ℓp regression problem minx Rd Ax b p by partitioning the rows of A into ℓgroups G1, . . . , Gℓbased on their corresponding values of b . The main point is that all rows in each group Gk have the same entry tk in b . Thus we can again observe that | ai, x tk|p can be written as a linear combination of O p2d variables and therefore use ℓq Lewis weight sampling (for the same q) to reduce each group Gk down to O p2d ε2 log d rows. Since there are ℓ= O log n ε groups, then there are roughly O p2d total rows that have been sampled across all the groups. It follows from the decomposition of the ℓp loss function across each group that the matrix T formed by these rows is a coreset for the ℓp regression problem. Therefore, by solving the ℓp regression problem on T, which has significantly smaller dimension than the original input matrix A, we obtain a vector bx with the desired property that Abx b p (1 + O (ε)) min x Rd Ax b p. Practical ℓp regression for arbitrary matrices. We now describe a practical procedure for ℓp regression on general matrices that avoids the necessity for convex programming to approximate the ℓp Lewis weights for p > 4. We first pick an integer r such that p 2r [2, 4). By the same structural argument as in Lemma 1.4, we can tensor product each row ai of A with itself 2r times, thus obtaining an extended matrix of size n d , where d = d2r, independent of the entries in b. We then ℓp/2r Lewis weight sample on the extended matrix, using the iterative method in Figure 1 since p 2r < 4 rather than solving a convex program for ℓp Lewis weight sampling for p > 4. Since p 2r [2, 4), it follows from Theorem 2.3 that Lewis weight sampling requires roughly (d )p/2 rows. Thus we obtain a matrix with O (d2r)p/2r+1 = O dp/2 rows, from which we can compute a residual vector b such that b p = O (OPT), where OPT := minx Rd Ax b p. We then round and truncate the entries of b using the subroutine Round Trunc to obtain a vector b , which allows us to partition the rows of A and the entries of b into O log n ε groups. Due to the constant values of b in each group, we can again ℓp/2r Lewis weight sample on an extended matrix using an iterative method and finally solve the ℓp regression problem on the subsequent rows that have been sampled. We give our algorithm in full in Algorithm 3. 2 ℓp REGRESSION FOR VANDERMONDE MATRICES 2.1 PRELIMINARIES ON LEWIS WEIGHT SAMPLING There are a number of known sampling distributions for dimensionality reduction for the ℓp regression problem, such as the ℓp leverage scores, e.g., (Cormode et al., 2018; Dasgupta et al., 2008) and the ℓp sensitivities, e.g., (Clarkson et al., 2019; Braverman et al., 2020; 2021; Musco et al., 2021); in this paper we focus on the ℓp Lewis weights, e.g., (Cohen & Peng, 2015; Durfee et al., 2018; Chen & Derezinski, 2021; Parulekar et al., 2021). Definition 2.1 (ℓp Lewis Weights, (Cohen & Peng, 2015)). Given a matrix A Rn d and p 1, the ℓp Lewis weights w1(A), . . . , wn(A) are the unique quantities that satisfy (wi(A))2/p = a i (A W1 2/p A) 1ai for all i [n], where W Rn n is the diagonal matrix with Wi,i = wi(A) for all i [n]. Published as a conference paper at ICLR 2022 Definition 2.2 (Lewis Weight Sampling). For an input matrix A Rn d and m samples, let the sampling matrix S Rm n be generated by independently setting each row of S to be the i-th standard basis vector multiplied by d m wp i (A) 1/p with probability wp i (A) Theorem 2.3 (ℓp Subspace Embedding from Lewis Weight Sampling, (Cohen & Peng, 2015)). For any A Rn d, let the matrix S Rm n be generated from Lewis weight sampling with m = O dmax(1,p/2) log(d/δ) log(1/ε) εγ , where γ = 2 for p [1, 2] and γ = 5 for p > 2. Then with probability at least 1 δ, we have that simultaneously for all x Rd, (1 ε) Ax p SAx p (1 + ε) Ax p. Since the Lewis weights are implicitly defined, it may not be clear how to compute them exactly. In fact, it suffices to compute constant factor approximations to the Lewis weights. (Cohen & Peng, 2015) show that for p 4, there exists a simple iterative approach to compute a constant factor approximation to the ℓp Lewis weights in input sparsity time, which we present in Figure 1. w = Lewis Iterate(A, p, β, w) (1) For i = 1, . . . , n (a) Let τi be a constant factor approximation to a i (A W1 2/p A) 1ai (b) wi (τi)p/2 (c) Return w w = Approx Lewis Weights(A, p, β, T) (1) Intialize wi = 1 for all i [n]. (2) For t = 1, . . . , T (a) w Lewis Iterate(A, p, β, w) (3) Return w Fig. 1: Iterative algorithm for approximate ℓp Lewis weights with p < 4. At a high level, the correctness of Figure 1 follows from Banach s fixed point theorem and the fact that the subroutine Lewis Iterate is a contraction mapping, because |2/p 1| < 1 for p < 4 (Cohen & Peng, 2015). However, for p 4, Figure 1 no longer works. Instead, prior to the recent work of (Fazel et al., 2021), approximating ℓp Lewis weights for p > 4 seems to require solving the convex program Q = argmax M det(M), subject to X i (a i Ma)p/2 d, M 0, and setting wi = (a i Qa)p/2. Unfortunately, this is often infeasible in practice, and we could not obtain empirical results using it. Therefore, a nice advantage of our algorithms, both for structured and unstructured matrices, is that we only use ℓq Lewis weight sampling for q [1, 4), even if p 4, whereas previous algorithms required using ℓp Lewis weight sampling for p > 4. 2.2 ALGORITHM AND ANALYSIS We first describe the general framework of our algorithm for efficient ℓp regression, so that given A Rn d and b Rd, the goal is to approximately compute OPT := minx Rd Ax b p. Recall that in order to apply our structural results, we first require a measurement vector b with a small number of distinct entries. We obtain b by first finding a constant factor approximation, i.e., x Rd such that A x b p O (OPT). We can ℓp Lewis weight sample the rows of A to do this, but the time to solve the subsequent polynomial regression problem would have exponential dependence on p, due to Theorem 2.3. Instead, we use our structural properties to implicitly create a matrix M from A with fewer than dp columns and then ℓp/2r Lewis weight sample the rows of Published as a conference paper at ICLR 2022 A, where r is the integer that satisfies 2r p < 2r+1, and then we solve the ℓp regression problem on the sampled rows of A and entries of b to obtain x. We set b to be the residual vector b A x so that b p = O (OPT). We then use the procedure Round Trunc, i.e., as in Algorithm 1, which sets the entries of b that are the smallest in magnitude to zero, and rounds the remaining entries of b to the nearest power of (1 + ε). Algorithm 1 Round Trunc: Round and truncate coordinates of input vector b Input: Vector b Rn, accuracy parameter ε > 0 Output: Vector x Rn that is a rounded and truncated version of b 1: M maxi n |bi| 2: for i = 1 to i = n do 3: pi log1+ε |bi| 4: xi (1 + ε)pi sign(bi) 5: if |xi| M n5 then 6: xi 0 7: return x Since b p = O (OPT), it then follows by triangle inequality that it suffices to approximately solve the ℓp regression problem on the vector b instead. We partition the rows of A into groups G1, . . . , Gℓbased on the values of b and perform ℓp/2r Lewis weight sampling again on an implicit matrix that we create from the rows of each group. Here we leverage our theory about the rank of the ℓp regression problem for structured inputs. It then suffices to solve the ℓp regression problem on the matrix formed by the sampled rows across all the groups. Algorithm 2 Faster regression for Vandermonde matrices Input: Vandermonde matrix A Rn d, measurement vector b, accuracy parameter ε > 0 Output: bx R with Abx b p (1 + O (ε)) minx Rd Ax b p 1: r log p 2r p < 2r+1. 2: Extend A to a Vandermonde matrix M with dimensions n d , where d = (2r(d 1) + 1). 3: Use ℓp/2r-Lewis weight sampling on M to find a set S of O (d log d ) indices in [n] = {1, 2, . . . , n}, and corresponding rescaling factors. 4: Let A be the corresponding submatrix of A with indices in S, and scaled accordingly. 5: Compute x 5 minx Rd A x b p. 6: b b A x, b Round Trunc(b ), ℓ O log n 7: Partition the rows of A into groups G1, . . . , Gℓ, each containing all rows with the same value of b 8: Let Gk be the submatrix of Gk extended to d = 22r(d 1) + 1 columns. 9: Let tk be the coordinate of b corresponding to Gk for each k [ℓ]. 10: Use ℓp/2r-Lewis weight sampling on [Gk; tk] to find a set S k of O d ε2 log d indices in [n] and rescaling factors. 11: Let Tk be the corresponding sampling and rescaling matrix for S k. 12: T [T1; . . . ; Tk] 13: Compute bx (1 + ε) minx Rd TAx Tb p. 14: return bx We show in the supplementary material that Algorithm 2 satisfies the guarantees of Theorem 1.1. Moreover, because Theorem 1.1 has polynomial dependence on p rather than exponential dependence, we obtain the first known sublinear size coreset for the important problem of ℓ regression. We use the following structural property. Lemma 2.4. Let x Rn and p = Ω log n ε . Then x x p (1 + ε) x . Lemma 2.4 implies that to solve ℓ regression, we can instead solve ℓp regression for p = ε . Then Theorem 1.2 follows from the fact that even for p = Ω log n ε , the matrix T Published as a conference paper at ICLR 2022 in Algorithm 2 satisfies (1 ε) TAx Tb p Ax b p (1 + ε) TAx Tb p for all x Rd. Hence, we can solve the ℓp regression problem on the smaller matrix T to solve the ℓ regression on A. The results of Theorem 1.1 can be further extended to matrices with block Vandermonde structure. Corollary 2.5. Given ε (0, 1) and p 1, A Rn dq, and b Rd, suppose A = [A1| . . . |Aq] for Vandermonde matrices A1, . . . , Aq Rn d. Then there exists an algorithm that, with high probability, returns a vector bx Rd such that Abx b p (1 + ε) min x Rd Ax b p, using O T(A)(dp)q 1 log n + poly((dp)q log n, 1/ε) time, where T(A) is the runtime of multiplying the matrix A by an arbitrary vector. For q = 2, this can be further optimized to O ndω2/2 1 + poly((dp)2, log n, 1/ε) time, where O (nω2) is the time to multiply an n n matrix with an n n2 matrix, so that ω2 [3, 4]. We obtain similar algorithms for noisy low-rank matrices and noisy Vandermonde matrices, i.e., Theorem 1.5 and Theorem 1.6. We defer presentation and justification for these results to the supplementary material. Algorithm 3 Faster ℓp regression for general matrices Input: Matrix A Rn d, measurement vector b, accuracy parameter ε > 0 Output: bx R with Abx b p (1 + ε) minx Rd Ax b p, parameter p 4 1: r log p 1 2r+1 p < 2r+2. 2: Extend A to a matrix M with dimension n O d2r so that each row in Mi is the 2r-fold tensor product of ai reshaped into a d2r length vector. 3: Use ℓp/2r-Lewis weight sampling on M to find a set S of O dp/2 log d indices in [n] and rescaling factors. 4: Let A be the corresponding submatrix of A with indices in S and scaled accordingly. 5: Compute x 5 minx Rd A x b p. 6: b b A x, b Round Trunc(b ), ℓ O log n 7: Partition the rows of A into groups G1, . . . , Gℓ, each containing all rows with the same value of b 8: Let Gk be the corresponding submatrix and tk be the coordinate of b corresponding to Gk for each k [ℓ]. 9: Use ℓp/2r-Lewis weight sampling on [Gk; tk] to find a set S k of O 1 ε5 dp/2 log d indices in [n] and rescaling factors. 10: Let Tk be the corresponding rows with indices in S k and scaled accordingly. 11: T [T1; . . . ; Tk] 12: Compute bx (1 + ε) minx Rd Tx Tb p. 13: return bx Finally, we describe in Algorithm 3 a practical approach for ℓp regression for arbitrary matrices without requiring any structural assumptions. Theorem 2.6. Algorithm 3 outputs a vector bx such that Abx b p (1 + ε) min Ax b p. The runtime of Algorithm 3 is T(n, dp/2) + poly dp/2, log n, 1 ε , where T(n, dp/2) is the time to multiply a matrix of size n dp/2 by a vector of length dp/2. Theorem 2.6 achieves the same optimal sample complexity as previous ℓp Lewis weight algorithms of roughly dp/2. However, the main advantage of Algorithm 3 is that it performs ℓq Lewis weight sampling for some q [2, 4), which is quite efficient because we can use an iterative method rather than solving a convex program. Published as a conference paper at ICLR 2022 5 10 15 20 25 Norm Parameter (p) Relative Error (εempirical) Lp/2r Lewis Weights (a) Relative Error versus p 500 1,000 1,500 2,000 2,500 3,000 100 Subsample Complexity (m) Relative Error (εempirical) Lp/2r Lewis Weights (b) Relative Error versus m Fig. 2: Comparison of relative error (εempirical) versus ℓp parameter p and sample complexity m on Vandermonde data. We ran both experiments 30 times and plot the median, 25th quartile, and 75th quartile for each value of p and m. 3 EMPIRICAL VERIFICATION We provide empirical evidence to validate our core statistical claim about Vandermonde regression: that the relative error ε achieved by ℓp/2r Lewis Weight subsampling is polynomial in p, and not exponential in p. More precisely, we compute the gap between the error achieved from exact ℓp Vandermonde regression and the error achieved from the subsampled regression: εempirical = Vbx b p Vx b p Our theory tells us that (εempirical)3 O dp2 m , where m is the total number of subsampled rows. The prior work on unstructured matrices instead suggests (εempirical)c O d O(p) m (Shi & Woodruff, 2019). So, to visually distinguish these two settings, we look at the logarithm of both sides: log(εempirical) O ln(p) + ln( d m) OR log(εempirical) O p ln(d) + ln( 1 In particular, our work suggests a logarithmic dependence on p, while the prior work suggests a linear dependence on p. To validate our theory, we plot log(εempirical) versus p and m on synthetic data. Specifically, we generate n = 25, 000 i.i.d. N(0, 1) times samples to form a Vandermonde matrix V with d = 20 columns, then compute the polynomial q(t) = t10 at each time sample, add N(0, 1010) additive noise, and save the corresponding values in b. We then compute εempirical for this ℓp regression problem. Notably, in order to compute bx, we omit the rounding procedure in our code, since the rounding is designed for worst-case inputs. Instead, we simply compute bx by solving minx b Vx bb p where b V and bb are computed by sampling and rescaling V and b with the ℓp/2r Lewis Weights. Figure 2 shows the result of these experiments, which were run in Julia 1.6.1, on Windows 10 with an Intel i7-7700K CPU and 16Gb RAM. In Figure 2a, we fix m = 1000 and vary p [2, 25]. The trendline of Lewis Weight sampling clearly better fits a logarithmic model, as opposed to a linear model. This reinforces our analysis by showing that the dependence on p is notably sub-exponential, beating the known bounds for ℓp subsampling on unstructured matrices. As a benchmark, we compare our Lewis Weight sampling method to uniform sampling. The noise in b is large enough that most rows of A have little information about the underlying polynomial q(t). Lewis weight sampling takes avoids these rows, while uniform sampling does not, explaining why uniform sampling is much weaker in Figure 2a. Further, in Figure 2b, we fix p = 6 and vary m [100, 3000], showing that Lewis Weight sampling outperforms uniform sampling across both m and p. In Appendix C, we demonstrate similar results for unstructured matrix regression, validating the analysis of Theorem 2.6. Published as a conference paper at ICLR 2022 ACKNOWLEDGEMENTS Cameron Musco was supported by NSF grants 2046235 and 1763618, and an Adobe Research grant. David P. Woodruff and Samson Zhou were supported by National Institute of Health grant 5401 HG 10798-2 and a Simons Investigator Award. Christopher Musco and Raphael Meyer were supported by NSF grant 2045590 and DOE Award DE-SC0022266. Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative refinement for lpnorm regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1405 1424, 2019a. 1, 15 Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent IRLS algorithm for p-norm linear regression. In Advances in Neural Information Processing Systems 32, Neur IPS, pp. 14166 14177, 2019b. 1, 15 Haim Avron, Vikas Sindhwani, and David P. Woodruff. Sketching structured matrices for faster nonlinear regression. In 27th Annual Conference on Neural Information Processing Systems. Proceedings, pp. 2994 3002, 2013. 3 Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. A universal sampling method for reconstructing signals with simple fourier transforms. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pp. 1051 1063, 2019. 2 Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 517 528, 2020. 5, 22 Vladimir Braverman, Avinatan Hassidim, Yossi Matias, Mariano Schain, Sandeep Silwal, and Samson Zhou. Adversarial robustness of streaming algorithms through importance sampling. Co RR, abs/2106.14952, 2021. 5, 22 Samprit Chatterjee and Ali S Hadi. Regression analysis by example, volume 607. John Wiley and Sons, 2006. 1 Xue Chen and Michal Derezinski. Query complexity of least absolute deviation regression via robust uniform convergence. In Conference on Learning Theory, COLT, pp. 1144 1179, 2021. 5, 22 Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, and Supratim Shit. Streaming coresets for symmetric tensor factorization. In Proceedings of the 37th International Conference on Machine Learning, ICML, pp. 1855 1865, 2020. 22 Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference, STOC, pp. 81 90, 2013. 22 Kenneth L. Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, Xiangrui Meng, and David P. Woodruff. The fast cauchy transform and faster robust linear regression. SIAM J. Comput., 45(3):763 810, 2016. 22 Kenneth L. Clarkson, Ruosong Wang, and David P. Woodruff. Dimensionality reduction for tukey regression. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 1262 1271, 2019. 5, 22 Michael B. Cohen and Richard Peng. lp row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC, pp. 183 192, 2015. 2, 3, 5, 6, 15, 21, 22 Graham Cormode, Charlie Dickens, and David P. Woodruff. Leveraging well-conditioned bases: Streaming and distributed summaries in minkowski p-norms. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 1048 1056, 2018. 5, 22 Published as a conference paper at ICLR 2022 Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. Sampling algorithms and coresets for lp regression. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 932 941, 2008. 5, 22 David Durfee, Kevin A. Lai, and Saurabh Sawlani. l1 regression using lewis weights preconditioning and stochastic gradient descent. In Conference On Learning Theory, COLT, pp. 1626 1656, 2018. 5, 22 Yonina C. Eldar, Jerry Li, Cameron Musco, and Christopher Musco. Sample efficient toeplitz covariance estimation. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 378 397, 2020. 2 Maryam Fazel, Yin Tat Lee, Swati Padmanabhan, and Aaron Sidford. Computing lewis weights to high precision. Co RR, abs/2110.15563, 2021. 3, 6 Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. The elements of statistical learning. Springer series in statistics New York, 2001. 1 JD Gergonne. The application of the method of least squares to the interpolation of sequences. Historia Mathematica, 1(4):439 447, 1974. 2 I Gohberg and V Olshevsky. Complexity of multiplication with vectors for structured matrices. Linear Algebra and Its Applications, 202:163 192, 1994. 15 Jerrad Hampton and Alireza Doostan. Coherence motivated sampling and convergence analysis of least squares polynomial chaos regression. Computer Methods in Applied Mechanics and Engineering, 290:73 97, 2015. 20 Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008. 2, 20 Daniel Kane, Sushrut Karmalkar, and Eric Price. Robust polynomial regression up to the information theoretic limit. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 391 402, 2017. 20, 21 Kiran S. Kedlaya and Christopher Umans. Fast polynomial factorization and modular composition. SIAM J. Comput., 40(6):1767 1802, 2011. 17 Yi Li, Ruosong Wang, and David P. Woodruff. Tight bounds for the subspace sketch problem with applications. SIAM J. Comput., 50(4):1287 1335, 2021. 2, 3, 22 Ian B Mac Neill. Properties of sequences of partial sums of polynomial regression residuals with applications to tests for change of regression at unknown times. The Annals of Statistics, pp. 422 433, 1978. 2 Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Symposium on Theory of Computing Conference, STOC, pp. 91 100. ACM, 2013. 22 Raphael Meyer, Cameron Musco, Christopher Musco, David Woodruff, and Samson Zhou. Cheybshev sampling is universal for lp polynomial regression, 2021. 20, 21, 22 Cameron Musco, Christopher Musco, David P. Woodruff, and Taisuke Yasuda. Active sampling for linear regression beyond the l2 norm. Co RR, abs/2111.04888, 2021. 5, 21, 22 Jelani Nelson and Huy L. Nguyen. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS, pp. 117 126, 2013. 22 Michael N usken and Martin Ziegler. Fast multipoint evaluation of bivariate polynomials. In Algorithms - ESA, 12th Annual European Symposium, Proceedings, pp. 544 555, 2004. 17 Aditya Parulekar, Advait Parulekar, and Eric Price. l1 regression with lewis weights subsampling. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pp. 49:1 49:21, 2021. 5, 22 Published as a conference paper at ICLR 2022 Vaughan Pratt. Direct least-squares fitting of algebraic surfaces. ACM SIGGRAPH computer graphics, 21(4):145 152, 1987. 2, 20 Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher R e, and Atri Rudra. A two-pronged progress in structured dense matrix vector multiplication. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1060 1079, 2018. 17 Tam as Sarl os. Improved approximation algorithms for large matrices via random projections. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), Proceedings, pp. 143 152. IEEE Computer Society, 2006. 22 Xiaofei Shi and David P. Woodruff. Sublinear time numerical linear algebra for structured matrices. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp. 4918 4925, 2019. 2, 3, 9, 13 Christian Sohler and David P. Woodruff. Subspace embeddings for the l1-norm with applications. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC, pp. 755 764, 2011. 22 Ruosong Wang and David P. Woodruff. Tight bounds for lp oblivious subspace embeddings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1825 1843, 2019. 22 David P. Woodruff and Qin Zhang. Subspace embeddings and lp-regression using exponential random variables. In COLT 2013 - The 26th Annual Conference on Learning Theory, pp. 546 567, 2013. 22 BROADER IMPACT This work is focused on improving the runtime of algorithms for ℓp regression on structured inputs, and can allow users to more efficiently solve regression problems on large datasets. The problem we study is abstracted away from specific ethical and unethical applications of ℓp regression. So, with the assumption that the research and data science communities will focus on ethical applications while avoiding unethical ones, we believe that the benefits of our algorithmic results outweigh the negative risks. A MISSING PROOFS FROM SECTION 2 In this section, we give the missing proofs from Section 2. A.1 ℓp REGESSION ON VANDERMONDE MATRICES We first prove a simple statement that shows a good solution to the ℓp regression problem on a subspace embedding SA is also a good solution to the original input matrix A. Lemma A.1. Let S be a sampling and rescaling matrix such that 11 12 SAx p Ax p 13 12 SAx p, E SAx p p = Ax p p for all x Rd. Let OPT = minx Rd Ax b p and let x Rd be any vector for which SAx Sb p 5OPT. Then with probability at least 0.79, Ax b p 12OPT. Proof. Let x be a minimizer of Ax b p so that OPT = Ax b p. We will prove the claim by contrapositive, so we first suppose that Ax b p 12OPT. Then by the triangle inequality, SAx Sb p SA(x x ) p SA Sb p. 12 SAx p Ax p 13 12 SAx p for all x Rd, then SAx Sb p 11 12 A(x x ) p SA Sb p. Published as a conference paper at ICLR 2022 By the triangle inequality, SAx Sb p 11 12 ( Ax b p Ax b p) SA Sb p. Note that by Jensen s inequality, E[ SAx Sb p] OPT, so that by Markov s inequality, Pr [ SAx Sb p 5OPT] 1 Thus with probability at least 0.8, SAx Sb p 11 12 ( Ax b p Ax b p) 5OPT. Thus if Ax b p 12OPT, then SAx Sb p 11 12 (12OPT OPT) 5OPT > 5OPT, as desired. Before justifying the correctness of Algorithm 2, we first recall the following algorithm for efficient ℓp subspace embeddings. Theorem A.2. (Shi & Woodruff, 2019) Given ε (0, 1) and p [1, 2], a Vandermonde matrix A Rn d, and b Rd, let T(A) be the time it takes to perform matrix-vector multiplication, i.e., compute Av for an arbitrary v Rd. There exists an algorithm that uses O (T(A) log n + dq poly(1/ε)) time, where q = ω for p [1, 4) and q = p/2 + C for some fixed constant C > 0 for p > 4, and with high probability, returns x Rd such that A x b p (1 + ε) min x Rd Ax b p. We now justify the correctness of Algorithm 2. Lemma A.3. Given ε (0, 1) and p 1, a Vandermonde matrix A Rn d, and b Rd, then with high probability, Algorithm 2 returns a vector bx Rd such that Abx b p (1 + O (ε)) min x Rd Ax b p. Proof. Consider Algorithm 2 and let r be an integer so that 2r p < 2r+1. Note that X i [n] | ai, x |p = X i [n] |( ai, x )2r|p/2r. Since A is Vandermonde, then ( ai, x )2r = j=1 aj i,1xj 2r(d 1)+1 X j=1 aj 1 i,2 yj, where each yj is a fixed function of the coordinates of x. Notably, the fixed function is the same across all i [n]. Hence, the ℓ2r subspace embedding problem on an input Vandermonde matrix A Rn d can be reshaped as a constrained ℓ1 subspace embedding problem on an input Vandermonde matrix of size n (2r(d 1) + 1). Thus, for ℓp regression with p [2r, 2r+1), we have X i [n] | ai, x |p = X 2r(d 1)+1 X j=1 aj 1 i,2 yj which is a constrained ℓp/2r regression problem on an input Vandermonde matrix M of size n (2r(d 1) + 1). By Theorem A.2, we can use ℓp/2r Lewis weight sampling to find a matrix M such that 1 5 My p/2r p/2r M y p/2r p/2r 5 My p/2r Published as a conference paper at ICLR 2022 for all y R2r(d 1)+1 with high probability. Note that by the above argument, if we take the matrix A corresponding to the scaled rows of A that are sampled by M , then we also have 11 12 Ax p p A x p p 13 and thus 11 12 Ax p A x p 13 for all x Rd+1 with high probability. Thus by Lemma A.1, we can find a vector x such that A x b p 12 min x Rd Ax b p. We use the subroutine Round Trunc to create the vector b which is the vector with all entries of b rounded to the nearest power of (1 + ε), starting at the maximum entry of b in absolute value, and stopping after we are 1 poly(n) times that, and replacing all remaining entries with 0. By the triangle inequality, we have Ax b p Ax b p + b b p Ax b p + 12εOPT (1 + 12ε) Ax b p, for any x Rd+1. Note that b has discretized the values of b into ℓ= O log n ε possible values. We partition the rows of A into ℓgroups G1, . . . , Gℓ, based on the corresponding values of b . Suppose that for a group Gk, the corresponding values of b are all tk. Then we have Ax b p p = X i Gk | ai, x tk|p . Since A is Vandermonde, for each i Gk, ( ai, x tk)2r = j=1 aj 1 i,2 xj 22r(d 1)+1 X j=1 aj 1 i,2 tk,jyj, for some fixed values tk,1, tk,2, . . . that can be computed from tk, where again each yj can be a different function of the coordinates of x. In particular, there can be 2r choices for the exponent of tk. For a fixed choice of the exponent of tk, the exponent of ai,2 can range from 0 to 2r(d 1). Hence, the ℓ2r regression problem on a submatrix of an input Vandermonde matrix A Rn d with the same fixed measurement values, i.e., the corresponding coordinates of b are all the same, can be reshaped to a constrained ℓ1 regression problem on an input Vandermonde matrix with 22r(d 1)+1 columns times a (22r(d 1) + 1) (22r(d 1) + 1) diagonal matrix. Hence, by invoking Theorem A.2 to sample rows of M corresponding to their ℓp/2r Lewis weights in the submatrix induced by the rows of Gk, we obtain a sampling matrix Tk such that with high probability, (1 ε) Tk My Tkvk p/2r 2r(d 1)+1 X j=1 aj 1 i,2 tk,jyj (1+ε) Tk My Tkvk p/2r where vk corresponds to the vector b that is set to zero outside of coordinates whose values are tk. Note that by the above argument, then we also have (1 ε) Tk Ax Tkvk p p X i Gk | ai, x tk|p (1 + ε) Tk Ax Tkvk p p, with high probability. Summing over all k, we have k Tk Ax Tkvk p p X i Gk | ai, x tk|p (1 + ε) X k Tk Ax Tkvk p p, with high probability. Published as a conference paper at ICLR 2022 Let T be the sampling matrix so that T = T1 . . . Tℓ, so that P k Tk Ax Tkvk p p = TAx Tb p p. Observe that Ax b p p = P i Gk | ai, x tk|p. Hence, (1 ε) TAx Tb p p Ax b p p (1 + ε) TAx Tb p p and thus (1 ε) TAx Tb p Ax b p (1 + ε) TAx Tb p. Thus we can compute a vector bx Rd such that Abx b p (1 + O (ε)) min x Rd Ax b p. Before analyzing the time complexity of Algorithm 2, we first recall the following algorithms for Vandermonde matrix-vector multiplication and approximate ℓp regression. Theorem A.4 (Vandermonde Matrix-Vector Multiplication Runtime, e.g., Table 1 in (Gohberg & Olshevsky, 1994)). The runtime of computing Ax for a Vandermonde matrix A Rn d and a vector x Rd for d n is O n log2 n . Theorem A.5 (Approximate ℓp Regression Runtime). (Adil et al., 2019b) Given A Rn d and p [2, ), there exists an algorithm that makes O n log n ε calls to a linear system solver and computes a vector x such that A x b p (1 + ε) min x Rd Ax b p. We now analyze the runtime of Algorithm 2. Lemma A.6. Algorithm 2 runs in O n log3 n + d0.5+ω poly 1 ε, p, log n time. Proof. Observe that Algorithm 2 has three main bottlenecks for runtime. Since we only need to Lewis weight sample from the extended matrices, we do not need to explicitly form them, which would otherwise require Ω(ndp2) time just to list to entries. Hence the first bottleneck is performing the Lewis weight sampling procedure on the extended matrices. The second bottleneck is solving the ℓp regression problem on the final subsampled matrix. The only remaining procedure is rounding and truncating the coordinates of b Rn to form a vector b Rn using the procedure Round Trunc and then forming the groups G1, . . . , Gℓ, which clearly takes O (n) arithmetic operations combined. We thus analyze each of the three main runtime bottlenecks. First observe that the extended matrix M is a Vandermonde matrix with O dp2 columns. (Cohen & Peng, 2015) show that O (log n) matrix-vector multiplication operations can be done to compute approximate Lewis weights for the purposes of ℓp Lewis weight sampling. By Theorem A.4, each matrix-vector multiplication uses time O n log2 n . Hence computing the extended matrix M uses O n log3 n time. Similarly, the extended matrix for each group Gk is the product of a Vandermonde matrix with O dp2 columns and a diagonal matrix. Thus by Theorem A.4, the extended matrices for all the groups Gk can be formed using O n log3 n time in total. Since each Vandermonde matrix has O dp2 rows, then observe that each group Gk samples ε2 log d rows and there are ℓ= O 1 ε log n such groups k [ℓ]. Thus the resulting sub- sampled matrix has O dp2 ε3 log2 n rows for d n. To approximately solve the ℓp regression problem, Theorem A.5 notes that for p 2 and a subsampled matrix of size O dp2 ε3 log2 d , we require only O p d ε3/2 log n log d ε calls to a linear system solver. Moreover, on an iteration t of the ℓp regression algorithm of (Adil et al., 2019a) used in Theorem A.5, the linear system solves the equation xt (A RT A) 1A Rtb for a diagonal matrix RT . Each linear system solve can be done in dω poly 1 ε, p, log n time. Hence, the total time to approximately solve the ℓp regression problem is d0.5+ω poly 1 ε, p, log n . Therefore, the total runtime is O n log3 n + d0.5+ω poly 1 ε, p, log n . Published as a conference paper at ICLR 2022 Lemma 2.4. Let x Rn and p = Ω log n ε . Then x x p (1 + ε) x . Proof. For any vector x Rd, we have x = max i n |xi| i |xi|p !1/p n1/p max i n |xi|. Since (1 + ε)3/eps > e for all ε > 0, then (1 + ε)p > n for p = Ω log n ε . Therefore, n1/p max i n |xi| (1 + ε) max i n |xi| = (1 + ε) x . Corollary 2.5. Given ε (0, 1) and p 1, A Rn dq, and b Rd, suppose A = [A1| . . . |Aq] for Vandermonde matrices A1, . . . , Aq Rn d. Then there exists an algorithm that, with high probability, returns a vector bx Rd such that Abx b p (1 + ε) min x Rd Ax b p, using O T(A)(dp)q 1 log n + poly((dp)q log n, 1/ε) time, where T(A) is the runtime of multiplying the matrix A by an arbitrary vector. For q = 2, this can be further optimized to O ndω2/2 1 + poly((dp)2, log n, 1/ε) time, where O (nω2) is the time to multiply an n n matrix with an n n2 matrix, so that ω2 [3, 4]. Proof. Recall that a key part in the proof of Theorem 1.1 was to first the ℓ2r regression on a Vandermonde matrix with dimension Rn d to ℓ1 regression on a Vandermonde matrix with dimension Rn (2r(d 1)+1), where 2r p < 2r+1. For a matrix A with block Vandermonde structure, we can similarly write ( ai, x )2r = j=1 aj 1 i,2+(k 1)dxj+(k 1)d (2r(d 1))q+1 X k [q],P pj,k=2r apj,k i,2+(k 1)dyj, where again each yj is a fixed function of the coordinates of x. Thus we can reshape the ℓ2r regression problem on a matrix A with dimension Rn dq with block Vandermonde structure to an ℓ1 regression problem on a matrix A with dimension Rn (2r(d 1))q+1. Moreover, we can further reshape A into the concatenation of (dp)q 1 Vandermonde matrices, where each Vandermonde matrix has columns that are geometrically growing in ai,2 but are multiplied by all (dp)q 1 products Qq 1 k=1 apk i,2+kd, where pk [dp]. We can now use matrix-vector multiplication on each of the (dp)q 1 Vandermonde matrices. Thus by Theorem A.2, we can ℓ1 Lewis weight sample from the rows of the reshaped A, using O T(A)(dp)q 1 log n + (dp)ωq time. We can similarly write ( ai, x tk )2r for each tk among the discretized values of the updated b vector as a sum of (2r(d 1))q + 1 terms that are all products of powers of the bases ai,2, ai,2+d, . . . and a variables yj, as in the proof of Theorem 1.1. Thus we can partition the ℓ2r regression problem into ℓ= O log n ε instances of a constrained ℓ1 regression problem on (dp)q 1 Vandermonde matrices, each with at most 2r(d 1))q + 1 columns. To approximately solve the ℓp regression problem, we can thus sample rows by their ℓp/2r Lewis weights, as in Theorem 1.1. Since there are up to (dp)q 1 Vandermonde matrices, each with at most 2r(d 1))q + 1 columns, then by Theorem A.2, the total time required is O T(A)(dp)q 1 log n + (dp)ωq poly(log n, 1/ε) . Published as a conference paper at ICLR 2022 We remark that for the special case of q = 2, (Sa et al., 2018) noted an efficient bivariate matrix multiplication algorithm of (N usken & Ziegler, 2004; Kedlaya & Umans, 2011). Theorem A.7. (N usken & Ziegler, 2004; Kedlaya & Umans, 2011; Sa et al., 2018) Given a q-variate polynomial f(X1, . . . , Xq) such that each variable has degree at most d 1 and N = dq distinct points x(i) = (x(i)1, . . . , x(i)q) for i [N], there exists an algorithm that uses O dω2(q 1)/2+1 time to output the vector (f(x(1)), . . . , f(x(N))), where O (nω2) is the time to multiply an n n matrix with an n n2 matrix, so that ω2 [3, 4]. The case of a matrix-vector product for q = 2 corresponds to the evaluation of d2 points in Theorem A.7. Thus we need to repeat the algorithm in Theorem A.7 a total of n d2 times to handle all n rows in the input matrix. Since each instance of the algorithm uses O dw2/2+1 time, the total time for the matrix-vector product is O ndω2/2 1 , rather than the na ıve O (nd) time (recall that ω2 [3, 4]). Corollary A.8. Given ε (0, 1) and p 1, A Rn 2d, and b Rd, suppose A = [A1|A2] for Vandermonde matrices A1, A2 Rn d. Then there exists an algorithm that with high probability returns a vector bx Rd such that Abx b p (1 + ε) min x Rd Ax b p, using O ndω2/2 1 + poly((dp)2, log n, 1/ε) , where O (nω2) is the time to multiply an n n matrix with an n n2 matrix. A.2 ℓp REGESSION ON OTHER STRUCTURED INPUT Algorithm 4 Faster regression for noisy low-rank matrices Input: Rank k matrix K Rn d, matrix S Rn d with at most s non-zero entries per row, such that A = K + S, measurement vector b, accuracy parameter ε > 0 Output: bx R with Abx b p (1 + O (ε)) minx Rd Ax b p 1: r log p 2r p < 2r+1. 2: Extend A to a matrix M with dimension n O ds(k + s)2r so that each entry Mi,j is the coefficient of the j-th term in the tensor decomposition of a 2r i . 3: Use ℓp/2r-Lewis weight sampling on M to find a set S of O (d log d ) indices in [n] and rescaling factors, for d = O ds(k + s)2r . 4: Let A be the corresponding submatrix of A with indices in S and scaled accordingly. 5: Compute x 5 minx Rd A x b p. 6: b b A x, b Round Trunc(b ), ℓ O log n 7: Partition the rows of A into groups G1, . . . , Gℓ, each containing all rows with the same value of b 8: Let Gk be the corresponding submatrix and tk be the coordinate of b corresponding to Gk for each k [ℓ]. 9: Use ℓp/2r-Lewis weight sampling on [Gk; tk] to find a set S k of O d ε2 log d indices in [n] and rescaling factors, where d = O pds(k + s)2r . 10: Let Tk be the corresponding sampling and rescaling matrix for S k. 11: T [T1; . . . ; Tk] 12: Compute bx (1 + ε) minx Rd TAx Tb p. 13: return bx Lemma A.9. Given ε (0, 1) and p 1, a matrix A Rn d such that A = K + S for a rank k matrix K and an s-sparse matrix S, and b Rd, there exists an algorithm that with high probability, returns a vector bx Rd such that Abx b p (1 + ε) min x Rd Ax b p. Published as a conference paper at ICLR 2022 Proof. The proof is similar to Lemma A.3. We once again let r be an integer so that 2r p < 2r+1 and observe that X i [n] | ai, x |p = X i [n] |( ai, x )2r|p/2r = X i [n] |( ki + si, x )2r|p/2r. Since K is a low-rank matrix, then for all ki, we can write j=1 αi,jvj, for a fixed set of basis vectors v1, . . . , vk Rd. Hence we have i [n] | ai, x |p = X j=1 αi,jvj, x Since S has sparsity s, then we can further write i [n] | ai, x |p = X j=1 βi,jeij + j=1 αi,jvj, x where e1, . . . , ed denote the elementary vectors. By the Hadamard Product-Kronecker Product mixed-product property, we have i [n] | ai, x |p = X (y1 . . . y2r) x (2r) p/2r , where each yk {αi,1v1, . . . , αi,kvk, βi,1ei1, . . . , βi,sei,s} for i [2r]. Thus for a fixed set of elementary vectors ei1, . . . , eis, there are (k + s)2r possible values for the tensor product (y1 . . . y2r). Since there are d s choices for the elementary vectors ei1, . . . , eis, then there are at most (Cds(k + s)2r) possible values for the tensor product for an absolute constant C > 0. Therefore, the ℓp subspace embedding problem on A Rn d can be reshaped as a constrained ℓp/2r subspace embedding problem on an input matrix of size n (Cds(k + s)2r). Hence for ℓp regression with p [2r, 2r+1), we have i [n] | ai, x |p = X 2r(d 1)+1 X which is a constrained ℓp/2r regression problem on a matrix M of size n (Cds(k + s)2r) whose entries can be determined from the decomposition of each row of A. Using ℓp/2r Lewis weight sampling, Theorem A.2 implies that we can find a matrix M such that 11 12 My p/2r p/2r M y p/2r for all y R(Cds(k+s)2r ) with high probability. By Lemma A.1, we can thus compute a vector x such that A x b p 12 min x Rd Ax b p. We again set b = b A x and define OPT = minx Rd Ax b p, so that b p = A x b p 12OPT. Let b = b A x and OPT = minx Rd Ax b p, so that b p = A x b p 12OPT. Published as a conference paper at ICLR 2022 Let b be the vector with all entries of b rounded to the nearest power of (1 + ε), starting at the maximum entry of b in absolute value, and stopping after we are 1 poly(n) of that and replacing all remaining entries with 0. By the triangle inequality, we have Ax b p Ax b p + b b p Ax b p + 12εOPT (1 + 12ε) Ax b p, for any x Rd+1. Since the coordinates of b can have ℓ= O log n ε possible distinct values, we can partition the rows of A into ℓgroups, G1, . . . , Gℓ, based on the corresponding values of b . Let tm be the corresponding value of b for all rows in a group Gm, so that Ax b p p = X i Gk | ai, x tm|p. By the above argument, we have for each i Gk, ( ai, x tm)2r = j=1 βi,jeij + j=1 αi,jvj, x 2r Cds(k+s)2r X j=1 Bi,jtm,jyj, where (1) Bi,j are entries of a matrix B with 2r Cds(k+s)2r columns that can be computed from A, (2) tm,1, tm,2, . . . are fixed values that can be computed from mk, and (3) each yj is a fixed function of the coordinates of x. Notably, B is the matrix formed by the concatenation of the coefficients of the decomposition of the α-fold tensor product of the row into the α-fold tensor products of the lowrank and sparse basis elements, for each α = 0, . . . , p. By comparison, the matrix M previously defined in this proof is only the decomposition for α = p. Hence, the ℓ2r regression problem on a submatrix of A Rn d the same coordinate of b , can be reshaped as a constrained ℓ1 regression problem on a matrix with 2r Cds(k + s)2r columns. The remainder of the proof follows from the same grouping argument as Theorem 1.1. We apply Theorem A.2 by sampling rows of B corresponding to their ℓp/2r Lewis weights in the submatrix induced by the rows of Gm, we obtain a matrix Tm such that with high probability, (1 ε) Tm By Tmvm p/2r 2r Cds(k+s)2r X j=1 Bi,jtm,jyj (1+ε) Tm By Tmvm p/2r where vm is the vector restricted to the coordinates of b that are equal to tm. Conditioning on the above inequality holding, it follows that (1 ε) Tm Ax Tmvm p p X i Gm | ai, x tm|p (1 + ε) Tm Ax Tmvm p p. Therefore by summing over all m [ℓ], we have that with high probability, m Tm Ax Tmvm p p X i Gm | ai, x tm|p (1 + ε) X m Tm Ax Tmvm p p. For T = T1 . . . Tℓ, we have P m Tm Ax Tmvm p p = Tx b p p. Since Ax b p p = P i Gm | ai, x tm|p, then (1 ε) TAx Tb p p Ax b p p (1 + ε) TAx Tb p p. Therefore for p 1, (1 ε) TAx Tb p Ax b p (1 + ε) TAx Tb p. Because ℓ= O log n ε and P O (T(Gk)) = O (T(A)), we can compute a vector bx Rd such that Abx b p (1 + O (ε)) min x Rd Ax b p. Published as a conference paper at ICLR 2022 Lemma A.10. Given the low-rank factorization of K, Algorithm 4 uses n poly 2p, ds, kp, sp, 1 ε, log n time. Proof. We analyze the runtime of Algorithm 4. First note that we can compute the extended matrix in time O (n((k + 1)pdssp)) to perform the ℓp/2r Lewis weight sampling, where we recall that r is the unique integer such that 2r p < 2r+1. To perform Lewis weight sampling on the extended matrix, we require matrix-vector multiplication, which requires time O (nk) for a low-rank matrix and time O (ns) for a matrix whose rows have at most s nonzero entries. After the first iteration of Lewis weight sampling, we can round and truncate the coordinates of b Rn to form a vector b Rn using the procedure Round Trunc and then forming the groups G1, . . . , Gℓ, which clearly takes O (n) arithmetic operations combined. Once the groups are formed, we can compute the extended matrix in time O (n((k + 1)pdssp)) and perform ℓp/2r Lewis weight sampling on each group, which takes O (nk + ns) total time across all groups. To approximately solve the resulting ℓp regression problem formed by the subsampled rows, we require poly ds, kp, sp, 1 ε time. Therefore, the total runtime is n poly 2p, ds, kp, sp, 1 Theorem 1.5 then follows from Lemma A.9 and Lemma A.10. Theorem 1.6 is achieved through similar analysis for a noisy Vandermonde matrix. In particular, it follows from the same proof structure as Lemma A.3 by showing for 2r p < 2r+1, the ℓ2r subspace embedding problem on A = V + S for a given Vandermonde matrix V Rn d and a sparse matrix S Rn d can be reshaped as a constrained ℓ1 subspace embedding problem on an input Vandermonde matrix of size n d , where d = (ds)(sp)(pd). B APPLICATIONS TO POLYNOMIAL REGRESSION In the polynomial regression problem, the goal is to find a degree d polynomial ˆq such that bq(t) f(t) p p (1 + ε) min q:deg(q) d q(t) f(t) p p, where ε > 0 is an accuracy parameter given as input and p p is the ℓp norm to the pth power, g p p = R 1 1 |g(t)|pdt. The polynomial regression problem is a fundamental problem in statistics, computational mathematics, machine learning, and more. The problem has been studied as early as the 19th century with the work of Legendre and Gauss on least squares polynomial regression and has applications in learning half-spaces (Kalai et al., 2008), solving parametric PDEs (Hampton & Doostan, 2015), and surface reconstruction (Pratt, 1987). Given the flexibility to choose query locations x1, . . . , xs, we can consider the polynomial regression problem as an active learning or experimental design problem. Thus we would like to minimize the number of queries s, as a function of the approximation degree d, the norm p, and the accuracy parameter ε, to find ˆq. Observe that d+1 queries are obviously necessary, but also that d+1 queries suffice when f can be exactly fit by a degree d polynomial, by using direct interpolation. In general, however, in the case when minq:deg(q) d q(t) f(t) p p = 0 we require s > d + 1 queries. Our Vandermonde ℓp regression results can be used to give the first result showing that for all p 1, d exp(O(p)) poly 1 ε queries suffice to obtain a (1 + ε)-approximation to the best polynomial fit. We require the following structural theorem reducing the ℓp polynomial regression problem to a problem of solving ℓp regression on Vandermonde matrices: Theorem B.1. (Kane et al., 2017; Meyer et al., 2021) Suppose s1, . . . , sn0 are drawn uniformly from [ 1, 1]. Let A Rn0 (d+1) be the associated Vandermonde matrix, so that Ai,j = sj 1 i . Let n0 = exp(O(p)) O 1 ε2+2p d5 and let b Rn0 be the evaluations of f, so that bi = f(si). Then with probability 11 12, the sketched solution ˆx = argminx Ax b p satisfies Px f p p (1 + ε) min x Rd+1 Px f p p. Theorem B.1 states that we can uniformly sample exp(O(p)) poly d, 1 ε points from [ 1, 1]. We can then form an ℓp regression problem by using the evaluation each of the polynomial bases Published as a conference paper at ICLR 2022 at the sampled points to form the design matrix A and querying the underlying signal f at the sampled points to form the measurement vector b. Theorem B.1 says that the optimal solution ˆx = argminx Ax b p is a (1 + ε)-approximation to the best fit degree d polynomial. We can na ıvely approximately solve the ℓp regression on the Vandermonde matrix A and the measurement vector b by standard ℓp regression techniques such as Lewis weight sampling, which would result in total query complexity exp(O(p)) O 1 ε2+2p d5 . However, we can instead note that Lemma 1.4 implies we can instead solve an ℓq regression problem on a Vandermonde matrix with O(dp) columns for q [1, 2]. Crucially for q [1, 2], there exist active ℓq regression algorithms that only require reading O(d) poly 1 ε entries of b: Theorem B.2. (Musco et al., 2021) Given p [1, 2] and an input matrix A Rn d, there exists an algorithm that reads O(d) poly 1 ε entries of b and with probability at least 0.99, outputs x such that Ax b p (1 + ε) min x Ax b p. Hence by Theorem B.2, we can approximately solve the ℓq regression on a Vandermonde matrix with dp rows by reading O(dp) poly 1 ε entries of b. By Lemma 1.4, the approximate solution will also be a (1+ε)-approximation to the optimal ℓp regression on the Vandermonde matrix A. By Theorem B.1, the approximate solution will also form the coefficient vector of a polynomial that is a (1 + ε)-approximation to the polynomial ℓp regression problem. Therefore, we obtain the following guarantees for the polynomial regression problem: Theorem B.3. For any degree d and norm p 1, there exists an algorithm that queries f at s = dp poly log(dp), 1 ε points and outputs a degree d polynomial bq(t) such that bq(t) f(t) p p (1 + ε) min q:deg(q) d q(t) f(t) p p, with probability at least 2 The previous-best algorithm (Kane et al., 2017; Meyer et al., 2021) for ℓp polynomial regression sampled O(d5) points uniform from [ 1, 1] and then used standard ℓp regression algorithms that required reading the signal at all O(d5) sampled points, for a total query complexity of O(d5). By comparison, Theorem B.3 only has linear dependency in d due to the structural property of Lemma 1.4. Since Ω(d) queries are clearly necessary, our result settles the dependency of d in the query complexity for ℓp polynomial regression for all d and all p 1. C ADDITIONAL EXPERIMENTS In this section, we experimentally verify that ℓp/2r Lewis Weight sampling works for unstructured matrices. We take a similar approach as in Section 3 to verify that ℓp/2r Lewis Weight sampling is correct for ℓp regression on unstructured matrices. For this test, we fix n = 25, 000, d = 10, and p = 6, while varying m [1, 1000]. We let A = G1 0 0 G2 where G1 R100 6 and G2 R24,900 4 are i.i.d. N(0, 1) matrices. To generate b, we sample a vector x R10 whose first 6 entries are N(0, 1002) and remaining 4 entries are N(0, 1), and let b = Ax + z where z R25,000 is a N(0, 1) iid vector. We generate this matrix A and response vector b just once and run ℓp/2r Lewis Weight sampling many times, so the variance in the plot comes only from the random sampling algorithms. Note that we again omit the rounding procedure on b. Figure 3 shows the result of this test, and we clearly see that the error shrinks quickly in m for our algorithm. This approach is much more practical than the prior Lewis Weight approximation method for unstructured matrices when p > 4. That approach required solving a non-linearly constrained SDP O (log n) times (Cohen & Peng, 2015), while our method requires only a Gaussian sketch matrix and the standard Lewis Weight Iteration, which converges very quickly. Since b is so large on its first 100 entries, it is important for any subsampling algorithm to sample at least 6 of the first 100 rows. Uniform sampling picks none of these rows until m 25000 100 = 250, which is why uniform sampling fails to converge to a good solution for small m. Lewis weight sampling instead gives much higher priority to the first 100 rows, avoiding any issue. This is why the the gap between Lewis Weight sampling and uniform sampling is so large for this experiment. Published as a conference paper at ICLR 2022 100 101 102 103 Subsample Complexity (m) Relative Error (εempirical) Lp/2r Lewis Weights Fig. 3: Empirical Relative Error (εempirical) versus subsample complexity m. We ran the experiment 50 times and plot the median, 25th quartile, and 75th quartile for each value of m. D ADDITIONAL RELATED WORK As previously mentioned, subspace embeddings are common tools used to approximately solve ℓp regression. Given an input matrix A Rn d, a subspace embedding is a matrix M Rm d with m n such that (1 ε) Ax p Mx p (1 + ε) Ax p, for all x Rd. Thus given an instance of ℓp regression, where the goal is to minimize Ax b p, we can set B = [A; b], compute a subspace embedding for B, and then solve a constrained ℓp regression problem on the smaller matrix B. A subspace embedding of a matrix A Rn d can be formed by sampling rows of A with probabilities proportional to their ℓp leverage scores, e.g., (Cormode et al., 2018; Dasgupta et al., 2008) their ℓp sensitivities, e.g., (Clarkson et al., 2019; Braverman et al., 2020; 2021; Musco et al., 2021); or their ℓp Lewis weights, e.g., (Cohen & Peng, 2015; Durfee et al., 2018; Chhaya et al., 2020; Chen & Derezinski, 2021; Parulekar et al., 2021; Meyer et al., 2021). In any of these cases, the resulting matrix M will contain a subset of rows of A that are rescaled by a function of their sampling probability, as to give an unbiased estimate of the actual ℓp mass. In addition to sampling methods, sketching is a common approach for subspace embeddings. In these cases, the subspace embedding M is formed by setting M = RA for some (often random) matrix R Rm n. The advantage of sketching over sampling is that sometimes R can be computed oblivious to the structure of A, whereas the sampling probabilities for each of the above distributions (ℓp leverage scores, ℓp sensitivities, and ℓp Lewis weights) are data dependent and thus require a pass over the matrix A. The sketching matrix can be generated from a family of random matrices whose entries are Cauchy random variables for p = 1, e.g., (Sohler & Woodruff, 2011; Meng & Mahoney, 2013; Clarkson et al., 2016), or sub-Gaussian random variables for p = 2, e.g., (Sarl os, 2006; Nelson & Nguyen, 2013; Clarkson & Woodruff, 2013). More generally, exponential random variables can be used for p 2, though the number of rows m in the resulting sketch matrix now has a polynomial dependency in n (Woodruff & Zhang, 2013). A line of recent works has studied the tradeoffs between oblivious linear sketches, sampling-based algorithms, and other sketches for ℓp subspace embeddings (Wang & Woodruff, 2019; Li et al., 2021). In this paper, we focus on sampling-based algorithms for ℓp regression due to preservation of structure when the input matrix A itself has structure.