# indatabase_regression_in_input_sparsity_time__3fbc3e6f.pdf In-Database Regression in Input Sparsity Time Rajesh Jayaram 1 Alireza Samadian 2 David P. Woodruff 1 Peng Ye 3 Sketching is a powerful dimensionality reduction technique for accelerating algorithms for data analysis. A crucial step in sketching methods is to compute a subspace embedding (SE) for a large matrix A RN d. SE s are the primary tool for obtaining extremely efficient solutions for many linear-algebraic tasks, such as least squares regression and low rank approximation. Computing an SE often requires an explicit representation of A and running time proportional to the size of A. However, if A = T1 T2 Tm is the result of a database join query on several smaller tables Ti Rni di, then this running time can be prohibitive, as A itself can have as many as O(n1n2 nm) rows. In this work, we design subspace embeddings for database joins which can be computed significantly faster than computing the join. For the case of a two table join A = T1 T2 we give input-sparsity algorithms for computing subspace embeddings, with running time bounded by the number of non-zero entries in T1, T2. This results in input-sparsity time algorithms for high accuracy regression, significantly improving upon the running time of prior FAQ-based methods for regression. We extend our results to arbitrary joins for the ridge regression problem, also considerably improving the running time of prior methods. Empirically, we apply our method to real datasets and show that it is significantly faster than existing algorithms. 1Computer Science Department, Carnegie Mellon University, Pittsburgh PA, United States. 2Department of Computer Science, University of Pittsburgh, Pittsburgh PA, United States. 3Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.. Correspondence to: David P. Woodruff . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1. Introduction Sketching is an important tool for dimensionality reduction, whereby one quickly reduces the size of a large-scale optimization problem while approximately preserving the solution space. One can then solve the lower-dimensional problem much more efficiently, and the sketching guarantee ensures that the resulting solution is approximately optimal for the original optimization problem. In this paper we focus on the notion of a subspace embedding (SE), and its applications to problems in databases. Formally, given a large matrix A RN d, an ϵ-subspace embedding for A is a matrix SA, where S Rk N, with the property that SAx 2 = (1 ϵ) Ax 2 simultaneously for all x Rd. A prototypical example of how a subspace embedding can be applied to solve an optimization problem is linear regression, where one wants to solve minx Ax b 2 for a tall matrix A RN d where N d contains many data points (rows). Instead of directly solving for x, which requires computing the covariance matrix of A and which would require O(Nd2) time for general A1, one can first compute a sketch SA, Sb of the problem, where S Rk N is a random matrix which can be quickly applied to A and b. If [SA, Sb] is an ϵ-subspace embedding for [A, b], it follows that the regression problem using the solution ˆx to minx SAx Sb 2 will be within a (1 + ϵ) factor of the optimal solution cost. However, if k N, then solving for ˆx can now be accomplished much faster in O(kd2) time. SEs and similar tools for dimensionality reduction can also be used to speed up the running time of algorithms for ℓp regression, low rank approximation, and many other problems. We refer the reader to the survey (Woodruff et al., 2014) for a survey of applications of sketching to randomized numerical linear algebra. One potential downside of most standard subspace embeddings is that the time required to compute SA often scales linearly with the input sparsity of A, meaning the number of non-zero entries of A, which we denote by nnz(A). This dependence on nnz(A) is in general necessary just to read all of the entries of A. However, in many applications the dataset A is highly structured, and is itself the result of a query performed on a much smaller dataset. 1This can be sped up to O(Ndω 1) time in theory, where ω 2.373 is the exponent of matrix multiplication. In-Database Regression in Input Sparsity Time A canonical and important example of this is a database join. This example is particularly important since datasets are often the result of a database join (Hellerstein et al., 2012). In fact, this use-case has motivated companies and teams such as Relational AI (Rel) and Google s Bigquery ML (Big) to design databases that are capable of handling machine learning queries. Here, we have m tables T1, T2, . . . , Tm, where Ti Rni di, and we can consider their join J = T1 T2 Tm RN d over a set of columns. In general, the number N of rows of J can be as large as n1n2 nm, which far exceeds the actual input description of P i nnz(Ti) to the problem. We note that it is possible to do various operations on a join in sublinear time using database algorithms that are developed for Functional Aggregation Queries (FAQ) (Abo Khamis et al., 2016), and indeed it is possible using so-called FAQ-based techniques (Abo Khamis et al., 2018) to compute the covariance matrix JT J in time O(d4mn log(n)) for an acyclic join, where n = max(n1, . . . , nm), after which one can solve least squares regression in poly(d) time. While this can be significantly faster than the Ω(N) time required to compute the actual join, it is still significantly larger than the input description size P i nnz(Ti), which even if the tables are dense, is at most dmn. When the tables are sparse, e.g., O(n) non-zero entries in each table, one could even hope for a running time close to O(mn). One could hope to achieve such running times with the help of subspace embeddings, by first reducing the data to a low-dimensional representation, and then solving regression exactly on the small representation. However, due to the lack of a clean algebraic structure for database joins, it is not clear how to apply a subspace embedding without first computing the join. Thus, a natural question is: Is it possible to apply a subspace embedding to a join, without having to explicitly form the join? We note that the lack of input-sparsity time algorithms for regression on joins is further exacerbated in the presence of categorical features. Indeed, it is a common practice to convert categorical data to their so-called one-hot encoding before optimizing any statistical model. Such an encoding creates one column for each possible value of the categorical feature and only a single column is non-zero for each row. Thus, in the presence of categorical features, the tables in the join are extremely high dimensional and extremely sparse. Since the data is high-dimensional, one often regularizes it to avoid overfitting, and so in addition to ordinary regression, one could also ask to solve regularized regression on such datasets, such as ridge regression. One could then ask if it is possible to design input-sparsity time algorithms on joins for regression or ridge regression. 1.1. Our Contributions We start by describing our results for least squares regression in the important case when the join is on two tables. We note that the two-table case is a very well-studied case, see, e.g., (Alon et al., 1999; 2002; Gucht et al., 2015). Moreover, one can always reduce to the case of a two-table join by precomputing part of a general join. The following two theorems state our results for computing a subspace embedding and for solving regression on the join J = T1 T2 of two tables. Our results demonstrate a substantial improvement over all prior algorithms for this problem. In particular, they answer the two questions above, showing that it is possible to compute a subspace embedding in input-sparsity time, and that regression can also be solved in input-sparsity time. To the best of our knowledge, the fastest algorithm for linear regression on two tables has a worst-case time complexity of O(nd + n D2), where n = max(n1, n2), d is the number of columns, and D is the dimensionality of the data after encoding the categorical data. Note that in the case of numerical data (dense case) D = d since there is no one-hot encoding and the time complexity is O(nd2), and it can be further improved to O(ndω 1) where ω < 2.373 is the exponent of fast matrix multiplication; this time complexity is the same as the fastest known time complexity for exact linear regression on a single table. In the case of categorical features (sparse data), using sparse tensors, D2 can be replaced by a constant that is at least the number of non-zero elements in JT J (which is at least d2 and at most D2) using the algorithm in (Abo Khamis et al., 2018). We state two results with differing leading terms and loworder additive terms, as one may be more useful than the other depending on whether the input tables are dense or sparse. Theorem 1 (In-Database Subspace Embedding). Suppose J = T1 T2 RN d is a join of two tables, where T1 Rn1 d1, T2 Rn2 d2. Then Algorithm 1 outputs a sketching matrix S Rk N such that J = S J is an ϵ-subspace embedding for J, meaning S Jx 2 2 = (1 ϵ) Jx 2 2 simultaneously for all x Rd with probability2 at least 9/10. The running time to return S J is the mini- 2We remark that using standard techniques for amplifying the success probability of an SE (see Section 2.3 of (Woodruff et al., 2014)) one can boost the success probability to 1 δ by repeating the entire algorithm O(log δ 1) times, increasing the running time by a multiplicative O(log δ 1) factor. One must then compute the SVD of each of the O(log δ 1) sketches, which results in an additive O(kdω 1 log δ 1) term in the running time, where ω 2.373 is the exponent of matrix multiplication. Note that this additive dependence on d is only slightly ( d.373) larger than the dependence required for constant probability as stated in the theorem. In-Database Regression in Input Sparsity Time mum of O((n1 + n2)d/ϵ2 + d3/ϵ2) and O((nnz(T1) + nnz(T2))/ϵ2 +(n1 +n2)/ϵ2 +d5/ϵ2).3 In the former case, we have k = O(d2/ϵ2), and in the latter case we have k = O(d4/ϵ2). Next, by following a standard reduction from a subspace embedding to an algorithm for regression, we obtain extremely efficient machine precision regression algorithms for two-table database joins. Theorem 2 (Machine Precision Regression). Suppose J = T1 T2 RN d is a join of two tables, where T1 Rn1 d1, T2 Rn2 d2. Let U [d] be any subset, and let JU RN |U| be J restricted to the columns in U, and let b RN be any column of the join J. Then there is an algorithm which outputs ˆx R|U| such that with probability 9/104 we have JU ˆx b 2 (1 + ϵ) min x R|U| JUx b 2. The running time required to compute ˆx is the minimum of O(((n1 + n2)d + d3) log(1/ϵ)) and O((nnz(T1) + nnz(T2) + d5) log(1/ϵ)). General Joins We next consider arbitrary joins on more than two tables. In this case, we primarily focus on the ridge regression problem minx Jx b 2 2 + λ x 2 2, for a regularization parameter λ. This problem is a popular regularized variant of regression and was considered in the context of database joins in (Abo Khamis et al., 2018). We introduce a general framework to apply sketching methods over arbitrary joins in the supplementary material; our method is able to take a sketch with certain properties as a black box, and can be applied both to Tensor Sketch (Avron et al., 2014; Pagh, 2013; Pham & Pagh, 2013), as well as recent improvements to this sketch (Ahle et al., 2020; Woodruff & Zandieh, 2020) for certain joins. Unlike previous work, which required computing JT J exactly, we show how to use sketching to approximate this up to high accuracy, where the number of entries of JT J computed depends on the socalled statistical dimension of J, which can be much smaller than the data dimension D. Evaluation Empirically, we compare our algorithms on various databases to the previous best FAQ-based algorithm of (Abo Khamis et al., 2018), which computes each entry of the covariance matrix JT J. For two-table joins, we focus on the standard regression problem. We use the algorithm described in Section 3, replacing the Fast Tensor-Sketch with Tensor-Sketch for better practical performance. For general 3We use O notation to omit factors of log(N). 4The probability of success here is the same as the probability of success of constructing a subspace embedding; see earlier footnote about amplifying this success probability. joins, we focus on the ridge regression problem; such joins can be very high dimensional and ridge regression helps to prevent overfitting. We apply our sketching approach to the entire join and obtain an approximation to it, where our complexity is in terms of the statistical dimension rather than the actual dimension D. Our results demonstrate significant speedups over the previous best algorithm, with only a small sacrifice in accuracy. For example, for the join of two tables in the Movie Lens data set, which has 23 features, we obtain a 10-fold speedup over the FAQ-based algorithm, while maintaining a 0.66% relative error. For the natural join of three tables in the real Movie Lens data set, which is a join with 24 features, we obtain a 3-fold speedup over the FAQ-based algorithm with only 0.28% MSE relative error. Further details can be found in Section 4. 1.2. Related Work on Sketching Structured Data The use of specialized sketches for different classes of structured matrices A has been a topic of substantial interest. The Tensor Sketch algorithm of (Pagh, 2013) can be applied to Kronecker products A = A1 Am without explicitly computing the product. The efficiency of this algorithm was recently improved by (Ahle et al., 2020). The special case when all Ai are equal is known as the polynomial kernel, which was considered in (Pham & Pagh, 2013) and extended by (Avron et al., 2014). Kronecker Product Regression has also been studied for the ℓp loss functions (Diao et al., 2017; 2019), which also gave improved algorithms for ℓ2 regression. In (Avron et al., 2013; Shi & Woodruff, 2019) it is shown that regression on A can be solved in time T(A) poly(log(nd)), where T(A) nnz(A) is the time needed to compute the matrixvector product Ay for any y Rd. For many classes of A, such as Vandermonde matrices, T(A) is substantially smaller than nnz(A). Finally, a flurry of work has used sketching to obtain faster algorithms for low rank approximation of structured matrices. In (Musco & Woodruff, 2017; Bakshi et al., 2019), low rank approximations to positive semidefinite (PSD) matrices are computed in time sublinear in the number of entries of A. This was also shown for distance matrices in (Bakshi & Woodruff, 2018; Indyk et al., 2019; Bakshi et al., 2019). 1.3. Related In-Database Machine Learning Work The work of (Abo Khamis et al., 2016) introduced Insideout, a polynomial time algorithm for calculating functional aggregation queries (FAQs) over joins without performing the joins themselves, which can be utilized to train various types of machine learning models. The Inside-Out algorithm builds on several earlier papers, including (Aji & Mc Eliece, 2000; Dechter, 1996; Kohlas & Wilson, 2008; Grohe & Marx, 2006). Relational linear regression, sin- In-Database Regression in Input Sparsity Time gular value decomposition, and factorization machines are studied extensively in multiple prior works (Rendle, 2013; Kumar et al., 2015b; Schleich et al., 2016; Khamis et al., 2018; Abo Khamis et al., 2018; Kumar et al., 2016; Elgamal et al., 2017; Kumar et al., 2015a). The best known time complexity for training linear regression when the features are continuous, is O(d4mnfhtw log(n)) where fhtw is the fractional hypertree width of the join query. Note that fhtw is 1 for acyclic joins. For categorical features, the time complexity is O(d2mnfhtw+2) in the worst-case; however, (Abo Khamis et al., 2018) uses sparse computation of the results to reduce this time depending on the join instance. In the case of polynomial regression, the calculation of pairwise interactions among the features can be time-consuming and it is addressed in (Abo Khamis et al., 2018; Li et al., 2019). Relational support vector machines with Gaussian kernels are studied in (Yang et al.). In (Cheng & Koudas, 2019), a relational algorithm is introduced for Independent Gaussian Mixture Models, which can be used for kernel density estimation by estimating the underlying distribution of the data. 2. Preliminaries 2.1. Database Joins We first introduce the notion of a block of a join, which will be important in our analysis. Let T1, . . . , Tm be tables, with Ti Rni di. Let J = T1 T2 Tm RN d be an arbitrary join on the tables Ti. Let Q be the subset of columns which are contained in at least two tables, e.g., the columns which are joined upon. For any subset U of columns and any table T containing a set of columns U , let T|U be the projection of T onto the columns in U U . Similarly define r|U for a row r. Let C be the set of columns in J, and let Cj C be the columns contained in Tj. Define the set of blocks B = B(J) R|Q| of the join to be the set of distinct rows in the projection of J onto Q. In other words, B is the set of distinct rows which occur in the restriction of J to the columns being joined on. For any j [m], let ˆTj Rnj d be the embedding of the rows of Tj into the join J, obtained by padding Tj with zero-valued columns for each column not contained in Tj, and such that for any column c contained in more than one Tj, we define the matrices ˆTj so that exactly one of the ˆTj contains c (it does not matter which of the tables containing the column c has c assigned to it in the definition of ˆTj). More formally, we fix any partition { ˆCj}j [m] of C, such that C = j ˆCj and ˆCj Cj for all j. For simplicity, given a block i B, which was defined as a row in R|Q|, we drop the vector notation and write i = i B. For a given i = (i1, . . . , i|Q|) B, let s(i) denote the size of the block, meaning the number of rows r Figure 1: Example of two table join blocks T (1) 1 f1 f2 1 1 2 1 T (2) 1 f1 f2 3 2 T (1) 2 f2 f3 1 1 1 2 T (2) 2 f2 f3 2 3 Figure 2: Examples of T (j) i of the join J such that ij is in the j-th column of r for all j Q. For i B, let T(i) j be the subset of rows r in Tj such that r|Q = i|Cj, and similarly define ˆT(i) j , J(i) to be the subset of rows r in ˆTj (respectively J) such that r|Q = i| ˆ Cj (respectively r|Q = i). For a row r such that r|Q = i we say that r belongs to the block i B. Let s(i),j denote the number of rows of T(i) j , so that s(i) = Qm j=1 s(i),j. As an example, considering the join T1(A, B) T2(B, C), we have one block for each distinct value of B that is present in both T1 and T2, and for a given block B = b, the size of the block can be computed as the number of rows in T1, with B = b, multiplied by the number of rows in T2, with B = b. Using the above notion of blocks of a join, we can construct J as a stacking of matrices J(i) for i B. For the case of two table joins J = T1 T2, we have J(i) = ˆT(i) 1 1s(i),2 + 1s(i),1 ˆT(i) 2 Rs(i) d. In other words, J(i) is the subset of rows of J contained in block i. Observe that the entire join J is the result of stacking the matrices J(i) on top of each other, for all i B. In other words, if B = {i1, i2, . . . , i|B|}, the join J = T1 T2 is given by J = (J(i1))T , (J(i2))T , . . . , (J(i|B|))T T . Figure 1 illustrates an example of blocks in a two table join. In this example column f2 is the column that we are joining the two tables on, and there are two values for f2 that are present in both tables, namely the values{1, 2}. Thus B = {B1, B2}, where B1 = 1 and B2 = 2. In other words, Block B1 is the block for value 1, and its size is s1 = 4, and similarly B2 has size s2 = 4. Figure 1 illustrates how the join J can be written as stacking together the blockmatrices J(1) and J(2). Figure 2 shows the tables T (j) i for different values of i and j in the same example. In-Database Regression in Input Sparsity Time Finally, for any subset U [N], let JU denote the set of rows of J belonging to U. If L is a set of blocks of J, meaning L B(J), then let JL denote the set of rows of J belonging to some block i L (recall that a row r belongs to a block i L B if r|Q = i). 2.2. Linear Algebra We use boldface font, e.g., A, B, J, throughout to denote matrices. Given A Rn d with rank r, we write A = UΣVT to denote the singular value decomposition (SVD) of A, where U Rn r, V Rd r, and Σ Rr r is a diagonal matrix containing the non-zero singular values of A. For i [d], we write σi(A) to denote the i-th (possibly zero-valued) singular value of A, so that σ1(A) σ2(A) σd(A). We also use σmax(A) and σmin(A) to denote the maximum and minimum singular values of A respectively, and let κ(A) = σmax(A) σmin(A) denote the condition number of A. Let A+ denote the Moore Penrose pseudoinverse of A, namely A+ = VΣ 1UT . Let A F = (P i,j A2 i,j)1/2 denote the Frobenius norm of A, and A 2 = σmax(A) the spectral norm. We write In Rn n to denote the n-dimensional identity matrix. For a matrix A Rn d, we write nnz(A) to denote the number of non-zero entries of A. We can assume that each row of the table Tj is non-zero, since otherwise the row can be removed, and thus nnz(Tj) ni. Given A Rn d, we write Ai, R1 d to denote the i-th row (vector) of A, and A ,i Rn 1 to denote the i-th column (vector) of A. For values a, b R and ϵ > 0, we write a = (1 ϵ)b to denote the containment (1 ϵ)b a (1 + ϵ)b. For n Z+, let [n] = {1, 2, . . . , n}. Throughout, we will use O( ) notation to omit poly(log N) factors. Definition 3 (Statistical Dimension). For a matrix A Rn d, and a non-negative scalar λ, the λ-statistical dimension is defined to be dλ = P i λi(AT A) λi(AT A)+λ, where λi(AT A) is the i-th eigenvalue of AT A. Definition 4 (Subspace Embedding). For an ϵ 0, we say that A Rm d is an ϵ-subspace embedding for A Rn d if for all x Rd we have (1 ϵ) Ax 2 Ax 2 (1 + ϵ) Ax 2. Note that if A Rm d is an ϵ-subspace embedding for A Rn d, in particular this implies that σi(A) = (1 ϵ)σi( A) for all i [d]. Leverage Scores The leverage score of the i-th row Ai, of A Rn d is defined to be τi(A) = Ai, (AT A)+AT i, . Let τ(A) Rn be the vector such that (τ(A))i = τi(A). Note that τ(A) is the diagonal of A(AT A)+AT . Our algorithm will utilize generalized leverage scores. Given matrices A Rn d and B Rn1 d, the generalized leverage scores of A with respect to B are defined as τ B i (A) = ( Ai, (BT B)+AT i, if Ai, ker(B) 1 otherwise Where Ai, ker(B) denotes that Ai, is orthogonal to the kernel of B. Note that for a matrix A Rn d with SVD B = UΣVT , we have (BT B)+ = VΣ 2VT . Thus, for any x Rd, we have x T (BT B)+x = x T VΣ 1 2 2, and in particular τ B i (A) = AT i, VΣ 1 2 2 if Ai, ker(B). The proof of the following can be found in the supplementary material. Proposition 1. If B Rn 1 d is an ϵ-subspace embedding for B Rn1 d, and A Rn d is any matrix, then τ B i (A) = (1 O(ϵ))τ B i (A) Our algorithm will employ a mixture of several known oblivious subspace embeddings as tools to construct our overall database join SE. The first result we will need is an improved variant of Tensor-Sketch, which is an SE that can be applied quickly to tensor products of matrices. Lemma 5 (Fast Tensor-Sketch, Theorem 3 of (Ahle et al., 2020)). Fix any matrices A1, A2, . . . , Am, where Ai Rni di, fix ϵ > 0 and λ 0. Let n = n1 nm and d = d1 dm. Let A = A1 A2 Am have statistical dimension dλ. Then there is an oblivious randomized sketching algorithm which produces a matrix S Rk n, where k = O(dλm4/ϵ2), such that with probability 1 1/poly(n), we have that for all x Rd SAx 2 2 + λ x 2 2 = (1 ϵ)( Ax 2 2 + λ x 2 2). Note for the case of λ = 0, this implies that SA is an ϵsubspace embedding for A. Moreover, SA can be computed in time O(Pm i=1 nnz(Ai)/ϵ2 m5 + kdm).5 For the special case of λ = 0 in Lemma 5, the statistical dimension is d, and Tensor-Sketch is just a standard SE. Lemma 6 (OSNAP Transform (Nelson & Nguyˆen, 2013)). Given any A RN d, there is a randomized oblivious sketching algorithm that produces a matrix W Rt N with t = O(d/ϵ2), such that WA can be computed in time O(nnz(A)/ϵ2), and such that WA is an ϵ-subspace embedding for A with probability at least 99/100. Moreover, each column of W has at most O(1) non-zero entries. Lemma 7 (Count-Sketch(Clarkson & Woodruff, 2013)). For any fixed matrix A Rn d, and any ϵ > 0, there exists an algorithm which produces a matrix S Rk n, where k = O(d2/ϵ2), such that SA is an ϵ-subspace embedding 5Theorem 3 of (Ahle et al., 2020) is written to be applied to the special case of the polynomial kernel, where A1 = A2 = = Am. However, the algorithm itself does not use this fact, nor does it require the factors in the tensor product to be non-distinct. In-Database Regression in Input Sparsity Time for A with probability at least 99/100. Moreover, each column of S contains exactly one non-zero entry, and therefore SA can be computed in O(nnz(A)) time. 3. Subspace Embeddings for Two-Table Database Joins In this section, we will describe our algorithms for fast computation of in-database subspace embeddings for joins of two tables J = T1 T2, where Ti Rni di, and J RN d. As a consequence of our subspace embeddings, we obtain an input sparsity time algorithm for machine precision in-database regression. Here, machine precision refers to a convergence rate of log(1/ϵ) to the optimal solution. Our subspace embedding algorithm can be run with two separate hyper-parameterizations, one of which we refer to as the dense case where the tables T1, T2 have many non-zero entries, and the other is referred to as the sparse case, where we exploit the sparsity of the tables T1, T2. In the former, we will obtain O( 1 ϵ2 ((n1 + n2)d + d3)) runtime for construction of our subspace embedding, and in the latter we will obtain O( 1 ϵ2 (nnz(T1) + nnz(T2) + d5)) time. Thus, when the matrices T1, T2 are dense, we have (n1 + n2)d = Θ(nnz(T1) + nnz(T2)), in which case the former algorithm has a strictly better runtime dependence on n1, n2, and d. However, for the many applications where T1, T2 are sparse, the latter algorithm will yield substantial improvements in runtime. By first reading off the sparsity of T1, T2 and choosing the hyperparameters which minimize the runtime, the final runtime of the algorithm is the minimum of the two aforementioned runtimes. Our main subspace embedding is given in Algorithm 1. The full proofs are deferred to the supplementary material. As noted in Section 2.1, we can describe the join J as the result of stacking several blocks J(i), where the rows of J(i) consist of all pairs of concatenations of a row of T(i) 1 and T(i) 2 , where the T(i) j s partition Tj. We deal separately with blocks i for which J(i) contains a very large number of rows, and smaller blocks. Formally, we split the set of blocks B(J) into Bbig and Bsmall. For each block J(i) from Bbig, we apply a fast tensor sketch transform to obtain a subspace embedding for that block. For the smaller blocks, however, we need a much more involved routine. Our algorithm computes a random sample of the rows of the blocks J(i) from Bsmall, denoted Jsmall. Using the results of (Cohen et al., 2015), it follows that sampling sufficiently many rows from the distribution induced by the generalized leverage scores τ Jsmall i (Jsmall) of Jsmall with respect to Jsmall yields a subspace embedding of Jsmall. However, it is not possible to write down (let alone compute) all the values τ Jsmall i (Jsmall), since there can be more rows i Algorithm 1 Subspace embedding for join J = T1 T2. 1: In the dense case, set γ = 1. In the sparse case, set γ = d. Compute block sizes s(i), and let Bbig = {i B | maxj{s(i),j} d γ}. Set Bsmall = B \ Bbig, nsmall = |Bsmall|. 2: For each i Bbig, generate a Fast Tensor-Sketch matrix Si Rt s(i) (Lemma 5) and compute Si J(i). 3: Let Jbig be the matrix from stacking the matrices {Si J(i)}i Bbig. Generate a Count-Sketch matrix S (Lemma 7) and compute S Jbig. 4: Let Jsmall = JBsmall, and sample uniformly a subset U of m = Θ((n1 + n2)/γ) rows of JBsmall Rnsmall d and form the matrix Jsmall = (Jsmall)U Rm d. 5: Generate OSNAP transform W (Lemma 6), compute W Jsmall and the SVD W Jsmall = UΣVT . 6: Generate Gaussian matrix G Rd t with entries drawn i.i.d. from N(0, 1/t2), t = Θ(log N), and Gaussian vector g N(0, Id). 7: For all rows i of Jsmall, set (Jsmall)i, Id VVT g 2 2 = αi, and ( 1 if αi > 0 (Jsmall)i, VΣ 1G 2 2 otherwise . 8: Using Algorithms 2 and 3 with Y = VΣ 1G, construct diagonal row sampling matrix S Rnsmall nsmall such that Si,i = 1 pi with probability pi, and Si,i = 0 otherwise, where pi min n 1, log d 9: Return J, where J is the result of stacking the matrices S Jbig with SJsmall. in Jsmall than our entire allowable running time. To handle this issue, we first note that by Proposition 1 and the discussion prior to it, the value τ Jsmall i (Jsmall) is well-approximated by (Jsmall)i, VΣ 1 2 2, which in turn is well-approximated by (Jsmall)i, VΣ 1G 2 2 if G is a Gaussian matrix with only a small Θ(log N) number of columns. Thus, sampling from the generalized leverage scores τ Jsmall i (Jsmall) can be approximately reduced to the problem of sampling a row i from Jsmall Y with probability proportional to (Jsmall)i, Y 2 2, where Y is any matrix given as input. We then design a fast algorithm which accomplishes precisely this task: namely, for any join J and input matrix Y with a small number of columns, it samples rows from J Y with probability proportional to the squared row norms of J Y. Since Jsmall = J is itself a database join, this is the desired sampler. This procedure is given in Algorithms 2 (pre-processing step) and 3 (sampling step). We can apply this sampling primitive to efficiently sample from the generalized leverage scores in time substantially In-Database Regression in Input Sparsity Time Algorithm 2 Pre-processing step for fast ℓ2-sampling of rows of J Y, given input J = T1 T2 and Y Rd r. 1: For each block i B, compute a(i) = ˆT (i) 1 Y and b(i) = ˆT (i) 2 2: For each block i B, construct a binary tree τ(i)(T1), τ(i)(T2) as follows: 3: Each node v τ(i)(Tj) is a vector v R3r 4: If v is not a leaf, then v = vlchild + vrchild with vlchild, vrchild the left and right children of v. 5: The leaves of τ(i)(Tj) are given by the set {v(i),j l = (v(i),j l,1 , v(i),j l,2 , . . . , v(i),j l,r ) R3r | l [s(i),j]}, where v(i),1 l,q = (1, 2(a(i))l,q, (a(i))2 l,q) R3 and v(i),2 l,q = ((b(i))2 l,q, (b(i))l,q, 1) R3. 6: Compute the values root(τi(T1)), root(τi(T2)) for all i B, and also compute the sum P i root(τi(T1)), root(τi(T2)) . Algorithm 3 Sampling step for fast ℓ2-sampling of rows of J Y. 1: Sample a block i B with probability proportional to root(τi(T1)), root(τi(T2)) . 2: Sample l1 [s(i),1] with probability proportional to v(i),1 l1 , root(τi(T2)) . 3: Sample l2 [s(i),2] with probability proportional to v(i),1 l1 , v(i),2 l2 . 4: Return row corresponding to (l1, l2) in block i. less than constructing Jsmall, which ultimately allows for our final subspace embedding guarantee of Theorem 1. Finally, to obtain our input sparsity runtime machine precision regression algorithm, we apply our subspace embedding with constant ϵ to precondition the join J, after which the regression problem can be solved quickly via gradient descent. While a general gradient step is not always possible to compute efficiently with respect to the join J, we demonstrate that when the products used in the gradient step arise from vectors in the column span of J, the updates can be computed efficiently, which will yield our main regression result (Theorem 2). The full proofs are deferred to the supplementary material. 4. Evaluation We study the performance of our sketching method on several real datasets, both for two-table joins and general joins.6 We first introduce the datasets we use in the experiments. We consider two datasets: Last FM (Cantador et al., 2011) and Movie Lens (Harper & Konstan, 2015). Both of them contain several relational tables. We will compare our algo- 6Code available at https://github.com/ Anonymous Fireman/ICML_code rithm with the FAQ-based algorithm on the joins of some relations. The Last FM dataset has three relations: Userfriends (the friend relations between users), Userartists (the artists listened by each user) and Usertaggedartiststimestamps (the tag assignments of artists provided by each particular user along with the timestamps). The Movie Lens dataset also has three relations: Ratings (the ratings of movies given by the users and the timestamps), Users (gender, age, occupation, and zip code information of each user), Movies (release year and the category of each movie). 4.1. Two-Table Joins In the experiments for two-table joins, we solve the regression problem minx JUx b 2 2, where J = T1 T2 RN d is a join of two tables, U [d] and b is one of the columns of J. In our experiments, suppose column p is the column we want to predict. We will set U = [d] \ {p} and b to be the p-th column of J. To solve the regression problem, the FAQ-based algorithm computes the covariance matrix JT J by running the FAQ algorithm for every two columns, and then solves the normal equations JT Jx = JT b. Our algorithm will compute a subspace embedding J, and then solve the regression problem minx JUx b 2 2, i.e., solve JT Jx = JT b. We compare our algorithm to the FAQ-based algorithm on the Last FM and Movie Lens datasets. The FAQ-based algorithm employs the FAQ-based algorithm to calculate each entry in JT J. For the Last FM dataset, we consider the join of Userartists and Usertaggedartiststimestamps: J1 = UA UA.user=UTA.user UTA. Our regression task is to predict how often a user listens to an artist based on the tags. For the Movie Lens dataset, we consider the join of Ratings and Movies: J2 = R R.movie=M.movie M. Our regression task is to predict the rating that a user gives to a movie. In our experiments, we do the dataset preparation mentioned in (Schleich et al., 2016), to normalize the values in each column to range [0, 1]. For each column, let vmax and vmin denote the maximum value and minimum value in this column. We normalize each value v to (v vmin)/(vmax vmin). 4.2. General Joins For general joins, we consider the ridge regression problem. Specifically, our goal is to find a vector x that minimizes Jx b 2 2 + λ x 2 2, where J = T1 Tm RN d is an arbitrary join, b is one of the columns of J and λ > 0 is the regularization parameter. The optimal solution to the ridge regression problem can be found by solving the In-Database Regression in Input Sparsity Time 0 200 400 600 800 1,000 (a) MSE vs. λ for the FAQ-based algorithm 0 1 2 3 4 5 (b) MSE vs. λ for our algorithm 0 0.5 1 1.5 2 2.5 3 relative error(%) (c) Relative Error vs. λ Figure 3: Experimental Results for General Joins Table 1: Experimental Results for Two-Table Joins n1 n2 d Tbf Tours err J1 92834 186479 6 .034 .011 0.70% J2 1000209 3883 23 .820 .088 0.66% normal equations (JT J + λId)x = JT b. The FAQ-based algorithm is the same as in the experiment for two-table joins. It directly runs the FAQ algorithm a total of d(d + 1)/2 times to compute every entry of JT J. We run our algorithm, discussed in the supplementary, and the FAQ-based algorithm on the Movie Lens-25m dataset, which is the largest of the Movie Lens(Harper & Konstan, 2015) datasets. We consider the join of Ratings, Users and Movies: J3 = R R.user=U.user U U.movie=M.movie M. Our regression task is to predict the rating that a user gives to a movie. 4.3. Results We run the FAQ-based algorithm and our algorithm on those joins and compare their running times. To measure accuracy, we compute the relative mean-squared error, given by: err = JUxours b 2 2 JUxbf b 2 2 JUxbf b 2 2 in the experiments for two-table joins, where xbf is the solution given by the FAQ-based algorithm, and xours is the solution given by our algorithm. All results (runtime, accuracy) are averaged over 5 runs of each algorithm. In our implementation, we adjust the target dimension in our sketching algorithm for each experiment, as in practice it appears unnecessary to parameterize according to the worst-case theoretical bounds when the number of features is small, as in our experiments. Additionally, for two-table joins, we replace the Fast Tensor-Sketch with Tensor-Sketch ((Ahle et al., 2020; Pagh, 2013)) for the same reason. The implementation is written in MATLAB and run on an Intel Core i7-7500U CPU with 8GB of memory. We let Tbf be the running time of the FAQ-based algorithm and Tours be the running time of our approach, measured in seconds. Table 1 shows the results of our experiments for two-table joins. From that we can see our approach can give a solution with relative error less than 1%, and its running time is significantly less than that of the FAQ-based algorithm. For general joins, due to the size of the dataset, we implement our algorithm in Taichi (Hu et al., 2019; 2020) and run it on an Nvidia GTX1080Ti GPU. We split the dataset into a training set and a validation set, run the regression on the training set and measure the MSE (mean squared error) on the validation set. We fix the target dimension and try different values of λ to see which value achieves the best MSE. Our algorithm runs in 0.303s while the FAQ-based algorithm runs in 0.988s. The relative error of MSE (namely, MSEours MSEbf MSEbf , both measured under the optimal λ) is only 0.28%. We plot the MSE vs. λ curve for the FAQ-based algorithm and our algorithm in Figure 3(b) and 3(a). We observe that the optimal choice of λ is much larger in the sketched problem than in the original problem. This is because the statistical dimension dλ decreases as λ increases. Since we fix the target dimension, ϵ thus decreases. So a larger λ can give a better approximate solution, yielding a better MSE even if it is not the best choice in the unsketched problem. We also plot the relative error of the objective function in Figure 3(c). For ridge regression it becomes Jxours b 2 2 + λ xours 2 2 Jxbf b 2 2 + λ xbf 2 2 Jxbf b 2 2 + λ xbf 2 2 . We can see that the relative error decreases as λ increases in accordance with our theoretical analysis. In-Database Regression in Input Sparsity Time 5. Conclusion In this work, we demonstrate that subspace embeddings for database join queries can be computed in time substantially faster than forming the join, yielding input sparsity time algorithms for regression on joins of two tables up to machine precision, and we extend our results to ridge regression on arbitrary joins. Our results improve on the state-of-the-art FAQ-based algorithms for performing in-database regression on joins. Empirically, our algorithms are substantially faster than the state-of-the-art algorithms for this problem. Acknowledgments: The authors would like to thank support from the National Science Foundation (NSF) grant No. CCF-1815840, the Office of Naval Research (ONR) grant N00014-18-1-256, and a Simons Investigator Award. https://cloud.google.com/bigquery-ml/ docs/bigqueryml-intro. https://www.relational.ai/. Abo Khamis, M., Ngo, H. Q., and Rudra, A. Faq: questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 13 28. ACM, 2016. Abo Khamis, M., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. In-database learning with sparse tensors. In Proceedings of the 37th ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems, SIGMOD/PODS 18, pp. 325 340, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-4706-8. doi: 10.1145/3196959.3196960. URL http://doi.acm. org/10.1145/3196959.3196960. Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Velingker, A., Woodruff, D. P., and Zandieh, A. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 141 160. SIAM, 2020. Aji, S. M. and Mc Eliece, R. J. The generalized distributive law. IEEE transactions on Information Theory, 46(2): 325 343, 2000. Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999. Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci., 64(3):719 747, 2002. doi: 10.1006/jcss.2001. 1813. URL https://doi.org/10.1006/jcss. 2001.1813. Avron, H., Sindhwani, V., and Woodruff, D. P. Sketching structured matrices for faster nonlinear regression. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 2994 3002, 2013. Avron, H., Nguyen, H., and Woodruff, D. Subspace embeddings for the polynomial kernel. In Advances in neural information processing systems, pp. 2258 2266, 2014. Bakshi, A. and Woodruff, D. Sublinear time low-rank approximation of distance matrices. In Advances in Neural Information Processing Systems, pp. 3782 3792, 2018. Bakshi, A., Chepurko, N., and Woodruff, D. P. Robust and sample optimal algorithms for psd low-rank approximation. Ar Xiv, abs/1912.04177, 2019. Cantador, I., Brusilovsky, P., and Kuflik, T. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, Rec Sys 2011, New York, NY, USA, 2011. ACM. Cheng, Z. and Koudas, N. Nonlinear models over normalized data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1574 1577. IEEE, 2019. Clarkson, K. L. and Woodruff, D. P. Low rank approximation and regression in input sparsity time. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, pp. 81 90, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320290. doi: 10.1145/ 2488608.2488620. URL https://doi.org/10. 1145/2488608.2488620. Cohen, M. B., Lee, Y. T., Musco, C., Musco, C., Peng, R., and Sidford, A. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 181 190, 2015. Dechter, R. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, 1996. Diao, H., Song, Z., Sun, W., and Woodruff, D. P. Sketching for kronecker product regression and p-splines. Co RR, abs/1712.09473, 2017. URL http://arxiv.org/ abs/1712.09473. Diao, H., Jayaram, R., Song, Z., Sun, W., and Woodruff, D. P. Optimal sketching for kronecker product regression and low rank approximation, 2019. In-Database Regression in Input Sparsity Time Elgamal, T., Luo, S., Boehm, M., Evfimievski, A. V., Tatikonda, S., Reinwald, B., and Sen, P. Spoof: Sumproduct optimization and operator fusion for large-scale machine learning. In CIDR, 2017. Grohe, M. and Marx, D. Constraint solving via fractional edge covers. In SODA, pp. 289 298, 2006. Gucht, D. V., Williams, R., Woodruff, D. P., and Zhang, Q. The communication complexity of distributed set-joins with applications to matrix multiplication. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 199 212, 2015. Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), December 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL https://doi.org/10. 1145/2827872. Hellerstein, J., R e, C., Schoppmann, F., Wang, D. Z., Fratkin, E., Gorajek, A., Ng, K. S., Welton, C., Feng, X., Li, K., et al. The madlib analytics library or mad skills, the sql. ar Xiv preprint ar Xiv:1208.4165, 2012. Hu, Y., Li, T.-M., Anderson, L., Ragan-Kelley, J., and Durand, F. Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics (TOG), 38(6):201, 2019. Hu, Y., Anderson, L., Li, T.-M., Sun, Q., Carr, N., Ragan Kelley, J., and Durand, F. Difftaichi: Differentiable programming for physical simulation. ICLR, 2020. Indyk, P., Vakilian, A., Wagner, T., and Woodruff, D. Sample-optimal low-rank approximation of distance matrices. ar Xiv preprint ar Xiv:1906.00339, 2019. Khamis, M. A., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. Ac/dc: in-database learning thunderstruck. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, pp. 8. ACM, 2018. Kohlas, J. and Wilson, N. Semiring induced valuation algebras: Exact and approximate local computation algorithms. Artif. Intell., 172(11):1360 1399, 2008. Kumar, A., Jalal, M., Yan, B., Naughton, J., and Patel, J. M. Demonstration of santoku: optimizing machine learning over normalized data. Proceedings of the VLDB Endowment, 8(12):1864 1867, 2015a. Kumar, A., Naughton, J., and Patel, J. M. Learning generalized linear models over normalized data. In ACM SIGMOD International Conference on Management of Data, pp. 1969 1984, 2015b. Kumar, A., Naughton, J., Patel, J. M., and Zhu, X. To join or not to join?: Thinking twice about joins before feature selection. In International Conference on Management of Data, pp. 19 34, 2016. Li, S., Chen, L., and Kumar, A. Enabling and optimizing non-linear feature interactions in factorized linear algebra. In Proceedings of the 2019 International Conference on Management of Data, pp. 1571 1588, 2019. Musco, C. and Woodruff, D. P. Sublinear time low-rank approximation of positive semidefinite matrices. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 672 683. IEEE, 2017. Nelson, J. and Nguyˆen, H. L. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 ieee 54th annual symposium on foundations of computer science, pp. 117 126. IEEE, 2013. Pagh, R. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT), 5(3):1 17, 2013. Pham, N. and Pagh, R. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 239 247, 2013. Rendle, S. Scaling factorization machines to relational data. In Proceedings of the VLDB Endowment, volume 6, pp. 337 348. VLDB Endowment, 2013. Schleich, M., Olteanu, D., and Ciucanu, R. Learning linear regression models over factorized joins. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD 16, pp. 3 18. ACM, 2016. ISBN 978-1-4503-3531-7. Shi, X. and Woodruff, D. P. Sublinear time numerical linear algebra for structured matrices. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp. 4918 4925, 2019. doi: 10.1609/aaai.v33i01. 33014918. URL https://doi.org/10.1609/ aaai.v33i01.33014918. Woodruff, D. P. and Zandieh, A. Near input sparsity time kernel embeddings via adaptive sampling. In International Conference on Machine Learning (ICML), 2020. Woodruff, D. P. et al. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1 2):1 157, 2014. Yang, K., Gao, Y., Liang, L., Yao, B., Wen, S., and Chen, G. Towards factorized svm with gaussian kernels over normalized data.