# dimensionality_reduction_for_the_sumofdistances_metric__354ac11c.pdf

Dimensionality Reduction for Sum-of-Distances Metric

Zhili Feng 1 Praneeth Kacham 1 David P. Woodruff 1

We give a dimensionality reduction procedure to approximate the sum of distances of a given set of n points in Rd to any shape that lies in a k-dimensional subspace. Here, by shape we mean any set of points in Rd. Our algorithm takes an input in the form of an n d matrix A, where each row of A denotes a data point, and outputs a subspace P of dimension O(k3/ε6) such that the projections of each of the n points onto the subspace P and the distances of each of the points to the subspace P are sufﬁcient to obtain an ε-approximation to the sum of distances to any arbitrary shape that lies in a k-dimensional subspace of Rd. These include important problems such as k-median, k-subspace approximation, and (j, l) subspace clustering with j l k. Dimensionality reduction reduces the data storage requirement to (n + d)k3/ε6 from nnz(A). Here nnz(A) could potentially be as large as nd. Our algorithm runs in time nnz(A)/ε2 + (n + d)poly(k/ε), up to logarithmic factors. For dense matrices, where nnz(A) nd, we give a faster algorithm, that runs in time nd + (n + d)poly(k/ε) up to logarithmic factors. Our dimensionality reduction algorithm can also be used to obtain poly(k/ε) size coresets for k-median and (k, 1)-subspace approximation problems in polynomial time.

1. Introduction

Machine learning models often require millions of highdimensional data samples in order to train. For example, an image with moderate resolution can easily have more than a million pixels. It is crucial that we can decrease the size of the data to save on computational power. One way to decrease the size of the data is dimensionality reduction, where we project our data samples onto a low-dimensional

1 Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Praneeth Kacham <pkacham@cs.cmu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

subspace and perform the task on the low-dimensional points. Given a set of n points A = {a1, . . . , an} in Rd, the projections of A onto a subspace P of k dimensions needs only k parameters for each point in the dataset. Thus the size of the data is proportional to (n + d)k, which can be much smaller than nd. Therefore if there exists a subspace P of dimension k, where k is much smaller than n and d, and for which the projections of A onto the subspace P alone are sufﬁcient to perform a certain a task on the dataset A, then we can achieve a signiﬁcant reduction in the size of the data.

One very common task that requires dimensionality reduction is the shape-ﬁtting problem. A problem instance is deﬁned by a quadruple (A, S, dist, f), where A = {a1, . . . , an} Rd is a set of points, dist : Rd Rd R 0 is a metric which we will also refer to as the distance function, S is a collection of subsets in Rd which we call shapes, and a function f : R 0 R 0. The task is to ﬁnd a shape S S that minimizes !

i f(dist(ai, S)). The most common function f used is f(x) = x2 as it has a natural Frobenius norm interpretation for many tasks and has closed-form solutions for natural sets S of shapes. Recently, the function f(x) = x has been considered as it is more robust to outliers than the function f(x) = x2, meaning that it does not square the distance to an erroneous point, allowing the objective to ﬁt more of the remaining (non-outlier) data points.

The most common dimensionality reduction techniques include Principal Component Analysis (PCA) and the Johnson-Lindenstrauss Transform (JL). PCA projects the original dataset onto a low-dimensional subspace for which the data variance is the largest. On the other hand, the JL transform provides a data-oblivious dimensionality reduction that preserves pairwise distances between points in the dataset.

Feldman et al. (2013) show that if P is the subspace spanned by the top O(k/ε2) singular vectors of the data matrix A, which is given by PCA, then for any shape S that lies in a k-dimensional space, the quantity !

i mins S ai s 2

2 can be approximated by !

i mins S PP ai s 2

i ai PP ai 2

2, where PP ai denotes the Euclidean projection of ai onto the subspace P, thereby giving a dimensionality reduction technique for

Dimensionality Reduction for Sum-of-Distances Metric

the shape-ﬁtting problem instantiated with f(x) = x2, Euclidean norm distance function dist(x, y) = x y 2, and with S being the collection of any k-dimensional shape.

In this work, we concentrate on shape ﬁtting problems with dist(x, y) = x y 2 and f(x) = x. Unfortunately, both PCA and the JL transform are not known to work in this case. We give fast algorithms to ﬁnd a subspace P of

"O(k3/ε6)1 dimensions that allows us to compute a (1 ε)- approximation to !

i dist(xi, S) for any shape S that lies in a k-dimensional subspace. Examples of such shapes include all k-dimensional subspaces themselves, which corresponds to the subspace approximation problem, as well as all sets of k points, which corresponds to the k-median problem. Our results also apply to the (j, l)-projective clustering problem, with j l k, where we seek to ﬁnd j subspaces, each of dimension at most l, so as to minimize the sum of distances of each input point to its nearest subspace among the j that we have chosen. We also show empirically that we need fewer dimensions than our theoretical analysis predicts to obtain good approximations.

A coreset is another type of data structure to reduce the size of a data set A. Namely, a coreset P is a data structure consuming a much smaller amount of memory than A, which can be used as a substitute for A for any query Y on A. For example, in the k-median problem, the query Y = {y1, . . . , yk} can be a set of k points, and we want to ﬁnd a coreset P to obtain a (1 + ε)-approximation to !n

i=1 ai yai 2, where yai is the closest point to ai in Y . Often, we want to construct a strong coreset, meaning with high probability, P can be used in place of A simultaneously for all possible query sets Y . If this is the case, then we can throw away the original dataset A, which saves us not only on computational power, but also on storage.

There is a long line of work which focuses on constructing coresets for subspace approximation with sum of squared distances loss function, as well as for the k-means problem (see, e.g., (Deshpande et al., 2006; Deshpande and Varadarajan, 2007; Feldman and Langberg, 2011; Feldman et al., 2010; 2013; Varadarajan and Xiao, 2012; Shyamalkumar and Varadarajan, 2007; B adoiu et al., 2002; Chen, 2009; Feldman and Schulman, 2012; Frahling and Sohler, 2005; 2008; Har-Peled and Kushal, 2007; Har Peled and Mazumdar, 2004; Langberg and Schulman, 2010)). Feldman et al. (2013) give the ﬁrst coresets of size independent of d. For subspace approximation, they give strong coresets of size O(k/ε), and "O(k3/ε4) for kmeans. Cohen et al. (2015) improve the result and give an input sparsity time algorithm to construct the coreset.

Later, Sohler and Woodruff (2018) give a strong coreset of size poly(k/ε) for the k-median problem, as well as the

1We use !O(f(n)) notation to denote O(f(n)polylog(f(n))).

subspace approximation problem with the sum of distances loss function, obtaining the ﬁrst strong coresets independent of n and d for this problem. Their algorithm runs in "O(nnz(A) + (n + d) poly(k/ε) + exp(poly(k/ε))) time. Recent work by Makarychev et al. (2019) provides an oblivious dimensionality reduction for k-median to an O(ε 2 log(k/ε))-dimensional space while preserving the cost of every clustering. This dimension reduction result can also be used to construct a strong coreset of size poly(k/ε).

Sohler and Woodruff (2018) gave an algorithm to compute ﬁrst polynomial size coresets for k-median using their dimensionality reduction, albeit, with a running time exponential in k, 1/ε as discussed. In a similar way, we can obtain "O(k4/ε8) size coreset for k-median in polynomial time using our dimensionality reduction algorithm. In concurrent and independent work, Huang and Vishnoi (2020) gave a polynomial time algorithm to compute a coreset of size "O(k/ε4). We stress that we can run the second stage in the coreset construction algorithm of Huang and Vishnoi (2020) on a coreset of size "O(k4/ε8) to obtain a coreset of size "O(k/ε4) just as in (Huang and Vishnoi, 2020). Also, their techniques cannot be extended to give an efﬁcient dimensionality reduction algorithm to approximate the sumof-distances metric, as their coreset construction arguments are based on an existential dimensionality reduction result. As an aside, we observe that the coreset construction algorithm of Huang and Vishnoi (2020) can be implemented to have a running time of "O(nnz(A)+(n+d)poly(k/ε)), improving their "O(nnz(A) k) time algorithm. To our knowledge, this is the ﬁrst input sparsity time algorithm to construct a coreset for the fundamental k-median problem. We give a proof of our observation in the supplementary material.

The size of the coresets constructed based on the sensitivity sampling framework of Feldman and Langberg (2011) depend on the dimension of the data. An important consequence of our dimensionality reduction algorithm is that the coresets constructed on the data after ﬁrst reducing its dimensionality can be much smaller for many important problems.

1.1. Our Results

Our main contribution is that we obtain the ﬁrst polynomial time, in fact near-linear time, dimension reduction algorithm that returns a poly(k/ε)-dimensional subspace such that the projections of the input points to this subspace, as well as the distances of the points to this subspace, can be used to compute a (1 ε)-approximation to the sum of distances of the set A to any k-dimensional shape S.

Theorem 1.1 (Dimensionality Reduction). Given A Rn d and 0 < ε < 1, there exists an algorithm that runs in

Dimensionality Reduction for Sum-of-Distances Metric

time "O(nnz(A)/ε2 +(n+d)poly(k/ε)) and outputs a subspace P of dimension "O(k3/ε6) such that, with probability 2/3, for any shape S Rd that lies in a k-dimensional subspace, #

dist(PP ai, S)2 + dist(ai, P)2 = (1 ε)

dist(ai, S).

When A is dense, i.e., nnz(A) nd, the quantity nnz(A)/ε2 nd/ε2 may be prohibitive. In this case, we also provide a fast dimensionality reduction algorithm which runs in "O(nd + (n + d) poly(k/ε)) time.

Theorem 1.2. For any ε (0, 1) and k 1, there is an

"O(nd + (n + d) poly(k/ε)) time algorithm that ﬁnds an "O(k3.5/ε6)-dimensional subspace P such that, with probability 2/3, for any shape S Rd that lies in a kdimensional subspace, #

dist(PP ai, S)2 + dist(ai, P)2 = (1 ε)

dist(ai, S).

Given a subspace P as in the above theorems, it is still expensive to compute the projections of the rows of A onto the subspace P as well as the distances to the subspace P. We also give an algorithm to compute approximate projections and approximate distances that still satisfy the guarantees of the above theorems, obtaining the following theorem.

Theorem 1.3 (Size Reduction). Given a matrix A Rn d

and a subspace P of r = "O(k3/ε6) dimensions that satisﬁes the guarantees of Theorems 1.1 and 1.2, there is an algorithm that runs in time "O(nnz(A)+(n+d)poly(k/ε)) and outputs vectors a B

i Rr and values vi R 0 for all i such that for any shape S that lies in a k dimensional subspace,

i , S)2 + v2

dist(ai, S),

where B is an orthonormal basis for the subspace P. Thus the storage requirement drops from nnz(A) to (n + d)k3/ε6.

2. Preliminaries and Technical Overview

We let A Rn d denote our input matrix. The rows of A are interpreted as a set of n points in Rd. Throughout the paper, we use Ai and ai to denote the ith row of A, and A i to denote the ith column. Similarly, for J [n], AJ denotes the matrix with rows of A only indexed by J. For n Z+, [n] denotes the set {1, 2, 3, . . . , n}. For a matrix A, we use A+ to denote its Moore-Penrose pseudoinverse. We write x = (a, b)y to denote that ay x by. If

a = 1 ε and b = 1 + ε, we abbreviate the notation as x = (1 ε)y.

Given a subspace B, we use PB to denote the projection matrix onto B, i.e., for any vector u, we have PBu = arg minv B u v 2. Let B denote the orthogonal complement of the subspace B. We use bold capital letters such as S, L to stress that these are random matrices that are explicitly sampled.

Deﬁnition 2.1 ((p, 2)-norm). For a matrix A Rn d, its (p, 2)-norm is A p,2 = (!n

2)1/p. We deﬁne A h to be AT 1,2 which is the sum of ℓ2 norms of columns of A.

Deﬁnition 2.2 ((k, p)-clustering). Given input matrix A Rn d, let X be the collection of all sets containing k points. The (k, p)-clustering problem denotes the optimization problem min X X

Ai A d(Ai , X)p.

If p = 2, we have the k-means problem, while if p = 1, we have the k-median problem.

Deﬁnition 2.3 ((k, p)-subspace approximation). Given input matrix A Rn d, let P be the set of all subspaces with dimension at most k. The (k, p)-subspace approximation problem denotes the optimization problem min P P

i [n] d(Ai , P)p. We let Sub Apxk,p(A) denote the optimum value of the (k, p) subspace approximation to A.

Deﬁnition 2.4 (ε-strong coreset). For the (k, p)-clustering problem with input matrix A Rn d, a weighted εstrong coreset is a tuple (C, w) where C Rm d and w : rows(C) R+ is such that simultaneously for all X Rd with |X| = k,

w(Ci )d(Ci , X)p = (1 ε)

d(Ai , X)p.

The deﬁnition can be generalized to any data structure that lets us compute a (1 ε) approximation to !

Ai A d(Ai , X)p for all sets X of size k. A similar notion of strong coreset can be deﬁned for the (k, p)-subspace approximation problem as well.

Deﬁnition 2.5 ((α, β)-bicriteria approximation). Given an input matrix A Rn d for the (k, 1)-subspace approximation problem, we say that a subspace Q is an (α, β)-bicriteria approximation if dim(Q) β and !n

i=1 d(Ai , Q) α Sub Apxk,1(A).

Deﬁnition 2.6 (ℓ1 subspace embedding). Let A Rn d, Π Rs n. We call Π an (α, β) ℓ1 subspace embedding if for all x Rd, α Ax 1 ΠAx 1 β Ax 1. If QR = ΠA is the QR decomposition, then we let Ai R 1 1 be the ℓ1 leverage score of the ith row. See (Cohen and Peng, 2015; Wang and Woodruff, 2019) for several constructions of ℓ1 subspace embeddings.

Dimensionality Reduction for Sum-of-Distances Metric

2.1. Technical Overview

Let A Rn d be the input matrix. Sohler and Woodruff (2018) show that if a subspace S satisﬁes

A(I PS) 1,2 A(I PS+W ) 1,2 ε2Sub Apxk,1(A)

(1) for all k-dimensional subspaces W, then we can reduce the dimension of the input points by projecting the points onto S, while being able to compute a (1 ε)-approximation to the sum of distances to any k-dimensional shape. They construct such a subspace S by directly computing a (1 + ε, poly(k/ε)) bicriteria approximation for the (i k, 1) subspace approximation problem on A, where i is a randomly chosen index in [1/ε2]. This introduces the exp(poly(k/ε)) term in their running time. We show that we can compute (1 + ε, poly(k/ε))-bicriteria solutions for the (k, 1)- subspace approximation problem on A(I P), for adaptively chosen projection matrices P, and that with constant probability, the union of the bicriteria solutions we compute has the desired property (1).

We solve the problem of ﬁnding a (1 + ε, poly(k/ε))- bicriteria solution for the (k, 1)-subspace approximation problem on the input A(I P), where P is an arbitrary projection matrix onto a subspace of dimension at most poly(k/ε), based on techniques from (Clarkson and Woodruff, 2015). We simplify their arguments and obtain tighter parameters for their algorithms. We solve the problem in two stages. First we compute an (O(1), "O(k))- approximation, i.e., we ﬁnd a subspace & X of dimension at most "O(k) such that A(I P)(I P ! X) 1,2 O(1) Sub Apxk,1(A(I P)).

To achieve this guarantee, we make use of so-called lopsided embeddings. Clarkson and Woodruff (2015) show that if a matrix S is an ε lopsided embedding for (Vk, (A(I P))T), where Vk is an orthonormal basis for the k-dimensional subspace that attains the cost Sub Apxk,1(A(I P)), then minrank-k X A(I P)STX A 1,2 (1+ε)Sub Apxk,1(A(I P)). We ﬁrst show that a Gaussian matrix S with O(k) rows is an O(1) lopsided embedding with probability 9/10. Then we show that if a random matrix L is an O(1) ℓ1 subspace embedding for the matrix A(I P)ST and satisﬁes EL[ LM 1,2] = M 1,2 for any ﬁxed matrix M, then the row space of (LA(I P)) is an O(1) approximation. We use the Lewis weight sampling algorithm of Cohen and Peng (2015) to sample a matrix L that satisﬁes these properties. As the matrix ST, which is a Gaussian matrix, has only O(k) columns, the matrix L has only "O(k) rows. We can also instead use the ℓ1 subspace embeddings of Wang and Woodruff (2019) to construct an "O(k3.5)-sized ℓ1 embedding by leverage score sampling (Woodruff, 2014).

Next, based on the (O(1), "O(k)) bicriteria solution, we per-

form non-adaptive residual sampling. This was shown to give a (1 + ε, "O(k3/ε2)) bicriteria solution in (Clarkson and Woodruff, 2015) when an O(1) approximate solution is used. Thus, we obtain a subspace &S for which

A(I P)(I P!S) 1,2 (1 + ε)Sub Apxk,1(A(I P)).

Starting with P = 0, we obtain a (1 + ε, k3/ε2) bicriteria subspace &S. However, the dimensionality reduction requires a subspace that satisﬁes (1). To obtain such a guarantee, we crucially run this algorithm adaptively Θ(1/ε) times. Let &Si be the subspace obtained in the ith iteration. In the ith iteration, we ﬁnd a bicriteria solution for the (k, 1) subspace approximation problem on the matrix A(I P!S1 ... !Si 1). We then show that the ﬁnal subspace

&S = j &Sj, with probability 9/10, satisﬁes A(I P!S) 1,2 A(I P!S+W ) 1,2 ε Sub Apxk,1(A) for all k-dimensional subspaces W. Thus, running the above procedure with parameter ε2 gives a subspace that satisﬁes (1). We show that each iteration of the algorithm takes

"O(nnz(A)+(n+d)poly(k/ε)) time and as we run the algorithm adaptively for 1/ε2 iterations, the total time complexity of the algorithm is O(nnz(A)/ε2 + (n + d)poly(k/ε)).

For dense inputs A, the algorithm described above has a running time of O(nd/ε2 + (n + d)poly(k/ε)) which can be prohibitive when both n and d are large. We observe that in each of the 1/ε2 iterations, the algorithm computes sampling probabilities pi for all the n rows, whereas it samples only poly(k/ε) rows independently with these probabilities in any particular iteration. We propose a novel alternate sampling scheme in which we partition rows of the matrix A(I P) into equal size blocks I1, . . . , Ib [n]. We show that given several precomputed matrices, we can quickly obtain estimates apxj that approximate !

i Ij pi for all j [b]. Now, to sample a row i [n] with probability approximately equal to pi, we ﬁrst sample a block Ij with probability proportional to apxj, which is close to !

i Ij pj, and then compute the probabilities pi only for the rows in the sampled blocks. If the number of samples is less than the number of blocks, we see that we compute the actual probabilities only for a few rows. In order to be able to estimate apxj, we make use of several standard properties of Cauchy and Gaussian random matrices. Finally, we show that each of the precomputed matrices required can be computed in time "O(nd) using the fast rectangular matrix multiplication algorithm by Coppersmith (1982).

In addition to providing a tool for data size reduction, our dimensionality reduction also leads to small coreset constructions for various problems with sizes that depend only on the problem parameter k instead of n or d. As shown by Sohler and Woodruff (2018), the points projected onto the subspace given by a dimensionality reduction algorithm

Dimensionality Reduction for Sum-of-Distances Metric

can be used to construct coresets of sizes poly(k/ε) for kmedian and (k, 1)-subspace approximation problems. We note that the same constructions work with our dimensionality reduction algorithm. We include the details of such constructions in the supplementary material.

3. Sum of Distances to a k-dimensional shape

Let A = {a1, . . . , an} be a given set of points and P be a poly(k/ε) dimensional subspace that satisﬁes (1). Let S Rd be an arbitrary shape such that span(S) has dimension at most k. We want to obtain an ε approximation to !

i dist(ai, S).

Sohler and Woodruff (2018) show that for any such shape S, !

dist(ai, PP ai)2 + dist(PP ai, S)2 = (1 ε) !

i S dist(ai, S). The following lemma is a more general version that works with approximate projections onto the subspace P and approximate distances to the subspace P. A similar lemma is stated as Lemma 14 in (Sohler and Woodruff, 2018). We correct an error in Equation 2 of their proof.

Theorem 3.1. Let P be an r dimensional subspace of Rd

such that #

dist(ai, P)

dist(ai, P +W) ε2

80Sub Apxk,1(A)

for all k-dimensional subspaces W. Let B Rd r be an orthonormal basis for the subspace P. For each ai, let a B

i Rr be such that dist(ai, Ba B

i ) (1 + εc)dist(ai, P) and let (1 εc)dist(ai, P) apxi (1 + εc)dist(ai, P) for εc = ε2/6. Then for any k dimensional shape S, !

i , S)2 + apx2

i = (1 5ε) !

i dist(ai, S).

The above theorem shows that we have to only compute approximate projections onto the subspace, which can be done in input sparsity time by using high probability subspace embeddings obtained from Count Sketch matrices (see Section 2.3 of (Woodruff, 2014) and (Liang et al., 2014)).

4. Dimensionality Reduction for Sparse

4.1. Constructing an (O(1), "O(k))-bicriteria Subspace

Approximation

We ﬁrst show how to obtain an (O(1), "O(k))-bicriteria solution for (k, 1)-subspace approximation. A key tool we use is a lopsided embedding deﬁned as follows:

Deﬁnition 4.1 (Lopsided embedding). A matrix S is a lopsided ε-embedding for matrices A and B with respect to a matrix norm and constraint set C, if (i) for all matrices X of the appropriate dimensions, S(AX B)

(1 ε) AX B , and (ii) for B = AX B, we have SB (1+ε) B , where X = arg min X C AX B .

Let Uk Rn k and V T

k Rk d be rank k matrices such that Uk V T

k A 1,2 = Sub Apxk,1(A). Clarkson and Woodruff (2015) show that if S is a lopsided ε-embedding for matrices (Vk, AT) with respect to the norm h, then minrank-k X ASTX A 1,2 (1 + O(ε))Sub Apxk,1(A). We show that a suitably scaled Gaussian random matrix S with "O(k) rows is a lopsided (1/4)-embedding for matrices (Vk, AT) with probability 9/10. Thus, we have that with probability 9/10,

min rank-k X ASTX A 1,2 (3/2)Sub Apxk,1(A).

We next prove that a row-sampling based ℓ1 subspace embedding for the column space of the matrix AST can be used to obtain a bicriteria solution to the subspace approximation problem.

The following lemma summarizes the results discussed above. The results of the lemma are a signiﬁcant improvement over Lemma 44 of (Clarkson and Woodruff, 2015) and have simpler proofs that do not involve ε-nets.

Lemma 4.1. (i) If ST is a random Gaussian matrix with O(k) columns, then S is a 1/4-lopsided embedding for (Vk, AT) with respect to the h norm with probability 9/10. Therefore, with probability 9/10

min rank-k X ASTX A 1,2 (3/2)Sub Apxk,1(A).

(ii) If L is a random matrix drawn from a distribution such that with probability 9/10, α ASTy 1 LASTy 1 β ASTy 1 for all vectors y and if EL[ LM 1,2] = M 1,2 for any matrix M, then with probability 3/5, all matrices X of appropriate dimensions such that LASTX LA 1,2 10 Sub Apxk,1(A) satisfy ASTX A 1,2 O(2 + 40/α) Sub Apxk,1(A).

Using the above lemma, we now have the following theorem which shows that Algorithm 1 returns an (O(1), "O(k)) approximation.

Theorem 4.1. Given any matrix A Rn d and a matrix B Rd c1 with c1 = poly(k/ε) orthonormal columns, Algorithm 1 returns a matrix & X with "O(k) orthonormal columns that with probability 1 δ satisﬁes

A(I BBT)(I & X & XT) 1,2

O(1) Sub Apxk,1(A(I BBT)),

in time "O((nnz(A) + dpoly(k/ε)) log(1/δ)).

Dimensionality Reduction for Sum-of-Distances Metric

Algorithm 1 POLYAPPROX

Input: A Rn d, B Rd c1, k Z, δ Output: & X Rd c2 cols O(k + 1/δ2) ST N(0, 1)d cols

L LEWISWEIGHT(A(I BBT)ST,1/2) (Cohen and Peng, 2015)

& X Orthonormal Basis for rowspace(LA(I BBT)) Repeat the above O(log(1/δ)) times and return the best

& X i.e., & X minimizing A(I & X & XT)G 1,2 where G is a Gaussian matrix with O(log(n)) columns

4.2. Constructing a (1 + ε, "O(k3/ε2))-bicriteria

Subspace Approximation

Using the (O(1), "O(k))-bicriteria subspace approximation solution found, we design a ﬁner sampling process based on Theorem 45 of (Clarkson and Woodruff, 2015) to further pick a subspace of dimension "O(k3/ε2) that contains a (1+ ε)-approximate solution for subspace approximation of the matrix A(I BBT).

The following lemma states that given a subspace of cost at most K Sub Apxk,1(A), that a sample of "O(K k3/ε2) rows with probabilities chosen proportional to the distances of the rows of the matrix A to the subspace, can be used to construct a subspace that is a 1 + ε approximation.

Algorithm 2 EPSAPPROX

Input: A, B, & X, k, K, ε, δ > 0. Output: U Rd c such that U TB = 0. t O(log(n/δ)), G N(0, 1/t)d t

M A(I BBT)(I & X & XT)G pi Mi 2/ M 1,2 for all i [n] s "O(K k3/ε2 log(1/δ)) S Multiset of s independent samples drawn from distribution p U Orthonormal basis for column space of the matrix ((I BBT)[ & X (AS)T]) Return U

Lemma 4.2. Given a matrix A Rn d and a matrix & X Rd c that satisﬁes

A(I & X & XT) 1,2 K Sub Apxk,1(A),

suppose we generate a matrix S of s = "O((K/α) k3/ε2 log(1/δ)) rows, each chosen independently to be the ith standard basis vector with probability pi. Here, !

i [n] pi = 1 and for all i [n] pi α qi "

i qi , where

qi = Ai (I & X & XT) 2. Let U be an orthonormal basis for the rowspace of [ & XT ; SA]. Then with probability

A(I UU T) 1,2 (1 + ε)Sub Apxk,1(A).

The proof of the above lemma is the same as that of the proof of Theorem 45 of (Clarkson and Woodruff, 2015) with a minor change to account for the approximation error α. Now the following theorem shows that Algorithm 2 satisﬁes conditions of the previous lemma.

Theorem 4.2 (Residual Sampling). Given matrix A Rn d, matrices B Rd c1 and & X Rd c2 with orthonormal columns such that A(I BBT)(I

& X & XT) 1,2 K Sub Apxk,1(A(I BBT)), Algorithm 2 returns a matrix U having c = "O(c2+K k3/ε2 log(1/δ)) orthonormal columns such that with probability 1 δ,

A(I BBT)(I UU T) 1,2

(1 + ε)Sub Apxk,1(A(I BBT)).

The algorithm runs in time "O(nnz(A) + dpoly(k/ε)). Moreover we also have that U TB = 0 i.e., the column spaces of U and B are orthogonal to each other.

The proof of the theorem mainly involves showing that Mi 2 is proportional to the residual Ai (I BBT)(I

& X & XT) 2. This is done by using the fact that if G is a Gaussian matrix with O(log(1/δ)) rows, then x TG 2 = (1/2, 3/2) x T 2 with probability 1 δ. We then apply Lemma 4.2 to conclude that the solution computed by the algorithm is a bicriteria solution of cost at most (1 + ε)Sub Apxk,1(A(I BBT)). Therefore, using the (O(1), "O(k)) bicriteria solution obtained using Algorithm 1, we can obtain a (1 + ε, "O(k3/ε2)) bicriteria solution.

5. Dimensionality Reduction

With an algorithm to construct a (1+ε, k3/ε2) bicriteria solution from the previous section, we are now ready to construct a subspace that satisﬁes (1). Recall the crucial property for the subspace S we need is that for all k-dimensional subspaces W, A(I PS) 1,2 A(I PS+W ) 1,2 ε2Sub Apxk,1(A). To get such a subspace, we run Algorithms 1 and 2 adaptively and then show that the union of all 1 + ε approximate bicriteria solutions satisfy the above property with parameter O(ε). Thus, running the algorithm with parameter Θ(ε2) gives a subspace with the desired property.

Theorem 5.1. Given an n d matrix A, k Z, and an accuracy parameter ε > 0, Algorithm 4 returns a matrix B with "O(k3/ε6) orthonormal columns and a matrix Apx = [X v] such that, with probability 9/10, for

Dimensionality Reduction for Sum-of-Distances Metric

Algorithm 3 DIMENSIONREDUCTION

Input: A Rn d, k, ε > 0. Output: B Rd c with orthonormal columns i uniformly random integer from [10/ε + 1]. Initialize B [] for i iterations do

& X POLYAPPROX(A, B, k, ε/100). U EPSAPPROX(A, B, & X, k, "O(

k), ε, ε/100). B [B | U]. end for Return B.

Algorithm 4 COMPLETEDIMREDUCE

Input: A Rn d, k Z, ε > 0. Output: Apx Rn (c+1)

Let B = DIMENSIONREDUCTION(A, k, Θ(ε2)). t = O(log(n)) Compute (Sj B, Sj AT) for j [t] where Sj is an independent Count Sketch matrix with poly(k/ε) rows for i = 1, . . . , n do

Let Uj Dj V T

j SVD(Sj[B AT

i ]) for all j [t] for j [t] do

Check if for at least half j = j, all singular values of Dj V T

j ) 1 are in [1 Θ(ε2), 1 + Θ(ε2)] If the above check holds, set xi (Sj B) (Sj AT

i ), vi (I (Sj B)(Sj B) )(Sj AT

i ) 2 and go to next i end for end for Return B and n (c + 1) matrix Apx with Apxi = [xi vi]

any k dimensional shape S, !

i , S)2 + v2

i = (1 ε) !

i dist(Ai, S). The algorithm runs in time O(nnz(A)/ε2 + (n + d)poly(k/ε)).

Let Bi be the value of the matrix B after i iterations in Algorithm 3. The proof of the above theorem ﬁrst shows that Algorithm 3 outputs a subspace B satisfying (1). This is done by showing that for at least a constant fraction of j [10/ε + 1], the terms A(I Bj BT

j ) 1,2 and A(I Bj+1BT

j+1) 1,2 are close. This further means that the rows of the matrix A(I Bj BT

j ) cannot be projected onto any k dimensional subspace W to make A(I Bj BT

j )(I WW T) 1,2 substantially smaller than A(I Bj BT

j ) 1,2. Thus, we can show that with constant probability, for i chosen randomly by Algorithm 3, the subspace colspan(Bi ) satisﬁes (1).

Then, the proof uses the fact that for every i [n], the algorithm ﬁnds a matrix Sj that is a Θ(ε2) subspace embedding for [B AT

i ]. This is shown to be true in (Liang

et al., 2014). Now, if Sj is a subspace embedding, it can be shown that the vector xi and value vi satisfy the conditions of Theorem 3.1, thus proving the above theorem.

6. Linear Time Algorithm for Dense Matrices

We see from Algorithm 4 that, after computing a subspace that satisﬁes (1), we can compute approximate projections and approximate distances to the subspace in time "O(nd + (n+d)poly(k/ε)). We now show that the subspace can also be found in "O(nd+(n+d)poly(k/ε)) time, thereby giving a near linear time algorithm for dimensionality reduction for dense matrices.

6.1. Computing an (O(1), poly(k)) approximation

Consider constructing an ℓ1 subspace embedding for the matrix A(I BBT)ST in Algorithm 1. The algorithm uses Lewis weights to sample a matrix that is an O(1) ℓ1 subspace embedding for A(I BBT)ST with high probability. We instead use the following theorem to compute an ℓ1 subspace embedding which is more amenable for giving fast algorithms for dense matrices. Theorem 6.1 (Section 3.1 of Woodruff (2014)). Given a matrix A Rn d, let L Rr n be a random matrix that is an (α, β) ℓ1 subspace embedding for the matrix A. Let LA = QR be the QR decomposition of the matrix LA. Let ℓi = Ai R 1 1 for i [n]. If we generate a matrix L with N = O((d2 r/γε2)(β/α) log(1/δε)) rows, each chosen independently as the ith standard basis vector, times 1/(Npi) with probability pi, where pi γ(ℓi/ !

i ℓi ) for all i [n], then the matrix L satisﬁes with probability 1 δ, for all vectors x, (1 ε) Ax 1 L Ax 1 (1 + ε) Ax 1.

Therefore, given an (α, β) ℓ1 subspace embedding with r rows for the matrix A(I BBT)ST Rn O(k), we can compute a (1 ε) ℓ1 embedding with N = O((k2 r/γε2)(β/α) log(1/ε)) rows. Using the ℓ1 subspace embedding of Theorem 1.3 of Wang and Woodruff (2019), we have r, (β/α) = O(k log(k)). Theorem 6.2. Given A Rn d, B Rd c1, k Z and δ, there exists an algorithm that returns & X with "O(k3.5) orthonormal columns that with probability 1 δ satisﬁes

A(I BBT)(I & X & XT) 1,2

O(1) Sub Apxk,1(A(I BBT)).

Given that the matrices C1AIj for all j [b] and WA, where C1 is a Cauchy matrix with O(log(npoly(k/ε))) rows and W is the subspace embedding of Wang and Woodruff (2019) for O(k) dimensional spaces, are precomputed for each of O(log(1/δ)) trials, the algorithm can be implemented in time "O(((nd/b) k3.5 + d poly(k/ε)) log(1/δ)).

Dimensionality Reduction for Sum-of-Distances Metric

We ﬁrst partition rows of the matrix A into [b] sets denoted I1, . . . , Ib. To prove the above theorem, we use the following fact: if C is a Cauchy matrix with O(log(n/δ)/ε2) rows, then for any vector x of n dimensions, median(abs(Cx)) = (1 ε) x 1 with probability 1 δ. This fact lets us compute an approximation to the sum of leverage scores of the rows that lie in Ij quickly without computing individual leverage scores. Using these approximations, we can sample r rows by computing the leverage scores of all rows in just r blocks instead of computing the leverage scores of all the rows of the matrix, which gives the cost saving as described in the theorem.

6.2. Computing a (1 + ε, poly(k/ε)) approximation

Theorem 6.3. Given a matrix A Rn d, orthonormal matrices B and & X such that A(I BBT)(I & X & XT) 1,2 K Sub Apxk,1(A(I BBT)), and parameters k, ε, and δ, there exists an algorithm that outputs a matrix U with poly(K k/ε) orthonormal columns such that with probability 1 δ, A(I BBT)(I UU T) 1,2 (1+ε)Sub Apxk,1(A(I BBT)). Given that C1AIj is precomputed for all j [b], where C1 is a Cauchy Matrix with O(log(npoly(k/ε))) rows, the algorithm runs in time

"O((nd/b) (K k3/ε2 log(1/δ)) + dpoly(k/ε)).

Note that Algorithm 2 samples O(K poly(k/ε)) rows using probabilities proportional to the residuals of the rows with respect to the O(1) approximation. We again divide the rows of the matrix A(I BBT)(I & X & XT) into b parts denoted by I1, . . . , Ib. We use the following fact from (Plan and Vershynin, 2013): If G is a Gaussian matrix with m columns, then with high probability (1/m) x TG 1 (

2/π) x 2. Thus AIj (I BBT)(I & X & XT) 1,2 can be approximated by scaling AIj (I BBT)(I

& X & XT)G 1,1, i.e., the sum of absolute values of entries of the matrix AIj (I BBT)(I & X & XT)G. We show that these approximations can be computed quickly given the precomputed matrices as required in the theorem statement. Given the approximations for the sum of residuals of rows in each Ij, we only have to compute the residuals of (n/b)c rows to sample c rows from the distribution of residuals.

Replacing Algorithms 1 and 2 by algorithms given by the above theorems lets us compute a subspace satisfying (1) in time "O((nd/b)k3.5/ε6 + (n + d)poly(k/ε) + T), where T is the time required to compute the precomputed matrices required by the algorithms. By choosing b = k3.5/ε6, we obtain that a subspace can be computed in time "O(nd+ (n+d)poly(k/ε)+T). Notice that the above theorems only require at most poly(k/ε) products of the form MA for matrices M of at most poly(k/ε) rows. These products can be obtained by computing MA, where M is formed by stacking all the matrices M. Now M only has poly(k/ε) rows

Figure 1. Comparison of subspaces output by Algorithm 3 on Synthetic dataset with Random and Singular Value Subspaces

and the product MA can be computed in time "O(nd) by using the fast rectangular matrix multiplication algorithm of Coppersmith (1982). Thus, a subspace satisfying (1) can be computed in time "O(nd + (n + d)poly(k/ε)).

7. Experiments

We perform experiments to empirically verify that we can attain a non-trivial amount of data reduction while still being able to compute an approximate sum of distances to a kdimensional shape. In our experiments, we set n = 10000 and k = 5. We use various subspaces to compute an approximation to the sum of distances to a k center set.

7.1. Synthetic Data

We set d = 10000 and we choose a set C of 5 centers in Rd randomly. For each center, we add Cauchy noise to generate 2000 independent samples and hence obtain a dataset A of size 10000. We run our algorithm with a target dimension of 100 and store the subspaces at intermediate steps of the loop in Algorithm 3. We compare the approximation computed for !

i dist(Ai , C) using the subspace computed by Algorithm 3, a random subspace, and the top singular value subspace of the same dimensions. See Figure 1 for a plot. We note that for all dimensions, the subspace output by Algorithm 3 does better than a random subspace to approximate the sum of distances and particularly at lower dimensions, the subspace output by Algorithm 3 does better than the top singular value subspace of the same dimension.

7.2. Real-World Data

We run our dimensionality reduction algorithm on a randomly chosen subset A of size 10000 of the Cover Type dataset (Dua and Graff, 2017). We compute a k-means solution C on the dataset and then evaluate the sum of distances to the center set C. Similar to the case of synthetic data, we compare the approximate sum of distances to C

Dimensionality Reduction for Sum-of-Distances Metric

Figure 2. Comparison of subspaces output by Algorithm 3 on Cover Type dataset with random and singular value subspaces

computed using the subspace output by Algorithm 3, a random subspace, and the top singular value subspace. See Figure 2 for a plot. We note that, again, the subspace output by our algorithm performs better than the random subspace at all dimensions. But note that the singular value subspace approximates the sum of distances to the set C better than the subspace output by our algorithm. This occurs due to the fact that the data is inherently low dimensional and therefore if P is the top singular value subspace, then PP ai ai and dist(ai, P) 0 and therefore, $

dist(ai, P)2 + dist(PP ai, C)2 dist(ai, C). Thus, if the matrix A can be approximated well with a low dimensional matrix, then we can instead use top singular value subspace to reduce the dimension of the data and still be able to compute an approximate sum of distances to k dimensional shapes, although we do not have any theoretical bounds for singular value subspaces.

An implementation of our Algorithm 3 and code for our experiments is available at here2.

Acknowledgements

The authors would like to thank support from the National Institute of Health (NIH) grant 5R01 HG 10798-2, Ofﬁce of Naval Research (ONR) grant N00014-18-1-256, and a Simons Investigator Award.

Mihai B adoiu, Sariel Har-Peled, and Piotr Indyk. Approxi-

mate clustering via core-sets. In Proceedings of the thiryfourth annual ACM symposium on Theory of computing, pages 250 257. ACM, 2002.

2https://gitlab.com/praneeth10/ dimensionality-reduction-for-sum-of-distances

Ke Chen. On coresets for k-median and k-means cluster-

ing in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923 947, 2009.

Kenneth L Clarkson and David P Woodruff. Input spar-

sity and hardness for robust subspace approximation. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 310 329. IEEE, 2015.

Michael B. Cohen and Richard Peng. L<sub>p</sub> row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 15, page 183 192, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450335362. doi: 10.1145/ 2746539.2746567. URL https://doi.org/10. 1145/2746539.2746567.

Michael B Cohen, Sam Elder, Cameron Musco, Christo-

pher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163 172. ACM, 2015.

D. Coppersmith. Rapid multiplication of rectangular matri-

ces. SIAM Journal on Computing, 11(3):467 471, 1982. doi: 10.1137/0211037. URL https://doi.org/ 10.1137/0211037.

Amit Deshpande and Kasturi Varadarajan. Sampling-based

dimension reduction for subspace approximation. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 641 650. ACM, 2007.

Amit Deshpande, Luis Rademacher, Santosh Vempala, and

Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1117 1126. Society for Industrial and Applied Mathematics, 2006.

Dheeru Dua and Casey Graff. UCI machine learning repos-

itory, 2017. URL http://archive.ics.uci. edu/ml.

Dan Feldman and Michael Langberg. A uniﬁed framework for approximating and clustering data. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC 11, page 569 578, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306911. doi: 10.1145/ 1993636.1993712. URL https://doi.org/10. 1145/1993636.1993712.

Dan Feldman and Leonard J Schulman. Data reduction for

weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium

Dimensionality Reduction for Sum-of-Distances Metric

on Discrete Algorithms, pages 1343 1354. Society for Industrial and Applied Mathematics, 2012.

Dan Feldman, Morteza Monemizadeh, Christian Sohler,

and David P Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In Proceedings of the twenty-ﬁrst annual ACM-SIAM symposium on Discrete Algorithms, pages 630 649. Society for Industrial and Applied Mathematics, 2010.

Dan Feldman, Melanie Schmidt, and Christian Sohler.

Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1434 1453. Society for Industrial and Applied Mathematics, 2013.

Gereon Frahling and Christian Sohler. Coresets in dynamic

geometric data streams. In Proceedings of the thirtyseventh annual ACM symposium on Theory of computing, pages 209 217. ACM, 2005.

Gereon Frahling and Christian Sohler. A fast k-means im-

plementation using coresets. International Journal of Computational Geometry & Applications, 18(06):605 625, 2008.

Sariel Har-Peled and Akash Kushal. Smaller coresets for

k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3 19, 2007.

Sariel Har-Peled and Soham Mazumdar. On coresets for

k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291 300. ACM, 2004.

Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: Importance sampling is nearly optimal. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 1416 1429, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi: 10.1145/3357713. 3384296. URL https://doi.org/10.1145/ 3357713.3384296.

Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307 323, May 2006. ISSN 0004-5411. doi: 10.1145/1147954.1147955. URL https://doi. org/10.1145/1147954.1147955.

Michael Langberg and Leonard J Schulman. Universal ε-approximators for integrals. In Proceedings of the twenty-ﬁrst annual ACM-SIAM symposium on Discrete Algorithms, pages 598 607. SIAM, 2010.

Yingyu Liang, Maria-Florina F Balcan, Vandana Kan-

chanapally, and David Woodruff. Improved distributed principal component analysis. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings. neurips.cc/paper/2014/file/ 52947e0ade57a09e4a1386d08f17b656-Paper. pdf.

Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027 1038. ACM, 2019.

Jirı Matoušek. Lecture notes on metric embeddings. Tech-

nical report, Technical report, ETH Zürich, 2013.

Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8): 1275 1297, 2013. doi: 10.1002/cpa.21442. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/cpa.21442.

Nariankadu D Shyamalkumar and Kasturi Varadarajan. Ef-

ﬁcient subspace approximation algorithms. In SODA, volume 7, pages 532 540, 2007.

Christian Sohler and David P Woodruff. Strong coresets for

k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 802 813. IEEE, 2018.

Kasturi Varadarajan and Xin Xiao. On the sensitivity of

shape ﬁtting problems. ar Xiv preprint ar Xiv:1209.4893, 2012.

Ruosong Wang and David P. Woodruff. Tight bounds for

lp oblivious subspace embeddings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 19, page 1825 1843, USA, 2019. Society for Industrial and Applied Mathematics.

David P. Woodruff. Sketching as a tool for numerical linear

algebra. Found. Trends Theor. Comput. Sci., 10(1 2): 1 157, October 2014. ISSN 1551-305X. doi: 10.1561/ 0400000060. URL https://doi.org/10.1561/ 0400000060.