# computing_approximate_ell_p_sensitivities__78d260a7.pdf

Computing Approximate ℓp Sensitivities

Swati Padmanabhan Massachusetts Institute of Technology pswt@mit.edu

David P. Woodruff Carnegie Mellon University dwoodruf@cs.cmu.edu

Qiuyi (Richard) Zhang Google Research qiuyiz@google.com

Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating sensitivities, which we show is equivalent to approximate regression, are known for only the ℓ2 setting, in which they are popularly termed leverage scores. In this work, we provide the first efficient algorithms for approximating ℓp sensitivities and related summary statistics of a given matrix. In particular, for a given n d matrix, we compute α-approximation to its ℓ1 sensitivities at the cost of O(n/α) sensitivity computations. For estimating the total ℓp sensitivity (i.e. the sum of ℓp sensitivities), we provide an algorithm based on importance sampling of ℓp Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly O(

d) sensitivity computations. Furthermore, we estimate the maximum ℓ1 sensitivity, up to a

d factor, using O(d) sensitivity computations. We generalizeall these results to ℓp norms for p > 1. Lastly, we experimentally show that for a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have low intrinsic effective dimensionality.

1 Introduction

Many modern large-scale machine learning datasets comprise tall matrices A Rn d with d features and n d training examples, often making it computationally expensive to run them through even basic algorithmic primitives. Therefore, many applications spanning dimension reduction, privacy, and fairness aim to estimate the importance of a given datapoint as a first step. Such importance estimates afford us a principled approach to focus our attention on a subset of only the most important examples.

In view of this benefit, several sampling approaches have been extensively developed in the machine learning literature. One of the simplest such methods prevalent in practice is uniform sampling. Prominent examples of this technique include variants of stochastic gradient descent methods [1 4] such as [5 14] developed for the finite-sum minimization problem minimizeθ X Pn i=1 fi(θ). (1.1) where, in the context of empirical risk minimization, each fi : X 7 R 0 in (1.1) measures the loss incurred by the i-th data point from the training set, and in generalized linear models, each fi represents a link function applied to a linear predictor evaluated at the i-th data point. The aforementioned algorithms randomly sample a function fi and make progress via gradient estimation techniques.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

However, uniform sampling can lead to significant information loss when the number of important training examples is relatively small. Hence, importance sampling methods that assign higher sampling probabilities to more important examples have emerged both in theory [15, 16] and practice [17 21]. One such technique is based on sensitivity scores, defined [15] for each coordinate i [n] as:

σi def = max x X fi(x) Pn j=1 fj(x). (1.2)

Sampling each fi independently with probability pi σi and scaling the sampled row with 1/pi preserves the objective in (1.1) in expectation for every x X. Further, one gets a (1 ε)-approximation to this objective by sampling e O(Sdε 2) functions [22, Theorem 1.1], where S = Pn i=1 σi is called the total sensitivity, and d is an associated VC dimension.

Numerous advantages of sensitivity sampling over ℓp Lewis weight sampling [23] were highlighted in the recent work [24], prominent among which is that a sampling matrix built using the ℓp sensitivities of A is significantly smaller than one built using its ℓp Lewis weights in many important cases, e.g., when the total sensitivity S is small for p > 2 (which is seen in many structured matrices, such as combinations of Vandermonde, sparse, and low-rank matrices as studied in [25] as well as those in many naturally occuring datasets (cf. Section 4)). Additionally, sensitivity sampling has found use in constructing ϵ-approximators for shape-fitting functions, also called strong coresets [15, 16, 26 28], clustering [16, 29, 30], logistic regression [31, 32], and least squares regression [33, 34].

Despite their benefits, the computational burden of estimating these sensitivities has been known to be significant, requiring solving an expensive maximization problem for each datapoint, which is a limiting factor in their use [35]. A key exception is for fi(x) = (a i x)2, when the sensitivity score of fi is exactly the leverage score of ai, which can be computed quickly in O(log(n)) sensitivity calculations, as opposed to the naive O(n) sensitivity calculations (cf. Section 2).

1.1 Our contributions

In this work, we initiate a systematic study of algorithms for approximately computing the ℓp sensitivities of a matrix, generally up to a constant factor. We first define ℓp sensitivities:

Definition 1.1. [15] Given a full-rank matrix A Rn d with n d and a scalar p (0, ), let ai denote the i th row of matrix A. Then the vector of ℓp sensitivities of A is defined as1 (cf. Section 2 for the notation)

σp(ai) def = max x Rd,Ax =0 |a i x|p

Ax p p , for all i [n], (1.3)

and the total ℓp sensitivity is defined as Sp(A) def = P

i [n] σp(ai).

These ℓp sensitivities yield sampling probabilities for ℓp regression, a canonical optimization problem that captures least squares regression, maximum flow, and linear programming, in addition to appearing in applications like low rank matrix approximation [36], sparse recovery [37], graph-based semi-supervised learning [38 40], and data clustering [41]. However, algorithms for approximating ℓp sensitivities were known for only p = 2 (when they are called leverage scores ). As is typical in large-scale machine learning, we assume n d. We now state our contributions, which are three-fold.

(1) Approximation algorithms. We design algorithms for three prototypical computational tasks associated with ℓp sensitivities of a matrix A Rn d. Specifically, for an approximation parameter α > 1, we can simultaneously estimate all ℓp sensitivities up to an additive error of O αp

n Sp(A) via O(n/α) individual sensitivity calculations, reducing our runtime by sacrificing some accuracy. We limit most of our results in the main text to ℓ1 sensitivities, but extend these to all p 1 in Appendix C. To state our results for the ℓp setting, we use nnz(A) to denote the number of non-zero entries of the matrix A, ω for the matrix multiplication constant, and we introduce the following piece of notation.

Definition 1.2 (Notation for ℓp Results). We introduce the notation LP(m, d, p) to denote the cost of approximating one ℓp sensitivity of an m d matrix up to an accuracy of a given constant factor.

1From hereon, for notational conciseness, we omit stating Ax = 0 in the problem constraint.

Theorem 1.3 (Estimating all sensitivities; informal version of Theorem 3.1 and Theorem C.2). Given a matrix A Rn d and 1 α n, there exists an algorithm (Algorithm 1) which returns a vector eσ Rn 0 such that, with high probability, for each i [n], we have

σ1(ai) eσi O(σ1(ai) + α

Our algorithm s runtime is e O nnz(A) + n

α dω . More generally, for p 1, we can compute (via Algorithm 6) an estimate eσ Rn 0 satisfying σp(ai) eσi O(αp 1σp(ai) + αp

n Sp(A)) with high probability for each i [n], at a cost of O nnz(A) + n

α LP(O(dmax(1,p/2)), d, p) .

Two closely related properties arising widely in algorithms research are the total sensitivity S1(A) and the maximum sensitivity σ1(A) . As stated earlier, one important feature of the total sensitivity is that it determines the total sample size needed for function approximation, thus making its efficient estimation crucial for fast ℓ1 regression. Additionally, it was shown in [24] that the sample complexity of sensitivity sampling is much lower than that of Lewis weight sampling in many practical applications such as when the total sensitivity is small; therefore, an approximation of the total sensitivity can be used as a fast test for whether or not to proceed with sensitivity sampling (involving the costly task of calculating all sensitivities). While the total sensitivity helps us characterize the whole dataset (as explained above), the maximum sensitivity captures the importance of the most important datapoint and is used in, e.g., experiment design and in reweighting matrices for low coherence [42]. Additionally, it captures the maximum extent to which a datapoint can influence the objective function and is therefore used in differential privacy [43]. While one could naively estimate the total and maximum sensitivities by computing all n sensitivities using Theorem 1.3, we give faster algorithms that, strikingly, have no polynomial dependence on n. In the assumed regime of n d, this is a significant runtime improvement.

Theorem 1.4 (Estimating the total sensitivity; informal version of Theorem 3.5). Given a matrix A Rn d, a scalar p 1, and a small constant γ < 1, there exists an algorithm (Algorithm 2), which returns a positive scalar bs such that, with a probability of at least 0.99, we have

Sp(A) bs (1 + O(γ))Sp(A).

Our algorithm s runtime is e O nnz(A) + 1 γ2 d|p/2 1|LP(O(dmax(1,p/2), d, p) .

In Algorithm 4 (presented in Appendix B.2), we give an alternate method to estimate S1(A) based on recursive leverage score sampling, without ℓp Lewis weight calculations, which we believe to be of independent interest. All of our algorithms utilize approximation properties of ℓp Lewis weights and subspace embeddings and recent developments in fast computation of these quantities.

Theorem 1.5 (Estimating the maximum sensitivity; informal version of Theorem 3.7 and Theorem C.3). Given a matrix A Rn d, there exists an algorithm (Algorithm 3) which returns a scalar eσ > 0 such that

σ1(A) eσ O(

Our algorithm s runtime is e O(nnz(A) + dω+1). Furthermore, via Algorithm 7, this result holds for all p [1, 2] and for p > 2, this guarantee costs e O nnz(A) + dp/2 LP(O(dp/2), d, p) .

(2) Hardness results. In the other direction, while it is known that ℓp sensitivity calculation reduces to ℓp regression, we show the converse. Specifically, we establish hardness of computing ℓp sensitivities in terms of the cost of solving multiple corresponding ℓp regression problems, by showing that a (1 + ε)-approximation to ℓp-sensitivities solves ℓp multi-regression up to an accuracy factor of 1 + ε.

Theorem 1.6 (Leave-One-Out ℓp Multiregression Reduces to ℓp Sensitivities; informal2). Suppose that we are given an sensitivity approximation algorithm A, which for any matrix A Rn d

and accuracy parameter ε (0, 1), computes (1 ε )σp(A ) in time T (n , d , nnz(A ), ε ). Given a matrix A Rn d with n d, let OPTi := miny Rd 1 A: iy + A:i p p and y i := arg miny Rd 1 A: iy + A:i p for all the i [d]. Then, there exists an algorithm that takes A Rn d and computes (1 ε)OPTi for all i in time T (n + d, d, nnz(A), ε).

2Please see Lemma D.2 for the formal version with a small additive error.

For p = 2, approximating all ℓp sensitivities to a high accuracy would therefore require improving ℓp multi-regression algorithms with Ω(d) instances, for which the current best high-accuracy algorithms take poly(d) nnz(A) time [44, 45]. More concretely, our hardness results imply that in general one cannot compute all sensitivities as quickly as leverage scores unless there is a major breakthrough in multi-response regression. In order to show this, we introduce a reduction algorithm that efficiently regresses columns of A against linear combinations of other columns of A, by simply adding a row to A capturing the desired linear combination and then computing sensitivities, which will reveal the cost of the corresponding regression problem. We can solve Θ(d) of these problems by augmenting the matrix to increase the rows and columns by at most a constant factor. We defer these details to Appendix D.

(3) Numerical experiments. We consolidate our theoretical contributions with a demonstration of the empirical advantage of our approximation algorithms over the naive approach of estimating these sensitivities using the UCI Machine Learning Dataset Repository [46] in Section 4. We found that many datasets have extremely small total sensitivities; therefore, fast sensitivity approximation algorithms utilize this small intrinsic dimensionality of real-world data far better than Lewis weights sampling, with sampling bounds often a factor of 2-5x better. We also found these sensitivities to be easy to approximate, with the accuracy-runtime tradeoff much better than our theory suggests, with up to 40x faster in runtime. Lastly, we show that our algorithm for estimating the total sensitivity produces accurate estimates, up to small constant factors, with a runtime speedup of at least 4x.

1.2 Related work

Introduced in the pioneering work of [15], sensitivity sampling falls under the broad framework of importance sampling . It has found use in constructing ϵ-approximators of numerical integrands of several function classes in numerical analysis, statistics (e.g., in the context of VC dimension), and computer science (e.g., in coresets for clustering). The results from [15] were refined and extended by [16], with more advances in the context of shape-fitting problems in [26 28], clustering [16, 29, 30], logistic regression [31, 32], least squares regression [33, 34, 47], principal component analysis [48], reinforcement learning [49, 50], and pruning of deep neural networks [51 53]. Sampling algorithms for ℓp subspace embeddings have also been extensively studied in functional analysis [54 59] as well as in theoretical computer science [23, 60, 44, 45].

Another notable line of work in importance sampling uses Lewis weights to determine sampling probabilities. As we explain in Section 2, Lewis weights are closely related to sensitivities. Many modern applications of Lewis weights in theoretical computer science we introduced in [23], which gave input sparsity time algorithms for approximating Lewis weights and used them to obtain fast algorithms for solving ℓp regression. They have subsequently been used in row sampling algorithms for data pre-processing [33, 61, 62, 60, 23], computing dimension-free strong coresets for k-median and subspace approximation [63], and fast tensor factorization in the streaming model [64]. Lewis weights are also used for ℓ1 regression in: [65] for stochastic gradient descent pre-conditioning, [66] for quantile regression, and [67] to provide algorithms for linear algebraic problems in the sliding window model. Lewis weights have also become widely used in convex geometry [68], randomized numerical linear algebra [23, 42, 69, 64, 70], and machine learning [71 73].

2 Notation and preliminaries

We use boldface uppercase and lowercase letters for matrices and vectors respectively. We denote the ith row vector of a matrix A by ai. Given a matrix M, we use Tr(M) for its trace, rank(M) for its rank, M for its Moore-Penrose pseudoinverse, and |M| for the number of its rows. When x is an integer, we use [x] for the set of integers 1, 2, . . . , x. For two positive scalars x and y and a scalar α (0, 1), we use x α y to denote (1 α)y x (1 + α)y; we sometimes use x y to mean (1 c)y x (1 + c)y for some appropriate universal constant c. We use e O to hide dependence on no(1). We acknowledge that there is some notation overloading between p (when used to denote the scalar in, for example, ℓp norms) and pi (when used to denote some probabilities) but the distinction is clear from context. We defer the proofs of facts stated in this section to Appendix A.

Notation for sensitivities. We start with a slightly general definition of sensitivities: σB p (ai) to denote the ℓp sensitivity of ai with respect to B; when computing the sensitivity with respect to the

same matrix that the row is drawn from, we omit the matrix superscript:

σB p (ai) := max x |a i x|p

Bx p p and σp(ai) := max x |a i x|p

Ax p p . (2.1)

When referring to the vector of ℓp sensitivities of all the rows of a matrix, we omit the subscript in the argument: so, σB p (A) is the vector of ℓp sensitivities of rows of the matrix A, each computed with respect to the matrix B (as in Equation (2.1)), and analogously, σp(A) is the vector of ℓp sensitivities of the rows of matrix A, each computed with respect to itself. Similarly, the total sensitivity is the sum of ℓp sensitivities of the rows of A with respect to matrix B, for which we use the following notation:

SB p (A) := X

i [|A|] σB p (ai) and Sp(A) := X

i [|A|] σp(ai), (2.2)

where, analogous to the rest of the notation regarding sensitivities, we omit the superscript when the sensitivities are being computed with respect to the input matrix itself.

While σB p (ai) could be infinite, by appending ai to B, the generalized sensitivity becomes once again contained in [0, 1]. Specifically, a simple rearrangement of the definition gives us the identity σB ai p (ai) = 1/(1 + 1/σB p (ai)). We now define leverage scores, which are a special type of sensitivities and also amenable to efficient computation [60], which makes them an ideal building block in algorithms for approximating sensitivities.

Leverage scores are ℓ2 sensitivities. The leverage score of the ith row ai of A is τi(A) def = a i (A A) ai. The ith leverage score is also the ith diagonal entry of the orthogonal projection matrix A(A A) A . Since the eigenvalues of an orthogonal projection matrix are zero or one, it implies τi(A) = 1 i A(A A) A 1i 1 for all i [n] and Pn i=1 τi(A) d (see [74]).

It turns out that τi(A) = σ2(ai) (see [75]). This may be verified by applying a change of basis to the variable y = Ax in the definition of sensitivity and working out that the maximizer is the vector parallel to y = A ai. This also gives an explicit formula for the total ℓ2 sensitivity: S2(A) = rank(A).

The fastest algorithm for constant factor approximation to leverage scores is by [60], as stated next.

Fact 2.1 ([60, Lemma 7]). Given a matrix A Rn d, we can compute constant-factor approximations to all its leverage scores in time O(nnz(A) log(n) + dω log(d)).

In this paper, the above is the result we use to compute leverage scores (to a constant accuracy). Indeed, since the error accumulates multiplicatively across different steps of our algorithms, a high-accuracy algorithm for leverage score computation does not serve any benefit over this constant-accuracy one. Leverage scores approximate ℓ1 sensitivities up to a distortion factor of n, as seen from the below known fact we prove, for completeness, in Appendix A.

Fact 2.2 (Crude sensitivity approximation via leverage scores). Given a matrix A Rn d, let σ1(ai) denote the ith ℓ1 sensitivity of A with respect to A, and let τi(A) denote its ith leverage

score with respect to A. Then we have q

In general, ℓp sensitivities of rows of a matrix suffer a distortion factor of at most n|1/2 1/p| from corresponding leverage scores , which follows from Hölder s inequality. The ℓp generalizations of leverages scores are called ℓp Lewis weights [54, 23], which satisfy the recurrence wi = τi(W1/2 1/p A). However, unlike when p = 2, ℓp Lewis weights are not equal to sensitivities, and from an approximation perspective, they provide only a one-sided bound on the sensitivities σA p (ai) d max(0,p/2 1) w A p (ai) [75, 76]. While we have algorithms for efficiently computing leverage scores [60] and Lewis weights [23, 77], extensions for utilizing these ideas for fast and accurate sensitivity approximation algorithms for all p have been limited, which is the gap this paper aims to fill. A key regime of interest for many downstream algorithms is a small constant factor approximation to true sensitivities, such as for subsampling [78].

Subspace embeddings. We use sampling-based constant-approximation ℓp subspace embeddings that we denote by Sp. When the type of embedding is clear from context, we omit the subscript.

Definition 2.3 (ℓp Subspace Embedding). Given A Rn d (typically n d), we call S Rr n

(typically d r n) an ϵ-approximate ℓp subspace embedding of A if SAx p ε Ax p x Rd.

When the diagonal matrix S is defined as Sii = 1/ pi with probability pi = Ω(τi(A)) (and 0 otherwise), then with high probability SA is an ℓ2 subspace embedding for A [33, 79]; further, S has at most e O(dε 2) non-zero entries. For general p, we use Lewis weights to construct ℓp subspace embeddings fast [23, Theorem 7.1]. This fast construction has been made possible by the reduction of Lewis weight computation to a few (polynomial in p) leverage score computations for p < 4 [23] as well as p 4 [77]. This efficient construction is the reason for our choice of this subspace embedding. We note that there has been an extensive amount of literature on designing ℓ1 subspace embeddings for example using Cauchy [80] and exponential [81] random variables.

For p 2, it is known that the expected sample complexity of rows is Pn i=1 pi = O(d) [34], and for p > 2, this is known to be at most O(dp/2) [35]. While these bounds are optimal in the worst case, one important application of faster sensitivity calculations is to compute the minimal subspace embedding dimension for average case applications, particularly those seen in practice. This gap is captured by the total sensitivity Sp(A), which has been used to perform more sample efficient subspace embeddings [67, 27, 28], as sampling e O(Sp(A)d/ϵ2) rows suffice to provide an ℓp subspace embedding.

Accurate sensitivity calculations. While leverage scores give a crude approximation, constantfactor approximations of ℓp sensitivities seem daunting since we must compute worst-case ratios over the input space. However, we can compute the cost of one sensitivity specific row by reducing it to ℓp regression since by scale invariance, we have 1 σB p (ai) = mina i x=1 Bx p p.

This reduction allows us to use recent developments in fast approximate algorithms for ℓp regression [44, 25, 45], which utilize subspace embeddings to first reduce the number of rows of the matrix to O(poly(d)). For the specific case of p = 1, which is the focus of the main body of our paper, we may reduce to approximate ℓ1 regression on a O(d) d matrix, which is a form of linear programming.

Fact 2.4. Given an n d matrix, the cost of computing k of its ℓ1 sensitivities is e O(nnz(A)+k dω).

3 Approximating functions of sensitivities

We first present a constant-probability algorithm approximating the ℓ1 sensitivities in Algorithm 1. Our algorithm is a natural one: Since computing the ℓ1 sensitivities of all the rows simultaneously is computationally expensive, we instead hash α-sized blocks of random rows into smaller (constant-sized) buckets and compute the sensitivities of the smaller, O(n/α)-sized matrix P so generated, computing the sensitivities with respect to SA, the ℓ1 subspace embedding of A. Running this algorithm multiple times gives our high-probability guarantee via the standard median trick. Theorem 3.1. Given a full-rank matrix A Rn d and an approximation factor 1 < α n, let σ1(ai) be the ℓ1 sensitivity of the ith row of A. Then there exists an algorithm that, in time e O nnz(A) + n

α dω , returns eσ Rn 0 such that with high probability, for each i [n],

σ1(ai) eσi O(σ1(ai) + α

n S1(A)). (3.1)

Proof Sketch of Theorem 3.1; full proof in Appendix B.1. We achieve our guarantee via Algorithm 1. To see this, first note that SAx 1 Θ( Ax 1) (since as per Line 1, SA is an ℓ1 subspace embedding of A). Combining this with the fact that every row in A is mapped, via the designed randomness, to some row in the matrix P helps establish the desired approximation guarantee. The runtime follows from using Fact 2.4 to compute ℓ1 sensitivities with respect to SA of |P| = O(n/α) rows.

As seen in Appendix C.1, our techniques described above also generalize to the p 1 case. Specifically, in Theorem C.2, we show that reducing ℓp sensitivity calculations by an α factor gives an approximation guarantee of the form O(αp 1σp(ai)) + αp

n Sp(A)). Remark 3.2. The sensitivities returned by our algorithm are approximate, with relative and additive error, but are useful in certain settings as they preserve ℓp regression approximation guarantees while increasing total sample complexity by only a small poly(d) factor compared to true sensitivities while still keeping it much smaller than that obtained via Lewis weights. To see this with a simple toy

Algorithm 1 Approximating ℓ1-Sensitivities: Row-wise Approximation

Input: Matrix A Rn d, approximation factor α (1, n) Output: Vector eσ Rn >0 that satisfies, for each i [n], with probability 0.9, that

σ1(ai) eσi O(σ1(ai) + S1(A) α

1: Compute for A an ℓ1 subspace embedding SA O(d) d (cf. Definition 2.3) 2: Partition A into n

α blocks B1, . . . , Bn/α each comprising α randomly selected rows 3: for the ℓth block Bℓ, with ℓ [ n

4: Sample α-dimensional independent Rademacher vectors r(ℓ) 1 , . . . , r(ℓ) 100 5: For each j [100], compute the row vectors r(ℓ) j Bℓ Rd

6: end for 7: Let P R100 n

α d be the matrix of all vectors from Line 5. Compute σSA 1 (P) using Fact 2.4 8: for each i [n] do 9: Denote by J the set of row indices in P that ai is mapped to in Line 5 10: Set eσi = maxj J(σSA 1 (pj)) 11: end for 12: Return eσ

example, consider the case p > 2. Note that by Theorem 1.5 of [24], the sample complexity using approximate sensitivities is e O(α2p 2S2 2/p p (A)), using the αp factor blowup via Theorem C.2, that using true sensitivities is e O(S2 2/p p (A)), and that using ℓp Lewis weights is O(dp/2). Suppose also that the total sensitivity Sp(A) = d. Further assume that n = d10 and α = n 1 10p = d1/p; then the dimension-dependence of sample complexities given by our approximate sensitivities, true sensitivities, and Lewis weights are, respectively, e O(d4), e O(d2), and e O(dp/2) (ignoring small factors in the exponent that do not affect the asymptotic analysis). Thus, our approximate sensitivities provide a tradeoff in computational cost and sample complexity between using true sensitivities and Lewis weights.

3.1 Estimating the sum of ℓp sensitivities

In this section, we present a one-shot algorithm that provides a constant-approximation to the total ℓp sensitivity for all p 1. This algorithm is based on importance sampling using ℓp Lewis weights. For our desired guarantees, we crucially need the approximation bounds of ℓp Lewis weights, as provided by Fact 3.3, efficient computation of ℓp Lewis weights as provided by [23] for p < 4 and by [77] for p 4, and the fact that a random ℓp sampling matrix S (cf. Fact 3.4) yields SA, a constant-approximation subspace embedding to A [24, Theorem 1.8].

Algorithm 2 Approximating the Sum of ℓp-Sensitivities for p 1

Input: Matrix A Rn d, approximation factor γ (0.01, 1), and a scalar p 1 Output: A positive scalar that satisfies, with probability 0.99, that

Sp(A) bs (1 + O(γ))Sp(A)

1: Compute all ℓp Lewis weights of A, denoting the ith weight by wp(ai)

2: Define the sampling vector v Rn 0 by vi = wp(ai)

d for all i [n]

3: Sample m = O(d|1 p/2|) rows with replacement, picking the ith row with a probability of vi 4: Construct a random ℓp sampling matrix Sp A with probabilities {vi}n i=1 (cf. Fact 3.4)

5: For each sampled row ij, j [m], compute rj = σ Sp A p (aij ) vij 6: Return 1

m Pm j=1 rj

Fact 3.3 (Lewis Weights Bounds Sensitivities, [42, 82]). Given a matrix A Rn d and p 1, the ℓp sensitivity σA p (ai) and ℓp Lewis weight wp(ai) of the row ai satisfy, for all i [n],

σA p (ai) d max(0,p/2 1) wp(ai).

Fact 3.4 (Sampling via Lewis Weights [23, 35]). Given A Rn d and p > 0. Consider a random diagonal ℓp sampling matrix S Rn n with sampling probabilities {pi} proportional to the ℓp Lewis weight of A, i.e., for each i [n], the ith diagonal entry is independently set to be

Si,i = 1/p1/p i with probability pi 0 otherwise .

Then, with high probability, S with O(ε 2dmax(1,p/2)(log d)2 log(d/ε)) rows is an ℓp subspace embedding for A (cf. Definition 2.3). Theorem 3.5. Given a matrix A Rn d and an approximation factor γ (0, 1), there exists an algorithm, which returns a positive scalar bs such that, with a probability 0.99, we have Sp(A) bs (1 + O(γ))Sp(A).

Our algorithm s runtime is e O nnz(A) + 1 γ2 d|p/2 1|LP(O(dmax(1,p/2), d, p) .

Proof. Without loss of generality, we may assume A to be full rank. Then, its Lewis weights satisfy Pn i=1 wp(ai) = d. Per Line 2 of Algorithm 2, our sampling distribution is chosen to be vi = wp(ai)

d for all i [n]. We sample from A rows with replacement, with row i picked with a probability of vi. From the definition of vi in Line 2 and rj in Line 5 and the fact that Sp A is a constant factor ℓp subspace embedding of A, our algorithm s output satisfies the following unbiasedness condition:

σSp A p (aij) vij vij

= SSp A p (A) 2Sp(A).

By independence, we also have the following bound on the variance of 1

σSp A p (ai)2

σSp A p (ai)2

with the final step holding by the choice of vi in Line 2. In the case that p 2, we have

σSp A p (ai)2

i [n] σSp A p (ai) dp/2 1 2dp/2

where the first inequality uses Fact 3.3. Therefore, applying Chebyshev s inequality on 1

(as defined in Line 5) with m = O dp/2 Sp(A)γ2 gives us a γ-multiplicative accuracy in approximating

Sp(A) (with the desired constant probability). For p 2, we additionally have [24, Theorem 1.7] the lower bound Sp(A) d, which when plugged into the value of m gives the claimed sample complexity. A similar argument may be applied for the case p < 2; specifically, we have that

σSp A p (ai)2

i [n] σSp A p (ai) 2d

where the second step is by wp(ai) σp(ai) from Fact 3.3. For p < 2, we also have Sp(A) dp/2 from [24, Theorem 1.7]. Applying Chebyshev s inequality with this fact gives a sample complexity of m = O d1 p/2/γ2 . This completes the correctness guarantee.

Runtime. We first compute all ℓp Lewis weights of A Rn d up to a constant multiplicative

accuracy, the cost of which is O 1 1 |1 p/2| log(log(n)) leverage score computations for

p < 4 [23] and O(p3 log(np)) leverage score computations for p 4 [77]. Next, we compute m = O(d|1 p/2|) sensitivities with respect to Sp A. From [23, 35], Sp has, with high probability, O(dp/2 log d) rows when p > 2 and O(d log d) rows when p 2. Summing the cost of computing these m sensitivities and that of computing the Lewis weights gives the claimed runtime.

Remark 3.6. We present in Appendix B.2 an alternate algorithm, Algorithm 4, for estimating the total ℓp sensitivity for p = 1. Algorithm 4 uses recursive computations of leverage scores, in contrast to Algorithm 2 which uses Lewis weights in a one-shot manner, and may be of independent interest.

3.2 Approximating the maximum ℓ1 sensitivity

In this section, we present an algorithm that approximates σ1(A) = maxi [|A|] σ1(ai), the maximum of the ℓ1 sensitivities of the rows of a matrix A. As alluded to in Section 1, a first approach to this problem would be to simply estimate all n sensitivities and compute their maximum. To do better than this, one idea, inspired by the random hashing approach of Algorithm 1, is as follows.

If the matrix has a large number of high-sensitivity rows, then, intuitively, the appropriately scaled maximum sensitivity of a uniformly sampled subset of these rows should approximate σ1(A) . Specifically, assume that the matrix has at least α rows of sensitivity at least σ1(A) / α; uniformly sample n/α rows and return eσa, the (appropriately scaled) maximum of σS1A 1 (ai), for the sampled rows i. Then, σ1(A) eσa O( α σ1(A) ) with a constant probability. Here the upper bound guarantee is without any condition on the number of rows with large ℓ1 sensitivities.

In the other case, if the matrix does not have too many high-sensitivity rows, we could estimate the maximum sensitivity by hashing rows into small buckets via Rademacher combinations of uniformly sampled blocks of α rows each (just like in Algorithm 1). Then, eσb, the scaled maximum of the ℓ1 sensitivities of these rows satisfies σ1(A) eσb O( α σ1(A) ). Here, it is the lower bound that comes for free (i.e., without any condition on the number of rows with large ℓ1 sensitivities).

Despite the above strategies working for each case, there is no way to combine these approaches without knowledge of whether the input matrix has a enough high-sensitivity rows or not. We therefore avoid this approach and instead present Algorithm 3, where we make use of S A, an ℓ subspace embedding of A. Our algorithm hinges on the recent development [75, Theorem 1.3] on efficient construction of such an embedding with simply a subset of e O(d) rows of A.

Algorithm 3 Approximating the Maximum of ℓ1-Sensitivities

Input: Matrix A Rn d

Output: Scalar bs R 0 that satisfies σ1(A) bs C

d σ1(A) 1: Compute, for A, an ℓ subspace embedding S A RO(d log2(d)) d such that S A is a subset of the rows of A [75, Theorem 1.3] 2: Compute, for A, an ℓ1 subspace embedding S1A RO(d) d

d σS1A 1 (S A)

Theorem 3.7 (Approximating the Maximum of ℓ1 Sensitivities). Given a matrix A Rn d, there exists an algorithm, which in time e O(nnz(A) + dω+1), outputs a positive scalar bs that satisfies

Ω( σ1(A) ) bs O(

Proof Sketch of Theorem 3.7; full proof in Appendix B.3. We achieve our guarantee via Algorithm 3. Define x and ai as: x , i = arg maxx Rd,i [|A|] |a i x| Ax 1 . Since S1A is an ℓ1 subspace embedding of A, we have, for any x Rd, that S1Ax 1 = Θ( Ax 1). If S A contains the row ai , then S1Ax 1=Θ( Ax 1) implies σS1A 1 (S A) = Θ(1) σ1(A) . In the other case, suppose ai is not included in S A. Then we observe that

σS1A 1 (S A) = max x Rd,cj S A

c j x S1Ax 1 max x Rd S Ax

S1Ax 1 = Θ(1) max x Rd S Ax

(3.2) where the the second step is by choosing a specific vector in the numerator and the third step uses S1Ax 1 = Θ( Ax 1). We further have,

max x Rd S Ax

d Ax 1 |a i x |

d σ1(A) , , (3.3)

where the first step is by choosing x = x , the second step is by the guarantee of ℓ subspace embedding, and the final step is by definition of σ1(ai ). Combining Inequality (3.2) and Inequality (3.3) gives the claimed lower bound on σS1A 1 (S A) . The runtime follows from the cost of computing S A from [75] and that of computing e O(d) of ℓ1 sensitivities with respect to S1A.

Figure 1: Average absolute log ratios for all ℓp sensitivity approximations for wine, with the theoretical additive bound (dashed).

4 Numerical experiments

We demonstrate our fast sensitivity approximations on multiple real-world datasets in the UCI Machine Learning Dataset Repository [46], such as the wine and fires datasets, for different p and varying approximation parameters α. We focus on the wine dataset here, with full details in Appendix E.

Experimental Setup. For each dataset and each p, we first apply the Lewis weight subspace embedding to compute a smaller matrix SA. Then, we run our sensitivity computation Algorithm 1 and measure the average and maximum absolute log ratio | log(σapprox/σtrue)| to capture the relative error of the sensitivity approximation. Note that this is stricter than our guarantees, which give only an O(α) additive error; therefore we have no real upper bound on the maximum absolute log ratio due to small sensitivities. We plot an upper bound of log(αp), which is the worst case additive error approximation although it does provide relative error guarantees with respect to the average sensitivity. We compare our fast algorithm for computing the total sensitivity with the naive brute-force method by approximating all sensitivities and then averaging.

Analysis. In practice, we found most real-world datasets have easy-to-approximate sensitivities with total ℓp sensitivity about 2-5 times lower than the theoretical upper bound. Figure 1 shows that the average absolute log ratios for approximating the ℓp sensitivities are much smaller than those suggested by the theoretical upper bound, even for large α. Specifically, when α = 40, we incur only a 16x accuracy deterioration in exchange for a 40x faster algorithm. This is significantly better than the worst-case accuracy guarantee which for p = 3 would be αp = 403.

More importantly, we find that empirical total sensitivities are much lower than the theoretical upper bounds suggest, often by a factor of at least 5, especially for p > 2. This implies that real-world structure can be utilized for improved data compression and justifies the importance of sensitivity estimation for real-world datasets. Furthermore, our novel algorithm approximates the total sensitivity up to an overestimate within a factor of 1.3, in less than a quarter of the time of the brute-force algorithm. Our observation generalizes to other real-world datasets (see Appendix E).

p Total Sensitivity Upper Bound Brute-Force/True Approximation Brute-Force Runtime (s) Approximate Runtime (s)

1 14 5.2 6.4 667 105 1.5 14 11.6 14.2 673 131 2.5 27.1 13.8 14.9 693 209 3 52.4 7.2 8.9 686 192 Table 1: Runtime comparison for computing total sensitivities for the wine dataset, which has matrix shape (177, 14).

Acknowledgements

We are very grateful to: Sagi Perel, Arun Jambulapati, and Kevin Tian for helpful discussions about related work; Taisuke Yasuda for his generous help discussing results in ℓp sensitivity sampling; and our Neur IPS reviewers for their time, effort, and many constructive suggestions. Swati gratefully acknowledges funding for her student researcher position from Google Research and research

assistantship at UW from the NSF award CCF-1749609 (via Yin Tat Lee). Part of this work was done while D. Woodruff was at Google Research. D. Woodruff also acknowledges support from a Simons Investigator Award.

[1] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, 1951. 1

[2] Léon Bottou and Yann Cun. Large scale online learning. Advances in neural information processing systems, 16, 2003.

[3] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004.

[4] Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421 436. Springer, 2012. 1

[5] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm. 1

[6] Nicolas Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in neural information processing systems, 25, 2012.

[7] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(2), 2013.

[8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.

[9] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. Advances in neural information processing systems, 26, 2013.

[10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.

[11] Julien Mairal. Incremental majorization-minimization optimization with application to largescale machine learning. SIAM Journal on Optimization, 25(2), 2015.

[12] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-nonconvex objectives. In International conference on machine learning, 2016.

[13] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, 2016.

[14] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1), 2017. 1

[15] Michael Langberg and Leonard J Schulman. Universal ε-approximators for integrals. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, 2010. 2, 4

[16] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, 2011. 2, 4

[17] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/ paper/2014/file/f29c21d4897f78948b91f03172341b7b-Paper.pdf. 2

[18] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 1 9, 2015.

[19] Angelos Katharopoulos and François Fleuret. Biased importance sampling for deep neural network training. ar Xiv preprint ar Xiv:1706.00043, 2017.

[20] Tyler B. Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31, 2018.

[21] Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, 2022. 2

[22] Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, and Samson Zhou. New frameworks for offline and streaming coreset constructions. ar Xiv preprint ar Xiv:1612.00889, 2016. 2

[23] Michael B Cohen and Richard Peng. Lp row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, 2015. 2, 4, 5, 6, 7, 8, 28, 29

[24] David Woodruff and Taisuke Yasuda. Sharper bounds for ℓp sensitivity sampling. In International Conference on Machine Learning, pages 37238 37272. PMLR, 2023. 2, 3, 7, 8

[25] Raphael Meyer, Cameron Musco, Christopher Musco, David P Woodruff, and Samson Zhou. Fast regression for structured inputs. In International Conference on Learning Representations (ICLR), 2022. 2, 6

[26] Kasturi Varadarajan and Xin Xiao. On the sensitivity of shape fitting problems. In 32nd International Conference on Foundations of Software Technology and Theoretical Computer Science, 2012. 2, 4

[27] Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, and Samson Zhou. Efficient coreset constructions via sensitivity sampling. Proceedings of Machine Learning Research, 157 (2021), 2021. 6

[28] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In International Conference on Artificial Intelligence and Statistics, 2022. 2, 4, 6

[29] Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for nonparametric estimation-the case of dp-means. In International Conference on Machine Learning, 2015. 2, 4

[30] Mario Lucic, Olivier Bachem, and Andreas Krause. Strong coresets for hard and soft bregman clustering with applications to exponential family mixtures. In Artificial intelligence and statistics, 2016. 2, 4

[31] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. Advances in Neural Information Processing Systems, 29, 2016. 2, 4

[32] Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David Woodruff. On coresets for logistic regression. Advances in Neural Information Processing Systems, 31, 2018. 2, 4

[33] Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for ℓ2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 2006. 2, 4, 6

[34] Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2), 2011. 2, 4, 6

[35] David P Woodruff and Taisuke Yasuda. Online lewis weight sampling. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2023. 2, 6, 8, 26, 28, 29

[36] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, and David P Woodruff. Algorithms for ℓp low-rank approximation. In International Conference on Machine Learning, 2017. 2

[37] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12), 2005. 2

[38] Morteza Alamgir and Ulrike Luxburg. Phase transition in the family of p-resistances. Advances in neural information processing systems, 24, 2011. 2

[39] Jeff Calder. Consistency of lipschitz learning with infinite unlabeled data and finite labeled data. SIAM Journal on Mathematics of Data Science, 1(4), 2019.

[40] Mauricio Flores, Jeff Calder, and Gilad Lerman. Analysis and algorithms for ℓp-based semisupervised learning on graphs. Applied and Computational Harmonic Analysis, 60, 2022. 2

[41] Abderrahim Elmoataz, Matthieu Toutain, and Daniel Tenbrinck. On the p-laplacian and - laplacian on graphs with applications in image and data processing. SIAM Journal on Imaging Sciences, 8(4), 2015. 2

[42] Kenneth Clarkson, Ruosong Wang, and David Woodruff. Dimensionality reduction for tukey regression. In International Conference on Machine Learning, 2019. 3, 4, 7

[43] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. 3

[44] Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent irls algorithm for p-norm linear regression. Advances in Neural Information Processing Systems, 32, 2019. 4, 6, 30

[45] Arun Jambulapati, Yang P Liu, and Aaron Sidford. Improved iteration complexities for overconstrained p-norm regression. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, 2022. 4, 6, 30

[46] Arthur Asuncion and David Newman. Uci machine learning repository, 2007. 4, 10

[47] Anant Raj, Cameron Musco, and Lester Mackey. Importance sampling via local sensitivity. In International Conference on Artificial Intelligence and Statistics, 2020. 4

[48] Alaa Maalouf, Adiel Statman, and Dan Feldman. Tight sensitivity bounds for smaller coresets. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2051 2061, 2020. 4

[49] Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123 6135, 2020. 4

[50] Dingwen Kong, Ruslan Salakhutdinov, Ruosong Wang, and Lin F Yang. Online sub-sampling for reinforcement learning with general function approximation. ar Xiv preprint ar Xiv:2106.07203, 2021. 4

[51] Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. Provable filter pruning for efficient neural networks. In International Conference on Learning Representations, 2019. 4

[52] Murad Tukan, Loay Mualem, and Alaa Maalouf. Pruning neural networks via coresets and convex geometry: Towards no assumptions. Advances in Neural Information Processing Systems, 2022.

[53] Ben Mussay, Dan Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. Dataindependent structured pruning of neural networks via coresets. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7829 7841, 2021. 4

[54] D Lewis. Finite dimensional subspaces of ℓp. Studia Mathematica, 1978. 4, 5

[55] Gideon Schechtman. More on embedding subspaces of ℓp in ℓn r . Compositio Mathematica, 1987.

[56] Jean Bourgain, Joram Lindenstrauss, and Vitali Milman. Approximation of zonoids by zonotopes. 1989.

[57] Michel Talagrand. Proceedings of the American Mathematical Society, 1990.

[58] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

[59] Gideon Schechtman and Artem Zvavitch. Embedding subspaces of ℓp into ℓn p, 0 < p < 1. Mathematische Nachrichten, 2001. 4

[60] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, 2015. 4, 5, 19

[61] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1), 2012. 4

[62] Mu Li, Gary L Miller, and Richard Peng. Iterative row sampling. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 2013. 4

[63] Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018. 4

[64] Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, and Supratim Shit. Streaming coresets for symmetric tensor factorization. In International Conference on Machine Learning, 2020. 4

[65] David Durfee, Kevin A Lai, and Saurabh Sawlani. ℓ1 regression using lewis weights preconditioning and stochastic gradient descent. In Conference On Learning Theory, 2018. 4

[66] Yi Li, Ruosong Wang, Lin Yang, and Hanrui Zhang. Nearly linear row sampling algorithm for quantile regression. In Proceedings of the 37th International Conference on Machine Learning, 2020. 4

[67] Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020. 4, 6

[68] Aditi Laddha, Yin Tat Lee, and Santosh Vempala. Strong self-concordance and sampling. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, 2020. 4

[69] Yi Li, Ruosong Wang, Lin Yang, and Hanrui Zhang. Nearly linear row sampling algorithm for quantile regression. In International Conference on Machine Learning, 2020. 4

[70] Arvind V Mahankali and David P Woodruff. Optimal ℓ1 column subset selection and a fast ptas for low rank approximation. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021. 4

[71] Tung Mai, Cameron Musco, and Anup Rao. Coresets for classification simplified and strengthened. Advances in Neural Information Processing Systems, 34, 2021. 4

[72] Xue Chen and Michal Derezinski. Query complexity of least absolute deviation regression via robust uniform convergence. In Conference on Learning Theory, 2021.

[73] Aditya Parulekar, Advait Parulekar, and Eric Price. L1 regression with lewis weights subsampling. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2021. 4

[74] Frederic G Foster. On the stochastic matrices associated with certain queuing processes. The Annals of Mathematical Statistics, 24(3), 1953. 5

[75] David P Woodruff and Taisuke Yasuda. High-dimensional geometric streaming in polynomial space. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 732 743, 2022. 5, 9, 28

[76] David P Woodruff Cameron Musco, Christopher Musco, David P Woodruff, and Taisuke Yasuda. Active sampling for linear regression beyond the l2 norm. Co RR, 2021. 5

[77] Maryam Fazel, Yin Tat Lee, Swati Padmanabhan, and Aaron Sidford. Computing lewis weights to high precision. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2022. 5, 6, 7, 8

[78] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2), 2014. 5

[79] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM (JACM), 2007. 6

[80] Christian Sohler and David P Woodruff. Subspace embeddings for the ℓ_1-norm with applications. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 755 764, 2011. 6

[81] David Woodruff and Qin Zhang. Subspace embeddings and ℓ_p-regression using exponential random variables. In Conference on Learning Theory, pages 546 567. PMLR, 2013. 6

[82] Cameron Musco, Christopher Musco, David P Woodruff, and Taisuke Yasuda. Active linear regression for ℓp norms and beyond. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 2022. 7

[83] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. Journal of the ACM (JACM), 68(1), 2021. 17

[84] Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2020. 17

A Omitted proofs: general technical results

Fact A.1 (Crude sensitivity approximation via leverage scores). Given a matrix A Rn d, let σ1(ai) denote the ith ℓ1 sensitivity of A with respect to A, and let τi(A) denote its ith leverage

score with respect to A. Then we have q

Proof. The proof of this claim follows from a simple application of standard norm inequalities, and we present one here for completeness. For any u Rn, we have u 2 u 1 n u 2. Therefore, for any x Rd, we have

|x ai| n Ax 2 |x ai|

Ax 1 |x ai|

Ax 2 . (A.1)

Let xτ be the vector that realizes the ith leverage score, and xσ be the vector that realizes the ith ℓ1 sensitivity (i.e. they are the maximizers of sensitivity). Then, we may conclude from Inequality (A.1) that s

n Axτ 2 2 |x τ ai| Axτ 1 |x σ ai| Axσ 1

which gives the claimed result.

Fact A.2. Given an n d matrix, the cost of computing k of its ℓ1 sensitivities is e O(nnz(A)+k dω).

Proof. Given an n d matrix A, we first compute SA, its ℓ1 subspace embedding of size O(d) d, and compute the ith sensitivity of A as follows.

σA 1 (ai) = max x Rd |a i x| Ax 1 = max x Rd |a i x| Θ( SAx 1) O(1) σSA 1 (ai),

where the second step is by the definition of ℓ1 subspace embedding (cf. Definition 2.3). To compute σSA 1 (ai), we need to compute mina i x=1 SAx 1, which, by introducing a new variable for each of the rows of SAx, can be transformed into a linear program with O(d) variables and d constraints. To see the claim on runtime, the cost of computing the subspace embedding is nnz(A). Having computed this once, we can then solve k linear programs, each at cost dω (which is the current fastest LP solver [83]). Putting together these two costs gives the claimed runtime.

Lemma A.3. Given positive numbers a1, a2, . . . , am, let r := maxi [m] ai

mini [m] ai and A(true) := P

i [m] ai. Suppose we sample, uniformly at random with replacement, a set S [m] of these numbers, and let

A(est) := m

i:ai S ai. Then, if |S| 10 r(1+γ)

δ for a large enough absolute constant

C, we can ensure with at least a probability of 1 δ, that |A(est) A(true)| γA(true).

Proof. By construction, the expected value of a sample ai in S is 1 m A(true). Denoting by P the uniform distribution over a1, a2, . . . , am, and let σ(P) be the standard deviation of this distribution P. Then, we can apply Bernstein s inequality [84, Theorem 2.8.1] to see that the absolute error |A(est) A(true)| satisfies the following guarantee for all t > 0:

Pr n |A(est) A(true)| t o = Pr

m A(true) t |S|

1 2 t2 |S|2

|S| σ(P)2+ 1

m maxi [m] ai

m2 σ(P)2+ 1

3 m t maxi [m] ai

Since each of the ais is positive, we note that

i [m] a2 i 1

m max i [m] ai X

i [m] ai = 1

m max i [m] ai A(true).

(A.3) Combining Inequality (A.2) and Inequality (A.3) for t = γA(true) implies that

Pr n |A(est) A(true)| γA(true)o e

1 2 γ2A(true)|S|

3 γ) maxi [m] ai

holds for the choice of |S| 10 m γ2 (1 + γ) maxi [m] ai

A(true) log 1

δ yields the claimed error guarantee.

B Omitted proofs: ℓ1 sensitivities

B.1 Estimating all ℓ1 sensitivities

Theorem 3.1. Given a full-rank matrix A Rn d and an approximation factor 1 < α n, let σ1(ai) be the ℓ1 sensitivity of the ith row of A. Then there exists an algorithm that, in time e O nnz(A) + n

α dω , returns eσ Rn 0 such that with high probability, for each i [n],

σ1(ai) eσi O(σ1(ai) + α

n S1(A)). (3.1)

Proof. Note that in Line 2 of Algorithm 1, the rows of A are partitioned into randomly created n/α blocks. Suppose ai, the ith row of A, falls into the block Bℓ. Recall, that in Line 5, the rows from Bℓare mapped to those in P with row indices in the set J. Then, we compute the ith ℓ1 sensitivity estimate as eσi = maxj J σSA 1 (pj). We observe that for all x Rd, SAx 1 = Θ( Ax 1). (B.1)

We use Equation (B.1) below to establish the claimed bounds. Let x = arg maxx Rd |a i x| Ax 1 be the

vector that realizes the ith ℓ1 sensitivity of A. Further, let the jth row of P be r(ℓ) k Bℓ. Then, with a probability of at least 1/2, we have

σSA 1 (pj) = max x Rd |r(ℓ) k Bℓx| SAx 1 |r(ℓ) k Bℓx | SAx 1 = Θ(1)|r(ℓ) k Bℓx | Ax 1 Θ(1) |a i x | Ax 1 = Θ(1)σ1(ai),

where the first step is by definition of σSA 1 (pj), the second is by evaluating the function being maximized at a specific choice of x, the third step is by applying SAx 1 = Θ( Ax 1) from Equation (B.1), and the final step is by definition of x and σ1(ai). For the fourth step, we use the fact that |rℓ k Bℓx | = |rℓ k,ia i x + P

j =i rℓ k,j(Bℓx )j| |a i x | with a probability of at least 1/2 since the vector rℓ k has coordinates that are +1 or 1 with equal probability. By a union bound over the |J| independent rows that block Bℓis mapped to, we establish the claimed lower bound in Inequality (B.2) with a probability of at least 0.9. To show an upper bound on σSA 1 (pj), we observe that

σSA 1 (pj) = max x Rd |r(ℓ) k Bℓx| SAx 1 max x Rd Bℓx 1 SAx 1 = Θ(1) max x Rd Bℓx 1

Ax 1 Θ(1) X

j:aj Bℓ max x Rd |a j x| Ax 1 ,

where the second step is by Hölder inequality and the fact that the entries of r(ℓ) k are all bounded between 1 and +1, and the third step is by Equation (B.1). The final term of the above chain of inequalities may be rewritten as Θ(1) P

j:aj Bℓsj(A), by the definition of Bℓ. Because Bℓis a group of α rows selected uniformly at random out of n rows and contains the row ai, we have:

j:aj Bℓ, j =i

j =i σ1(aj).

Therefore, Markov inequality gives us that with a probability of at least 0.9, we have

σSA 1 (pj) σ1(aj) + S1(A)O( α

The median trick then establishes the bounds in Inequality (B.2) and Inequality (B.4) with high probability. The claimed runtime is obtained by the cost of constructing SA plus O(n/α) computations of ℓ1 sensitivities with respect to SA, for which we may use Fact 2.4.

B.2 Estimating the sum of ℓ1 sensitivities

In this section, we present a randomized algorithm that approximates the total sensitivity S1(A), up to a γ-multiplicative approximation for some γ (0, 1) that is not too small. This algorithm is significantly less general than Algorithm 2 (which holds for all p 1) but it uses extremely simple facts about leverage scores and what we think is a new recursion technique in this context, which we believe could be of potential independent interest. Theorem B.1. Given a matrix A Rn d and an approximation factor γ (0, 1) such that

n3 , there exists an algorithm, which in time e O nnz(A) + dω

γ2 , returns a positive scalar bs such that, with a probability of 0.99, we have

S1(A) bs (1 + O(γ))S1(A).

Algorithm 4 Approximating the Sum of ℓ1-Sensitivities

Input: Matrix A Rn d and approximation factor γ (0, 1) Output: A positive scalar that satisfies, with probability 0.99, that

S1(A) bs (1 + O(γ))S1(A)

1: Compute SA RO(d) d, a 1/2-approximate ℓ1 subspace embedding of A (cf. Definition 2.3) 2: Set ρ = O(γ/log(log(n + d))). Compute S A, a ρ-approximate ℓ1 subspace embedding of A 3: Construct a matrix b A comprising only those rows of A with leverage scores at least 1/n10

4: Let s be the output of Algorithm 5 with inputs b A, SA, S A, n, and ρ

5: Return bs := (1 + γ) s + 1 n5 (|A| | b A|)

B.2.1 Proof sketch.

We achieve our guarantee via Algorithm 4 and Algorithm 5, which are based on the following key principles. The first is Fact 2.2: the ith sensitivity and ith leverage score of A Rn d satisfy q

τi(A). The second idea, derived from Bernstein s inequality (and formalized in Lemma A.3), is: the sum of a set of positive numbers with values bounded between a and b can be approximated to a γ-multiplicative factor using e O((b/a)γ 2) uniformly sampled (and appropriately scaled) samples. The third principle is that σ1(ai) σSA 1 (ai) when SA is a constant-approximation ℓ1 subspace embedding of A.

Based on these ideas, we first divide the rows of A into B = Θ(log n) submatrices based on which of the B buckets [1/2, 1], [1/4, 1/2], [1/8, 1/4], etc. their leverage scores (computed with respect to A) fall into. Since each of these B submatrices comprises rows with leverage scores in the range [a, 2a] for some a, per Fact 2.2, we have p a

2a. Therefore, by Lemma A.3, the total sensitivity of this submatrix may be approximated by the appropriately scaled total sensitivity of O( n) of its uniformly sampled rows. Our design choice to split rows based on leverage scores is based on the fact that leverage scores (unlike sensitivities) are efficiently computable (using [60]). Computing the total sensitivity of an e O( n)-sized matrix thus costs e O(nnz(A) + n dω) under this scheme.

Applying the idea sketched above recursively further reduces the cost. Suppose, after dividing the input matrix A into B submatrices Mi and subsampling f Mi from these matrices, we try to recurse on each of these B submatrices f Mi. A key requirement for this idea to work is that the sum of the ℓ1 sensitivities of the rows of f Mi in isolation be close enough to the true sum of ℓ1 sensitivities of

Algorithm 5 Subroutine for Approximating the Sum of ℓ1-Sensitivities

Input: Matrices M Rr d, SA RO(d) d, S A RO(d) d, scalars n and ρ (0, 1) Output: A positive scalar bs that satisfies, with probability 0.99, that

SS A 1 (M) bs (1 + O(ρ))SS A 1 (M).

1: Set B = Θ(log n), D = O(log(log(n + d))), δ = 0.01

BD and b = e O D4

2: if the number of rows of M is at most b then 3: Let SS A 1 (M) be the sum of ℓ1 sensitivities of the rows of M with respect to S A 4: Using Fact 2.4 on M, compute bs such that SS A 1 (M) bs (1 + O(ρ))SS A 1 (M) 5: Return bs 6: else

7: Construct the matrix C := M SA

. Set r = O( p

|C|(1 + ρ)ρ 2 log(δ 1)).

8: Compute τ(C), the vector of leverage scores of C 9: Divide 1

n20 , 1 into B geometrically decreasing sub-intervals [1/2, 1], [1/4, 1/2], . . . , etc. 10: for each bucket i [B] created in the previous step do 11: Construct matrix Mi with those rows of M whose leverage scores fall in the ith bucket 12: Construct matrix f Mi composed of r uniformly sampled (with replacement) rows of Mi 13: Compute esi, the output of Algorithm 5 with inputs f Mi, SA, S A, n, and ρ. 14: end for 15: Return bs := (1 + ρ) P

i [B] |Mi| |f Mi|esi 16: end if

those rows with respect to the original matrix A. In general, this is not true. However, we can meet our requirement using the ℓ1 subspace embedding SA of the (original) matrix A. Specifically, if we vertically concatenate a matrix f Mi with SA and call this matrix C, then for each row f Mi[j] f Mi, we have σC 1 (f Mi[j]) σA 1 (f Mi[j]). Further, |C| O( n), and so in the step subsampling the rows of f Mi, we choose O(n1/4) samples. Recursing this exponentially decreases the row dimension of the input matrix at each level, with (log n)k matrices at the kth level of recursion.

B.2.2 Helper lemmas for proving Theorem B.1

For our formal proof, we need Definition B.2, Fact B.3, and Definition B.4. Definition B.2 (Notation for recursion tree of Algorithm 5). Recall that B = Θ(log(n)) as in Line 1 of Algorithm 5. Then we have the following notation corresponding to Algorithm 5.

We define a rooted B-ary tree T corresponding to the execution of Algorithm 5. We use Tj to denote the subtree starting from the root node with all the nodes up to and including those at the jth level. We denote the ith node at the jth level by T(j,i). Thus, every node is uniquely specified by its level in the tree and its index within that level. The root node T(1,1) at level 1 corresponds to the first call to Algorithm 5 from Algorithm 4.

For the node T(j,i), we denote by M(j,i) the input matrix to the corresponding recursive call; note that all the other inputs are global and remain the same throughout the recursive call. Analogously, we use bs(j,i) to denote our estimate of SS A 1 (M(j,i)), the sum of ℓ1 sensitivities of rows of the matrix M(j,i) in the node T(j,i), defined with respect to the sketched matrix S A.

Fact B.3. If event A happens with probability 1 τ1 and event B happens with probability 1 τ2 conditioned on event A, then the probability of both events A and B happening is at least 1 τ1 τ2.

Definition B.4. We say that a node N with matrix M satisfies the (bρ, bδ)-approximation guarantee if bs, the output of Algorithm 5 on M, satisfies the following guarantee for the true sum SS A 1 (M) of ℓ1 sensitivities of the rows in M:

bs bρ SS A 1 (M) with a probability at least 1 bδ

for some approximation parameter bρ (0, 1) and an error probability parameter bδ (0, 1).

Lemma B.5 (Partitioning of M). In Algorithm 5, even though we ignore the range 0, 1 n20 when creating the submatrices M1, M2, . . . , MB from M, every row in M is part of exactly one of the submatrices M1, M2, . . . , MB. In other words, the matrix M is partitioned into the matrices M1, M2, . . . , MB, with no row missed out.

Proof. Our proof strategy is to show that for all row indices k [|M|], each of the leverage scores τk(C) satisfy τk(C) 1 n20 . Therefore, creating buckets of rows with the range of leverage score starting at 1 n20 (instead of 0) does not miss out any rows.

To this end, we note that for every row ak in the input M to Algorithm 5, we have: r

|A| σA 1 (ak) = max x Rd |x ak| Ax 1 max x Rd 3|x ak|

Cx 1 = 3σC 1 (ak) 3 p

(B.5) where the first step is because in Line 3 of Algorithm 4, we discard those rows ak of A with leverage scores smaller than 1/n10 and there are n rows in A, so every row ak in the input matrix M to Algorithm 5 satisfies τk(A) 1 n10 ; the second step is by Fact 2.2 applied to ak; the third step is by the definition of the ℓ1 sensitivity of ak with respect to A; the fourth step is by observing that Cx 1 = SAx 1 + Mx 1 2 Ax 1 + Ax 1 = 3 Ax 1 since SA is a constant-factor (with the constant assumed 1/2) ℓ1 subspace embedding of A and because M is a submatrix of A; the fifth step is by the definition of ℓ1 sensitivity of ak with respect to C; the final step is by Fact 2.2 applied to ak within C. Observing the first and last terms of this inequality chain, we have τk(C) 1 3n11 for all k [|C|]. This finishes the proof of the claim.

Lemma B.6. For the recursion tree T corresponding to Algorithm 5, let the size b of the input matrices at the leaf nodes be b = 50D3 log(100B)

γ2 2D3 log(100B)

d log(d) . Then, the depth of

the recursion tree is less than D := 1 + log(log(2n + 2d log(d))).

Proof. For now, let β be arbitrary but less than β2 n + d , where d is the row dimension of the subspace embedding SA. Then, the size of the input matrix to the nodes of the recursion tree as we increase in depth evolves as

d + n. . . .

Thus, this sequence evolves as

d + xt, with x0 = n + d .

To see the limit of this sequence, we set x = β

d + x, and solve for this quadratic equation as

We set the base case to be b := 7β(β +

d ), a constant times larger than this asymptotic limit. We now compute the depth t at which the recursion attains the value b is attained. To do so, we consider a sequence {yt} which overestimates the sequence {xt} of matrix sizes for a big range of t. We define {yt} as y0 = n + d , and yt+1 = β p

2yt, and it can be checked that for t such that yt d , we have yt+1 xt+1. Moreover, for all t 1, yt = β1+1/2+ +1/2t 1(2y0)1/2t = β2 21 t(2y0)1/2t. The limit of this sequence can easily be seen to β2. We now split our analysis in two cases based on the limits of these two sequences.

Case 1: β2 d . Since the limit of the sequence {yt} is β2, it is easy to see that yt is always greater d (since yt is monotonically decreasing as y0 β2). Thus, xt will always be less than yt in this case. We will now show that yt (and thus xt) will be less than b in t0 = 2 log log(2y0) steps. Solving the inequality for t as follows: yt β2(2y0)1/2t 4β2 b , it suffices to ensure that (2y0)1/2t 4, which happens if t t0. Thus, t 2 log log 2y0.

Case 2: β2 < d . In this case, the sequence yt will eventually go below d and thus the inequality yt xt might eventually break down. Let t = t denote the first time step at which yt 5d (which will be finite in this case). Then, by definition, yt xt for all t < t . We will now upper bound the value of t as follows: yt β2(2y0)1/2t d (2y0)1/2t 4d ; Thus, it suffices to ensure that (2y0)1/2t 4, which happens if t t0 := 2 log log(2y0). Hence, t t0. Moreover, the value of yt 5d , implies that xt 1 yt 1 (5d )2

2β2 . We will now show that xt +1 b using the evolution of xt+1 = β d + xt as follows: First, xt β

d + β xt 1 β

d + 5d 6d ; Then, xt +1 β

d b . Thus, t 1 + 2 log log 2y0.

Therefore, in both of these cases, we have that after at most D = 1 + log log(2n + 2d ) depth, the base case of b = 7β(β +

d ) is achieved at the leaf node. Recall that the algorithm sets D = 2 log log(n + d ) + 1 and ρ = γ/D for d = d log(d) which implies that β := C (1+ρ)

ρ2 log(δ 1) = CD (D+γ)

γ2 log(100BD) 2D3 log(100B)

γ2 . Since Algorithm 5 sets

b = 50D3 log(100B)

γ2 2D3 log(100B)

d log(d) , which is larger than b , we may conclude that the choice of b implies that the recursion tree T of Algorithm 5 has a depth less than D = 1 + log(log(2n + 2d log(d))).

Lemma B.7. Suppose the root node T(1,1) of Algorithm 5 satisfies a (bρ, bδ)-approximation guarantee as defined in Definition B.4, and denote by bs the output of Algorithm 4. Then, with a probability of at least 1 bδ, we have 1 1 + γ bs bρ+ρ SA 1 (A).

As a result of this lemma, in the final proof of correctness of Algorithm 4 (appearing later as proof of Theorem B.1), it suffices to show that bs satisfies a (bρ, bδ)-approximation guarantee for some appropriate choices of these parameters.

Proof of Lemma B.7. Suppose the root node satisfies Definition B.4 with some (bρ, bδ)-approximation guarantee. This means that there exists an es such that the output of Algorithm 5 on the input matrix b A satisfies es bρ SS A 1 ( b A) with a probability at least 1 bδ. (B.7)

Since S A is, by design, a ρ-approximation ℓ1 subspace embedding for A, it means for all x Rd we have (1 ρ) Ax 1 S Ax 1 (1 + ρ) Ax 1. (B.8)

Therefore, for every row ai b A, we have by the definition of ℓ1 sensitivity and Inequality (B.8) that the ℓ1 sensitivity of ai computed with respect to S A is a ρ-approximation of the ℓ1 sensitivity of ai with respect to A:

σS A 1 (ai) = max x Rd |a i x| S Ax 1 ρ max x Rd |a i x| Ax 1 = σA 1 (ai).

Therefore, by linearity, this ρ-approximation factor transfers over in connecting the total sensitivity of b A with respect to S A to the total sensitivity of b A with respect to A:

SS A 1 ( b A) = X

i [| b A|] σS A 1 (ai) ρ X

i [| b A|] σA 1 (ai) = SA 1 ( b A). (B.9)

Therefore, chaining Equation (B.7) and Equation (B.9) yields the following approximation guarantee:

es bρ+ρ SA 1 ( b A) with a probability at least 1 bδ. (B.10)

There is no change in the error probability above from Equation (B.7) because Equation (B.9) holds deterministically. Next, since in Line 3 of Algorithm 4, we dropped rows of A that have leverage scores less than 1 n10 , for each of the dropped rows ai, we have

σA 1 (ai) p

n5 . (B.11)

Therefore, the quantity we return, bs, satisfies the following approximation guarantee: 1 1 + γ bs := es + 1

n5 (|A| | b A|) (1 (bρ + ρ))SA 1 ( b A) + X

i:ai / b A σA 1 (ai) (1 (bρ + ρ))SA 1 (A),

(B.12) where the first step is by definition of bs in Algorithm 4; the second step is by Equation (B.10); the third step is by applying Inequality (B.11) to each of the rows not present in b A. In the other direction, 1 1 + γ bs = es + 1

n5 (|A| | b A|) (1 + (bρ + ρ))SA 1 ( b A) + 1

n4 (1 + (bρ + ρ))SA 1 (A), (B.13)

where the first step is by definition of bs in Algorithm 4; the second step is by the fact that |A| n; the final step is by combining the assumption ρ 1 n4 and the facts that all sensitivities are non-negative and that the total ℓ1 sensitivity is at least one. Combining Inequality (B.12) and Inequality (B.13) finishes the proof.

Lemma B.8 (Single-Level Computation in Recursion Tree). Consider a node N in the recursion tree corresponding to Algorithm 5, and let M be the input matrix for N. Further assume that N is not a leaf node. Denote by N1, N2, . . . , NB (where B = Θ(log(n)) is the number of recursive calls at each node in Algorithm 5) the children of node N, with each node Ni with the input matrix f Mi (as constructed in Algorithm 5). Suppose each of these child nodes Ni satisfies a (bρ, bδ)-approximation guarantee (cf. Definition B.4). Then N satisfies a (bρ + ρ, (bδ + δ) B)-approximation guarantee.

Proof. We first reiterate the steps of Algorithm 5 that are crucial to this proof. As proved in Lemma B.5, Algorithm 5 partitions the rows of the matrix M in node N into B submatrices M1, M2, . . . , MB based on their leverage scores. Next, for each matrix Mi, we uniformly sample with replacement O( p

|C|(1 + ρ)ρ 2 log(δ 1)) rows and term this set of rows as matrix f Mi, which we then pass to Algorithm 5 recursively in node Ni (recall that we assume that the node N is not a leaf node). In other words, each child node Ni has as its input matrix f Mi.

Since we assume that all the child nodes N1, N2, . . . , NB satisfy a (bρ, bδ)-approximation guarantee as in Definition B.4, it means that, for all i [B], the output bsi of Algorithm 5 on node Ni satisfies the following guarantee:

bsi bρ SS A 1 (f Mi) with a probability at least 1 bδ. (B.14)

Next, we recall that each Mi is composed of only those rows of M whose leverage scores (computed with respect to C) lie in the ith bucket [2 i, 2 i+1]. Combining this with Fact 2.2, this implies that the ℓ1 sensitivity of the jth row Mi[j] of the matrix Mi, computed with respect to C, satisfies s

|C| σC 1 (Mi[j]) q

2 i+1. (B.15)

Additionally, we claim that 1 3(1 + ρ)σC 1 (Mi[j]) σS A 1 (Mi[j]) 3(1 + ρ)σC 1 (Mi[j]). (B.16)

To see this, we observe that

σC 1 (Mi[j]) = max x Rd |x Mi[j]|

Cx 1 max x Rd |x Mi[j]|

3 Ax 1 max x Rd |x Mi[j]| 3(1 + ρ) S Ax 1 = σS A 1 (Mi[j]) 3(1 + ρ) ,

(B.17) where the first step is by the definition of ℓ1 sensitivity of Mi[j] (i.e., the jth row of matrix Mi) with respect to C; the second step is by the definition of C, the fact that SA is a 1/2-factor ℓ1 subspace embedding of A, and because M is a subset of rows of A; the third step is because S A is a ρ-approximate ℓ1 subspace embedding for A, and the final step is by the definition of σA 1 (Mi[j]). In the other direction, we have

σC 1 (Mi[j]) = max x Rd |x Mi[j]|

Cx 1 max x Rd |x Mi[j]| 0.5 Ax 1 max x Rd (1 + ρ)|x Mi[j]|

0.5 S Ax 1 3(1+ρ)σS A 1 (Mi[j]),

where the second step is by the definition of C as a vertical concatenation of SA and M and further using that SA is a constant factor ℓ1 subspace embedding of A and dropping the non-negative term Mx 1. Combining Inequality (B.15) and Inequality (B.16), we see that for all rows j [|Mi|], we have 1 3(1 + ρ)

|C| σS A 1 (Mi[j]) 3(1 + ρ)

2 i+1. (B.19)

Since Inequality (B.19) shows upper and lower bounds on each sensitivity σS A 1 (Mi) of the matrix Mi, we may use Lemma A.3 to approximate the sum of these sensivities. In particular, by Lemma A.3,

we may construct a matrix f Mi composed of 20

|C|(1+2ρ) log(δ 1)

ρ2 rows of Mi sampled uniformly at random with replacement, and we are guaranteed

(1 + ρ)|Mi|

|f Mi| SS A 1 (f Mi) ρ SS A 1 (Mi) with a probability at least 1 δ. (B.20)

Note that the output bs of Algorithm 5 when applied to node N is defined as

bs := (1 + ρ) X

|f Mi| bsi. (B.21)

Therefore, by union bound over the failure probability (cf. Fact B.3), we can combine Equation (B.14) and sum Equation (B.20) over all i B child nodes to conclude that bs from Equation (B.21) satisfies a (bρ + ρ, (bδ + δ)B)-approximation guarantee of Definition B.4.

B.2.3 Proof of Theorem B.1

Proof of Theorem B.1. We first sketch our strategy to prove Theorem B.1 s correctness guarantee. Essentially, our goal is to establish Definition B.4 for the root node T(1,1). That this goal suffices to prove the theorem is justified in Lemma B.7. We now show that the node T(1,1) satisfies a (bρ, bδ)-approximation guarantee for some bρ and bδ. We use the method of induction.

Correctness guarantee of Theorem B.1. We know that the leaf nodes all satisfy the (ρ, 0)- approximation guarantee in Definition B.4 since at this level, the base case is triggered, where we compute a ρ-approximation estimate of the total ℓ1 sensitivity with respect to the subspace embedding S A. Having established this approximation guarantee at the leaf nodes, we can use Lemma B.8 to inductively propagate the property of (ρ, δ)-approximation guarantee up the recursion tree T . From Lemma B.6, we have that the recursion tree T of Algorithm 5 is of depth at most D = O(log log(n + d)), and the number of child nodes at any given node is at most B = Θ(log(n)); we factor these into the approximation guarantee and error probability.

Base case. At the leaf nodes of the recursion tree, we directly compute the total sensitivities with respect to S A. Recall that we denote the depth of the tree by D. Then, because S A is, by definition, a ρ-approximate subspace embedding for A, we have, for each i that indexes a leaf node,

bs(D,i) ρ SS A 1 (M(D,i)) with a probability of 1. (B.22)

Induction proof. Consider the subtree Tj truncated at (but including) the nodes at level j in the recursion tree T corresponding to Algorithm 5. Assume the induction hypothesis that all the leaf nodes in the subtree Tj satisfy Definition B.4 with parameters ρ (D + 1 j), δ BD j+1 B

is, at the matrix of each such node i, we have an estimate bs(j,i) of its total sensitivity SS A 1 (M(j,i)) such that

bs(j,i) ρ (D+1 j) SS A 1 (M(j,i)) with a probability at least 1 δ BD j+1 B

Then, each of the leaf nodes of the subtree Tj 1 (i.e., the subtree that stops one level above Tj) satisfies the assumption in Lemma B.8 since its child nodes are either leaf nodes of the entire recursion tree T (for which this statement has been shown in Inequality (B.22)) or, if they are not

leaf nodes of the recursion tree, they satisfy Definition B.4 with parameters bρ = ρ (D + 1 j) and bδ = δ BD j+1 B

B 1 . Therefore, Lemma B.8 implies that each of the leaf nodes of Tj 1 satisfies

Definition B.4 with approximation parameter bρ and error probability bδ defined as follows:

bρ = ρ+ρ (D+1 j) = ρ (D+2 j) and bδ = B δ + δ BD j+1 B

= δ BD j+2 B

In other words, the outputs bs(j 1,k) of Algorithm 5 on the matrices M(j 1,k) in the leaf nodes of Tj 1 each satisfy:

bs(j 1,k) O(ρ (D+2 j)) SS A 1 (M(j 1,k)) with a probability at least 1 δ BD j+2 B

(B.24) The base case in Inequality (B.22) and the satisfaction of the induction hypothesis in Inequality (B.24) together finish the proof of the induction hypothesis.

Finishing the proof of correctness. In light of the above conclusion, for the root node T(1,1), which is at the level j = 1, we may plug in j = 1 in Inequality (B.23) and obtain that the output bs(1,1) of Algorithm 5 at the root node satisfies

bs(1,1) O(Dρ) SS A 1 (M(1,1)) with a probability of at least 1 δ BD B

In Algorithm 5, we choose the parameters δ = 0.01

BD and ρ = O γ

D . Plugging these parameters into Inequality (B.25), we conclude that the root node satisfies Definition B.4 with a (γ, 0.01)- approximation guarantee. This finishes the proof of correctness of the claim.

Proof of runtime. Let b denote the maximum number of rows of a matrix at a leaf node of T . Further, recall from Fact 2.4 that the cost of computing one ℓ1 sensitivity of an n d matrix is,

L = e O(nnz(A) + dω). (B.26)

Then as stated in Line 1, the algorithm computes the ρ-approximate total sensitivity at the leaf nodes; Fact 2.4, this incurs a cost of C(b) = e O(nnz(A) + b dω). (B.27)

At any other node, the computational cost is the work done at that level plus the total cost of the recursive calls at that level. Let C(r) denote the runtime of Algorithm 5 when the input is an r d matrix. Then,

( B C r(1+ρ)

ρ2 log(δ 1) + L r > b C(b) r b ,

where b is the number of rows in the base case, and L is the cost of computing leverage scores and equals L = e O(nnz(A) + dω) from Fact 2.1. For the depth of the recursion D (where the top level is D = 1), plugging in our choices of δ = 0.01

BD and ρ = O γ

D , this may be expressed as via expanding the recursion as:

C(n) = BD C(b) + L BD 1

for n b. We plug into this expression the values of b and D from Lemma B.6, L from Equation (B.26), C(b) from Equation (B.27), and B = Θ(log(n)) to get the claimed runtime.

B.3 Estimating the maximum ℓ1 sensitivity

Theorem 3.7 (Approximating the Maximum of ℓ1 Sensitivities). Given a matrix A Rn d, there exists an algorithm, which in time e O(nnz(A) + dω+1), outputs a positive scalar bs that satisfies

Ω( σ1(A) ) bs O(

Proof. First, we have S1Ax 1 = Θ( Ax 1). (B.29)

We use Equation (B.29) to establish the claimed bounds below. First we set some notation. Define x and ai as follows:

x , i = arg max x Rd,i [|A|] |a i x| Ax 1 (B.30)

Thus, x is the vector that realizes the maximum sensitivity of the matrix A, and the row ai is the row of A with maximum sensitivity with respect to A. Suppose the matrix S A contains the row ai . Then we have

max i:ci S A σ1(ci) = max x Rd,ck S A |c k x| S1Ax 1 = Θ(1) max x Rd,ck S A |c k x| Ax 1 = Θ(1) max x Rd |a i x| Ax 1 ,

where the first step is by definition of ℓ1 sensitivity of S A with respect to S1A, the second step is by Equation (B.29), and the third step by noting that the matrix S A is a subset of the rows of A (which includes ai ). By definition of σ1(ai ) in Equation (B.30), we have

max x Rd |a i x| Ax 1 = σ1(A) , . (B.32)

Then, combining Equation (B.31) and Inequality (B.32) gives the guarantee in this case. In the other case, suppose ai is not included in S A. Then we observe that the upper bound from the preceding inequalities still holds. For the lower bound, we observe that

σS1A 1 (S A) = max x Rd,cj S A

c j x S1Ax 1 max x Rd S Ax

S1Ax 1 = Θ(1) max x Rd S Ax

(B.33) where the the second step is by choosing a specific vector in the numerator and the third step uses S1Ax 1 = Θ( Ax 1). We further have,

max x Rd S Ax

d Ax 1 |a i x |

d σ1(A) , (B.34)

where the first step is by choosing x = x , the second step is by the distortion guarantee of ℓ subspace embedding [35], and the final step is by definition of σ1(ai ). Combining Inequality (C.10) and Inequality (C.11) gives the claimed lower bound on σS1A 1 (S A) . The runtime follows from the computational cost and row dimension of S A from [35] and the cost of computing ℓ1 sensitivities with respect to S1A as per Fact 2.4.

C Omitted proofs: ℓp sensitivities

We recall the following notation that we use for stating our results in this setting.

Definition C.1 (Notation for ℓp Results). We introduce the notation LP(m, d, p) to denote the cost of approximating one ℓp sensitivity of an m d matrix up to an accuracy of a given constant factor.

C.1 Estimating all ℓp sensitivities

We generalize Algorithm 1 to the case p 1 below.

Theorem C.2 (Approximating all ℓp sensitivities). Given a full-rank matrix A Rn d, an approximation factor 1 < α n, and a scalar p 1, let σp(ai) be the ith ℓp sensitivity. Then there exists an algorithm that returns a vector eσ Rn 0 such that with high probability, for each i [n], we have

Ω(σp(ai)) eσi O(αp 1σp(ai)) + αp

n Sp(A)). (C.1)

Our algorithm runs in time e O nnz(A) + n

α LP(dmax(1,p/2), d, p) (cf. Definition 1.2).

Algorithm 6 Approximating ℓp-Sensitivities: Row-wise Approximation

Inputs: Matrix A Rn d, approximation factor α (1, n], scalar p 1 Output: Vector eσ Rn >0 that satisfies, for each i [n], with probability 0.9, that

σp(ai) eσi O(αp 1σp(ai) + αp

1: Compute, for A, an ℓp subspace embedding Sp A R e O(dmax(1,p/2)) d

2: Partition A into n

α blocks B1, . . . , Bn/α each comprising α randomly selected rows. 3: for the ℓth block Bℓ, with ℓ [ n

4: Sample α-dimensional independent Rademacher vectors r(ℓ) 1 , . . . , r(ℓ) 100 5: For each j [100], compute the row vectors r(ℓ) j Bℓ Rd

6: end for 7: Let P R100 n

α d be the matrix of all vectors computed in Line 5. Compute σSp A p (P) using Definition 1.2 8: for each i [n] do 9: Denote by J the set of row indices in P that ai is mapped to in Line 5 10: Set eσi = maxj J(σSp A p (pj)) 11: end for 12: Return eσ

Proof. Suppose the ith row of A falls into the bucket Bℓ. Suppose, further, that the rows from Bℓare mapped to those in P with row indices in the set J. Then, Algorithm 6 returns a vector of sensitivity estimates eσ Rn with the ith coordinate defined as eσi = maxj J σp(cj). For any x Rd, we have Sp Ax p p Ax p p. (C.2)

We use Equation (C.2) to establish the claimed bounds below. Let x = arg maxx Rd |a i x|p

Ax p p be the

vector that realizes the ith ℓp sensitivity of A. Further, let the jth row of P be r(ℓ) k Bℓ. Then, with probability of at least 1/2, we have

σSp A p (pj) = max x Rd |r(ℓ) k Bℓx|p

Sp Ax p p |r(ℓ) k Bℓx |p

Sp Ax p p = Θ(1)|r(ℓ) k Bℓx |p

Ax p p Θ(1) |a i x |p

Ax p p = Θ(1)σp(ai),

where the first step is by definition of σp(pj), the second is by evaluating the function being maximized at a specific choice of x, the third step is by Equation (C.2), and the final step is by definition of x and σp(ai). For the fourth step, we use the fact that |r(ℓ) k Bℓx |p = |r(ℓ) k,ia i x + P

j =i r(ℓ) k,j(Bx )j|p |a i x |p with a probability of at least 1/2 since the vector r(ℓ) k has coordinates that are 1 or +1 with equal probability. By a union bound over j J independent rows that block Bℓis mapped to, we establish the claimed lower bound in Inequality (C.3) with probability at least 0.9. To show an upper bound on σSp A p (pj), we observe that

max x Rd |r(ℓ) k Bℓx|p

Sp Ax p p αp 1 max x Rd Bℓx p p Sp Ax p p Θ(αp 1) X

j:aj Bℓ max x Rd |a j x|p

Ax p p = Θ(αp 1) X

j:aj Bℓ σp(aj),

where the second step is by Equation (C.2) and opening up Bℓx p p in terms of the rows in Bℓ, and the final step is by definition of Bℓ. To see the first step, we observe that

|r(ℓ) k Bℓx|p Bℓx p p r(ℓ) k p q Bℓx p p αp/q = Bℓx p p αp 1,

q = 1, the first step is by Hölder s inequality, the second is because each entry of r(ℓ) k is

either +1 or 1 and r(ℓ) k Rα, and the third is because p

q = p 1. Because Bℓis a group of α rows

selected uniformly at random out of n rows and contains the row ai, we have:

j:aj Bℓ,j =i σp(aj)

j =i σp(ai).

Therefore, Markov inequality gives us that with a probability of at least 0.9, we have

σSp A p (pj) O(αp 1σp(ai) + αp

n Sp(A)). (C.5)

The median trick then establishes Inequality (C.4) and Inequality (C.5) with high probability. To establish the runtime, note that we first construct the subspace embedding Sp A and then compute the ℓp sensitivities of P (with O(n/α) rows) with respect to Sp A. The size of the subspace embedding Sp A is O(d1 p/2 d) (see [75, Table 1]). This completes the runtime claim.

C.2 Estimating the maximum of ℓp sensitivities

In Section 3.2 and Appendix B.3, we showed our result for estimating the maximum ℓ1 sensitivity. In this section, we show how to extend this result to all p 1. Algorithm 7 generalizes Algorithm 3.

Algorithm 7 Approximating the Maximum of ℓp-Sensitivities

Input: Matrix A Rn d and p 1 (with p = 2) Output: Scalar bs R 0 that satisfies that

σp(A) bs C dp/2 σp(A)

1: Compute, for A, an ℓ subspace embedding S A RO(d log2(d)) d such that S A is a subset of the rows of A [35]

2: Compute, for A, an ℓp subspace embedding Sp A RO(d) d p [1, 2) RO(dp/2) d p > 2

3: Return dp/2 σSp A p (S A)

Theorem C.3 (Approximating the maximum of ℓp sensitivities). Given a matrix A Rn d and p 1 (with3 p = 2), there exists an algorithm, which outputs a positive scalar bs that satisfies

Ω( σp(A) ) bs O(dp/2 σp(A) ).

The runtime of the algorithm is e O(nnz(A) + dω+1) for p [1, 2) and e O(nnz(A) + dp/2 LP(O(dp/2), d, p)) for p > 2.

Proof. First, we have Sp Ax p p = Θ( Ax p p). (C.6)

We use Equation (C.6) to establish the claimed bounds below. First we set some notation. Define x and ai as follows:

x , i = arg max x Rd,i [|A|] |a i x|p

Ax p p (C.7)

Thus, x is the vector that realizes the maximum sensitivity of the matrix A, and the row ai is the row of A with maximum sensitivity with respect to A. Suppose the matrix S A contains the row ai . Then we have

max i:ci S A σSp A p (ci) = max x Rd,ck S A |c k x|p

Sp Ax p p = Θ(1) max x Rd,ck S A |c k x|p

Ax p p = Θ(1) max x Rd |a i x|p

3For p = 2, the result of [23] gives a constant factor approximation to all leverage scores in nnz(A) + dω

where the first step is by definition of ℓp sensitivity of S A with respect to Sp A, the second step is by Equation (C.6), and the third step by noting that the matrix S A is a subset of the rows of A (which includes ai ). By definition of σp(ai ) in Equation (C.7), we have

max x Rd |a i x|p

Ax p p = σp(A) . (C.9)

Then, combining Equation (C.8) and Equation (C.9) gives the guarantee in this case. In the other case, suppose ai is not included in S A. Then we observe that the upper bound from the preceding inequalities still holds. For the lower bound, we observe that

σSp A p (S A) = max x Rd,cj S A

Sp Ax p p max x Rd S Ax p Sp Ax p p = Θ(1) max x Rd S Ax p Ax p p ,

(C.10) where the the second step is by choosing a specific vector in the numerator and the third step uses Sp Ax p p = Θ( Ax p p). We further have,

max x Rd S Ax p Ax p p S Ax p Ax p p Ax p dp/2 Ax p p |a i x |p

dp/2 Ax p p = 1 dp/2 σp(A) , (C.11)

where the first step is by choosing x = x , the second step is by the guarantee of ℓ subspace embedding, and the final step is by definition of σp(ai ). Combining Inequality (C.10) and Inequality (C.11) gives the claimed lower bound on σSp A p (S A) . The runtime follows from the cost of computing S A from [35] and the cost of computing and size of Sp A from [23, Figure 1]

D Lower Bounds

Theorem D.1 (ℓp Regression Reduces to ℓp Sensitivities). Suppose that we are given an algorithm A, which for any matrix A Rn d and accuracy parameter ε (0, 1), computes (1 ε )σp(A ) in time T (n , d , nnz(A ), ε ). Then, there exists an algorithm that takes A Rn d and b Rn as inputs and computes (1 ε) miny Rd Ay b p p λpε in time T (n+1, d+1, nnz(A)+nnz(b)+ 1, ε) for any λ > 0.

Proof. Given A Rn d and b Rd, consider the matrix A := A b 0 λ

R(n+1) d. Then

the n + 1-th ℓp sensitivity of A is

σp(a n+1) = max x Rd | ed, x |p

A x p p = max x Rd:xd=λ 1 1 A x p p = 1 1 + miny Rd 1{ Ay λ 1b p p},

where the second step is by the scale-invariance of the definition of sensitivity with respect to the variable of optimization. By scale-invariance we see that for any c > 0, and miny A: iy c A:i p p = cp miny A: iy A:i p p.

Therefore, note that rearranging gives miny Rd 1 Ay b p p = λp σp(a n+1) λp, which is how we

can estimate the cost of the regression. Then computing σp(a n+1) to a multiplicative accuracy of ε

gives the value of miny Rd 1 Ay b p p up to error ε( λp σp(a n+1)) = ε(λp +miny Rd 1 Ay b p p), giving our final bounds.

Using the same technique as in Theorem D.1, we can similarly extend the reduction to a multiple regression task. In particular, we show that computing the values of a family of certain regularized leave-one-out regression problems for a matrix may be obtained by simply computing leverage scores of an associated matrix; therefore, this observation demonstrates that finding fast algorithms for sensitivities is as hard as multiple regression tasks. Specifically, for some matrix A Rn d, we denote A: i Rn d 1 as the submatrix of A with its i-th column, denoted as A:i, removed. We show that approximate sensitivity calculations can solve miny Rd 1 A: iy + A:i 2 approximately for all i.

Lemma D.2 (Regularized Leave-One-Out ℓp Multiregression Reduces to ℓp Sensitivities). Suppose that we are given an algorithm A, which for any matrix A Rn d and accuracy parameter ε (0, 1), computes (1 ε )σp(A ) in time T (n , d , nnz(A ), ε ). Given a matrix A Rn d

with n d, let OPTi := miny Rd 1 A: iy + A:i p p and y i := arg miny Rd 1 A: iy + A:i p for all the i [d]. Then, there exists an algorithm that takes A Rn d and computes (1 ε)OPTi λp(1 + y i p p) in time T (n + d, d, nnz(A), ε) for any λ > 0.

Proof. Given A Rn d, consider the matrix A := A λI

R(n+d) d formed by vertically

appending a scaled identity matrix to A, with λ > 0. By the definition of ℓp sensitivities, we have the following.

σp(a n+i) = max x Rd |x a n+i|p

A x p p = max x Rd:xi=λ 1 1 A x p p = max x Rd:xi=λ 1 1 λ 1A :i + A :ix i p p

= 1 miny Rd 1 {1 + λp y p p + A: iy + λ 1A:i p p}

Recall that y i := arg miny Rd 1 A: iy + A:i p, and since the power is a monotone transform, OPTi := miny Rd 1 A: iy + A:i p = A: iy i + A:i p. By scale-invariance we see that for any c > 0, cy i = arg miny Rd 1 A: iy + c A:i p p and miny A: iy + c A:i p p = cp OPTi. Therefore, by plugging y = λ 1y i , note that

min y Rd 1 1 + λp y p p + A: iy + λ 1A:i p p 1 + y p p + λ p OPTi.

However, we also note that by non-negativity,

min y Rd 1 1 + λp y p p + A: iy + λ 1A:i p p λ p OPTi.

Combining this all together gives the claimed approximation guarantee by noting that

OPTi + λp y p p + λp λp

σp(a n+i) OPTi

Therefore, our algorithm is to simply return λp/sp(a n+i), where s is the ϵ-approximate estimate of the sensitivity and our bound follows. The runtime guarantees follow immediately from assumption that the runtime of calculating the ℓp sensitivities of A .

Corollary D.3. Assuming that leave-one-out ℓp multi-regression with Ω(d) instances takes poly(d) nnz(A) + poly(1/ε) time [44, 45], computing ℓp sensitivities of an n d matrix A costs at least poly(d) nnz(A) + poly(1/ε).

E Additional Experiments

Here we include additional experiments on other similar datasets.

p Total Sensitivity Upper Bound Brute-Force Approximation Brute-Force Runtime (s) Approximate Runtime (s)

1 11 4.7 3.9 1940 303 1.5 11 8.3 10.3 1980 280 2.5 20 10.3 11.8 1970 326 3 36.4 6.9 7.3 1970 371

Table 2: Runtime comparison for computing total sensitivities for the fires dataset, which has matrix shape (517, 11). We include the theoretical upper bound for the total sensitivities using lewis weights calculations.

p Total Sensitivity Upper Bound Brute-Force Approximation Brute-Force Runtime (s) Approximate Runtime (s)

1 11 4.1 5.2 540 154 1.5 11 10.1 6.7 560 201 2.5 20 19.9 12.2 390 117 3 36.4 8.8 6.3 354 136

Table 3: Runtime comparison for computing total sensitivities for the concrete dataset, which has matrix shape (101, 11). We include the theoretical upper bound for the total sensitivities using lewis weights calculations.

Figure 2: Average absolute log ratios for all ℓp sensitivity approximations for fires.