# leveraged_volume_sampling_for_linear_regression__f7df3c7c.pdf Leveraged volume sampling for linear regression Michał Derezi nski and Manfred K. Warmuth Department of Computer Science University of California, Santa Cruz mderezin@berkeley.edu, manfred@ucsc.edu Daniel Hsu Computer Science Department Columbia University, New York djhsu@cs.columbia.edu Suppose an n d design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to sample only a small number k n of the responses, and then produce a weight vector whose sum of squares loss over all points is at most 1 + times the minimum. When k is very small (e.g., k = d), jointly sampling diverse subsets of points is crucial. One such method called volume sampling has a unique and desirable property that the weight vector it produces is an unbiased estimate of the optimum. It is therefore natural to ask if this method offers the optimal unbiased estimate in terms of the number of responses k needed to achieve a 1 + loss approximation. Surprisingly we show that volume sampling can have poor behavior when we require a very accurate approximation indeed worse than some i.i.d. sampling techniques whose estimates are biased, such as leverage score sampling. We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size k = O(d log d + d/ ) suffices to guarantee total loss at most 1 + times the minimum with high probability. Thus we improve on the best previously known sample size for an unbiased estimator, k = O(d2/ ). Our rescaling procedure leads to a new efficient algorithm for volume sampling which is based on a determinantal rejection sampling technique with potentially broader applications to determinantal point processes. Other contributions include introducing the combinatorics needed for rescaled volume sampling and developing tail bounds for sums of dependent random matrices which arise in the process. 1 Introduction Consider a linear regression problem where the input points in Rd are provided, but the associated response for each point is withheld unless explicitly requested. The goal is to sample the responses for just a small subset of inputs, and then produce a weight vector whose total square loss on all n points is at most 1 + times that of the optimum.1 This scenario is relevant in many applications where data points are cheap to obtain but responses are expensive. Surprisingly, with the aid of having all input points available, such multiplicative loss bounds are achievable without any range dependence on the points or responses common in on-line learning [see, e.g., 8]. A natural and intuitive approach to this problem is volume sampling, since it prefers diverse sets of points that will likely result in a weight vector with low total loss, regardless of what the corresponding responses turn out to be [11]. Volume sampling is closely related to optimal design criteria [18, 26], which are appropriate under statistical models of the responses; here we study a worst-case setting where the algorithm must use randomization to guard itself against worst-case responses. 1The total loss being 1 + times the optimum is the same as the regret being times the optimum. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Volume sampling and related determinantal point processes are employed in many machine learning and statistical contexts, including linear regression [11, 13, 26], clustering and matrix approximation [4, 14, 15], summarization and information retrieval [19, 23, 24], and fairness [6, 7]. The availability of fast algorithms for volume sampling [11, 26] has made it an important technique in the algorithmic toolbox alongside i.i.d. leverage score sampling [17] and spectral sparsification [5, 25]. It is therefore surprising that using volume sampling in the context of linear regression, as suggested in previous works [11, 26], may lead to suboptimal performance. We construct an example in which, even after sampling up to half of the responses, the loss of the weight vector from volume sampling is a fixed factor >1 larger than the minimum loss. Indeed, this poor behavior arises because for any sample size >d, the marginal probabilities from volume sampling are a mixture of uniform probabilities and leverage score probabilities, and uniform sampling is well-known to be suboptimal when the leverage scores are highly non-uniform. Figure 1: Plots of the total loss for the sampling methods (averaged over 100 runs) versus sample size (shading is standard error) for the libsvm dataset cpusmall [9]. A possible recourse is to abandon volume sampling in favor of leverage score sampling [17, 33]. However, all i.i.d. sampling methods, including leverage score sampling, suffer from a coupon collector problem that prevents their effective use at small sample sizes [13]. Moreover, the resulting weight vectors are biased (when regarded as estimators for the least squares solution based on all responses). This is a nuisance when averaging multiple solutions (e.g., as produced in distributed settings). In contrast, volume sampling offers multiplicative loss bounds even with sample sizes as small as d and it is the only known non-trivial method that gives unbiased weight vectors [11]. We develop a new solution, called leveraged volume sampling, that retains the aforementioned benefits of volume sampling while avoiding its flaws. Specifically, we propose a variant of volume sampling based on rescaling the input points to correct the resulting marginals. On the algorithmic side, this leads to a new determinantal rejection sampling procedure which offers significant computational advantages over existing volume sampling algorithms, while at the same time being strikingly simple to implement. We prove that this new sampling scheme retains the benefits of volume sampling (like unbiasedness) but avoids the bad behavior demonstrated in our lower bound example. Along the way, we prove a new generalization of the Cauchy-Binet formula, which is needed for the rejection sampling denominator. Finally, we develop a new method for proving matrix tail bounds for leveraged volume sampling. Our analysis shows that the unbiased least-squares estimator constructed this way achieves a 1 + approximation factor from a sample of size O(d log d + d/ ), addressing an open question posed by [11]. Experiments. Figure 1 presents experimental evidence on a benchmark dataset (cpusmall from the libsvm collection [9]) that the potential bad behavior of volume sampling proven in our lower bound does occur in practice. Appendix E shows more datasets and a detailed discussion of the experiments. In summary, leveraged volume sampling avoids the bad behavior of standard volume sampling, and performs considerably better than leverage score sampling, especially for small sample sizes k. Related work. Despite the ubiquity of volume sampling in many contexts already mentioned above, it has only recently been analyzed for linear regression. Focusing on small sample sizes, [11] proved multiplicative bounds for the expected loss of size k = d volume sampling. Because the estimators produced by volume sampling are unbiased, averaging a number of such estimators produced an estimator based on a sample of size k = O(d2/ ) with expected loss at most 1 + times the optimum. It was shown in [13] that if the responses are assumed to be linear functions of the input points plus white noise, then size k = O(d/ ) volume sampling suffices for obtaining the same expected bounds. These noise assumptions on the response vector are also central to the task of A-optimal design, where volume sampling is a key technique [2, 18, 28, 29]. All of these previous results were concerned with bounds that hold in expectation; it is natural to ask if similar (or better) bounds can also be shown to hold with high probability, without noise assumptions. Concentration bounds for volume sampling and other strong Rayleigh measures were studied in [30], but these results are not sufficient to obtain the tail bounds for volume sampling. Other techniques applicable to our linear regression problem include leverage score sampling [17] and spectral sparsification [5, 25]. Leverage score sampling is an i.i.d. sampling procedure which achieves tail bounds matching the ones we obtain here for leveraged volume sampling, however it produces biased weight vectors and experimental results (see [13] and Appendix E) show that it has weaker performance for small sample sizes. A different and more elaborate sampling technique based on spectral sparsification [5, 25] was recently shown to be effective for linear regression [10], however this method also does not produce unbiased estimates, which is a primary concern of this paper and desirable in many settings. Unbiasedness seems to require delicate control of the sampling probabilities, which we achieve using determinantal rejection sampling. Outline and contributions. We set up our task of subsampling for linear regression in the next section and present our lower bound for standard volume sampling. A new variant of rescaled volume sampling is introduced in Section 3. We develop techniques for proving matrix expectation formulas for this variant which show that for any rescaling the weight vector produced for the subproblem is unbiased. Next, we show that when rescaling with leverage scores, then a new algorithm based on rejection sampling is surprisingly efficient (Section 4): Other than the preprocessing step of computing leverage scores, the runtime does not depend on n (a major improvement over existing volume sampling algorithms). Then, in Section 4.1 we prove multiplicative loss bounds for leveraged volume sampling by establishing two important properties which are hard to prove for joint sampling procedures. We conclude in Section 5 with an open problem and with a discussion of how rescaling with approximate leverage scores gives further time improvements for constructing an unbiased estimator. 2 Volume sampling for linear regression In this section, we describe our linear regression setting, and review the guarantees that standard volume sampling offers in this context. Then, we present a surprising lower bound which shows that under worst-case data, this method can exhibit undesirable behavior. 2.1 Setting Suppose the learner is given n input vectors x1, . . . , xn 2 Rd, which are arranged as the rows of an n d input matrix X. Each input vector xi has an associated response variable yi 2 R from the response vector y 2 Rn. The goal of the learner is to find a weight vector w 2 Rd that minimizes the square loss: w def= argmin w2Rd L(w), where L(w) > i w yi)2 = k Xw yk2. Given both matrix X and vector y, the least squares solution can be directly computed as w = X+y, where X+ is the pseudo-inverse. Throughout the paper we assume w.l.o.g. that X has (full) rank d.2 In our setting, the learner is initially given the entire input matrix X, while response vector y remains hidden. The learner is then allowed to select a subset S of row indices in [n] = {1, . . . , n} for which the corresponding responses yi are revealed. The learner next constructs an estimate bw of w using matrix X and the partial vector of observed responses. Finally, the learner is evaluated by the loss over all rows of X (including the ones with unobserved responses), and the goal is to obtain a multiplicative loss bound, i.e., that for some > 0, L(bw) (1 + ) L(w ). 2.2 Standard volume sampling Given X 2 Rn d and a size k d, standard volume sampling jointly chooses a set S of k indices in [n] with probability Pr(S) = det(X> 2Otherwise just reduce X to a subset of independent columns. Also assume X has no rows of all zeros (every weight vector has the same loss on such rows, so they can be removed). where XS is the submatrix of the rows from X indexed by the set S. The learner then obtains the responses yi, for i 2 S, and uses the optimum solution w S = (XS)+y S for the subproblem (XS, y S) as its weight vector. The sampling procedure can be performed using reverse iterative sampling (shown on the right), which, if carefully implemented, takes O(nd2) time (see [11, 13]). Reverse iterative sampling Volume Sample(X, k): S [n] while |S| > k S XS) Sample i / qi out of S S S\{i} end return S The key property (unique to volume sampling) is that the subsampled estimator w S is unbiased, i.e. E[w S] = w , where w = argmin As discussed in [11], this property has important practical implications in distributed settings: Mixtures of unbiased estimators remain unbiased (and can conveniently be used to reduce variance). Also if the rows of X are in general position, then for volume sampling This is important because in A-optimal design bounding tr((X> SXS) 1) is the main concern. Given these direct connections of volume sampling to linear regression, it is natural to ask whether this distribution achieves a loss bound of (1 + ) times the optimum for small sample sizes k. 2.3 Lower bound for standard volume sampling We show that standard volume sampling cannot guarantee 1 + multiplicative loss bounds on some instances, unless over half of the rows are chosen to be in the subsample. Theorem 1 Let (X, y) be an n d least squares problem, such that Id d γ Id d C C A , y = C C A , where γ > 0. S = (XS)+y S be obtained from size k volume sampling for (X, y). Then, S)] L(w ) 1 + n k and there is a γ > 0 such that for any k n Proof In Appendix A we show (2), and that for the chosen (X, y) we have L(w )=Pd i=1(1 li) (see (8)), where li = x> i (X>X) 1xi is the i-th leverage score of X. Here, we show (3). The marginal probability of the i-th row under volume sampling (as given by [12]) is Pr(i 2 S) = li + (1 ) 1 = 1 (1 li), where = n k Next, we bound the probability that all of the first d input vectors were selected by volume sampling: Pr(i 2 S) = i=1(1 li) z }| { L(w ) where ( ) follows from negative associativity of volume sampling (see [26]). If for some i 2 [d] we have i 62 S, then L(w S) 1. So for γ such that L(w ) = 2 3 and any k n 2/3 z }| { L(w ) Note that this lower bound only makes use of the negative associativity of volume sampling and the form of the marginals. However the tail bounds we prove in Section 4.1 rely on more subtle properties of volume sampling. We begin by creating a variant of volume sampling with rescaled marginals. 3 Rescaled volume sampling Given any size k d, our goal is to jointly sample k row indices 1, . . . , k with replacement (instead of a subset S of [n] of size k, we get a sequence 2 [n]k). The second difference to standard volume sampling is that we rescale the i-th row (and response) by 1 pqi , where q = (q1, ..., qn) is any discrete distribution over the set of row indices [n], such that Pn i=1 qi = 1 and qi > 0 for all i 2 [n]. We now define q-rescaled size k volume sampling as a joint sampling distribution over 2 [n]k, s.t. q-rescaled size k volume sampling: Pr( ) det Using the following rescaling matrix Q 1 q i e ie> i 2 Rn n, we rewrite the determinant as det(X>Q X). As in standard volume sampling, the normalization factor in rescaled volume sampling can be given in a closed form through a novel extension of the Cauchy-Binet formula (proof in Appendix B.1). Proposition 2 For any X 2 Rn d, k d and q1, . . . , qn > 0, such that Pn i=1 qi = 1, we have q i = k(k 1) (k d+1) det(X Given a matrix X 2 Rn d, vector y 2 Rn and a sequence 2 [n]k, we are interested in a leastsquares problem (Q 1/2 y), which selects instances indexed by , and rescales each of them by the corresponding 1/pqi. This leads to a natural subsampled least squares estimator The key property of standard volume sampling is that the subsampled least-squares estimator is unbiased. Surprisingly this property is retained for any q-rescaled volume sampling (proof in Section 3.1). As we shall see, this will give us great leeway for choosing q to optimize our algorithms. Theorem 3 Given a full rank X 2 Rn d and a response vector y 2 Rn, for any q as above, if is sampled according to (5), then ] = w , where w = argmin w k Xw yk2. The matrix expectation equation (1) for standard volume sampling (discussed in Section 2) has a natural extension to any rescaled volume sampling, but now the equality turns into an inequality (proof in Appendix B.2): Theorem 4 Given a full rank X 2 Rn d and any q as above, if is sampled according to (5), then 3.1 Proof of Theorem 3 We show that the least-squares estimator w 1/2 y produced from any q-rescaled volume sampling is unbiased, illustrating a proof technique which is also useful for showing Theorem 4, as well as Propositions 2 and 5. The key idea is to apply the pseudo-inverse expectation formula for standard volume sampling (see e.g., [11]) first on the subsampled estimator w , and then again on the full estimator w . In the first step, this formula states: det(X>Q SX) S z }| { (Q $ def= {S {1, . . . , k} : |S| = d} and S denotes a subsequence of indexed by the elements of set S. Note that since S is of size d, we can decompose the determinant: >Q SX) = det(X S)2 Y Whenever this determinant is non-zero, w S is the exact solution of a system of d linear equations: 1 pq i > iw = 1 pq i y i, for i 2 S. Thus, the rescaling of each equation by 1 pq i cancels out, and we can simply write w (X S)+y S. (Note that this is not the case for sets larger than d whenever the optimum solu- tion incurs positive loss.) We now proceed with summing over all 2 [n]k. Following Proposition 2, we define the normalization constant as Z = d! det(X>X), and obtain: det(X S)2(X S)+y S det(X )2(X )+y det(XS)2(XS)+y S Note that in (1) we separate into two parts, and (respectively, for subsets S and [k]\S), and sum over them separately. The binomial coefficient counts the number of ways that S can be placed into the sequence . In (2) we observe that whenever has repetitions, determinant det(X ) is zero, so we can switch to summing over sets. Finally, (3) again uses the standard size d volume sampling unbiasedness formula, now for the least-squares task (X, y), and the fact that qi s sum to 1. 4 Leveraged volume sampling: a natural rescaling Determinantal rejection sampling 1: Input: X2Rn d, q = ( l1 d , . . . , ln d ), k d 2: s max{k, 4d2} 3: repeat 4: Sample 1, . . . , s i.i.d. (q1, . . . , qn) 5: Sample Accept Bernoulli s X>Q X) det(X>X) 6: until Accept = true 7: S Volume Sample 1/2 [1..n]X) , k 8: return S Rescaled volume sampling can be viewed as selecting a sequence of k rank-1 matrices from the covariance matrix X>X = Pn i . If 1, . . . , k are sampled i.i.d. from q, i.e., Pr( ) = Qk i=1 q i, then matrix 1 k X>Q X is an unbiased estimator of the covariance matrix because E[q 1 i] = X>X. In rescaled volume sampling (5), Pr( ) # Qk $ det(X>Q X) det(X>X) , and the latter volume ratio introduces a bias to that estimator. However, we show that this bias vanishes when q is exactly proportional to the leverage scores (proof in Appendix B.3). Proposition 5 For any q and X as before, if 2 [n]k is sampled according to (5), then E[Q ] = (k d) I + diag , . . . , ln In particular, E[ 1 k X>Q X] = X>E[ 1 k Q ]X = X>X if and only if qi = li d > 0 for all i 2 [n]. This special rescaling, which we call leveraged volume sampling, has other remarkable properties. Most importantly, it leads to a simple and efficient algorithm we call determinantal rejection sampling: Repeatedly sample O(d2) indices 1, . . . , s i.i.d. from q = ( l1 d , . . . , ln d ), and accept the sample with probability proportional to its volume ratio. Having obtained a sample, we can further reduce its size via reverse iterative sampling. We show next that this procedure not only returns a q-rescaled volume sample, but also exploiting the fact that q is proportional to the leverage scores, it requires (surprisingly) only a constant number of iterations of rejection sampling with high probability. Theorem 6 Given the leverage score distribution q = ( l1 d , . . . , ln d ) and the determinant det(X>X) for matrix X 2 Rn d, determinantal rejection sampling returns sequence S distributed according to leveraged volume sampling, and w.p. at least 1 δ finishes in time O((d2+ k)d2 ln( 1 Proof We use a composition property of rescaled volume sampling (proof in Appendix B.4): Lemma 7 Consider the following sampling procedure, for s > k: s X (q-rescaled size s volume sampling), 1 . . . 1 pq s x> 1/2 [1..n]X (standard size k volume sampling). Then S is distributed according to q-rescaled size k volume sampling from X. First, we show that the rejection sampling probability in line 5 of the algorithm is bounded by 1: s X>Q X) det(X>X) = det where ( ) follows from the geometric-arithmetic mean inequality for the eigenvalues of the underlying matrix. This shows that sequence is drawn according to q-rescaled volume sampling of size s. Now, Lemma 7 implies correctness of the algorithm. Next, we use Proposition 2 to compute the expected value of acceptance probability from line 5 under the i.i.d. sampling of line 4: s X>Q X) det(X>X) = s(s 1) . . . (s d+1) where we also used Bernoulli s inequality and the fact that s 4d2 (see line 2). Since the expected value of the acceptance probability is at least 3 4, an easy application of Markov s inequality shows that at each trial there is at least a 50% chance of it being above 1 2. So, the probability of at least r trials occurring is less than (1 1 4)r. Note that the computational cost of one trial is no more than the cost of SVD decomposition of matrix X>Q X (for computing the determinant), which is O(sd2). The cost of reverse iterative sampling (line 7) is also O(sd2) with high probability (as shown by [13]). Thus, the overall runtime is O((d2 + k)d2r), where r ln( 1 3) w.p. at least 1 δ. 4.1 Tail bounds for leveraged volume sampling An analysis of leverage score sampling, essentially following [33, Section 2] which in turn draws from [31], highlights two basic sufficient conditions on the (random) subsampling matrix Q that lead to multiplicative tail bounds for L(w It is convenient to shift to an orthogonalization of the linear regression task (X, y) by replacing matrix X with a matrix U = X(X>X) 1/2 2 Rn d. It is easy to check that the columns of U have unit length and are orthogonal, i.e., U>U = I. Now, v = U>y is the least-squares solution for the orthogonal problem (U, y) and prediction vector Uv = UU>y for (U, y) is the same as the prediction vector Xw = X(X>X) 1X>y for the original problem (X, y). The same property holds for the subsampled estimators, i.e., Uv 1/2 y. Volume sampling probabilities are also preserved under this transformation, so w.l.o.g. we can work with the orthogonal problem. Now L(v ) can be rewritten as yk2 (1) = k Uv yk2 + k U(v v )k2 (2) = L(v ) + kv v k2, (6) where (1) follows via Pythagorean theorem from the fact that U(v v ) lies in the column span of U and the residual vector r = Uv y is orthogonal to all columns of U, and (2) follows from U>U = I. By the definition of v , we can write kv v k as follows: >Q (y Uv )k k(U where k Ak denotes the matrix 2-norm (i.e., the largest singular value) of A; when A is a vector, then k Ak is its Euclidean norm. This breaks our task down to showing two key properties: 1. Matrix multiplication: Upper bounding the Euclidean norm k U>Q rk, 2. Subspace embedding: Upper bounding the matrix 2-norm k(U>Q U) 1k. We start with a theorem that implies strong guarantees for approximate matrix multiplication with leveraged volume sampling. Unlike with i.i.d. sampling, this result requires controlling the pairwise dependence between indices selected under rescaled volume sampling. Its proof is an interesting application of a classical Hadamard matrix product inequality from [3] (Proof in Appendix C). Theorem 8 Let U 2 Rn d be a matrix s.t. U>U = I. If sequence 2 [n]k is selected using leveraged volume sampling of size k 2d , then for any r 2 Rn, Next, we turn to the subspace embedding property. The following result is remarkable because standard matrix tail bounds used to prove this property for leverage score sampling are not applicable to volume sampling. In fact, obtaining matrix Chernoff bounds for negatively associated joint distributions like volume sampling is an active area of research, as discussed in [21]. We address this challenge by defining a coupling procedure for volume sampling and uniform sampling without replacement, which leads to a curious reduction argument described in Appendix D. Theorem 9 Let U 2 Rn d be a matrix s.t. U>U = I. There is an absolute constant C, s.t. if sequence 2 [n]k is selected using leveraged volume sampling of size k C d ln( d Theorems 8 and 9 imply that the unbiased estimator w produced from leveraged volume sampling achieves multiplicative tail bounds with sample size k = O(d log d + d/ ). Corollary 10 Let X 2 Rn d be a full rank matrix. There is an absolute constant C, s.t. if sequence 2 [n]k is selected using leveraged volume sampling of size k C , then for estimator 1/2 (Xw y)k2, we have L(w ) (1 + ) L(w ) with probability at least 1 δ. Proof Let U = X(X>X) 1/2. Combining Theorem 8 with Markov s inequality, we have that for large enough C, k U>Q rk2 k2 82 krk2 w.h.p., where r = y Uv . Finally following (6) and (7) above, we have that w.h.p. ) L(w ) + k(U >Q U) 1k2 k U >Q rk2 L(w ) + 82 82 krk2 = (1 + ) L(w ). 5 Conclusion We developed a new variant of volume sampling which produces the first known unbiased subsampled least-squares estimator with strong multiplicative loss bounds. In the process, we proved a novel extension of the Cauchy-Binet formula, as well as other fundamental combinatorial equalities. Moreover, we proposed an efficient algorithm called determinantal rejection sampling, which is to our knowledge the first joint determinantal sampling procedure that (after an initial O(nd2) preprocessing step for computing leverage scores) produces its k samples in time e O(d2+k)d2), independent of the data size n. When n is very large, the preprocessing time can be reduced to e O(nd + d5) by rescaling with sufficiently accurate approximations of the leverage scores. Surprisingly the estimator stays unbiased and the loss bound still holds with only slightly revised constants. For the sake of clarity we presented the algorithm based on rescaling with exact leverage scores in the main body of the paper. However we outline the changes needed when using approximate leverage scores in Appendix F. In this paper we focused on tail bounds. However we conjecture that there are also volume sampling based unbiased estimators achieving expected loss bounds E[L(w )] (1+ )L(w ) with size O( d Acknowledgements Michał Derezi nski and Manfred K. Warmuth were supported by NSF grant IIS-1619271. Daniel Hsu was supported by NSF grant CCF-1740833. [1] Nir Ailon and Bernard Chazelle. The fast Johnson Lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302 322, 2009. [2] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experi- ments via regret minimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 126 135, International Convention Centre, Sydney, Australia, 2017. [3] T Ando, Roger A. Horn, and Charles R. Johnson. The singular values of a Hadamard product: A basic inequality. Journal of Linear and Multilinear Algebra, 21(4):345 365, 1987. [4] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464 1499, 2013. [5] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice-Ramanujan sparsifiers. SIAM Journal on Computing, 41(6):1704 1721, 2012. [6] L Elisa Celis, Amit Deshpande, Tarun Kathuria, and Nisheeth K Vishnoi. How to be fair and diverse? ar Xiv:1610.07183, October 2016. [7] L Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth K Vishnoi. Fair and diverse dpp-based data summarization. ar Xiv:1802.04023, February 2018. [8] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth. Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7(3):604 619, 1996. Earlier version in 6th COLT, 1993. [9] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [10] Xue Chen and Eric Price. Condition number-free query and active learning of linear families. Co RR, abs/1711.10051, 2017. [11] Michał Derezi nski and Manfred K Warmuth. Unbiased estimates for linear regression via volume sampling. In Advances in Neural Information Processing Systems 30, pages 3087 3096, Long Beach, CA, USA, December 2017. [12] Michał Derezi nski and Manfred K. Warmuth. Reverse iterative volume sampling for linear regression. Journal of Machine Learning Research, 19(23):1 39, 2018. [13] Michał Derezi nski and Manfred K. Warmuth. Subsampling for ridge regression via regularized volume sampling. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, 2018. [14] Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS 10, pages 329 338, Washington, DC, USA, 2010. [15] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA 06, pages 1117 1126, Philadelphia, PA, [16] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13(1):3475 3506, December 2012. [17] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127 1136, 2006. [18] Valerii V. Fedorov, William J. Studden, and E. M. Klimko, editors. Theory of optimal experi- ments. Probability and mathematical statistics. Academic Press, New York, 1972. [19] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point processes. In Proceedings of the 10th ACM Conference on Recommender Systems, Rec Sys 16, pages 349 356, New York, NY, USA, 2016. [20] David Gross and Vincent Nesme. Note on sampling without replacing from a finite collection of matrices. ar Xiv:1001.2738, January 2010. [21] Nicholas JA Harvey and Neil Olver. Pipage rounding, pessimistic estimators and matrix concentration. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 926 945. SIAM, 2014. [22] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963. [23] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In Proceed- ings of the 28th International Conference on Machine Learning, pages 1193 1200. Omnipress, 2011. [24] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA, 2012. [25] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 250 269. IEEE, 2015. [26] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume sampling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5045 5054. 2017. [27] Michael W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123 224, February 2011. [28] Zelda E. Mariet and Suvrit Sra. Elementary symmetric polynomials for optimal experimental design. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2136 2145. 2017. [29] Aleksandar Nikolov, Mohit Singh, and Uthaipon Tao Tantipongpipat. Proportional volume sampling and approximation algorithms for A-optimal design. ar Xiv:1802.08318, July 2018. [30] Robin Pemantle and Yuval Peres. Concentration of Lipschitz functionals of determinantal and other strong rayleigh measures. Combinatorics, Probability and Computing, 23(1):140 160, 2014. [31] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS 06, pages 143 152, Washington, DC, USA, 2006. [32] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Compu- tational Mathematics, 12(4):389 434, August 2012. [33] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014.