# robust_partiallycompressed_leastsquares__0b9652f4.pdf

Robust Partially-Compressed Least-Squares

Stephen Becker University of Colorado Boulder stephen.becker@colorado.edu

Ban Kawas IBM T.J. Watson Research Center bkawas@us.ibm.com

Marek Petrik University of New Hampshire mpetrik@cs.unh.edu

Randomized matrix compression techniques, such as the Johnson-Lindenstrauss transform, have emerged as an effective and practical way for solving large-scale problems efﬁciently. With a focus on computational efﬁciency, however, forsaking solutions quality and accuracy becomes the tradeoff. In this paper, we investigate compressed least-squares problems and propose new models and algorithms that address the issue of error and noise introduced by compression. While maintaining computational efﬁciency, our models provide robust solutions that are more accurate than those of classical compressed variants. We introduce tools from robust optimization together with a form of partial compression to improve the error-time trade-offs of compressed least-squares solvers. We develop an efﬁcient solution algorithm for our Robust Partially-Compressed (RPC) model based on a reduction to a one-dimensional search.

Introduction Random projection is a simple and effective dimensionality reduction technique that enables signiﬁcant speedups in solving large-scale machine learning problems (Dasgupta 2000; Mahoney 2011; Woodruff 2014). It has been successfully used, for example, in classiﬁcation (Pilanci and Wainwright 2014; Zhang et al. 2013), clustering (Boutsidis, Zouzias, and Drineas 2010; Fern and Brodley 2003; Urruty, Djeraba, and Simovici 2007), and least-squares problems (Drineas et al. 2011; Pilanci and Wainwright 2014). The focus of this paper will be on the latter. We consider the following canonical least-squares estimator, with A RM N:

x LS def= argmin x 1 2 Ax b 2 = (ATA) 1 AT b (1)

where , for vectors, denotes the Euclidean norm throughout the paper, and A has the full column rank. We assume that M N and refer to x LS as the solution to the uncompressed problem. When M is very large, solving the least-squares problem in (1) can be time-consuming and computationally expensive. To gain the necessary speedups, random projections are used. The standard approach to doing so proceeds as follows (Drineas et al. 2011). First, we construct a compression matrix Φ Rm M from a random distribution such

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

1 1.02 1.04 1.06 1.08 1.1 0

(Empirical) Probability that residual of the method is with factor "t" of the best method

Full LS Compressed LS Compressed LS, reg. Compressed LS, robust . Partial Compressed LS Partial Comp. LS, reg. Partial Comp. LS, robust

Figure 1: Cumulative probability distribution of residuals of various compressed least-squares methods. The horizontal axis represents the normalized residual compared to solving the full least squares problem.

that E ΦTΦ = I and m M. Then, we solve the fully compressed problem:

x CLS def= argmin x 1 2 Φ (Ax b) 2 (2)

Numerous efﬁcient methods for constructing the compression matrix Φ have been developed; surveys are provided in (Boutsidis, Drineas, and Magdon-Ismail 2013; Drineas et al. 2011; Mahoney 2011). We describe and use several common methods for constructing Φ in the empirical results section below. When m in Φ is small, the fully compressed least-squares problem in (2) is much easier to solve than the uncompressed least-squares in (1). However, compression can introduce signiﬁcant errors to the solution x CLS when compared to the uncompressed solution, x LS and one is forced to consider the trade-off between accuracy and efﬁciency. As our main contribution, we propose and analyze two new models that address this issue and provide a desirable trade-off; enabling robust solutions while preserving small computational complexity.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Results in Fig. 1 which we explain in detail in the experimental section demonstrate the main issue with standard methods as well as how we alleviate it. Our new model is the following partially-compressed least-squares estimator:

min x 1 2 Φ A x 2 b TA x (3)

which is denoted as Partial Compressed LS in Fig. 1 and its solution is given by:

x PCLS def= (ATΦTΦA) 1ATb . (4) Note that only the computationally expensive parts of the ordinary least-squares estimator, which involve inverting ATA, are compressed. Also notice, in comparison, that the objective function of the fully compressed least-squares estimator is 1

2 Φ A x 2 b T ΦTΦ A x. While not the focus of this paper since our goal here is to introduce our new estimator in (3) it is important to note that the prediction error Ax PCLS Ax LS of the partially compressed solution, x PCLS, is not always smaller than the error of the fully compressed one, x CLS. We describe when and why this is the case in the section that deals with the approximation error bounds. To further reduce the residuals of partial least squares, we have derived our second model, the robust partiallycompressed least-squares estimator (RPC). It is denoted as Partial Comp. LS, robust in Fig. 1. RPC explicitly models errors introduced by compression and is closely related to robust least-squares regression (El Ghaoui and Lebret 1997). Leveraging robust optimization techniques makes it possible to reduce the solution error without excessively increasing computational complexity and is a data-driven approach that has been widely used in the last two decades (Ben-Tal, El Ghaoui, and Nemirovski 2009). In our numerical results, we have observed a similar effect to that when applying robust optimization to the fully compressed least-squares solution; increased accuracy and reduction in error. While we show that RPC can be formulated as a secondorder conic program (SOCP), generic off-the-shelf SOCP solvers may be slow for large problems. Therefore, as one of our contributions, we have developed a fast algorithm based on a one-dimensional search that can be signiﬁcantly faster than CPLEX. Using this fast algorithm, the RPC model is asymptotically just as efﬁcient as the non-robust model. Table 1 puts our results in context of prior related work. Our empirical results, show that both partiallycompressed and robust partially-compressed solutions can outperform models that use full compression in terms of quality of solutions. We also show that compressed variants are more computationally efﬁcient than ordinary least-squares, especially as dimensions grow.

Robust Partially-Compressed Least-Squares As described above, our objective is to enhance solution quality and increase robustness against noise and errors introduced by compression. One way of improving robustness is to use ridge regression, which when applied to our model (3), we obtain the following formulation:

min x 1 2 Φ A x 2 b TA x + μ x 2, (5)

for some regularization parameter μ. One caveat of using ridge regression is that it does not capture the error structure introduced by compression, which differs signiﬁcantly from that present in the data of the original uncompressed ordinary least-squares problem. Robust optimization (Ben Tal, El Ghaoui, and Nemirovski 2009), however, enables us to do exactly that and allows us to explicitly model the error structure. The following is our Robust Partially Compressed (RPC) estimator:

x RPC = argmin x max ΔP F ρ 1 2 (P + ΔP)x 2 b TAx (6)

where P = ΦA and ΔP is a matrix variable of size m N. The general formulation of the problem allows for a more targeted model of the noise that captures the fact that ΦAx is a random variable while b TAx is not. That is, the uncertainty is restricted to the data matrix P alone since the partial compression does not introduce any noise in the right-hand side. Without compression, it is worth noting that applying robust optimization techniques to the ordinary least-squares problem yields the same solution as applying ridge regression with a data-dependent parameter (El Ghaoui and Lebret 1997). As we will show, this is not the case in our setting, as robust partially-compressed least-squares does not reduce to ridge regression. Empirically, we have also seen that robust partially-compressed least-squares is more likely to yield better results than ridge regression and has more intuition built behind it. All of the above, motivated us to focus more on our RPC (6) model and to derive a corresponding efﬁcient solution algorithm.

Optimal Solution of RPC While the inner optimization in (6) is a non-convex optimization problem, we show in the following lemma that there exists a closed-form solution. Lemma 1. The inner maximization in (6) can be reformulated for any x as:

max ΔP F ρ (P + ΔP)x 2 = ( Px + ρ x )2 . (7)

In addition, the maximal value is achieved for ΔP = ρ P x x Pxx T.

Proof. The objective function can be upper-bounded using the triangle inequality:

max ΔP F ρ (P + ΔP)x 2 max ΔP F ρ ( Px + ΔPx )2

( Px + ρ x )2 .

To show that this bound is tight, consider ΔP = ρ P x x Pxx T. It can be readily seen that ΔP F = ρ. Then by algebraic manipulation:

max ΔP F ρ (P + ΔP)x 2 (P + ΔP) x 2

= ( Px + ρ x )2 .

Least Squares Ridge Regression Robust Least Squares

No Compression many e.g., (Boyd and Vandenberghe 2004) (El Ghaoui and Lebret 1997) Partial Compression new: (3) new: (5) new: (6) Full Compression e.g., (Drineas et al. 2011) e.g., (Boyd and Vandenberghe 2004) new (but algo. via El Ghaoui)

Table 1: Our work in context of previous results. The equation numbers point to the objective functions for each method.

Using Lemma 1, the robust partially-compressed estimator x RPC is the optimal solution to:

min x 1 2 ( Px + ρ x )2 b TAx . (8)

We now analyze the structure of the optimal solution and point to connections and differences in comparison to results from ridge regression.

Theorem 1. The optimal solution x RPC to (8) must satisfy:

x RPC = 1 α + ρ β (α 1 P T P + ρ β 1 I) 1AT b , (9)

such that α = Px RPC and β = x RPC , or x RPC = 0 if ATb = 0.

Proof. The theorem follows the ﬁrst-order optimality conditions. The function (8) is everywhere convex, and differentiable everywhere except at x = 0. We can show the solution x = 0 is only optimal if ATb = 0. The objective at x = 0 is 0. If ATb = 0, then for sufﬁciently small t > 0, the point t ATb gives a strictly negative objective (since t2 = o(t) as t 0), hence x = 0 is not optimal. If x = 0, the following ﬁrst-order conditions are necessary and sufﬁcient:

0 = ( Px + ρ x ) P TPx

from which we derive (9). The theorem follows directly from setting α and β to the required values.

Theorem 1 shows that the optimal solution to the robust partially-compressed least-squares problem is structurally similar to a ridge regression solution. The two main differences are that there are two parameters, α and β, and these parameters are data-dependent. When setting ρ to 1 which is what we have done in our empirical study, one advantage over ridge regression would be that there is no need to ﬁne-tune the regularization parameter, μ, and one can rely on only data-driven parameters α and β. Even when there is a need to ﬁne-tune the free parameter ρ in RPC which we have not done in our results and simply set ρ to be equal to 1 ρ has a structural meaning associated with it; ρ is the size of the uncertainty set in (6) and (7) and one can quickly build an intuition behind how to set its value, which is not the case for the regularization parameter μ. In a current investigation, which is out of the scope of this paper, we are building connections between ρ and the compression dimension m, which will enable us to appropriately set ρ as a function of m. Note that Theorem 1 does not provide a method to calculate x RPC, since α and β depend on x RPC. However, given

that (8) is a convex optimization problem, we are able to reformulate it as the following second-order conic program (SOCP) in standard form:

min x,t,u,z 1 2z b T Ax

s.t. Px t , ρ x u , t + u z 1

The last constraint in this program translates to z (t + u)2. While efﬁcient polynomial-time algorithms exists for solving SOCP problems, they are typically signiﬁcantly slower than solving least-squares. Therefore, to achieve practical speedup, we need to derive a more efﬁcient algorithm. In fact we propose a reduction to a one-dimensional optimization problem in the following section.

Efﬁcient Computation of RPC In this section, we describe a faster approach than solving the SOCP problem in (10) based on a reduction to a onedimensional search problem.

Input: A, b, Φ, P = ΦA, ρ Output: x U Σ V T SVD(P); τ ρ b 2/2 ; // Initialization // Solve x argminx hτk(x) while | Σy γk 1| ϵ do

γk arg minγ φ(γ) = N i=1

b2 i (γσ2 i +ρ) 2 1

τ V T(P TP + γk I) 1ATb ; // When τ = τ then α = Σy τk+1 τk Σyk γk ; end // Recover the solution α τ 1+ργ ; // Using: α + ρ β = τ β τ α

β V y ; Algorithm 1: Efﬁcient Algorithm for Solving RPC

First, we reformulate the optimization problem (8) as:

min x,t 1 2t2 b TAx s.t. Px + ρ x t (11)

Our goal is to derive and then solve the dual problem. The Lagrangian of (11) is

L(x, t, τ) = 1

2t2 b TAx + τ ( Px + ρ x t)

Since strong duality conditions hold, we solve the onedimensional dual maximization problem maxτ 0 g(τ)

where g(τ) is given as

2t2 τt + min x τ ( Px + ρ x ) b TAx

2τ 2 + min x τ ( Px + ρ x ) b TAx hτ (x)

The second equality follows since Px + ρ x = t = τ for the optimal primal and dual solution. Observe that hτ(x) is positive homogeneous in x and therefore:

min x hτ(x) =

τ < τ (Case 1) 0 τ = τ (Case 2) 0 τ > τ (Case 3) (13)

where τ 0 is the optimal dual value. Intuitively, to solve for the optimal solution, we need to ﬁnd the maximal value of τ such that hτ(x) = 0. Appendix derives the approach that is summarized in Algorithm 1. Observe that the function hτ(x) is convex. The main idea is to reduce the optimization to a single-dimensional minimization and solve it using Newton method. We also use the SVD decomposition of P to make the search more efﬁcient so that only a single O(N 3) step is needed. In terms of the computational complexity, Algorithm 1 requires O(m N 2 + N 3) operations. All operations inside of the loop are dominated by O(N 3). The number of iteration that is needed depends on the desired precision. Table 2 compares the asymptotic computational complexity of the proposed robust partial compression with the complexity of computing the full least-squares solution.

Approximation Error Bounds Analysis of solution quality is known for the fully compressed least-squares problem (e.g. (Pilanci and Wainwright 2014)). In this section, we derive bounds for the partiallycompressed least-squares regression. First, the following simple analysis elucidates the relative trade-offs in computing full or partial projection solutions. Let x be the solution to the full least-squares problem (3) and z = b Ax be the residual. Recall that ATz = 0. Now when x CLS is the solution to (2), then:

x CLS = (ATΦTΦA) 1ATΦTΦb

= x + (ATΦTΦA) 1ATΦTΦz

On the other hand, the solution x PCLS to (4) satisﬁes:

x PCLS = (ATΦTΦA) 1ATb = (ATΦTΦA) 1ATAx

The error in x CLS is additive and is a function of the remainder z . The error in x PCLS is, on the other hand, multiplicative and is independent of z . As a result, a small z will favor the standard fully compressed least-squares formulation, and a large z will favor the new partial compressed one. We will now show that, in the sense of the following deﬁnition, the residual of the optimal solution of the partial projection problem is close to the residual of the true solution of the least-squares problem.

Deﬁnition 1 (ϵ-optimal solution). We say that a solution ˆx is ϵ-optimal if it satisﬁes

Ax LS ϵ, ϵ (0, 1) (14)

where x LS is an optimal solution of the original highdimensional system (1).

For sub-Gaussian and ROS sketches, we can show that results in (Pilanci and Wainwright 2014) can be extended to bound approximation errors for partially-compressed leastsquares based on the deﬁnition of ϵ-optimal above. These results are nearly independent of the number of rows M in the data matrix (except for how these affect Ax LS ). The main guarantees for unconstrained least-squares are given in the following theorem (proof in appendix) which provides an exponential tail bound:

Theorem 2 (Approximation Guarantee). Given a normalized sketching matrix Φ Rm M, and universal constants c0, c 0, c1, c2, the sketched solution x P CLS (4) is ϵ-optimal (14) with probability at least 1 c1 exp( c2 m ϵ2), for any tolerance parameter ϵ (0, 1), when the sketch or compression size m is bounded below by

(i) m > c0 rank(A)

ϵ2 , if Φ is a scaled sub-Gaussian sketch

(ii) m > c 0 rank(A)

ϵ2 log4(N), if Φ is a scaled randomized orthogonal systems (ROS) sketch

By scaled sketch, we mean E(ΦTΦ) = I, since for partial compression, scaling Φ does affect the answer, unlike full compression. For example, in the Gaussian case, we draw the entries of Φ from N(0, 1

m) instead of N(0, 1).

Empirical Results Our focus in this section is on the improvement of the solution error in comparison with the non-compressed least squares solution as well as the improvement over regular full compression. We also investigate the computational speed of the algorithms and show that partial compression is just as fast as full compression (and hence sometimes faster than standard least-squares), and that robust partial compression is only roughly twice as slow (and asymptotically it is the same cost). For completeness, we compare with (ridge-)regularized and robust versions of the standard compressed LS problem (2). The robust version is solved following the algorithm outlined in (El Ghaoui and Lebret 1997) since this can be treated as a robust ordinary least squares problem. We ﬁrst use a data set from the National Health Interview Survey from 1992, containing 44085 rows and only 9 columns; since it is highly overcomplete and contains potentially sensitive data, it is a good candidate for sketching. To test over this, we do 100 realizations of 5000 randomly chosen training data and 10000 testing data, and for each realization draw 50 random Walsh-Hadamard sketches with m = 10N. The residual on the testing data (median over all 5000 realizations) is shown in Fig. 1. For robust variants, we set μ

Least Squares Robust Partial Compression

Compression Gaussian Walsh-Hadamard Counting Comp. Time O(m M N) O(M log MN) O(nnz) Solution Time O(M N 2) O(m N 2 + N 3) O(m N 2 + N 3) O(m N 2 + N 3) Total Time O(M N 2) O(m M N + m N 2) O(M log MN + m N 2) O(nnz + m N 2)

Table 2: Asymptotic computational complexity of various compression methods. Symbol nnz denotes the number of non-zero elements in A we are assuming that m N and M N.

1 1.5 2 2.5 10 2

normalized sampling size m/N

residual (normalized by optimal residual) 1

full compression partial compression partial (regularized) partial (robust)

Figure 2: For very high compression (m/M very small), with even m = N, robustness/regularization is beneﬁcial.

to be 5 times the minimum eigenvalue of ATΦTΦA, and for robust variants we set ρ = 1. Figure 1 presents the results similar to a CDF or a performance proﬁle as used in benchmarking software; a smaller area above the curve indicates better performance. A point such as (0.5, 1.02) means that on half the simulations, the method achieved a residual within a factor of 1.02 of the least-square residual. There are two clear observations from the Fig. 1: partial compression gives lower residuals than full compression, and the regularized and robust variants may do slightly worse in the lower-left (i.e., more bias) but better in the worst-case upper-right (i.e., less variance). Put another way, the robust and regularized versions have stronger tail bounds than the standard versions. We also see a slight beneﬁt of robustness over regularization, though the effect depends on how μ and ρ are chosen. Figure 3 shows the breakdown of timing for the individual parts of each of the algorithms that we consider. The compression method used the counting sketch in all compressed methods with the exception of Blendenpik (mex) which used the Walsh-Hadamard random matrix vis the Spiral WHT Package (P uschel et al. 2005), and both Blendenpik (Avron, Maymounkov, and Toledo 2010) versions are set to use lowaccuracy for the LSQR step. The matrix A is 5 104 500 random matrix with condition number 106. Fig. 2 investigates the effect of the number of rows m of

1 2 3 4 5 6 7 8 0

A\b (time is 6.1 s)

normal equations

BLENDENPIK (mex)

full compression

partial compression

partial (regularized)

partial (robust)

multiplication

partial eigs

solve small sys

Figure 3: Breakdown of the timing of the parts of the algorithms. Note that the time for A\b continues off the chart.

the compressed matrix. As expected, the residual quickly deteriorates when m is small (note the logarithmic scale). Employing regularization decreases the residual signiﬁcantly. Finally, the robust partially-compressed least squares leads to a further improvement over regularization. This is expected since regularization can be seen as a form of an approximation to the robust solution. We also evaluated our new methods on other synthetic and realistic problems. The additional results also show that robust partial least squares can lead to signiﬁcant improvements in comparison with the standard projected least squares. Please see the extended version of the paper for these results (Becker et al. 2015).

We developed two new models to address the issue of noise and errors introduced to solutions of compressed leastsquares problems; the partially-compressed and the robust partially-compressed least-squares models. Our models reduce the error introduced by random projection, or sketching, while retaining the computational beneﬁts of matrix compression. The robust model speciﬁcally captures the error structure introduced by partial compression, unlike ridge regression with the partially compressed model. Our experimental results indicate that the robust partial compression model outperforms both the partially-compressed model (with or without regularization) as well as the fully compressed one. Partial compression alone can also significantly improve the solution quality over full compression. We also derived the ﬁrst approximation-error bounds for the

partially-compressed least-squares model. While the partially-compressed least-squares retains the same computational complexity as full compression, the robust approach introduces an additional difﬁculty in solving the convex optimization problem. By introducing an algorithm based on one-dimensional parameter search, even the robust partially-compressed least-squares can be faster than ordinary least-squares.

Appendix Minimization of hτ in Algorithm 1 In this section, we describe how to efﬁciently minimize hτ(x). Recall that hτ is deﬁned in (12). Now consider the problem of computing argmin hτ(x) for a ﬁxed τ. In case 2 of (13), there exists a solution x = 0 and therefore the function is differentiable and the optimality conditions read

x = ATb, α = Px , β = x .

(15) The optimality conditions are scale invariant for x = 0 and therefore we can construct a solution such that β = x = 1. Let V DV T = P TP be an eigenvalue decomposition of P TP, i.e., Dii = di = σ2 i are the squared singular values of P, and V are the right singular vectors of P = UΣV T. We make the change-of-variables to y = V Tx (hence y = x ) and deﬁne b = V TATb, which gives an equation for y which is separable if α is ﬁxed. We thus need to solve

τ(γD + ρI)y = b (16) 1 = β = y (17) 1/γ = α = Σy (18)

Since di 0 the solution of (16) is unique for a given γ. Therefore, the equations (16)-(18) are satisﬁed if and only if there exists a γ such that the solution to (16) satisﬁes both (17) and (18). We use Newton s method to compute γ that satisﬁes (16). Deﬁne

φ(γ) = τ 2 N

b 2 i (γσ2 i + ρ)2 1

so (16) and (17) are satisﬁed if φ(γ) = 0 for γ 0. We note that

φ (γ) = 2τ 2 N

σ2 i b 2 i (γσ2 i + ρ)3

which is always negative when γ 0, hence φ is monotonic and we are guaranteed that there is a unique root (i.e., it is analogous to convex minimization) in the region γ 0. We can apply any variant of safe-guarded Newton style methods to solve for the root. Let x = V Ty for y the optimal solution of the Newton method optimization. We now check if (18) is satisﬁed for this particular value of γ α 1 to determine which case of (13) we are in. If (18) is satisﬁed and hτ(x) = 0 that means that we are in Case 2 and τ = τ . That is, the complementary slackness conditions are satisﬁed and the minimum of

hτ(x) is 0. If, on the other hand, hτ(x) < 0 then we are in Case 1 and scaling x yields an arbitrarily small value. Finally, if y does not satisfy (18), then the optimal solution is y = 0 and we are in Case 3. Note that hτ(x) is not differentiable at x = 0. Finally, if we are in Case 2, then x is a scaled optimal solution. To recover the optimal solution, we use t = τ to appropriately scale x. Speciﬁcally, since we took β = 1 and worked with γ α 1, this was equivalent to working with γ = β/α so we can recover the properly scaled β = α γ and hence α = (1 + ργ) 1τ .

Proof of Theorem 2 Proof. The proof uses the stochastic arguments of (Pilanci and Wainwright 2014) directly, and modiﬁes their deterministic argument (Lemma 1). For brevity, write x = x PCLS and x = x LS. From the optimality of x to the partialcompressed least squares problem (3), we have: ΦA x 2 ΦAx 2 + 2 A ( x x), b . (19) for all x, and in particular x = x . From the optimality of x to equation (1), the gradient at x is zero so we have A x, Ax b = 0 for any x, and hence, using x = x x , re-arranging this gives A( x x ), b = A( x x ), Ax (20) Thus 1 2 ΦA ( x x ) 2

2 ΦA x 2 + 1

2 ΦAx 2 ΦA x, ΦAx

ΦAx 2 + A ( x x ), b ΦA x, ΦAx = A ( x x ), b ΦA ( x x ), ΦAx

= A ( x x ), (I ΦTΦ)Ax where the ﬁrst inequality follows from (19) and the ﬁnal equality follows from (20). Normalizing both sides of the last inequality appropriately, we obtain: 1 2 ΦA ( x x ) 2

A ( x x ) 2 U1

Ax A ( x x )

A ( x x ) , (I ΦTΦ) Ax

U2 To complete the proof, we need to show that 2 U2

U1 is bounded above by ϵ (0, 1) for both the sub-Gaussian sketch and the ROS sketch. Deﬁne Z1(A) = infv range(A), v =1 Φv 2

and Z2(A) = supv range(A), v =1 u, ΦT Φ I v

where u is any ﬁxed vector of norm 1. Then U2/U1 Z2/Z1. Taking the scaling of Φ into account, then Z2/Z1 < ϵ if: (a) Φ is a scaled sub-Gaussian sketch and condition (i) of the theorem holds, since we apply Lemmas 2 and 3 of (Pilanci and Wainwright 2014); or (b) Φ is a scaled ROS sketch and condition (ii) of the theorem holds, since we apply Lemmas 4 and 5 of (Pilanci and Wainwright 2014).

References Avron, H.; Maymounkov, P.; and Toledo, S. 2010. Blendenpik: Supercharging lapack s least-squares solver. SIAM Journal on Scientiﬁc Computing 32(3):1217 1236. Becker, S.; Kawas, B.; Petrik, M.; and Ramamurthy, K. N. 2015. Robust partially-compressed least-squares. ar Xiv:1510.04905. Ben-Tal, A.; El Ghaoui, L.; and Nemirovski, A. 2009. Robust optimization. Princeton University Press. Boutsidis, C.; Drineas, P.; and Magdon-Ismail, M. 2013. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory 59(10):6880 6892. Boutsidis, C.; Zouzias, A.; and Drineas, P. 2010. Random projections for k-means clustering. In Advances in Neural Information Processing Systems (NIPS), 298 306. Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. Cambridge: Cambridge University Press. Dasgupta, S. 2000. Experiments with random projection. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 143 151. Drineas, P.; Mahoney, M. W.; Muthukrishnan, S.; and Sarl os, T. 2011. Faster least squares approximation. Numerische Mathematik 117(2):219 249. El Ghaoui, L., and Lebret, H. 1997. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications 18(4):1035 1064. Fern, X. Z., and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In International Conference on Machine Learning (ICML), 186 193. Mahoney, M. W. 2011. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 3. Pilanci, M., and Wainwright, M. J. 2014. Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory 61:5096 5115. P uschel, M.; Moura, J. M. F.; Johnson, J.; Padua, D.; Veloso, M.; Singer, B.; Xiong, J.; Franchetti, F.; Gacic, A.; Voronenko, Y.; Chen, K.; Johnson, R. W.; and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation 93(2):232 275. Urruty, T.; Djeraba, C.; and Simovici, D. A. 2007. Clustering by random projections. In Advances in Data Mining. Theoretical Aspects and Applications. Springer. 107 119. Woodruff, D. P. 2014. Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science 10(1-2):1 157. Zhang, L.; Mahdavi, M.; Jin, R.; Yang, T.; and Zhu, S. 2013. Recovering optimal solution by dual random projection. In International Conference on Learning Theory (COLT), 135 157.