# robust_regression_via_hard_thresholding__31c8ef09.pdf

Robust Regression via Hard Thresholding

Kush Bhatia , Prateek Jain , and Purushottam Kar

Microsoft Research, India Indian Institute of Technology Kanpur, India {t-kushb,prajain}@microsoft.com, purushot@cse.iitk.ac.in

We study the problem of Robust Least Squares Regression (RLSR) where several response variables can be adversarially corrupted. More speciﬁcally, for a data matrix X Rp n and an underlying model w , the response vector is generated as y = XT w + b where b Rn is the corruption vector supported over at most C n coordinates. Existing exact recovery results for RLSR focus solely on L1penalty based convex formulations and impose relatively strict model assumptions such as requiring the corruptions b to be selected independently of X. In this work, we study a simple hard-thresholding algorithm called TORRENT which, under mild conditions on X, can recover w exactly even if b corrupts the response variables in an adversarial manner, i.e. both the support and entries of b are selected adversarially after observing X and w . Our results hold under deterministic assumptions which are satisﬁed if X is sampled from any sub-Gaussian distribution. Finally unlike existing results that apply only to a ﬁxed w , generated independently of X, our results are universal and hold for any w Rp. Next, we propose gradient descent-based extensions of TORRENT that can scale efﬁciently to large scale problems, such as high dimensional sparse recovery. and prove similar recovery guarantees for these extensions. Empirically we ﬁnd TORRENT, and more so its extensions, offering signiﬁcantly faster recovery than the state-of-the-art L1 solvers. For instance, even on moderate-sized datasets (with p = 50K) with around 40% corrupted responses, a variant of our proposed method called TORRENT-HYB is more than 20 faster than the best L1 solver.

If among these errors are some which appear too large to be admissible, then those equations which produced these errors will be rejected, as coming from too faulty experiments, and the unknowns will be determined by means of the other equations, which will then give much smaller errors. A. M. Legendre, On the Method of Least Squares. 1805. 1 Introduction

Robust Least Squares Regression (RLSR) addresses the problem of learning a reliable set of regression coefﬁcients in the presence of several arbitrary corruptions in the response vector. Owing to the wide-applicability of regression, RLSR features as a critical component of several important realworld applications in a variety of domains such as signal processing [1], economics [2], computer vision [3, 4], and astronomy [2].

Given a data matrix X = [x1, . . . , xn] with n data points in Rp and the corresponding response vector y Rn, the goal of RLSR is to learn a ˆw such that,

( ˆw, ˆS) = arg min w Rp S [n]:|S| (1 β) n

i S (yi x T i w)2, (1)

This work was done while P.K. was a postdoctoral researcher at Microsoft Research India.

That is, we wish to simultaneously determine the set of corruption free points ˆS and also estimate the best model parameters over the set of clean points. However, the optimization problem given above is non-convex (jointly in w and S) in general and might not directly admit efﬁcient solutions. Indeed there exist reformulations of this problem that are known to be NP-hard to optimize [1].

To address this problem, most existing methods with provable guarantees assume that the observations are obtained from some generative model. A commonly adopted model is the following

y = XT w + b, (2)

where w Rp is the true model vector that we wish to estimate and b Rn is the corruption vector that can have arbitrary values. A common assumption is that the corruption vector is sparsely supported i.e. b 0 α n for some α > 0.

Recently, [4] and [5] obtained a surprising result which shows that one can recover w exactly even when α 1, i.e., when almost all the points are corrupted, by solving an L1-penalty based convex optimization problem: minw,b w 1 + λ b 1, s.t., X w + b = y. However, these results require the corruption vector b to be selected oblivious of X and w . Moreover, the results impose severe restrictions on the data distribution, requiring that the data be either sampled from an isotropic Gaussian ensemble [4], or row-sampled from an incoherent orthogonal matrix [5]. Finally, these results hold only for a ﬁxed w and are not universal in general.

In contrast, [6] studied RLSR with less stringent assumptions, allowing arbitrary corruptions in response variables as well as in the data matrix X, and proposed a trimmed inner product based algorithm for the problem. However, their recovery guarantees are signiﬁcantly weaker. Firstly, they are able to recover w only upto an additive error α p (or α s if w is s-sparse). Hence, they require α 1/ p just to claim a non-trivial bound. Note that this amounts to being able to tolerate only a vanishing fraction of corruptions. More importantly, even with n and extremely small α they are unable to guarantee exact recovery of w . A similar result was obtained by [7], albeit using a sub-sampling based algorithm with stronger assumptions on b.

In this paper, we focus on a simple and natural thresholding based algorithm for RLSR. At a high level, at each step t, our algorithm alternately estimates an active set St of clean points and then updates the model to obtain wt+1 by minimizing the least squares error on the active set. This intuitive algorithm seems to embody a long standing heuristic ﬁrst proposed by Legendre [8] over two centuries ago (see introductory quotation in this paper) that has been adopted in later literature [9, 10] as well. However, to the best of our knowledge, this technique has never been rigorously analyzed before in non-asymptotic settings, despite its appealing simplicity.

Our Contributions: The main contribution of this paper is an exact recovery guarantee for the thresholding algorithm mentioned above that we refer to as TORRENT-FC (see Algorithm 1). We provide our guarantees in the model given in 2 where the corruptions b are selected adversarially but restricted to have at most α n non-zero entries where α is a global constant dependent only on X1. Under deterministic conditions on X, namely the subset strong convexity (SSC) and smoothness (SSS) properties (see Deﬁnition 1), we guarantee that TORRENT-FC converges at a geometric rate and recovers w exactly. We further show that these properties (SSC and SSS) are satisﬁed w.h.p. if a) the data X is sampled from a sub-Gaussian distribution and, b) n p log p.

We would like to stress three key advantages of our result over the results of [4, 5]: a) we allow b to be adversarial, i.e., both support and values of b to be selected adversarially based on X and w , b) we make assumptions on data that are natural, as well as signiﬁcantly less restrictive than what existing methods make, and c) our analysis admits universal guarantees, i.e., holds for any w .

We would also like to stress that while hard-thresholding based methods have been studied rigorously for the sparse-recovery problem [11, 12], hard-thresholding has not been studied formally for the robust regression problem. [13] study soft-thresholding approaches to the robust regression problem but without any formal guarantees. Moreover, the two problems are completely different and hence techniques from sparse-recovery analysis do not extend to robust regression.

1Note that for an adaptive adversary, as is the case in our work, recovery cannot be guaranteed for α 1/2 since the adversary can introduce corruptions as bi = x i (ew w ) for an adversarially chosen model ew. This would make it impossible for any algorithm to distinguish between w and ew thus making recovery impossible.

Despite its simplicity, TORRENT-FC does not scale very well to datasets with large p as it solves least squares problems at each iteration. We address this issue by designing a gradient descent based algorithm (TORRENT-GD), and a hybrid algorithm (TORRENT-Hyb), both of which enjoy a geometric rate of convergence and can recover w under the model assumptions mentioned above. We also propose extensions of TORRENT for the RLSR problem in the sparse regression setting where p n but w 0 = s p. Our algorithm TORRENT-HD is based on TORRENT-FC but uses the Iterative Hard Thresholding (IHT) algorithm, a popular algorithm for sparse regression. As before, we show that TORRENT-HD also converges geometrically to w if a) the corruption index α is less than some constant C, b) X is sampled from a sub-Gaussian distribution and, c) n s log p.

Finally, we experimentally evaluate existing L1-based algorithms and our hard thresholding-based algorithms. The results demonstrate that our proposed algorithms (TORRENT-(FC/GD/HYB)) can be signiﬁcantly faster than the best L1 solvers, exhibit better recovery properties, as well as be more robust to dense white noise. For instance, on a problem with 50K dimensions and 40% corruption, TORRENT-HYB was found to be 20 faster than L1 solvers, as well as achieve lower error rates.

2 Problem Formulation

Given a set of data points X = [x1, x2, . . . , xn], where xi Rp and the corresponding response vector y Rn, the goal is to recover a parameter vector w which solves the RLSR problem (1). We assume that the response vector y is generated using the following model:

y = y + b + ε, where y = X w .

Hence, in the above model, (1) reduces to estimating w . We allow the model w representing the regressor, to be chosen in an adaptive manner after the data features have been generated.

The above model allows two kinds of perturbations to yi dense but bounded noise εi (e.g. white noise εi N(0, σ2), σ 0), as well as potentially unbounded corruptions bi to be introduced by an adversary. The only requirement we enforce is that the gross corruptions be sparse. ε shall represent the dense noise vector, for example ε N(0, σ2 In n), and b, the corruption vector such that b 0 α n for some corruption index α > 0. We shall use the notation S = supp(b) [n] to denote the set of clean points, i.e. points that have not faced unbounded corruptions. We consider adaptive adversaries that are able to view the generated data points xi, as well as the clean responses y i and dense noise values εi before deciding which locations to corrupt and by what amount.

We denote the unit sphere in p dimensions using Sp 1. For any γ (0, 1], we let Sγ = {S [n] : |S| = γ n} denote the set of all subsets of size γ n. For any set S, we let XS := [xi]i S Rp |S| denote the matrix whose columns are composed of points in that set. Also, for any vector v Rn we use the notation v S to denote the |S|-dimensional vector consisting of those components that are in S. We use λmin(X) and λmax(X) to denote, respectively, the smallest and largest eigenvalues of a square symmetric matrix X. We now introduce two properties, namely, Subset Strong Convexity and Subset Strong Smoothness, which are key to our analyses. Deﬁnition 1 (SSC and SSS Properties). A matrix X Rp n satisﬁes the Subset Strong Convexity Property (resp. Subset Strong Smoothness Property) at level γ with strong convexity constant λγ (resp. strong smoothness constant Λγ) if the following holds:

λγ min S Sγλmin(XSX S ) max S Sγλmax(XSX S ) Λγ.

Remark 1. We note that the uniformity enforced in the deﬁnitions of the SSC and SSS properties is not for the sake of convenience but rather a necessity. Indeed, a uniform bound is required in face of an adversary which can perform corruptions after data and response variables have been generated, and choose to corrupt precisely that set of points where the SSC and SSS parameters are the worst.

3 TORRENT: Thresholding Operator-based Robust Regression Method

We now present TORRENT, a Thresholding Operator-based Robust Regr Essio N me Thod for performing robust regression at scale. Key to our algorithms is the Hard Thresholding Operator which we deﬁne below.

Algorithm 1 TORRENT: Thresholding Operatorbased Robust Regr Essio N me Thod

Input: Training data {xi, yi} , i = 1 . . . n, step length η, thresholding parameter β, tolerance ϵ 1: w0 0, S0 = [n], t 0, r0 y 2: while rt St 2 > ϵ do 3: wt+1 UPDATE(wt, St, η, rt, St 1) 4: rt+1 i yi wt+1, xi

5: St+1 HT(rt+1, (1 β)n) 6: t t + 1 7: end while 8: return wt

Algorithm 2 UPDATE TORRENT-FC

Input: Current model w, current active set S

1: return arg min w

i S (yi w, xi )2

Algorithm 3 UPDATE TORRENT-GD

Input: Current model w, current active set S, step size η 1: g XS(X S w y S) 2: return w η g

Algorithm 4 UPDATE TORRENT-HYB

Input: Current model w, current active set S, step size η, current residuals r, previous active set S

1: // Use the GD update if the active set S is changing a lot 2: if |S\S | > then 3: w UPDATE-GD(w, S, η, r, S ) 4: else 5: // If stable, use the FC update 6: w UPDATE-FC(w, S) 7: end if 8: return w

Deﬁnition 2 (Hard Thresholding Operator). For any vector v Rn, let σv Sn be the permutation that orders elements of v in ascending order of their magnitudes i.e. vσv(1) vσv(2) . . . vσv(n) . Then for any k n, we deﬁne the hard thresholding operator as

HT(v; k) = i [n] : σ 1 v (i) k

Using this operator, we present our algorithm TORRENT (Algorithm 1) for robust regression. TORRENT follows a most natural iterative strategy of, alternately, estimating an active set of points which have the least residual error on the current regressor, and then updating the regressor to provide a better ﬁt on this active set. We offer three variants of our algorithm, based on how aggressively the algorithm tries to ﬁt the regressor to the current active set.

We ﬁrst propose a fully corrective algorithm TORRENT-FC (Algorithm 2) that performs a fully corrective least squares regression step in an effort to minimize the regression error on the active set. This algorithm makes signiﬁcant progress in each step, but at a cost of more expensive updates. To address this, we then propose a milder, gradient descent-based variant TORRENT-GD (Algorithm 3) that performs a much cheaper update of taking a single step in the direction of the gradient of the objective function on the active set. This reduces the regression error on the active set but does not minimize it. This turns out to be beneﬁcial in situations where dense noise is present along with sparse corruptions since it prevents the algorithm from overﬁtting to the current active set.

Both the algorithms proposed above have their pros and cons the FC algorithm provides signiﬁcant improvements with each step, but is expensive to execute whereas the GD variant, although efﬁcient in executing each step, offers slower progress. To get the best of both these algorithms, we propose a third, hybrid variant TORRENT-HYB (Algorithm 4) that adaptively selects either the FC or the GD update depending on whether the active set is stable across iterations or not.

In the next section we show that this hard thresholding-based strategy offers a linear convergence rate for the algorithm in all its three variations. We shall also demonstrate the applicability of this technique to high dimensional sparse recovery settings in a subsequent section.

4 Convergence Guarantees

For the sake of ease of exposition, we will ﬁrst present our convergence analyses for cases where dense noise is not present i.e. y = X w + b and will handle cases with dense noise and sparse corruptions later. We ﬁrst analyze the fully corrective TORRENT-FC algorithm. The convergence proof in this case relies on the optimality of the two steps carried out by the algorithm, the fully corrective step that selects the best regressor on the active set, and the hard thresholding step that discovers a new active set by selecting points with the least residual error on the current regressor.

Theorem 3. Let X = [x1, . . . , xn] Rp n be the given data matrix and y = XT w + b be the corrupted output with b 0 α n. Let Algorithm 2 be executed on this data with the thresholding parameter set to β α. Let Σ0 be an invertible matrix such that e X = Σ 1/2 0 X satisﬁes the SSC and SSS properties at level γ with constants λγ and Λγ respectively (see Deﬁnition 1). If the

data satisﬁes (1+

2)Λβ λ1 β < 1, then after t = O log 1 n b 2

ϵ iterations, Algorithm 2 obtains an

ϵ-accurate solution wt i.e. wt w 2 ϵ.

Proof (Sketch). Let rt = y X wt be the vector of residuals at time t and Ct = XSt X St. Also let S = supp(b) be the set of uncorrupted points. The fully corrective step ensures that

wt+1 = C 1 t XSty St = C 1 t XSt X Stw + b St = w + C 1 t XStb St,

whereas the hard thresholding step ensures that rt+1 St+1

2. Combining the two gives us

b St+1 2 2 X S \St+1C 1 t XStb St 2

2 + 2 b St+1X St+1C 1 t XStb St

ζ1= e X S \St+1 e XSt e XT St 1 e XStb St

2 + 2 b St+1 e X St+1 e XSt e XT St 1 e XStb St

ζ2 Λ2 β λ2 1 β b St 2 2 + 2 Λβ

λ1 β b St 2 b St+1 2 ,

where ζ1 follows from setting e X = Σ 1/2 0 X and X S C 1 t XS = e X S ( e XSt e X St) 1 e XS and ζ2 follows from the SSC and SSS properties, b St 0 b 0 β n and |S \St+1| β n. Solving the quadratic equation and performing other manipulations gives us the claimed result.

Theorem 3 relies on a deterministic (ﬁxed design) assumption, speciﬁcally (1+

2)Λβ λ1 β < 1 in order to guarantee convergence. We can show that a large class of random designs, including Gaussian and sub-Gaussian designs actually satisfy this requirement. That is to say, data generated from these distributions satisfy the SSC and SSS conditions such that (1+

2)Λβ λ1 β < 1 with high probability. Theorem 4 explicates this for the class of Gaussian designs.

Theorem 4. Let X = [x1, . . . , xn] Rp n be the given data matrix with each xi N(0, Σ). Let y = X w +b and b 0 α n. Also, let α β < 1 65 and n Ω p + log 1

δ . Then, with proba-

bility at least 1 δ, the data satisﬁes (1+

2)Λβ λ1 β < 9 10. More speciﬁcally, after T 10 log 1 n b 2

iterations of Algorithm 1 with the thresholding parameter set to β, we have w T w ϵ.

Remark 2. Note that Theorem 4 provides rates that are independent of the condition number λmax(Σ)

λmin(Σ) of the distribution. We also note that results similar to Theorem 4 can be proven for the larger class of sub-Gaussian distributions. We refer the reader to Section G for the same.

Remark 3. We remind the reader that our analyses can readily accommodate dense noise in addition to sparse unbounded corruptions. We direct the reader to Appendix A which presents convergence proofs for our algorithms in these settings.

Remark 4. We would like to point out that the design requirements made by our analyses are very mild when compared to existing literature. Indeed, the work of [4] assumes the Bouquet Model where distributions are restricted to be isotropic Gaussians whereas the work of [5] assumes a more stringent model of sub-orthonormal matrices, something that even Gaussian designs do not satisfy. Our analyses, on the other hand, hold for the general class of sub-Gaussian distributions.

We now analyze the TORRENT-GD algorithm which performs cheaper, gradient-style updates on the active set. We will show that this method nevertheless enjoys a linear rate of convergence.

Theorem 5. Let the data settings be as stated in Theorem 3 and let Algorithm 3 be executed on this data with the thresholding parameter set to β α and the step length set to η = 1 Λ1 β . If the data

satisﬁes max η p

Λβ, 1 ηλ1 β 1

4, then after t = O log b 2 n 1 ϵ iterations, Algorithm 1

obtains an ϵ-accurate solution wt i.e. wt w 2 ϵ.

Similar to TORRENT-FC, the assumptions made by the TORRENT-GD algorithm are also satisﬁed by the class of sub-Gaussian distributions. The proof of Theorem 5, given in Appendix D, details these arguments. Given the convergence analyses for TORRENT-FC and GD, we now move on to provide a convergence analysis for the hybrid TORRENT-HYB algorithm which interleaves FC and GD steps. Since the exact interleaving adopted by the algorithm depends on the data, and not known in advance, this poses a problem. We address this problem by giving below a uniform convergence guarantee, one that applies to every interleaving of the FC and GD update steps. Theorem 6. Suppose Algorithm 4 is executed on data that allows Algorithms 2 and 3 a convergence rate of ηFC and ηGD respectively. Suppose we have 2 ηFC ηGD < 1. Then for any interleavings of the FC and GD steps that the policy may enforce, after t = O log 1 n b 2

ϵ iterations, Algorithm 4

ensures an ϵ-optimal solution i.e. wt w ϵ.

We point out to the reader that the assumption made by Theorem 6 i.e. 2 ηFC ηGD < 1 is readily satisﬁed by random sub-Gaussian designs, albeit at the cost of reducing the noise tolerance limit. As we shall see, TORRENT-HYB offers attractive convergence properties, merging the fast convergence rates of the FC step, as well as the speed and protection against overﬁtting provided by the GD step.

5 High-dimensional Robust Regression

In this section, we extend our approach to the robust high-dimensional sparse recovery setting. As before, we assume that the response vector y is obtained as: y = X w + b, where b 0 α n. However, this time, we also assume that w is s -sparse i.e. w 0 s . As before, we shall neglect white/dense noise for the sake of simplicity. We reiterate that it is not possible to use existing results from sparse recovery (such as [11, 12]) directly to solve this problem.

Our objective would be to recover a sparse model ˆw so that ˆw w 2 ϵ. The challenge here is to forgo a sample complexity of n p and instead, perform recovery with n s log p samples alone. For this setting, we modify the FC update step of TORRENT-FC method to the following:

wt+1 inf w 0 s

i St (yi w, xi )2 , (3)

for some target sparsity level s p. We refer to this modiﬁed algorithm as TORRENT-HD. Assuming X satisﬁes the RSC/RSS properties (deﬁned below), (3) can be solved efﬁciently using results from sparse recovery (for example the IHT algorithm [11, 14] analyzed in [12]). Deﬁnition 7 (RSC and RSS Properties). A matrix X Rp n will be said to satisfy the Restricted Strong Convexity Property (resp. Restricted Strong Smoothness Property) at level s = s1 + s2 with strong convexity constant αs1+s2 (resp. strong smoothness constant Ls1+s2) if the following holds for all w1 0 s1 and w2 0 s2:

αs w1 w2 2 2 X (w1 w2) 2

2 Ls w1 w2 2 2

For our results, we shall require the subset versions of both these properties. Deﬁnition 8 (SRSC and SRSS Properties). A matrix X Rp n will be said to satisfy the Subset Restricted Strong Convexity (resp. Subset Restricted Strong Smoothness) Property at level (γ, s) with strong convexity constant α(γ,s) (resp. strong smoothness constant L(γ,s)) if for all subsets S Sγ, the matrix XS satisﬁes the RSC (resp. RSS) property at level s with constant αs (resp. Ls).

We now state the convergence result for the TORRENT-HD algorithm. Theorem 9. Let X Rp n be the given data matrix and y = XT w + b be the corrupted output with w 0 s and b 0 α n. Let Σ0 be an invertible matrix such that Σ 1/2 0 X satisﬁes the SRSC and SRSS properties at level (γ, 2s+s ) with constants α(γ,2s+s ) and L(γ,2s+s ) respectively (see Deﬁnition 8). Let Algorithm 2 be executed on this data with the TORRENT-HD update, thresholding parameter set to β α, and s 32 L(1 β,2s+s )

α(1 β,2s+s )

Total Points

Corrupted Points

TORRENT FC (p = 50 sigma = 0)

110 120 130 140 150 160 170 180 190 200

Total Points

Corrupted Points

TORRENT HYB (p = 50 sigma = 0)

110 120 130 140 150 160 170 180 190 200

Total Points

Corrupted Points

L1 DALM (p = 50 sigma = 0)

110 120 130 140 150 160 170 180 190 200

0 10 20 0.1

Magnitude of Corruption

p = 500 n = 2000 alpha = 0.25 sigma = 0.2

TORRENT FC TORRENT HYB L1 DALM

(a) (b) (c) (d) Figure 1: (a), (b) and (c) Phase-transition diagrams depicting the recovery properties of the TORRENT-FC, TORRENT-HYB and L1 algorithms. The colors red and blue represent a high and low probability of success resp. A method is considered successful in an experiment if it recovers w upto a 10 4 relative error. Both variants of TORRENT can be seen to recover w in presence of larger number of corruptions than the L1 solver. (d) Variation in recovery error with the magnitude of corruption. As the corruption is increased, TORRENT-FC and TORRENT-HYB show improved performance while the problem becomes more difﬁcult for the L1 solver.

If X also satisﬁes 4L(β,s+s ) α(1 β,s+s ) < 1, then after t = O log 1 n b 2

ϵ iterations, Algorithm 2

obtains an ϵ-accurate solution wt i.e. wt w 2 ϵ.

In particular, if X is sampled from a Gaussian distribution N(0, Σ) and n

Ω s λmax(Σ)

λmin(Σ) log p , then for all values of α β < 1 65, we can guarantee wt w 2 ϵ

after t = O log 1 n b 2

ϵ iterations of the algorithm (w.p. 1 1/n10).

Remark 5. The sample complexity required by Theorem 9 is identical to the one required by analyses for high dimensional sparse recovery [12], save constants. Also note that TORRENT-HD can tolerate the same corruption index as TORRENT-FC.

6 Experiments

Several numerical simulations were carried out on linear regression problems in low-dimensional, as well as sparse high-dimensional settings. The experiments show that TORRENT not only offers statistically better recovery properties as compared to L1-style approaches, but that it can be more than an order of magnitude faster as well.

Data: For the low dimensional setting, the regressor w Rp was chosen to be a random unit norm vector. Data was sampled as xi N(0, Ip) and response variables were generated as y i = w , xi . The set of corrupted points S was selected as a uniformly random (αn)-sized subset of [n] and the corruptions were set to bi U ( 5 y , 5 y ) for i S . The corrupted responses were then generated as yi = y i + bi + εi where εi N(0, σ2). For the sparse high-dimensional setting, supp(w ) was selected to be a random s -sized subset of [p]. Phase-transition diagrams (Figure 1) were generated by repeating each experiment 100 times. For all other plots, each experiment was run over 20 random instances of the data and the plots were drawn to depict the mean results.

Algorithms: We compared various variants of our algorithm TORRENT to the regularized L1 algorithm for robust regression [4, 5]. Note that the L1 problem can be written as minz z 1 s.t.Az = y, where A = X 1

λIm m and z = [w λb ] . We used the Dual Augmented Lagrange Multiplier (DALM) L1 solver implemented by [15] to solve the L1 problem. We ran a ﬁne tuned grid search over the λ parameter for the L1 solver and quoted the best results obtained from the search. In the low-dimensional setting, we compared the recovery properties of TORRENT-FC (Algorithm 2) and TORRENT-HYB (Algorithm 4) with the DALM-L1 solver, while for the high-dimensional case, we compared TORRENT-HD against the DALM-L1 solver. Both the L1 solver, as well as our methods, were implemented in Matlab and were run on a single core 2.4GHz machine with 8 GB RAM.

Choice of L1-solver: An extensive comparative study of various L1 minimization algorithms was performed by [15] who showed that the DALM and Homotopy solvers outperform other counterparts both in terms of recovery properties, and timings. We extended their study to our observation model and found the DALM solver to be signiﬁcantly better than the other L1 solvers; see Figure 3 in the appendix. We also observed, similar to [15], that the Approximate Message Passing (AMP) solver diverges on our problem as the input matrix to the L1 solver is a non-Gaussian matrix A = [XT 1

0 0.2 0.4 0.6

Fraction of Corrupted Points

p = 500 n = 2000 sigma = 0.2

TORRENT FC TORRENT HYB L1 DALM

0 2 4 6 8 10 12

p = 300 n = 1800 alpha = 0.41 kappa = 5

Time (in Sec)

TORRENT FC TORRENT HYB TORRENT GD L1 DALM

0 0.2 0.4 0.6 0.8 0

p = 10000 n = 2303 s = 50

Fraction of Corrupted Points

TORRENT HD L1 DALM

0 100 200 300 400 0

Time (in Sec)

p = 50000 n = 5410 alpha = 0.4 s = 100

TORRENT HD L1 DALM

(a) (b) (c) (d) Figure 2: In low-dimensional (a,b), as well as sparse high dimensional (c,d) settings, TORRENT offers better recovery as the fraction of corrupted points α is varied. In terms of runtime, TORRENT is an order of magnitude faster than L1 solvers in both settings. In the low-dim. setting, TORRENT-HYB is the fastest of all the variants.

Evaluation Metric: We measure the performance of various algorithms using the standard L2 error: r bw = bw w 2. For the phase-transition plots (Figure 1), we deemed an algorithm successful on an instance if it obtained a model bw with error r bw < 10 4 w 2. We also measured the CPU time required by each of the methods, so as to compare their scalability.

6.1 Low Dimensional Results

Recovery Property: The phase-transition plots presented in Figure 1 represent our recovery experiments in graphical form. Both the fully-corrective and hybrid variants of TORRENT show better recovery properties than the L1-minimization approach, indicated by the number of runs in which the algorithm was able to correctly recover w out of a 100 runs. Figure 2 shows the variation in recovery error as a function of α in the presence of white noise and exhibits the superiority of TORRENT-FC and TORRENT-HYB over L1-DALM. Here again, TORRENT-FC and TORRENT-HYB achieve signiﬁcantly lesser recovery error than L1-DALM for all α <= 0.5. Figure 3 in the appendix show that the variations of bw w 2 with varying p, σ and n follow a similar trend with TORRENT having signiﬁcantly lower recovery error in comparison to the L1 approach.

Figure 1(d) brings out an interesting trend in the recovery property of TORRENT. As we increase the magnitude of corruption from U ( y , y ) to U ( 20 y , 20 y ), the recovery error for TORRENT-HYB and TORRENT-FC decreases as expected since it becomes easier to identify the grossly corrupted points. However the L1-solver was unable to exploit this observation and in fact exhibited an increase in recovery error.

Run Time: In order to ascertain the recovery guarantees for TORRENT on ill-conditioned problems, we performed an experiment where data was sampled as xi N(0, Σ) where diag(Σ) U(0, 5). Figure 2 plots the recovery error as a function of time. TORRENT-HYB was able to correctly recover w about 50 faster than L1-DALM which spent a considerable amount of time pre-processing the data matrix X. Even after allowing the L1 algorithm to run for 500 iterations, it was unable to reach the desired residual error of 10 4. Figure 2 also shows that our TORRENT-HYB algorithm is able to converge to the optimal solution much faster than TORRENT-FC or TORRENT-GD. This is because TORRENT-FC solves a least square problem at each step and thus, even though it requires signiﬁcantly fewer iterations to converge, each iteration in itself is very expensive. While each iteration of TORRENT-GD is cheap, it is still limited by the slow O (1 1

κ)t convergence rate of the gradient descent algorithm, where κ is the condition number of the covariance matrix. TORRENT-HYB, on the other hand, is able to combine the strengths of both the methods to achieve faster convergence.

6.2 High Dimensional Results

Recovery Property: Figure 2 shows the variation in recovery error in the high-dimensional setting as the number of corrupted points was varied. For these experiments, n was set to 5s log(p) and the fraction of corrupted points α was varied from 0.1 to 0.7. While L1-DALM fails to recover w for α > 0.5, TORRENT-HD offers perfect recovery even for α values upto 0.7.

Run Time: Figure 2 shows the variation in recovery error as a function of run time in this setting. L1-DALM was found to be an order of magnitude slower than TORRENT-HD, making it infeasible for sparse high-dimensional settings. One key reason for this is that the L1-DALM solver is signiﬁcantly slower in identifying the set of clean points. For instance, whereas TORRENT-HD was able to identify the clean set of points in only 5 iterations, it took L1 around 250 iterations to do the same.

[1] Christoph Studer, Patrick Kuppinger, Graeme Pope, and Helmut B olcskei. Recovery of Sparsely Corrupted Signals. IEEE Transaction on Information Theory, 58(5):3115 3130, 2012.

[2] Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier Detection. John Wiley and Sons, 1987.

[3] John Wright, Alan Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust Face Recognition via Sparse Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210 227, 2009.

[4] John Wright and Yi Ma. Dense Error Correction via ℓ1 Minimization. IEEE Transaction on Information Theory, 56(7):3540 3560, 2010.

[5] Nam H. Nguyen and Trac D. Tran. Exact recoverability from dense corrupted observations via L1 minimization. IEEE Transaction on Information Theory, 59(4):2036 2058, 2013.

[6] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust Sparse Regression under Adversarial Corruption. In 30th International Conference on Machine Learning (ICML), 2013.

[7] Brian Mc Williams, Gabriel Krummenacher, Mario Lucic, and Joachim M. Buhmann. Fast and Robust Least Squares Estimation in Corrupted Linear Models. In 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.

[8] Adrien-Marie Legendre (1805). On the Method of Least Squares. In (Translated from the French) D.E. Smith, editor, A Source Book in Mathematics, pages 576 579. New York: Dover Publications, 1959.

[9] Peter J. Rousseeuw. Least Median of Squares Regression. Journal of the American Statistical Association, 79(388):871 880, 1984.

[10] Peter J. Rousseeuw and Katrien Driessen. Computing LTS Regression for Large Data Sets. Journal of Data Mining and Knowledge Discovery, 12(1):29 45, 2006.

[11] Thomas Blumensath and Mike E. Davies. Iterative Hard Thresholding for Compressed Sensing. Applied and Computational Harmonic Analysis, 27(3):265 274, 2009.

[12] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On Iterative Hard Thresholding Methods for High-dimensional M-Estimation. In 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.

[13] Yiyuan She and Art B. Owen. Outlier Detection Using Nonconvex Penalized Regression. ar Xiv:1006.2592 (stat.ME).

[14] Rahul Garg and Rohit Khandekar. Gradient descent with sparsiﬁcation: an iterative algorithm for sparse recovery with restricted isometry property. In 26th International Conference on Machine Learning (ICML), 2009.

[15] Allen Y. Yang, Arvind Ganesh, Zihan Zhou, Shankar Sastry, and Yi Ma. A Review of Fast ℓ1-Minimization Algorithms for Robust Face Recognition. Co RR abs/1007.3753, 2012.

[16] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

[17] Thomas Blumensath. Sampling and reconstructing signals from a union of linear subspaces. IEEE Transactions on Information Theory, 57(7):4660 4671, 2011.

[18] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages 210 268. Cambridge University Press, 2012.