# robust_bayesian_regression_via_hard_thresholding__41158d90.pdf

Robust Bayesian Regression via Hard Thresholding

Zheyi Fan1,2, Zhaohui Li3 , Qingpei Hu1,2

1Academy of Mathematics and Systems Science, Chinese Academy of Sciences, China 2School of Mathematical Sciences, University of Chinese Academy of Sciences, China 3H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA. 1,2{fanzheyi,qingpeihu}@amss.ac.cn, 3zhaohui.li@gatech.edu

By combining robust regression and prior information, we develop an effective robust regression method that can resist adaptive adversarial attacks. Due to the widespread existence of noise and data corruption, it is necessary to recover the true regression parameters when a certain proportion of the response variables have been corrupted. Methods to overcome this problem often involve robust least-squares regression. However, few methods achieve good performance when dealing with severe adaptive adversarial attacks. Based on the combination of prior information and robust regression via hard thresholding from [1], this paper proposes an algorithm that improves the breakdown point when facing adaptive adversarial attacks. Furthermore, to improve the robustness and reduce the estimation error caused by the inclusion of a prior, the idea of Bayesian reweighting is used to construct a more robust algorithm. We prove the theoretical convergence of proposed algorithms under mild conditions. Extensive experiments show that, under different dataset attacks, our algorithms achieve state-of-the-art results compared with other benchmark algorithms, demonstrating the robustness of the proposed approach.

1 Introduction

Least-squares methods are widely used because of their simplicity and ease of operation. However, due to the inevitable existence of outliers, least-squares methods, such as linear regression, may cause signiﬁcant bias in practical applications. Therefore, to meet the challenge of learning reliable regression coefﬁcients in the presence of signiﬁcant corruption in the response vector, this paper focuses on robust least-squares regression (RLSR). RLSR has excellent application value in many ﬁelds, such as signal processing [8][12][26], economics [23], industry [21], biology [13], remote sensing [11] and intelligent transportation [25].

Given a data matrix X = [x1, ..., xn] Rd n, the corresponding response vector y Rn, and a certain number k representing the number of corruptions in the data, the RLSR problem can be described as:

( ˆw, ˆS) = arg min w Rp,S [n] |S|=n k

i S (yi x T i w)2 (1)

That is, we aim to recover the correct point set S and the regression coefﬁcient w simultaneously to achieve the minimum regression error. However, this problem is NP hard, so it is difﬁcult to optimize directly [19].

To solve the above problem, a commonly used data generation model is y = XT w + b + ϵ, where w is the true regression coefﬁcient we wish to recover and ϵ is a dense white noise vector

Corresponding authors.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

subject to a speciﬁc distribution, that is, ϵ 0 n. The vector b is k-sparse, which means that there are only k non-zero values, representing k unbounded noise terms in the response vector. After years of development, there are many ways to ﬁnd a reasonable solution to the problem in Eq. (1). However, these methods typically only achieve good performance under speciﬁc conditions. The main challenge is the low breakdown point of conventional methods. The breakdown point k is a measure of robustness, which means the number of corruptions that the RLSR algorithm can tolerate. We can express k as the proportion of all data points: k = α n. Many RLSR algorithms cannot guarantee theoretical convergence as the value of k increases. For example, Mc Williams et al. [14] used weighted subsampling for linear regression, but only had a breakdown point of α = O(1/

d). Prasad et al. [18] proposed a robust gradient estimator that can be applied to linear regression, but their method only tolerates corruption up to α = O(1/ log d). Other methods may have a higher breakdown point, but tend to assume a speciﬁc pattern of data corruption. One representative adversary model for introducing data corruption is the oblivious adversarial attack (OAA), in which the opponent generates k sparse vectors while completely ignoring X, w , and ϵ. The work of Bhatia et al. [1] and Suggala et al. [20] reported excellent results against OAAs by using a novel hard thresholding method; indeed, [20] suggested that α may even get close to 1 as n . The recent online fast robust regression algorithm [16] also has consistent convergence with a mild condition by using Stochastic Gradient Decent (SGD) algorithm. However, these methods cannot resist adaptive adversarial attack (AAA), in which opponents can view X, w , and ϵ before determining b . Handling AAA is a challenging task, and so many methods can only guarantee a very low breakdown point, especially when the data distribution is not normal [4][9][15]. Bhatia et al. [2] proposed the thresholding operator-based robust regression method and their breakdown point reach 1/65 for noiseless model i.e., ϵ 0. However, their method can give a consistent estimation only in noiseless case. Karmalka et al. [10] had a good result in sparse robust linear regression by applying the L1 regression and the breakdown point of their method reaches 0.239, but their estimation is consistent only when white noise ϵ is sparse. Diakonikolas et al. [6] considered the situation that X and y may have outliers simultaneously, and proposed a ﬁlter algorithm in which the error bound is O(α log(1/α)σ). However, their method requires accurate data covariance of the true data distribution or numerous unlabeled correct data to estimate the data covariance, which are often unavailable in practice.

The limitations of the above-mentioned methods can be attributed to a lack of prior knowledge from the real data, making it difﬁcult to distinguish the set of correct points in the case of AAAs. Gülçehre et al. [7] showed that prior information effectively improves the accuracy of machine learning. In many application scenarios in industry, economics, and biology, prior knowledge such as previous experimental data or engineering data are available. The goal of this paper is to propose a new robust regression that can integrate available prior information, even if the prior information is not very accurate.

The typical approach for integrating prior information is the Bayesian method. This provides a way of formalizing the process of learning from data to update beliefs in accordance with recent notions of knowledge synthesis [5]. However, generic Bayesian method is also sensitive to outliers. Thus robust Bayesian method should be considered to produce more reliable estimates in the presence of data corruption. Polson et al. [17] used the local variance to assign each point a local parameter that makes the estimation result robust. Furthermore, Wang et al. [22] proposed a local parameterization method and used empirical Bayesian estimation to determine the global parameters. Bhatia et al. [3] proposed a Bayesian descent method using an unadjusted Langevin algorithm (ULA), which guarantees convergence in a ﬁnite number of steps. Wang et al. [24] employed Bayesian reweighting to assign different weights to samples, thus reducing the impact of outliers.

In this paper, we combine the Bayesian method with a hard thresholding method [1] and propose two algorithms, which we call TRIP and BRHT. Through assigning a simple normal prior on the coefﬁcients, TRIP can signiﬁcantly increase the breakdown point when resisting AAAs. To further improve the accuracy of estimation, we propose BRHT algorithm through applying Bayesian reweighting method [24] to coefﬁcient estimation. Experiments show that BRHT is even more resistant to AAAs and gives lower estimation errors, demonstrating that our method achieves signiﬁcantly improved robustness.

Our Contributions: The main contribution of this paper is proposing new methods that combining the prior and robust regression to increase the breakdown point when encountering AAAs. We derive the theoretical guarantees given by our proposed algorithms. Compared with the consistent robust

regression (CRR) algorithm proposed in [1], we prove that our algorithms guarantee convergence under a weaker condition, which also shows that our methods improves the breakdown point. We also establish an extended experiment to test the effectiveness of the algorithms. Compared with other basic algorithms, the experimental results show that our methods signiﬁcantly outperform alternative methods under AAAs. Moreover, BRHT algorithm is also competitive against OAAs.

Paper Organization: We state the problem formulation and present some notation and tools in Section 2. In Section 3, we describe the details of our proposed TRIP and BRHT algorithms. The theoretical properties of these two algorithms are discussed in Section 4. Section 5 presents extensive experimental results that demonstrate the excellent performance of our proposed algorithms. Section 6 concludes this paper.

2 Problem Formulation

In this study, we mainly focus on the problem of RLSR under AAAs. We are given a covariant matrix X = [x1, ..., xn] Rd n, where xi Rd. The true coefﬁcient of the regression model is denoted by w . The response vector y Rn is generated by:

y = XT w + b + ϵ (2) The perturbations to the response vector consist of two parts: the adversarial corruption vector introduced by b , which is a k-sparse vector, and the dense white noise ϵi N(0, σ2). Our goal is to recover the true regression coefﬁcient w while simultaneously determining the corruption set S. To illustrate our problem, we ﬁrst pay attention to the standard robust regression problem in Eq. (1). From the viewpoint of probability, the problem can be transformed into a log-likelihood version:

( ˆw, ˆS) = arg max w Rp,S [n] |S|=n k

i S log ℓ(w | yi, xi, σ2) (3)

We will try to convert the problem in Eq. (3) into a Bayesian version. From the Bayesian viewpoint, we consider pw(w) as the prior of the coefﬁcients in the model. In addition, we add the localization parameter r and its prior pr(r) to reﬂect the change introduced by each additional sample. For any subset S [n], the distribution of all parameters and data XS, y S is:

p(y S, w, r S | XS) = pw(w)pr(r S)

i S ℓ(yi | ri, w, xi, σ2) (4)

The posterior distributions p(w, r S|XS, y S) and p(y S, w, r S|XS) differ by a regularization constant. We ignore this regularization constant in the posterior distribution and only consider the main terms of the parameters. We then formulate the Bayesian RLSR problem of searching for the subset and coefﬁcients by maximizing the log-posterior:

( ˆw, ˆS) = arg max w Rp,r Rn + S [n],|S|=n k

log pw(w) +

i S [log ℓ(yi | ri, w, xi, σ2) + log pr(ri)] (5)

Note that we do not add any prior on σ2 and only treat this as an adjustable parameter. This is because, in the initial stage of the algorithm described in Section 3, the estimated σ2 will be large due to the existence of outliers, and this will make the estimation of w excessively biased to the prior distribution. This bias will be harmful, especially when the prior is not sufﬁciently accurate. This phenomenon can be observed in Section 3.1. To prove the positive effect of the prior on the RLSR problem, we require the properties of Subset Strong Convexity (SSC) and Subset Strong Smoothness (SSS). Given a set S [n], XS := [xi S] Rd |S| signiﬁes the matrix with columns in the set S. The smallest and largest eigenvalues of a square symmetric matrix X are denoted by λmin(X) and λmax(X). Deﬁnition 1 (SSC Property). A matrix X Rd n is said to satisfy the SSC property at level m with constant λm if the following holds:

λm min |S|=m λmin(XSXT S ) (6)

Deﬁnition 2 (SSS Property). A matrix X Rd n is said to satisfy the SSS property at level m with constant Λm if the following holds:

max |S|=m λmax(XSXT S ) Λm (7)

Algorithm 1 TRIP: hard Thresholding approach to Robust regression with s Imple Prior

Input: Covariates X = [x1, ..., xn], responses y = [y1, ..., yn]T , prior knowledge w0, penalty matrix M, corruption index k, tolerance ϵ Output: solution ˆw

1: b0 0, t 0, PMX XT (XXT + M) 1X, PMM XT (XXT + M) 1M 2: while bt bt 1 2 > ϵ do 3: bt+1 HTk(PMXbt + (I PMX)y PMMw0) 4: t t + 1; 5: end while 6: return ˆw (XXT ) 1X(y bt)

These two properties are proposed in [2], and are intended to standardize the generation of the data matrix so that it will not be too abnormal. They are used to prove the theorems in Section 4.

3 Methodology

We ﬁrst ignore the localization parameter r in Eq. (5) and propose a simple method called TRIP in Section 3.1. TRIP demonstrates the effect of a prior on the hard thresholding method. To improve the robustness and the accuracy of the estimation, the BRHT algorithm is proposed in Section 3.2.

3.1 TRIP: Hard Thresholding Approach to Robust Regression with Simple Prior

We propose a robust regression algorithm called TRIP (Algorithm 1), a hard Thresholding approach to Robust regression with s Imple Prior. In this subsection, only the prior pw(w) is considered and the localization parameter r is not added to the model. We assume the variance σ2 of ϵi can be set by ourselves and that the prior pw(w) obeys a normal distribution N(w0, Σ0), where w0 and Σ0 are determined in advance. Through the above simple parameter settings, the problem in Eq. (5) is transformed into the problem in Eq. (1) with an additional regularization term:

( ˆw, ˆS) = arg min w Rp,S [n] |S|=n k

i S (yi x T i w)2 + (w w0)T M(w w0) (8)

where M = (Σ0/σ2) 1. To solve this problem, we are motivated by the hard thresholding method proposed by Bhatia [1], which concentrated on recovering the errors instead of selecting the cleanest set. The problem in Eq. (8) can be formulated as minw Rp, b 0 k 1

2 XT w (y b) 2 2 + 1 2(w w0)T M(w w0). Thus, if we have an estimation ˆb of the corruption vector b , the estimation of w can be easily obtained by ˆw = (XXT + M) 1[X(y ˆb) + Mw0]. By substituting this estimation into the optimization problem, we obtain a new formulation of the problem:

min b 0 k f(b) = 1

2 (PMX I)(y b) + PMMw0 2 2 (9)

where PMX = XT (XXT + M) 1X, PMM = XT (XXT + M) 1M. The hard thresholding step in the TRIP algorithm can be viewed as bt+1 = HTk(bt f(bt)), where k is the selected corruption coefﬁcient. The hard thresholding operator HTk is deﬁned as follows.

Deﬁnition 3 (Hard Thresholding). For any vector r Rn, let δ 1 r (i) represent the position of the ith element in r, which are arranged in descending order of magnitude. Then, for any k < n, the hard thresholding operator is deﬁned as ˆr = HTk(r), where ˆri = ri if δ 1 r (i) k and 0 otherwise.

The difference between the proposed TRIP algorithm and the original CRR [1] is the form of iteration step. The iteration step in both TRIP and CRR can be expressed uniformly as HTk(y XT wt), but wt = (XXT + M) 1[X(y bt) + Mw0] in TRIP and wt = (XXT ) 1X(y bt) in CRR. The wt in CRR is just a simple least square estimation, while the prior added in TRIP can be regarded to adding a quadratic regularization in each iteration. This quadratic regularization can avoid the candidate of iteration that is too far from the prior mean, which is also helpful to ensure the

Algorithm 2 BRHT: robust Bayesian Reweighting regression via Hard Thresholding

Input: Covariates X = [x1, ..., xn], responses y = [y1, ..., yn]T , prior distribution pr(r), pw(w) , corruption index k, tolerance ϵ Output: solution ˆw

1: b0 0, t 0, 2: while bt bt 1 2 > ϵ do 3: wt V BEM(X, y bt, pr(r), pw(w)) 4: bt+1 HTk(y XT wt) 5: t t + 1; 6: end while 7: return ˆw (XXT ) 1X(y bt)

Algorithm 3 VBEM: Variational Bayes Expectation Maximization

Input: Covariates X = [x1, ..., xn], responses y = [y1, ..., yn]T , prior distribution pr(r), pw(w) Output: solution ˆw

1: repeat 2: update q(r) 3: update q(w) 4: until convergence 5: return ˆw MAP(q(w))

numerical stability of solution. Thus, as long as the prior is not mis-speciﬁed too much, TRIP will be more likely to identify the uncorrupted points, and the ﬁnal result of TRIP will be more robust than CRR.

Therefore, the prior plays an important role in TRIP. However, the weight of a prior in the solution depends entirely on the matrix M = (Σ0/σ2) 1. Therefore, if we use an estimation of ˆσ2 to replace σ2, or give σ2 a prior to calculate its posterior distribution, the overestimation of σ2 will cause a severe increase in M in the initial iteration steps. This will mislead the iteration and cause some deviation in the ﬁnal results. To overcome this difﬁculty we can directly treat M as an adjustable parameter to control by specifying the form such as M = s I, where s is a positive number, and the suitable parameter can be chosen through 5-fold or 10-fold cross validation.

3.2 BRHT: Robust Bayesian Reweighting Regression via Hard Thresholding

In this subsection, we describe how the Bayesian reweighting method is combined with hard thresholding to give a more robust algorithm, BRHT (Algorithm 2), a robust Bayesian Reweighting regression via Hard Thresholding. We ﬁrst introduce the reweighted probabilistic model (RPM) proposed in [24] for traditional linear regression. For the covariates X and the response y, the RPM model can be formulated as follows:

p(y, w, r|X) = 1

Z pw(w)pr(r)

i=1 ℓ(yi | w, xi, σ2)ri (10)

where r is the local weight assigned to each sample, Z is the normalizing constant, ℓ(yi | w, xi, σ2) represents the likelihood of the normal distribution N(x T i w, σ2), and pw(w), pr(r) are the priors of w and r, respectively. By ignoring the normalizing constant, the problem in Eq. (5) can be transformed into the following form under this RPM setting:

( ˆw, ˆS) = arg max w Rp,r Rn + S [n],|S|=n k

log pw(w) +

i S [ri log ℓ(yi|w, xi, σ2) + log pr(ri)] (11)

The speciﬁc form of the prior pr(ri) can be set to any nonnegative random variable distribution, including (but not limited to) the Gamma distribution, Beta distribution, or log-normal distribution. Here, we still use the normal distribution N(w0, Σ0) as the form of pw(w).

To solve the optimization problem in Eq. (11), we use the two-step BRHT algorithm. The key iteration step in BRHT is bt+1 HTk(y XT wt), where wt is calculated by maximizing the

log-posterior of the RPM model:

(wt, rt) = arg max w Rd,r Rn + log pw(w) + log pr(r) +

i=1 ri log ℓ(yi bt i | w, xi, σ2) (12)

However, the direct inference of Eq. (12) is hard because of the nonconvexity of this problem. In general, the parameters in this RPM model can be divided into two parts: the global variable w and the local latent variable r. To solve the inference problem of RPM, a feasible method is to use variational Bayesian expectation maximization (VBEM). We set q(w, r) = q(w)q(r) to approximate the true posterior after several iterations of VBEM (Algorithm 3), and replace the estimation of w by the maximum a posteriori (MAP) estimation from the ﬁnal approximate posterior. Full details of the VBEM method are given in Appendix A. It is reasonable to ask why we are using Bayesian reweighting. Note that the iteration step in the TRIP algorithm is bt+1 HTk(PMXbt + (I PMX)y PMMw0) = HTk(y XT wt), where wt = (XXT + M) 1[X(y bt) + Mw0]. Although we can show that TRIP already guarantees theoretical convergence, the estimation of w in every iteration still uses least-squares with a penalty term, which is easily affected by the corrupted points. This disadvantage forces us to assign a higher weight to the prior to resist severe data corruption in the case of AAAs. However, a higher weight on the prior means a larger estimation bias. When applying the Bayesian reweighting method, the estimation in each step is more robust than the least-squares result, and thus the weight on the prior can be reduced to guide the iteration. Therefore, the estimation bias is relatively small and the results are more robust. This is reﬂected in the experimental results presented in Section 5.

We should also explain why we only use a prior for a few parameters. It is important to ensure that the prior weights are neither too high nor too low. As mentioned earlier, if we treat σ2 as the parameter to be estimated, this places too much weight on the prior. We also ensure that pw(w) does not create more uncertainty, such as setting pw(w) to p(w, α) = N(w0, α 1Σ0)Gam(α|aα, bα). Data corruption means that the subset of the training data may vary greatly from the prior information. Thus, when calculating the posterior, the variance of w controlled by α will be very large to ﬁt the data, and so the prior information w0 will have lower weights in the inference step. The above problems also mislead the estimation and the selection of subsets, so we do not consider the uncertainty of these quantities and simply treat them as model parameters to be set in advance. An adjustment method for all the parameters in BRHT is described in Appendix D.

4 Theoretical Convergence Analysis

In this section, we establish the convergence theory for the TRIP algorithm, and clearly explain how the prior effectively enhances the convergence of the RLSR model. Theorems 1 and 2 summarize the results. We also show the theoretical guarantee of the BRHT algorithm in Theorems 3 5, which further demonstrate the special properties achieved by using Bayesian reweighting. Before presenting the convergence result, we ﬁrst introduce some notation. Let λt := (XXT +M) 1X(bt b ), g := (I PMX)ϵ, and f := PMM(w w0). Let St := [n]\supp(bt) be the chosen subset that is considered to be uncorrupted, and It := supp(bt) supp(b ).

Theorem 1. Let X = [x1, . . . , xn] Rd n be the given data matrix and y = XT w + b + ϵ be the corrupted output with sparse corruption of b 0 k n. For a speciﬁc positive semideﬁnite matrix M, X satisﬁes the SSC and SSS properties such that 2 Λk+k λmin(XXT +M) < 1. Then, if k > k , it is guaranteed with a probability of at least 1 δ that, for any ε, δ > 0, b T0 b 2

ε + O(e0) + O(

Λk+k λmax(M) λmin(XXT +M) ) w w0 2 after T0 = O(log( b 2

ε )) iterations of TRIP, where

(k + k ) log n δ(k+k )) under the normal design.

Theorem 2. Under the conditions of Theorem 1, and assuming that xi Rd are generated from the standard normal distribution, if k > k , it is guaranteed with a probability of at least 1 δ that, for any ε, δ > 0, the current estimation coefﬁcient w T0 satisﬁes w T0 w 2 O( 1 n)(ε + e0) +

O( k+k λmax(M)

n3/2 ) w w0 2 after T0 = O(log( b 2

ε )) steps.

For positive semi-deﬁnite matrices XXT and M, λmin(XXT + M) λmin(XXT ) + λmin(M). Thus, the condition 2 Λk+k λmin(XXT +M) < 1 in Theorem 1 is weaker than the condition 2 Λk+k λmin(XXT ) <

1 of Lemma 5 of Bhatia [1], which shows that a prior can effectively improve the convergence of the algorithm. Assigning a higher weight to a prior means that M has larger eigenvalues, so that the convergence condition will be more easily satisﬁed. As a result, the TRIP algorithm can tolerate a higher proportion of outliers than the CRR method of Bhatia [1] and achieves a higher breakdown point. In fact, under the condition limn λmin(M)

n = ξ, we can give an approximate expression of the breakdown point for TRIP when ξ is not too large: k k (0.3023 0.0887 0.0040ξ)n. Details can be found in Appendix C.2. However, the improved convergence comes at the cost of an unavoidable reduction in precision. This can be seen from Theorem 2. If the data corruption is such that k is O(n) and the maximum eigenvalue of M is also O(n), then the bias of ˆw cannot be decreased by adding more samples, which shows that there is a trade-off between convergence and accuracy. A reliable prior improves both accuracy and convergence because it has a higher weight. However, an inaccurate prior can also be helpful as long as it is quite different from the distribution of outliers, and the convergence can be improved through a prior with a low weight.

To prove the properties of our BRHT algorithm, we deﬁne the following two intermediate variables to simplify the description:

U(w, r, S) = log pw(w) +

i S [log pr(ri) + ri log ℓ(yi | w, xi, σ2)] (13)

M(w, r, b) = log pw(w) +

i [log pr(ri) + ri log ℓ(yi bi | w, xi, σ2)] (14)

Theorem 5. Suppose that the prior of ri is independently and identically distributed (iid). We consider the tth iteration step of the BRHT algorithm, where wt, rt = arg maxw Rd,r Rn + M(w, r, bt) and bt = HTk(y XT wt 1) is obtained from the hard thresholding step. Then, we have that U(wt, rt, St+1) U(wt 1, rt 1, St).

Theorem 6. Consider a data matrix X and a speciﬁc positive semi-deﬁnite matrix M satisfying the SSC and SSS properties such that 2 Λk+k λmin(XXT +M) < 1. Then, there exist α > 0 and 0 < γ 1 + ϵ, where ϵ is a small number, such that if k > k and Σ in the prior pw(w) is ασ2M 1, it is guaranteed with a probability of at least 1 δ that, for any ε, δ > 0, b T0 b 2 ε +

Λk+k λmax(M) λmin(XXT +M) )γ w w0 2 after T0 = O(log( γ b 2

ε )) iterations of BRHT, where

(k + k ) log n δ(k+k )) under the normal design.

Theorem 7. Under the conditions of Theorem 4 and with xi Rd generated from the standard normal distribution, there exist α > 0 and 0 < γ 1 + ϵ, where ϵ is a small number, such that if k > k and Σ in the prior pw(w) is ασ2M 1, it can be guaranteed with a probability of at least 1 δ that, for any ε, δ > 0, the current estimation coefﬁcient w T0 satisﬁes w T0 w 2 O( 1 n)(ε + e0) + O( k+k λmax(M)

n3/2 )γ w w0 2 after T0 = O(log( γ b 2

ε )) steps.

Theorem 5 shows that our BRHT algorithm is reasonable because it optimizes the problem in Eq. (11) in each step. Theorems 6 and 7 guarantee the convergence of the parameter, which means that if we introduce a prior pw(w) = N(w0, Σ0) to the TRIP algorithm, it is convergent. There then exists some α > 0 such that the prior pw(w) = N(w0, αΣ0) in the BRHT algorithm guarantees convergent parameters and the bias of the estimation of w will be γ times that of TRIP. Note that α is usually relatively large in practice, which causes a lower prior weight in BRHT. Thus, when the convergence can be guaranteed, the bias of the estimator ˆw can be signiﬁcantly reduced because BRHT assigns a lower weight to the prior. Even if the data are seriously corrupted, BRHT can ensure good results without signiﬁcant error. All proofs of these theorems are given in Appendix C.

5 Experiments

In this section, we ﬁrst consider how to effectively corrupt the dataset using two different attacks: OAA and AAA. We then report an extensive experimental evaluation to verify the robustness of the proposed methods.

Algorithm 4 ADCA: Adaptive Data Corruption Algorithm

Input: Covariates X = [x1, ..., xn], responses y = [y1, ..., yn]T , true parameter w penalty coefﬁcient δ, corruption index k, tolerance ϵ Output: solution ˆw

1: b0 0, t 0, PδX XT (XXT δI) 1X, Pδ XT (XXT δI) 1δI 2: while bt bt 1 2 > ϵ do 3: bt+1 HTk(PδXbt + (I PδX)y + Pδw ) 4: t t + 1; 5: end while 6: ˆw (XXT ) 1X(y bt) 7: C supp(bt) 8: return y C = XT C ˆw

5.1 Data and Metrics

In our experiments, the data generation can be divided into two steps. First, we generate the basic model. The true coefﬁcient w is chosen to be a random unit norm vector. The covariant xi are iid in N(0, Id). The data are generated by yi = x T i w + ϵi, where ϵi are iid in N(0, σ2). We set σ = 1 in the experiments. The second step is to generate the corrupted data using two kinds of attacks: OAA and AAA, as described in Section 5.2. The aim is to produce k corrupted responses in the whole dataset. The prior coefﬁcient w0 is generated by w + νu, where u is a random unit norm vector and ν is a non-negative number (ν is set to 0.5 unless otherwise stated). Σ0 takes the form s I, where s takes a different value for each method. All parameters are ﬁxed in each experiment.

Following the setting in [1], we measured the performance of the regression coefﬁcients by the standard L2 error: r ˆw = ˆw w 2. To judge whether the algorithm had converged, we used the termination criterion wt+1 wt 2 10 4. All results were averaged over 10 runs.

5.2 Corruption Methods

To demonstrate the efﬁciency of our proposed methods, we apply two different attacks to the dataset: OAA and AAA. The details of these two attacks are shown as follows.

OAA: The set of corrupted points S is selected as a uniformly random k-sized subset of [n], and the corresponding response variables are set as yi = x T i w + bi + ϵi, where bi are sampled from the uniform distribution U[0, 10] and the white noise ϵi N(0, σ2).

AAA: We use all information from the true data distribution to corrupt the data, and propose an adaptive data corruption algorithm (ADCA). This algorithm is quite similar to TRIP; full details of ADCA (Algorithm 4) are given in Appendix B. δ is set to 0.1n for n = 1000, p = 200, and to 0.2n for n = 2000, p = 100.

Both TRIP and BRHT employ the prior pw(w) while the variance Σ0 of pw(w) in OAA is set to be four times higher than that in AAA, which means that the prior has a higher weight in AAA. Other parameters for these two algorithms are ﬁxed. The prior distribution pr(r) is set to the Gamma distribution unless otherwise stated.

We also design another leverage point attack (LPA) to test the robustness of our methods. More results can be seen in Appendix E.

5.3 Methods Comparison

Our methods are compared with three baselines: 1) CRR [1] is an effective robust regression method in cases where there are large numbers of random outliers; 2) Reweighted robust Bayesian regression (RRBR) [24] allows us to judge whether our proposed methods are better than the original method; 3) Rob-ULA [3] is an effective robust Bayesian inference method that approximately converges to the real posterior distribution in a ﬁnite number of steps in the presence of outliers. The parameters of RRBR are the same as in the BRHT algorithm, except for the hard thresholding part. For more comparisons of other methods, please see Appendix E.

Figure 1: Recovery of parameters with respect to the number of data points n, dimensionality d, and corruption ratio α. TRIP and BRHT are more robust under AAAs than CRR, and BRHT exhibits the best performance in all experiments. RRBR and Rob-ULA show some robustness, but offer slightly worse recovery in some experiments.

Figure 2: (a), (b) show the convergence characteristics of TRIP and BRHT. Both algorithms exhibit an estimation bias as the price of adding a prior in OAA, but BRHT is more accurate and behaves signiﬁcantly better in AAA, while TRIP stalls during the iterative process. (c), (d) show the convergence under different weights of the prior pr(r) and the coefﬁcient prior pw(w).

5.4 Recovery Properties of Coefﬁcients and Uncorrupted Sets

CRR, RRBR, and Rob-ULA are excellent robust regression methods or Bayesian inference methods, but they all show their limitations in the face of different kinds of attacks. CRR achieves the best performance in the face of OAAs because it is theoretically unbiased, but it collapses rapidly when facing AAAs, as shown in Figures 1(c) and 1(d). RRBR and Rob-ULA take priors into consideration, but RRBR cannot resist AAAs and Rob-ULA produces poor results under OAAs, as shown in Figures 1(a) 1(d). TRIP produces a good effect against AAAs, while BRHT is not only optimal against AAAs, but also displays a similar effect to CRR in the case of OAAs. This shows that BRHT is the most robust algorithm among those compared in this experiment.

The TRIP and BRHT algorithms are compared in Figures 2(a) and 2(b). The TRIP method incorporates too much prior information in the case of OAAs, resulting in a greater estimation error than those of BRHT and CRR as shown in Figure 2(a). However, under AAAs, the prior information of TRIP and BRHT is enhanced by a factor of four, as described in Section 5.2. BRHT converges under a weaker prior, while TRIP becomes trapped around a local optimum. This shows that BRHT only needs to integrate weak prior information to ensure convergence. Figures 2(c) and 2(d) illustrate the convergence properties under different weights of the prior pr(r) and coefﬁcient prior pw(w). Figure 2(c) shows that BRHT is not especially sensitive to the weight prior pr(r) when this prior is relatively reliable. A log-normal distribution is the best choice when the prior pw(w) is relatively close to the real parameters and a Gamma distribution is more robust when the prior is imprecise, as shown in Figure 2(d).

6 Conclusion

This paper has described a novel robust regression algorithm named TRIP that achieves strong results in terms of resisting AAAs. By adding a prior to the robust regression via hard thresholding, the recovery of coefﬁcients is signiﬁcantly improved. Another algorithm, named BRHT, was designed to improve the robustness of TRIP and reduce the estimation error through the use of Bayesian reweighting regression. We prove that both algorithms have strong theoretical guarantees and that the algorithms converge linearly under a mild condition. Extensive experiments have illustrated that our algorithms outperform benchmark methods in terms of both robustness and efﬁciency.

There are several interesting future directions to extend current work. Firstly, in this article, we only consider the case when y is corrupted. One would consider using the prior information to better deal with the problem where both y and X are corrupted. Secondly, it would be also interesting to further reduce the effect of a prior on the estimation to make it consistent.

Acknowledgments and Disclosure of Funding

This work was partly supported by National Key Research and Development Program of China (2021YFA1000300 and 2021YFA1000301), National Center for Mathematics and Interdisciplinary Sciences, and Key Laboratory of Systems and Control of CAS.

[1] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust regression. Advances in Neural Information Processing Systems, 30, 2017.

[2] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. Advances in Neural Information Processing Systems, 28, 2015.

[3] Kush Bhatia, Yi-An Ma, Anca D Dragan, Peter L Bartlett, and Michael I Jordan. Bayesian robustness: A nonasymptotic viewpoint. ar Xiv preprint ar Xiv:1907.11826, 2019.

[4] Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I. Jordan, Nicolas Flammarion, and Peter L. Bartlett. Optimal robust linear regression in nearly linear time. ar Xiv preprint ar Xiv:2007.08317, 2020.

[5] Peter Congdon. Bayesian statistical modelling. John Wiley & Sons, 2007.

[6] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efﬁcient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2745 2754. SIAM, 2019.

[7] Çalar Gülçehre and Yoshua Bengio. Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17(1):226 257, 2016.

[8] Chao Huang, Dong Wang, and Nitesh V Chawla. Scalable uncertainty-aware truth discovery in big data social sensing applications for cyber-physical systems. IEEE Transactions on Big Data, 6(4):702 713, 2017.

[9] Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian. Robust regression revisited: Acceleration and improved estimation rates. In Advances in Neural Information Processing Systems, volume 34, pages 4475 4488. Curran Associates, Inc., 2021.

[10] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1 regression. ar Xiv preprint ar Xiv:1809.08055, 2018.

[11] Xi Li, Xiaoling Chen, Yousong Zhao, Jia Xu, Fengrui Chen, and Hui Li. Automatic intercalibration of night-time light imagery using robust regression. Remote sensing letters, 4(1):45 54, 2013.

[12] Ming Lin, Xiaomin Song, Qi Qian, Hao Li, Liang Sun, Shenghuo Zhu, and Rong Jin. Robust Gaussian process regression for real-time high precision GPS signal enhancement. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2838 2847, 2019.

[13] Vanda M Lourenço, Ana M Pires, and M Kirst. Robust linear regression methods in association studies. Bioinformatics, 27(6):815 821, 2011.

[14] Brian Mc Williams, Gabriel Krummenacher, Mario Lucic, and Joachim M Buhmann. Fast and robust least squares estimation in corrupted linear models. Advances in Neural Information Processing Systems, 27, 2014.

[15] Ankit Pensia, Varun Jog, and Po-Ling Loh. Robust regression with covariate ﬁltering: Heavy tails and adversarial contamination. ar Xiv preprint ar Xiv:2009.12976, 2020.

[16] Scott Pesme and Nicolas Flammarion. Online robust regression via SGD on the l1 loss. In Advances in Neural Information Processing Systems, volume 33, pages 2540 2552. Curran Associates, Inc., 2020.

[17] Nicholas G Polson and James G Scott. Shrink globally, act locally: Sparse Bayesian regularization and prediction. Bayesian Statistics, 9(105):501 538, 2010.

[18] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018.

[19] Christoph Studer, Patrick Kuppinger, Graeme Pope, and Helmut Bolcskei. Recovery of sparsely corrupted signals. IEEE Transaction on Information Theory, 58(5):31153130, may 2012.

[20] Arun Sai Suggala, Kush Bhatia, Pradeep Ravikumar, and Prateek Jain. Adaptive hard thresholding for near-optimal consistent robust regression. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2892 2897. PMLR, 25 28 Jun 2019.

[21] C Ming Wang, Dominic F Vecchia, Matt Young, and Nathan A Brilliant. Robust regression applied to optical-ﬁber dimensional quality control. Technometrics, 39(1):25 33, 1997.

[22] Chong Wang and David M Blei. A general method for robust Bayesian modeling. Bayesian Analysis, 13(4):1163 1191, 2018.

[23] Donglin Wang, Don Hong, and Qiang Wu. Prediction of loan rate for mortgage data: Deep learning versus robust regression. Computational Economics, pages 1 14, 2022.

[24] Yixin Wang, Alp Kucukelbir, and David M Blei. Robust probabilistic modeling with Bayesian data reweighting. In International Conference on Machine Learning, pages 3646 3655. PMLR, 2017.

[25] He Yan, Yong Qi, Qiaolin Ye, and Dong-Jun Yu. Robust least squares twin support vector regression with adaptive foa and pso for short-term trafﬁc ﬂow prediction. IEEE Transactions on Intelligent Transportation Systems, 2021.

[26] Abdelhak M Zoubir, Visa Koivunen, Esa Ollila, and Michael Muma. Robust statistics for signal processing. Cambridge University Press, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Section 4 and 5.

(c) Did you discuss any potential negative societal impacts of your work? [No] Our research concentrate on regression and no social risks are involved. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Appendix C. (b) Did you include complete proofs of all theoretical results? [Yes] Appendix C. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The experiments are based on simulation and it is easy to reproduce under given parameters. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Section 5.1, 5.2 and Appendix D. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] The random experiments will demonstrate the results. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] The overall calculation scale is small. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] In each time we ﬁrst mention other methods. (b) Did you mention the license of the assets? [No] We only reference other algorithms.

(c) Did you include any new assets either in the supplemental material or as a URL? [No]

Our experiments don t need that. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] We don t use other s data. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] Our data are randomly generated and do not involve these situations. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]