# robust_logistic_regression_and_classification__a15320ce.pdf

Robust Logistic Regression and Classiﬁcation

Jiashi Feng EECS Department & ICSI UC Berkeley jshfeng@berkeley.edu

Huan Xu ME Department National University of Singapore mpexuh@nus.edu.sg

Shie Mannor EE Department Technion shie@ee.technion.ac.il

Shuicheng Yan ECE Department National University of Singapore eleyans@nus.edu.sg

We consider logistic regression with arbitrary outliers in the covariate matrix. We propose a new robust logistic regression algorithm, called Ro LR, that estimates the parameter through a simple linear programming procedure. We prove that Ro LR is robust to a constant fraction of adversarial outliers. To the best of our knowledge, this is the ﬁrst result on estimating logistic regression model when the covariate matrix is corrupted with any performance guarantees. Besides regression, we apply Ro LR to solving binary classiﬁcation problems where a fraction of training samples are corrupted.

1 Introduction

Logistic regression (LR) is a standard probabilistic statistical classiﬁcation model that has been extensively used across disciplines such as computer vision, marketing, social sciences, to name a few. Different from linear regression, the outcome of LR on one sample is the probability that it is positive or negative, where the probability depends on a linear measure of the sample. Therefore, LR is actually widely used for classiﬁcation. More formally, for a sample xi Rp whose label is denoted as yi, the probability of yi being positive is predicted to be P{yi = +1} = 1 1+e β xi , given the LR model parameter β. In order to obtain a parameter that performs well, often a set of labeled samples {(x1, y1), . . . , (xn, yn)} are collected to learn the LR parameter β which maximizes the induced likelihood function over the training samples.

However, in practice, the training samples x1, . . . , xn are usually noisy and some of them may even contain adversarial corruptions. Here by adversarial , we mean that the corruptions can be arbitrary, unbounded and are not from any speciﬁc distribution. For example, in the image/video classiﬁcation task, some images or videos may be corrupted unexpectedly due to the error of sensors or the severe occlusions on the contained objects. Those corrupted samples, which are called outliers, can skew the parameter estimation severely and hence destroy the performance of LR.

To see the sensitiveness of LR to outliers more intuitively, consider a simple example where all the samples xi s are from one-dimensional space R, as shown in Figure 1. Only using the inlier samples provides a correct LR parameter (we here show the induced function curve) which explains the inliers well. However, when only one sample is corrupted (which is originally negative but now closer to the positive samples), the resulted regression curve is distracted far away from the ground truth one and the label predictions on the concerned inliers are completely wrong. This demonstrates that LR is indeed fragile to sample corruptions. More rigorously, the non-robustness of LR can be shown via calculating its inﬂuence function [7] (detailed in the supplementary material).

5 4 3 2 1 0 1 2 3 4 5

inlier outlier

Figure 1: The estimated logistic regression curve (red solid) is far away from the correct one (blue dashed) due to the existence of just one outlier (red circle).

As Figure 1 demonstrates, the maximal-likelihood estimate of LR is extremely sensitive to the presence of anomalous data in the sample. Pregibon also observed this non-robustness of LR in [14]. To solve this important issue of LR, Pregibon [14], Cook and Weisberg [4] and Johnson [9] proposed procedures to identify observations which are inﬂuential for estimating β based on certain outlyingness measure. Stefanski et al. [16, 10] and Bianco et al. [2] also proposed robust estimators which, however, require to robustly estimating the covariate matrix or boundedness on the outliers. Moreover, the breakdown point1 of those methods is generally inversely proportional to the sample dimensionality and diminishes rapidly for high-dimensional samples.

We propose a new robust logistic regression algorithm, called Ro LR, which optimizes a robustiﬁed linear correlation between response y and linear measure β, x via an efﬁcient linear programmingbased procedure. We demonstrate that the proposed Ro LR achieves robustness to arbitrarily covariate corruptions. Even when a constant fraction of the training samples are corrupted, Ro LR is still able to learn the LR parameter with a non-trivial upper bound on the error. Besides this theoretical guarantee of Ro LR on the parameter estimation, we also provide the empirical and population risks bounds for Ro LR. Moreover, Ro LR only needs to solve a linear programming problem and thus is scalable to large-scale data sets, in sharp contrast to previous LR optimization algorithms which typically resort to (computationally expensive) iterative reweighted method [11]. The proposed Ro LR can be easily adapted to solving binary classiﬁcation problems where corrupted training samples are present. We also provide theoretical classiﬁcation performance guarantee for Ro LR. Due to the space limitation, we defer all the proofs to the supplementary material.

2 Related Works

Several previous works have investigated multiple approaches to robustify the logistic regression (LR) [15, 13, 17, 16, 10]. The majority of them are M-estimator based: minimizing a complicated and more robust loss function than the standard loss function (negative log-likelihood) of LR. For example, Pregiobon [15] proposed the following M-estimator:

ˆβ = arg min β

i=1 ρ(ℓi(β)),

where ℓi( ) is the negative log-likelihood of the ith sample xi and ρ( ) is a Huber type function [8] such as

ρ(t) = t, if t c, 2

tc c, if t > c,

with c a positive parameter. However, the result from such estimator is not robust to outliers with high leverage covariates as shown in [5].

1It is deﬁned as the percentage of corrupted points that can make the output of an algorithm arbitrarily bad.

Recently, Ding et al [6] introduced the T-logistic regression as a robust alternative to the standard LR, which replaces the exponential distribution in LR by t-exponential distribution family. However, T-logistic regression only guarantees that the output parameter converges to a local optimum of the loss function instead of converging to the ground truth parameter.

Our work is largely inspired by following two recent works [3, 13] on robust sparse regression. In [3], Chen et al. proposed to replace the standard vector inner product by a trimmed one, and obtained a novel linear regression algorithm which is robust to unbounded covariate corruptions. In this work, we also utilize this simple yet powerful operation to achieve robustness. In [13], a convex programming method for estimating the sparse parameters of logistic regression model is proposed:

i=1 yi xi, β , s.t. β 1 s, β 1,

where s is the sparseness prior parameter on β. However, this method is not robust to corrupted covariate matrix. Few or even one corrupted sample may dominate the correlation in the objective function and yield arbitrarily bad estimations. In this work, we propose a robust algorithm to remedy this issue.

3 Robust Logistic Regression

3.1 Problem Setup

We consider the problem of logistic regression (LR). Let Sp 1 denote the unit sphere and Bp 2 denote the Euclidean unit ball in Rp. Let β be the groundtruth parameter of the LR model. We assume the training samples are covariate-response pairs {(xi, yi)}n+n1 i=1 Rp { 1, +1}, which, if not corrupted, would obey the following LR model:

P{yi = +1} = τ( β , xi + vi), (1)

where the function τ( ) is deﬁned as: τ(z) = 1 1+e z . The additive noise vi N(0, σ2 e) is an i.i.d. Gaussian random variable with zero mean and variance of σ2 e. In particular, when we consider the noiseless case, we assume σ2 e = 0. Since LR only depends on β , xi , we can always scale the samples xi to make the magnitude of β less than 1. Thus, without loss of generality, we assume that β Sp 1.

Out of the n + n1 samples, a constant number (n1) of the samples may be adversarially corrupted, and we make no assumptions on these outliers. Throughout the paper, we use λ n1

n to denote the outlier fraction. We call the remaining n non-corrupted samples authentic samples, which obey the following standard sub-Gaussian design [12, 3]. Deﬁnition 1 (Sub-Gaussian design). We say that a random matrix X = [x1, . . . , xn] Rp n is sub-Gaussian with parameter ( 1

nσ2 x) if: (1) each column xi Rp is sampled independently from a zero-mean distribution with covariance 1

nΣx, and (2) for any unit vector u Rp, the random variable u xi is sub-Gaussian with parameter2 1 nσx.

The above sub-Gaussian random variables have several nice concentration properties, one of which is stated in the following Lemma [12]. Lemma 1 (Sub-Gaussian Concentration [12]). Let X1, . . . , Xn be n i.i.d. zero-mean sub Gaussian random variables with parameter σx/ n and variance at most σ2 x/n. Then we have Pn i=1 X2 i σ2 x c1σ2 x q

n , with probability of at least 1 p 2 for some absolute constant c1.

Based on the above concentration property, we can obtain following bound on the magnitude of a collection of sub-Gaussian random variables [3]. Lemma 2. Suppose X1, . . . , Xn are n independent sub-Gaussian random variables with parameter σx/ n. Then we have maxi=1,...,n|Xi| 4σx p

(log n + log p)/n with probability of at least 1 p 2.

2Here, the parameter means the sub-Gaussian norm of the random variable Y , Y ψ2 = supq 1 q 1/2(E|Y |q)1/q.

Also, this lemma provides a rough bound on the magnitude of inlier samples, and this bound serves as a threshold for pre-processing the samples in the following Ro LR algorithm.

3.2 Ro LR Algorithm

We now proceed to introduce the details of the proposed Robust Logistic Regression (Ro LR) algorithm. Basically, Ro LR ﬁrst removes the samples with overly large magnitude and then maximizes a trimmed correlation of the remained samples with the estimated LR model. The intuition behind the Ro LR maximizing the trimmed correlation is: if the outliers have too large magnitude, they will not contribute to the correlation and thus not affect the LR parameter learning. Otherwise, they have bounded affect on the LR learning (which actually can be bounded by the inlier samples due to our adopting the trimmed statistic). Algorithm 1 gives the implementation details of Ro LR.

Algorithm 1 Ro LR

Input: Contaminated training samples {(x1, y1), . . . , (xn+n1, yn+n1)}, an upper bound on the number of outliers n1, number of inliers n and sample dimension p. Initialization: Set T = 4 p

log p/n + log n/n. Preprocessing: Remove samples (xi, yi) whose magnitude satisﬁes xi T. Solve the following linear programming problem (see Eqn. (3)):

ˆβ = arg max β Bp 2

i=1 [y β, x ](i).

Output: ˆβ.

Note that, within the Ro LR algorithm, we need to optimize the following sorted statistic:

i=1 [y β, x ](i). (2)

where [ ](i) is a sorted statistic such that [z](1) [z](2) . . . [z](n), and z denotes the involved variable. The problem in Eqn. (2) is equivalent to minimizing the summation of top n variables, which is a convex one and can be solved by an off-the-shelf solver (such as CVX). Here, we note that it can also be converted to the following linear programming problem (with a quadratic constraint), which enjoys higher computational efﬁciency. To see this, we ﬁrst introduce auxiliary variables ti {0, 1} as indicators of whether the corresponding terms yi β, xi fall in the smallest n ones. Then, we write the problem in Eqn. (2) as

max β Bp 2 min ti

i=1 ti yi β, xi , s.t.

i=1 ti n, 0 ti 1.

Here the constraints of Pn+n1 i=1 ti n, 0 ti 1 are from standard reformulation of Pn+n1 i=1 ti = n, ti {0, 1}. Now, the above problem becomes a max-min linear programming. To decouple the variables β and ti, we turn to solving the dual form of the inner minimization problem. Let ν, and ξi be the Lagrange multipliers for the constraints Pn+n1 i=1 ti n and ti 1 respectively. Then the dual form w.r.t. ti of the above problem is:

max β,ν,ξi ν n

i=1 ξi, s.t. yi β, xi + ν + ξi 0, β Bp 2, ν 0, ξi 0. (3)

Reformulating logistic regression into a linear programming problem as above signiﬁcantly enhances the scalability of LR in handling large-scale datasets, a property very appealing in practice, since linear programming is known to be computationally efﬁcient and has no problem dealing with up to 1 106 variables in a standard PC.

3.3 Performance Guarantee for Ro LR

In contrast to traditional LR algorithms, Ro LR does not perform a maximal likelihood estimation. Instead, Ro LR maximizes the correlation yi β, xi . This strategy reduces the computational complexity of LR, and more importantly enhances the robustness of the parameter estimation, using

the fact that the authentic samples usually have positive correlation between the yi and β, xi , as described in the following lemma. Lemma 3. Fix β Sp 1. Suppose that the sample (x, y) is generated by the model described in (1). The expectation of the product y β, x is computed as:

Ey β, x = E sech2(g/2),

where g N(0, σ2 x + σ2 e) is a Gaussian random variable and σ2 e is the noise level in (1). Furthermore, the above expectation can be bounded as follows,

ϕ+(σ2 e, σ2 x) Ey β, x ϕ (σ2 e, σ2 x).

where ϕ+(σ2 e, σ2 x) and ϕ (σ2 e, σ2 x) are positive. In particular, they can take the form of

ϕ+(σ2 e, σ2 x) = σ2 x 3 sech2 1+σ2 e 2 and ϕ (σ2 e, σ2 x) = σ2 x 3 + σ2 x 6 sech2 1+σ2 e 2 .

The following lemma shows the difference of correlations is an effective surrogate for the difference of the LR parameters. Thus we can always minimize the difference of ˆβ β through maximizing P

i yi ˆβ, xi . Lemma 4. Fix β Sp 1 as the groundtruth parameter in (1) and β Bp 2. Denote η = Ey β, x . Then Ey β , x = η β, β , and thus, E [y β, x y β , x ] = η(1 β, β ) η

Based on these two lemmas, along with some concentration properties of the inlier samples (shown in the supplementary material), we have the following performance guarantee of Ro LR on LR model parameter recovery.

Theorem 1 (Ro LR for recovering LR parameter). Let λ n1

n be the outlier fraction, ˆβ be the output of Algorithm 1, and β be the ground truth parameter. Suppose that there are n authentic samples generated by the model described in (1). Then we have, with probability larger than 1 4 exp( c2n/8),

ˆβ β 2λϕ (σ2 e, σ2 x) ϕ+(σ2e, σ2x) + 2(λ + 4 + 5

λ) ϕ+(σ2e, σ2x)

n + 8λ ϕ+(σ2e, σ2x)σ2 x

Here c2 is an absolute constant. Remark 1. To make the above results more explicit, we consider the asymptotic case where p/n 0. Thus the above bounds become

ˆβ β 2λϕ (σ2 e, σ2 x) ϕ+(σ2e, σ2x),

which holds with probability larger than 1 4 exp( c2n/8). In the noiseless case, i.e., σe = 0, and assuming σ2 x = 1, we have ϕ+(σ2 e) = 1

2 0.2622 and ϕ (σ2 e +1) = 1

2 0.4644. The ratio is ϕ /ϕ+ 1.7715. Thus the bound is simpliﬁed to:

ˆβ β 3.54λ.

Recall that ˆβ, β Sp 1 and the maximal value of ˆβ β is 2. Thus, for the above result to be non-trivial, we need 3.54λ 2, namely λ 0.56. In other words, in the noiseless case, the Ro LR is able to estimate the LR parameter with a non-trivial error bound (also known as a breakdown point ) with up to 0.56/1.56 100% = 36% of the samples being outliers.

4 Empirical and Population Risk Bounds of Ro LR

Besides the parameter recovery, we are also concerned about the prediction performance of the estimated LR model in practice. The standard prediction loss function ℓ( , ) of LR is a non-negative and bounded function, and is deﬁned as:

ℓ((xi, yi), β) = 1 1 + exp{ yiβ xi}. (4)

The goodness of an LR predictor β is measured by its population risk: R(β) = EP (X,Y )ℓ((x, y), β),

where P(X, Y ) describes the joint distribution of covariate X and response Y . However, the population risk rarely can be calculated directly as the distribution P(X, Y ) is usually unknown. In practice, we often consider the empirical risk, which is calculated over the provided training samples as follows:

Remp(β) = 1

i=1 ℓ((xi, yi), β).

Note that the empirical risk is computed only over the authentic samples, hence cannot be directly optimized when outliers exist.

Based on the bound of ˆβ β provided in Theorem 1, we can easily obtain the following empirical risk bound for Ro LR as the LR loss function given in Eqn. (4) is Lipschitz continuous.

Corollary 1 (Bound on the empirical risk). Let ˆβ be the output of Algorithm 1, and β be the optimal parameter minimizing the empirical risk. Suppose that there are n authentic samples generated by the model described in (1). Deﬁne X 4σx p

(log n + log p)/n. Then we have, with probability larger than 1 4 exp( c2n/8), the empirical risk of ˆβ is bounded by,

Remp(ˆβ) Remp(β ) X

2λϕ (σ2 e, σ2 x) ϕ+(σ2e, σ2x) + 2(λ + 4 + 5

λ) ϕ+(σ2e, σ2x)

+ 8λσ2 x ϕ+(σ2e, σ2x)

Given the empirical risk bound, we can readily obtain the bound on the population risk by referring to standard generalization results in terms of various function class complexities. Some widely used complexity measures include the VC-dimension [18] and the Rademacher and Gaussian complexity [1]. Compared with the Rademacher complexity which is data dependent, the VC-dimension is more universal although the resulting generalization bound can be slightly loose. Here, we adopt the VC-dimension to measure the function complexity and obtain the following population risk bound.

Corollary 2 (Bound on the population risk). Let ˆβ be the output of Algorithm 1, and β be the optimal parameter. Suppose the parameter space Sp 1 β has ﬁnite VC dimension d. There are n authentic samples are generated by the model described in (1). Deﬁne X 4σx p

(log n + log p)/n. Then we have, with high probability larger larger than 1 4 exp( c2n/8) δ, the population risk of ˆβ is bounded by,

R(ˆβ) R(β ) X

2λϕ (σ2 e, σ2 x) ϕ+(σ2e, σ2x) + 2(λ + 4 + 5

λ) ϕ+(σ2e, σ2x)

n + 8λσ2 x ϕ+(σ2e, σ2x)

d + ln(1/δ)

Here both c2 and c3 are absolute constants.

5 Robust Binary Classiﬁcation

5.1 Problem Setup

Different from the sample generation model for LR, in the standard binary classiﬁcation setting, the label yi of a sample xi is deterministically determined by the sign of the linear measure of the sample β , xi . Namely, the samples are generated by the following model: yi = sign ( β , xi + vi) . (5) Here vi is a Gaussian noise as in Eqn. (1). Since yi is deterministically related to β , xi , the expected correlation Ey β, x achieves the maximal value in this setup (ref. Lemma 5), which ensures that the Ro LR also performs well for classiﬁcation. We again assume that the training samples contain n authentic samples and at most n1 outliers.

5.2 Performance Guarantee for Robust Classiﬁcation

Lemma 5. Fix β Sp 1. Suppose the sample (x, y) is generated by the model described in (5). The expectation of the product y β, x is computed as:

2σ4x π(σ2x + σ2v).

Comparing the above result with the one in Lemma 3, here for the binary classiﬁcation, we can exactly calculate the expectation of the correlation, and this expectation is always larger than that of the LR setting. The correlation depends on the signal-noise ratio σx/σe. In the noiseless case, σe = 0 and the expected correlation is σx p

2/π, which is well known as the half-normal distribution. Similarly to analyzing Ro LR for LR, based on Lemma 5, we can obtain the following performance guarantee for Ro LR in solving classiﬁcation problems.

Theorem 2. Let ˆβ be the output of Algorithm 1, and β be the optimal parameter minimizing the empirical risk. Suppose there are n authentic samples generated by the model described by (5). Then we have, with large probability larger than 1 4 exp( c2n/8),

ˆβ β 2 2λ + 2(λ + 4 + 5

(σ2e + σ2x)πp

(σ2e + σ2x)π

The proof of Theorem 2 is similar to that of Theorem 1. Also, similar to the LR case, based on the above parameter error bound, it is straightforward to obtain the empirical and population risk bounds of Ro LR for classiﬁcation. Due to the space limitation, here we only sketch how to obtain the risk bounds.

For the classiﬁcation problem, the most natural loss function is the 0 1 loss. However, 0 1 loss function is non-convex, non-smooth, and we cannot get a non-trivial function value bound in terms of ˆβ β as we did for the logistic loss function. Fortunately, several convex surrogate loss functions for 0 1 loss have been proposed and achieve good classiﬁcation performance, which include the hinge loss, exponential loss and logistic loss. These loss functions are all Lipschitz continuous and thus we can bound their empirical and then population risks as for logistic regression.

6 Simulations

In this section, we conduct simulations to verify the robustness of Ro LR along with its applicability for robust binary classiﬁcation. We compare Ro LR with standard logistic regression which estimates the model parameter through maximizing the log-likelihood function.

We randomly generated the samples according to the model in Eqn. (1) for the logistic regression problem. In particular, we ﬁrst sample the model parameter β N(0, Ip) and normalize it as β := β/ β 2. Here p is the dimension of the parameter, which is also the dimension of samples. The samples are drawn i.i.d. from xi N(0, Σx) with Σx = Ip, and the Gaussian noise is sampled as vi N(0, σe). Then, the sample label yi is generated according to P{yi = +1} = τ( β, xi +vi) for the LR case. For the classiﬁcation case, the sample labels are generated by yi = sign( β, xi +vi) and additional nt = 1, 000 authentic samples are generated for testing. The entries of outliers xo are i.i.d. random variables from uniform distribution [ σo, σo] with σo = 10. The labels of outliers are generated by yo = sign( β, xo ). That is, outliers follow the model having opposite sign as inliers, which according to our experiment, is the most adversarial outlier model. The ratio of outliers over inliers is denoted as λ = n1/n, where n1 is the number of outliers and n is the number of inliers. We ﬁx n = 1, 000 and the λ varies from 0 to 1.2, with a step of 0.1.

We repeat the simulations under each outlier fraction setting for 10 times and plot the performance (including the average and the variance) of Ro LR and ordinary LR versus the ratio of outliers to inliers in Figure 2. In particular, for the task of logistic regression, we measure the performance by the parameter prediction error ˆβ β . For classiﬁcation, we use the classiﬁcation error rate on test samples #(ˆyi = yi)/nt as the performance measure. Here ˆyi = sign(ˆβ xi) is the predicted label for sample xi and yi is the ground truth sample label. The results, shown in Figure 2,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 0

outlier to inliear ratio

error: ||β β*||

Ro LR LR LR+P

(a) Logistic regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 0

classification error

outlier to inlier ratio

Ro LR Classification LR Classification

(b) Classiﬁcation

Figure 2: Performance comparison between Ro LR, ordinary LR and LR with the thresholding preprocessing as in Ro LR (LR+P) for (a) regression parameter estimation and (b) classiﬁcation, under the setting of σe = 0.5, σo = 10, p = 20 and n = 1, 000. The simulation is repeated for 10 times.

clearly demonstrate that Ro LR performs much better than standard LR for both tasks. Even when the outlier fraction is small (λ = 0.1), Ro LR already outperforms LR with a large margin. From Figure 2(a), we observe that when λ 0.3, the parameter estimation error of LR reaches around 1.3, which is pretty unsatisfactory since simply outputting a trivial solution ˆβ = 0 has an error of 1 (recall β 2 = 1). In contrast, Ro LR guarantees the estimation error to be around 0.5, even though λ = 0.8, i.e., around 45% of the samples are outliers. To see the role of preprocessing in Ro LR, we also apply such preprocessing to LR and plot its performance as LR+P in the ﬁgure. It can be seen that the preprocessing step indeed helps remove certain outliers with large magnitudes. However, when the fraction of outliers increases to λ = 0.5, more outliers with smaller magnitudes than the pre-deﬁned threshold enter the remained samples and increase the error of LR+P to be larger than 1. This demonstrates maximizing the correlation is more essential than the thresholding for the robustness gain of Ro LR. From results for classiﬁcation, shown in Figure 2(b), we observe that again from λ = 0.2, LR starts to breakdown. The classiﬁcation error rate of LR achieves 0.8, which is even worse than random guess. In contrast, Ro LR still achieves satisfactory classiﬁcation performance with classiﬁcation error rate around 0.4 even with λ 1. But when λ > 1, Ro LR also breaks down as outliers dominate in the training samples.

When there is no outliers, with the same inliers (n = 1 103 and p = 20), the error of LR in logistic regression estimation is 0.06 while the error of Ro LR is 0.13. Such performance degradation in Ro LR is due to that Ro LR maximizes the linear correlation statistics instead of the likelihood as in LR in inferring the regression parameter. This is the price Ro LR needs to pay for the robustness. We provide more investigations and also results for real large data in the supplementary material.

7 Conclusions

We investigated the problem of logistic regression (LR) under a practical case where the covariate matrix is adversarially corrupted. Standard LR methods were shown to fail in this case. We proposed a novel LR method, Ro LR, to solve this issue. We theoretically and experimentally demonstrated that Ro LR is robust to the covariate corruptions. Moreover, we devised a linear programming algorithm to solve Ro LR, which is computationally efﬁcient and can scale to large problems. We further applied Ro LR to successfully learn classiﬁers from corrupted training samples.

Acknowledgments

The work of H. Xu was partially supported by the Ministry of Education of Singapore through Ac RF Tier Two grant R-265-000-443-112. The work of S. Mannor was partially funded by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and by the Israel Science Foundation (ISF under contract 920/12).

[1] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463 482, 2003.

[2] Ana M Bianco and V ıctor J Yohai. Robust estimation in the logistic regression model. Springer, 1996.

[3] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In ICML, 2013.

[4] R Dennis Cook and Sanford Weisberg. Residuals and inﬂuence in regression. 1982.

[5] JB Copas. Binary regression models for contaminated data. Journal of the Royal Statistical Society. Series B (Methodological), pages 225 265, 1988.

[6] Nan Ding, SVN Vishwanathan, Manfred Warmuth, and Vasil S Denchev. T-logistic regression for binary and multiclass classiﬁcation. Journal of Machine Learning Research, 5:1 55, 2013.

[7] Frank R Hampel. The inﬂuence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383 393, 1974.

[8] Peter J Huber. Robust statistics. Springer, 2011.

[9] Wesley Johnson. Inﬂuence measures for logistic regression: Another point of view. Biometrika, 72(1):59 65, 1985.

[10] Hans R K unsch, Leonard A Stefanski, and Raymond J Carroll. Conditionally unbiased bounded-inﬂuence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84(406):460 466, 1989.

[11] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Efﬁcient L1 regularized logistic regression. In AAAI, 2006.

[12] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Annals of Statistics, 40(3):1637, 2012.

[13] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. Information Theory, IEEE Transactions on, 59(1):482 494, 2013.

[14] Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics, pages 705 724, 1981.

[15] Daryl Pregibon. Resistant ﬁts for some commonly used logistic models with medical applications. Biometrics, pages 485 498, 1982.

[16] Leonard A Stefanski, Raymond J Carroll, and David Ruppert. Optimally hounded score functions for generalized linear models with applications to logistic regression. Biometrika, 73(2):413 424, 1986.

[17] Julie Tibshirani and Christopher D Manning. Robust logistic regression using shift parameters. ar Xiv preprint ar Xiv:1305.4987, 2013.

[18] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264 280, 1971.