# distributionally_robust_logistic_regression__035f263c.pdf

Distributionally Robust Logistic Regression

Soroosh Shaﬁeezadeh-Abadeh Peyman Mohajerin Esfahani Daniel Kuhn Ecole Polytechnique F ed erale de Lausanne, CH-1015 Lausanne, Switzerland {soroosh.shafiee,peyman.mohajerin,daniel.kuhn} @epfl.ch

This paper proposes a distributionally robust approach to logistic regression. We use the Wasserstein distance to construct a ball in the space of probability distributions centered at the uniform distribution on the training samples. If the radius of this ball is chosen judiciously, we can guarantee that it contains the unknown datagenerating distribution with high conﬁdence. We then formulate a distributionally robust logistic regression model that minimizes a worst-case expected logloss function, where the worst case is taken over all distributions in the Wasserstein ball. We prove that this optimization problem admits a tractable reformulation and encapsulates the classical as well as the popular regularized logistic regression problems as special cases. We further propose a distributionally robust approach based on Wasserstein balls to compute upper and lower conﬁdence bounds on the misclassiﬁcation probability of the resulting classiﬁer. These bounds are given by the optimal values of two highly tractable linear programs. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.

1 Introduction

Logistic regression is one of the most frequently used classiﬁcation methods [1]. Its objective is to establish a probabilistic relationship between a continuous feature vector and a binary explanatory variable. However, in spite of its overwhelming success in machine learning, data analytics and medicine etc., logistic regression models can display a poor out-of-sample performance if training data is sparse. In this case modelers often resort to ad hoc regularization techniques in order to combat overﬁtting effects. This paper aims to develop new regularization techniques for logistic regression and to provide intuitive probabilistic interpretations for existing ones by using tools from modern distributionally robust optimization.

Logistic Regression: Let x Rn denote a feature vector and y { 1, +1} the associated binary label to be predicted. In logistic regression, the conditional distribution of y given x is modeled as

Prob(y|x) = [1 + exp( y β, x )] 1 , (1)

where the weight vector β Rn constitutes an unknown regression parameter. Suppose that N training samples {(ˆxi, ˆyi)}N i=1 have been observed. Then, the maximum likelihood estimator of classical logistic regression is found by solving the geometric program

i=1 lβ(ˆxi, ˆyi) , (2)

whose objective function is given by the sample average of the logloss function lβ(x, y) = log(1 + exp ( y β, x )). It has been observed, however, that the resulting maximum likelihood estimator may display a poor out-of-sample performance. Indeed, it is well documented that minimizing the average logloss function leads to overﬁtting and weak classiﬁcation performance [2, 3]. In order

to overcome this deﬁciency, it has been proposed to modify the objective function of problem (2) [4, 5, 6]. An alternative approach is to add a regularization term to the logloss function in order to mitigate overﬁtting. These regularization techniques lead to a modiﬁed optimization problem

i=1 lβ(ˆxi, ˆyi) + εR(β) , (3)

where R(β) and ε denote the regularization function and the associated coefﬁcient, respectively. A popular choice for the regularization term is R(β) = β , where denotes a generic norm such as the ℓ1 or the ℓ2-norm. The use of ℓ1-regularization tends to induce sparsity in β, which in turn helps to combat overﬁtting effects [7]. Moreover, ℓ1-regularized logistic regression serves as an effective means for feature selection. It is further shown in [8] that ℓ1-regularization outperforms ℓ2-regularization when the number of training samples is smaller than the number of features. On the downside, ℓ1-regularization leads to non-smooth optimization problems, which are more challenging. Algorithms for large scale regularized logistic regression are discussed in [9, 10, 11, 12].

Distributionally Robust Optimization: Regression and classiﬁcation problems are typically modeled as optimization problems under uncertainty. To date, optimization under uncertainty has been addressed by several complementary modeling paradigms that differ mainly in the representation of uncertainty. For instance, stochastic programming assumes that the uncertainty is governed by a known probability distribution and aims to minimize a probability functional such as the expected cost or a quantile of the cost distribution [13, 14]. In contrast, robust optimization ignores all distributional information and aims to minimize the worst-case cost under all possible uncertainty realizations [15, 16, 17]. While stochastic programs may rely on distributional information that is not available or hard to acquire in practice, robust optimization models may adopt an overly pessimistic view of the uncertainty and thereby promote over-conservative decisions.

The emerging ﬁeld of distributionally robust optimization aims to bridge the gap between the conservatism of robust optimization and the speciﬁcity of stochastic programming: it seeks to minimize a worst-case probability functional (e.g., the worst-case expectation), where the worst case is taken with respect to an ambiguity set, that is, a family of distributions consistent with the given prior information on the uncertainty. The vast majority of the existing literature focuses on ambiguity sets characterized through moment and support information, see e.g. [18, 19, 20]. However, ambiguity sets can also be constructed via distance measures in the space of probability distributions such as the Prohorov metric [21] or the Kullback-Leibler divergence [22]. Due to its attractive measure concentration properties, we use here the Wasserstein metric to construct ambiguity sets.

Contribution: In this paper we propose a distributionally robust perspective on logistic regression. Our research is motivated by the well-known observation that regularization techniques can improve the out-of-sample performance of many classiﬁers. In the context of support vector machines and Lasso, there have been several recent attempts to give ad hoc regularization techniques a robustness interpretation [23, 24]. However, to the best of our knowledge, no such connection has been established for logistic regression. In this paper we aim to close this gap by adopting a new distributionally robust optimization paradigm based on Wasserstein ambiguity sets [25]. Starting from a data-driven distributionally robust statistical learning setup, we will derive a family of regularized logistic regression models that admit an intuitive probabilistic interpretation and encapsulate the classical regularized logistic regression (3) as a special case. Moreover, by invoking recent measure concentration results, our proposed approach provides a probabilistic guarantee for the emerging regularized classiﬁers, which seems to be the ﬁrst result of this type. All proofs are relegated to the technical appendix. We summarize our main contributions as follows:

Distributionally robust logistic regression model and tractable reformulation: We propose a data-driven distributionally robust logistic regression model based on an ambiguity set induced by the Wasserstein distance. We prove that the resulting semi-inﬁnite optimization problem admits an equivalent reformulation as a tractable convex program. Risk estimation: Using similar distributionally robust optimization techniques based on the Wasserstein ambiguity set, we develop two highly tractable linear programs whose optimal values provide conﬁdence bounds on the misclassiﬁcation probability or risk of the emerging classiﬁers. Out-of-sample performance guarantees: Adopting a distributionally robust framework allows us to invoke results from the measure concentration literature to derive ﬁnite-sample probabilistic

guarantees. Speciﬁcally, we establish out-of-sample performance guarantees for the classiﬁers obtained from the proposed distributionally robust optimization model. Probabilistic interpretation of existing regularization techniques: We show that the standard regularized logistic regression is a special case of our framework. In particular, we show that the regularization coefﬁcient ε in (3) can be interpreted as the size of the ambiguity set underlying our distributionally robust optimization model.

2 A distributionally robust perspective on statistical learning

In the standard statistical learning setting all training and test samples are drawn independently from some distribution P supported on Ξ = Rn { 1, +1}. If the distribution P was known, the best weight parameter β could be found by solving the stochastic optimization problem

inf β EP [lβ(x, y)] = Z

Rn { 1,+1} lβ(x, y)P(d(x, y)) . (4)

In practice, however, P is only indirectly observable through N independent training samples. Thus, the distribution P is itself uncertain, which motivates us to address problem (4) from a distributionally robust perspective. This means that we use the training samples to construct an ambiguity set P, that is, a family of distributions that contains the unknown distribution P with high conﬁdence. Then we solve the distributionally robust optimization problem

inf β sup Q P EQ [lβ(x, y)] , (5)

which minimizes the worst-case expected logloss function. The construction of the ambiguity set P should be guided by the following principles. (i) Tractability: It must be possible to solve the distributionally robust optimization problem (5) efﬁciently. (ii) Reliability: The optimizer of (5) should be near-optimal in (4), thus facilitating attractive out-of-sample guarantees. (iii) Asymptotic consistency: For large training data sets, the solution of (5) should converge to the one of (4). In this paper we propose to use the Wasserstein metric to construct P as a ball in the space of probability distributions that satisﬁes (i) (iii).

Deﬁnition 1 (Wasserstein Distance). Let M(Ξ2) denote the set of probability distributions on Ξ Ξ. The Wasserstein distance between two distributions P and Q supported on Ξ is deﬁned as

W(Q, P) := inf Π M(Ξ2)

Ξ2 d(ξ, ξ ) Π(dξ, dξ ) : Π(dξ, Ξ) = Q(dξ), Π(Ξ, dξ ) = P(dξ ) ,

where ξ = (x, y) and d(ξ, ξ ) is a metric on Ξ.

The Wasserstein distance represents the minimum cost of moving the distribution P to the distribution Q, where the cost of moving a unit mass from ξ to ξ amounts to d(ξ, ξ ).

In the remainder, we denote by Bε(P) := {Q : W(Q, P) ε} the ball of radius ε centered at P with respect to the Wasserstein distance. In this paper we propose to use Wasserstein balls as ambiguity sets. Given the training data points {(ˆxi, ˆyi)}N i=1, a natural candidate for the center of the Wasserstein ball is the empirical distribution ˆPN = 1

N PN i=1 δ(ˆxi,ˆyi), where δ(ˆxi,ˆyi) denotes the Dirac point measure at (ˆxi, ˆyi). Thus, we henceforth examine the distributionally robust optimization problem

inf β sup Q Bε(ˆPN) EQ [lβ(x, y)] (6)

equipped with a Wasserstein ambiguity set. Note that (6) reduces to the average logloss minimization problem (2) associated with classical logistic regression if we set ε = 0.

3 Tractable reformulation and probabilistic guarantees

In this section we demonstrate that (6) can be reformulated as a tractable convex program and establish probabilistic guarantees for its optimal solutions.

3.1 Tractable reformulation

We ﬁrst deﬁne a metric on the feature-label space, which will be used in the remainder. Deﬁnition 2 (Metric on the Feature-Label Space). The distance between two data points (x, y), (x , y ) Ξ is deﬁned as d (x, y), (x , y ) = x x + κ|y y |/2 , where is any norm on Rn, and κ is a positive weight.

The parameter κ in Deﬁnition 2 represents the relative emphasis between feature mismatch and label uncertainty. The following theorem presents a tractable reformulation of the distributionally robust optimization problem (6) and thus constitutes the ﬁrst main result of this paper. Theorem 1 (Tractable Reformulation). The optimization problem (6) is equivalent to

ˆJ := inf β sup Q Bε(ˆPN) EQ [lβ(x, y)] =

min β,λ,si λε + 1

s.t. lβ(ˆxi, ˆyi) si i N lβ(ˆxi, ˆyi) λκ si i N β λ.

Note that (7) constitutes a tractable convex program for most commonly used norms . Remark 1 (Regularized Logistic Regression). As the parameter κ > 0 characterizing the metric d( , ) tends to inﬁnity, the second constraint group in the convex program (7) becomes redundant. Hence, (7) reduces to the celebrated regularized logistic regression problem

inf β ε β + 1

i=1 lβ(ˆxi, ˆyi),

where the regularization function is determined by the dual norm on the feature space, while the regularization coefﬁcient coincides with the radius of the Wasserstein ball. Note that for κ = the Wasserstein distance between two distributions is inﬁnite if they assign different labels to a ﬁxed feature vector with positive probability. Any distribution in Bε(ˆPN) must then have nonoverlapping conditional supports for y = +1 and y = 1. Thus, setting κ = reﬂects the belief that the label is a (deterministic) function of the feature and that label measurements are exact. As this belief is not tenable in most applications, an approach with κ < may be more satisfying.

3.2 Out-of-sample performance guarantees

We now exploit a recent measure concentration result characterizing the speed at which ˆPN converges to P with respect to the Wasserstein distance [26] in order to derive out-of-sample performance guarantees for distributionally robust logistic regression.

In the following, we let ˆΞN := {(ˆxi, ˆyi)}N i=1 be a set of N independent training samples from P, and we denote by ˆβ, ˆλ, and ˆsi the optimal solutions and ˆJ the corresponding optimal value of (7). Note that these values are random objects as they depend on the random training data ˆΞN. Theorem 2 (Out-of-Sample Performance). Assume that the distribution P is light-tailed, i.e. , there is a > 1 with A := EP[exp( 2x a)] < + . If the radius ε of the Wasserstein ball is set to

εN(η) = log (c1η 1)

a 1{N< log (c1η 1)

c2c3 } + log (c1η 1)

n 1{N log (c1η 1)

c2c3 }, (8)

then we have PN P Bε(ˆPN) 1 η, implying that PN{ˆΞN : EP[l ˆβ(x, y)] ˆJ} 1 η for all sample sizes N 1 and conﬁdence levels η (0, 1]. Moreover, the positive constants c1, c2, and c3 appearing in (8) depend only on the light-tail parameters a and A, the dimension n of the feature space, and the metric on the feature-label space.

Remark 2 (Worst-Case Loss). Denoting the empirical logloss function on the training set ˆΞN by EˆPN [l ˆβ(x, y)], the worst-case loss ˆJ can be expressed as

ˆJ = ˆλε + E ˆPN [l ˆβ(x, y)] + 1

i=1 max{0, ˆyi ˆβ, ˆxi ˆλκ}. (9)

Note that the last term in (9) can be viewed as a complementary regularization term that does not appear in standard regularized logistic regression. This term accounts for label uncertainty and decreases with κ. Thus, κ can be interpreted as our trust in the labels of the training samples. Note that this regularization term vanishes for κ . One can further prove that ˆλ converges to ˆβ for κ , implying that (9) reduces to the standard regularized logistic regression in this limit. Remark 3 (Performance Guarantees). The following comments are in order:

I. Light-Tail Assumption: The light-tail assumption of Theorem 2 is restrictive but seems to be unavoidable for any a priori guarantees of the type described in Theorem 2. Note that this assumption is automatically satisﬁed if the features have bounded support or if they are known to follow, for instance, a Gaussian or exponential distribution.

II. Asymptotic Consistency: For any ﬁxed conﬁdence level η, the radius εN(η) deﬁned in (8) drops to zero as the sample size N increases, and thus the ambiguity set shrinks to a singleton. To be more precise, with probability 1 across all training datasets, a sequence of distributions in the ambiguity set (8) converges in the Wasserstein metric, and thus weakly, to the unknown data generating distribution P; see [25, Corollary 3.4] for a formal proof. Consequently, the solution of (2) can be shown to converge to the solution of (4) as N increases.

III. Finite Sample Behavior: The a priori bound (8) on the size of the Wasserstein ball has two growth regimes. For small N, the radius decreases as N 1 a , and for large N it scales with N 1 n , where n is the dimension of the feature space. We refer to [26, Section 1.3] for further details on the optimality of these rates and potential improvements for special cases. Note that when the support of the underlying distribution P is bounded or P has a Gaussian distribution, the parameter a can be effectively set to 1.

3.3 Risk Estimation: Worstand Best-Cases

One of the main objectives in logistic regression is to control the classiﬁcation performance. Specifically, we are interested in predicting labels from features. This can be achieved via a classiﬁer function fβ : Rn {+1, 1}, whose risk R(β) := P y = fβ(x) represents the misclassiﬁcation probability. In logistic regression, a natural choice for the classiﬁer is fβ(x) = +1 if Prob(+1|x) > 0.5; = 1 otherwise. The conditional probability Prob(y|x) is deﬁned in (1). The risk associated with this classiﬁer can be expressed as R(β) = EP 1{y β,x 0} . As in Section 3.1, we can use worstand best-case expectations over Wasserstein balls to construct conﬁdence bounds on the risk.

Theorem 3 (Risk Estimation). For any ˆβ depending on the training dataset {(ˆxi, ˆyi)}N i=1 we have:

(i) The worst-case risk Rmax(ˆβ) := sup Q Bε(ˆPN) EQ[1{y ˆβ,x 0}] is given by

min λ,si,ri,ti λε + 1

s.t. 1 ri ˆyi ˆβ, ˆxi si i N 1 + ti ˆyi ˆβ, ˆxi λκ si i N ri ˆβ λ, ti ˆβ λ i N ri, ti, si 0 i N.

If the Wasserstein radius ε is set to εN(η) as deﬁned in (8), then Rmax(ˆβ) R(ˆβ) with probability 1 η across all training sets {(xi, yi)}N i=1.

(ii) Similarly, the best-case risk Rmin(ˆβ) := inf Q Bε(ˆPN) EQ[1{y ˆβ,x <0}] is given by

Rmin(ˆβ) = 1

min λ,si,ri,ti λε + 1

s.t. 1 + ri ˆyi ˆβ, ˆxi si i N 1 ti ˆyi ˆβ, ˆxi λκ si i N ri ˆβ λ, ti ˆβ λ i N ri, ti, si 0 i N.

10-5 10-4 10-3 10-2 10-1 100

Average CCR

(a) N = 10 training samples

10-5 10-4 10-3 10-2 10-1 100

Average CCR(%)

(b) N = 100 training samples

10-5 10-4 10-3 10-2 10-1 100

Average CCR(%)

(c) N = 1000 training samples

Figure 1: Out-of-sample performance (solid blue line) and the average CCR (dashed red line)

If the Wasserstein radius ε is set to εN(η) as deﬁned in (8), then Rmin(ˆβ) R(ˆβ) with probability 1 η across all training sets {(xi, yi)}N i=1.

We emphasize that (10a) and (10b) constitute highly tractable linear programs. Moreover, we have Rmin(ˆβ) R(ˆβ) Rmax(ˆβ) with probability 1 2η.

4 Numerical Results

We now showcase the power of distributionally robust logistic regression in simulated and empirical experiments. All optimization problems are implemented in MATLAB via the modeling language YALMIP [27] and solved with the state-of-the-art nonlinear programming solver IPOPT [28]. All experiments were run on an Intel XEON CPU (3.40GHz). For the largest instance studied (N = 1000), the problems (2), (3), (7) and (10) were solved in 2.1, 4.2, 9.2 and 0.05 seconds, respectively.

4.1 Experiment 1: Out-of-Sample Performance

We use a simulation experiment to study the out-of-sample performance guarantees offered by distributionally robust logistic regression. As in [8], we assume that the features x R10 follow a multivariate standard normal distribution and that the conditional distribution of the labels y {+1, 1} is of the form (1) with β = (10, 0, . . . , 0). The true distribution P is uniquely determined by this information. If we use the ℓ -norm to measure distances in the feature space, then P satisﬁes the light-tail assumption of Theorem 2 for 2 > a 1. Finally, we set κ = 1.

Our experiment comprises 100 simulation runs. In each run we generate N {10, 102, 103} training samples and 104 test samples from P. We calibrate the distributionally robust logistic regression model (6) to the training data and use the test data to evaluate the average logloss as well as the correct classiﬁcation rate (CCR) of the classiﬁer associated with ˆβ. We then record the percentage ˆηN(ε) of simulation runs in which the average logloss exceeds ˆJ. Moreover, we calculate the average CCR across all simulation runs. Figure 1 displays both 1 ˆηN(ε) and the average CCR as a function of ε for different values of N. Note that 1 ˆηN(ε) quantiﬁes the probability (with respect to the training data) that P belongs to the Wasserstein ball of radius ε around the empirical distribution ˆPN. Thus, 1 ˆηN(ε) increases with ε. The average CCR beneﬁts from the regularization induced by the distributional robustness and increases with ε as long as the empirical conﬁdence 1 ˆηN(ε) is smaller than 1. As soon as the Wasserstein ball is large enough to contain the distribution P with high conﬁdence (1 ˆηN(ε) 1), however, any further increase of ε is detrimental to the average CCR.

Figure 1 also indicates that the radius ε implied by a ﬁxed empirical conﬁdence level scales inversely with the number of training samples N. Speciﬁcally, for N = 10, 102, 103, the Wasserstein radius implied by the conﬁdence level 1 ˆη = 95% is given by ε 0.2, 0.02, 0.003, respectively. This observation is consistent with the a priori estimate (8) of the Wasserstein radius εN(η) associated with a given η. Indeed, as a 1, Theorem 2 implies that εN(η) scales with N 1 a N for ε c3.

4.2 Experiment 2: The Effect of the Wasserstein Ball

In the second simulation experiment we study the statistical properties of the out-of-sample logloss. As in [2], we set n = 10 and assume that the features follow a multivariate standard normal distribution, while the conditional distribution of the labels is of the form (1) with β sampled uniformly from the unit sphere. We use the ℓ2-norm in the feature space, and we set κ = 1. All results reported here are averaged over 100 simulation runs. In each trial, we use N = 102 training samples to calibrate problem (6) and 104 test samples to estimate the logloss distribution of the resulting classiﬁer.

Figure 2(a) visualizes the conditional value-at-risk (CVa R) of the out-of-sample logloss distribution for various conﬁdence levels and for different values of ε. The CVa R of the logloss at level α is deﬁned as the conditional expectation of the logloss above its (1 α)-quantile, see [29]. In other words, the CVa R at level α quantiﬁes the average of the α 100% worst logloss realizations. As expected, using a distributionally robust approach renders the logistic regression problem more risk-averse , which results in uniformly lower CVa R values of the logloss, particularly for smaller conﬁdence levels. Thus, increasing the radius of the Wasserstein ball reduces the right tail of the logloss distribution. Figure 2(c) conﬁrms this observation by showing that the cumulative distribution function (CDF) of the logloss converges to a step function for large ε. Moreover, one can prove that the weight vector ˆβ tends to zero as ε grows. Speciﬁcally, for ε 0.1 we have β 0, in which case the logloss approximates the deterministic value log(2) = 0.69. Zooming into the CVa R graph of Figure 2(a) at the end of the high conﬁdence levels, we observe that the 100%-CVa R, which coincides in fact with the expected logloss, increases at every quantile level; see Figure 2(b).

4.3 Experiment 3: Real World Case Studies and Risk Estimation

Next, we validate the performance of the proposed distributionally robust logistic regression method on the MNIST dataset [30] and three popular datasets from the UCI repository: Ionosphere, Thoracic Surgery, and Breast Cancer [31]. In this experiment, we use the distance function of Deﬁnition 2 with the ℓ1-norm. We examine three different models: logistic regression (LR), regularized logistic regression (RLR), and distributionally robust logistic regression with κ = 1 (DRLR). All results reported here are averaged over 100 independent trials. In each trial related to a UCI dataset, we randomly select 60% of data to train the models and the rest to test the performance. Similarly, in each trial related to the MNIST dataset, we randomly select 103 samples from the training dataset, and test the performance on the complete test dataset. The results in Table 1 (top) indicate that DRLR outperforms RLR in terms of CCR by about the same amount by which RLR outperforms classical LR (0.3% 1%), consistently across all experiments. We also evaluated the out-of-sample CVa R of logloss, which is a natural performance indicator for robust methods. Table 1 (bottom) shows that DRLR wins by a large margin (outperforming RLR by 4% 43%).

In the remainder we focus on the Ionosphere case study (the results of which are representative for the other case studies). Figures 3(a) and 3(b) depict the logloss and the CCR for different Wasserstein radii ε. DRLR (κ = 1) outperforms RLR (κ = ) consistently for all sufﬁciently small values of ε. This observation can be explained by the fact that DRLR accounts for uncertainty in the label, whereas RLR does not. Thus, there is a wider range of Wasserstein radii that result in

Quantile Percentage 0 20 40 60 80 100

ε =0 ε =0.005 ε =0.01 ε =0.05 ε =0.1 ε =0.5

(a) CVa R versus quantile of the logloss function

Quantile Percentage 94 95 96 97 98 99 100

ε =0 ε =0.005 ε =0.01 ε =0.05 ε =0.1 ε =0.5

(b) CVa R versus quantile of the logloss function (zoomed)

logloss 0 1 2 3 4 5 6

ε =0 ε =0.005 ε =0.01 ε =0.05 ε =0.1 ε =0.5

(c) Cumulative distribution of the logloss function

Figure 2: CVa R and CDF of the logloss function for different Wasserstein radii ε

Table 1: The average and standard deviation of CCR and CVa R evaluated on the test dataset.

LR RLR DRLR

Ionosphere 84.8 4.3% 86.1 3.1% 87.0 2.6% Thoracic Surgery 82.7 2.0% 83.1 2.0% 83.8 2.0% Breast Cancer 94.4 1.8% 95.5 1.2% 95.8 1.2% MNIST 1 vs 7 97.8 0.6% 98.0 0.3% 98.6 0.2% MNIST 4 vs 9 93.7 1.1% 94.6 0.5% 95.1 0.4% MNIST 5 vs 6 94.9 1.6% 95.7 0.5% 96.7 0.4%

Ionosphere 10.5 6.9 4.2 1.5 3.5 2.0 Thoracic Surgery 3.0 1.9 2.3 0.3 2.2 0.2 Breast Cancer 20.3 15.1 1.3 0.4 0.9 0.2 MNIST 1 vs 7 3.9 2.8 0.67 0.13 0.38 0.06 MNIST 4 vs 9 8.7 6.5 1.45 0.20 1.09 0.08 MNIST 5 vs 6 14.1 9.5 1.35 0.20 0.84 0.08

10-4 10-3 10-2 10-1

Average logloss

RLR (κ = + ) DRLR (κ = 1)

(a) The average logloss for different κ

10-4 10-3 10-2 10-1

Average CCR

RLR (κ = + ) DRLR (κ = 1)

(b) The average correct classiﬁcation rate for different κ

10-5 10-4 10-3 10-2 10-1 100

Conﬁdence (1 2ˆη)

1 True Risk Upper Bound Lower Bound Conﬁdence

(c) Risk estimation and its conﬁdence level

Figure 3: Average logloss, CCR and risk for different Wasserstein radii ε (Ionosphere dataset)

an attractive out-of-sample logloss and CCR. This effect facilitates the choice of ε and could be a signiﬁcant advantage in situations where it is difﬁcult to determine ε a priori.

In the experiment underlying Figure 3(c), we ﬁrst ﬁx ˆβ to the optimal solution of (7) for ε = 0.003 and κ = 1. Figure 3(c) shows the true risk R(ˆβ) and its conﬁdence bounds. As expected, for ε = 0 the upper and lower bounds coincide with the empirical risk on the training data, which is a lower bound for the true risk on the test data due to over-ﬁtting effects. As ε increases, the conﬁdence interval between the bounds widens and eventually covers the true risk. For instance, at ε 0.05 the conﬁdence interval is given by [0, 0.19] and contains the true risk with probability 1 2ˆη = 95%.

Acknowledgments: This research was supported by the Swiss National Science Foundation under grant BSCGI0 157733.

[1] D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. John Wiley & Sons, 2004.

[2] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust logistic regression and classiﬁcation. In Advances in Neural Information Processing Systems, pages 253 261, 2014.

[3] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482 494, 2013.

[4] N. Ding, S. Vishwanathan, M. Warmuth, and V. S. Denchev. t-logistic regression for binary and multiclass classiﬁcation. The Journal of Machine Learning Research, 5:1 55, 2013.

[5] C. Liu. Robit Regression: A Simple Robust Alternative to Logistic and Probit Regression, pages 227 238. John Wiley & Sons, 2005.

[6] P. J. Rousseeuw and A. Christmann. Robustness against separation and outliers in logistic regression. Computational Statistics & Data Analysis, 43(3):315 332, 2003. [7] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58(1):267 288, 1996. [8] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 78 85, 2004. [9] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale ℓ1-regularized logistic regression. The Journal of Machine Learning Research, 8:1519 1555, 2007. [10] S. Shalev-Shwartz and A. Tewari. Stochastic methods for ℓ1-regularized loss minimization. The Journal of Machine Learning Research, 12:1865 1892, 2011. [11] J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large-scale ℓ1-regularized logistic regression. The Journal of Machine Learning Research, 11:713 741, 2010. [12] S. Yun and K.-C. Toh. A coordinate gradient descent method for ℓ1-regularized convex minimization. Computational Optimization and Applications, 48(2):273 307, 2011. [13] A. Shapiro, D. Dentcheva, and A. Ruszczyski. Lectures on Stochastic Programming. SIAM, 2009. [14] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer, 2011. [15] A. Ben-Tal and A. Nemirovski. Robust optimization methodology and applications. Mathematical Programming B, 92(3):453 480, 2002. [16] D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35 53, 2004. [17] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009. [18] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010. [19] J. Goh and M. Sim. Distributionally robust optimization and its tractable approximations. Operations Research, 58(4):902 917, 2010. [20] W. Wiesemann, D. Kuhn, and M. Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358 1376, 2014. [21] E. Erdo gan and G. Iyengar. Ambiguous chance constrained problems and robust optimization. Mathematical Programming B, 107(1-2):37 61, 2006. [22] Z. Hu and L. J. Hong. Kullback-Leibler divergence constrained distributionally robust optimization. Technical report, Available from Optimization Online, 2013. [23] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. The Journal of Machine Learning Research, 10:1485 1510, 2009. [24] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. IEEE Transactions on Information Theory, 56(7):3561 3574, 2010. [25] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. http://arxiv. org/abs/1505.05116, 2015. [26] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, pages 1 32, 2014. [27] J. L ofberg. YALMIP: A toolbox for modeling and optimization in Matlab. In IEEE International Symposium on Computer Aided Control Systems Design, pages 284 289, 2004. [28] A. W achter and L. T. Biegler. On the implementation of an interior-point ﬁlter line-search algorithm for large-scale nonlinear programming. Mathematical Programming A, 106(1):25 57, 2006. [29] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:21 42, 2000. [30] Y. Le Cun. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/ exdb/mnist/. [31] K. Bache and M. Lichman. UCI machine learning repository, 2013. http://archive. ics.uci.edu/ml.