# wasserstein_logistic_regression_with_mixed_features__53fc68bb.pdf

Wasserstein Logistic Regression with Mixed Features

Aras Selvi Mohammad Reza Belbasi Martin B. Haugh Wolfram Wiesemann Imperial College Business School, Imperial College London, United Kingdom {a.selvi19, r.belbasi21, m.haugh, ww}@imperial.ac.uk

Recent work has leveraged the popular distributionally robust optimization paradigm to combat overfitting in classical logistic regression. While the resulting classification scheme displays a promising performance in numerical experiments, it is inherently limited to numerical features. In this paper, we show that distributionally robust logistic regression with mixed (i.e., numerical and categorical) features, despite amounting to an optimization problem of exponential size, admits a polynomial-time solution scheme. We subsequently develop a practically efficient column-and-constraint approach that solves the problem as a sequence of polynomial-time solvable exponential conic programs. Our model retains many of the desirable theoretical features of previous works, but in contrast to the literature it does not admit an equivalent representation as a regularized logistic regression, that is, it represents a genuinely novel variant of logistic regression. We show that our method outperforms both the unregularized and the regularized logistic regression on categorical as well as mixed-feature benchmark instances.

1 Introduction

Consider a data set (xi, yi)N i=1 with feature vectors xi and associated binary labels yi { 1, 1}. Classical logistic regression assumes that the labels depend probabilistically on the features via

Prob(y | x) = 1 + exp( y [β0 + β x]) 1 ,

where the parameters (β0, β) R1+n are estimated from the empirical risk minimization problem

minimize (β0,β) 1 N

i=1 lβ(xi, yi)

subject to (β0, β) Rn+1

with the log-loss function lβ(x, y) := log 1 + exp y [β0 + β x] . Its compelling performance across many domains, the availability of mature and computationally efficient algorithms as well as its interpretability have all contributed to the widespread adoption of logistic regression [6, 18, 27].

Similar to other machine learning models, logistic regression is prone to overfitting, especially when the number of training samples is small relative to the number of considered features. Moreover, logistic regression can be sensitive to erroneous feature and label values as well as distribution shifts under which the training and test sets stem from different distributions. In recent years, distributionally robust (DR) optimization [3, 4] has been proposed to simultaneously address these challenges. To this end, the DR optimization paradigm models a machine learning task as a zero-sum game between the decision maker, who seeks to obtain the most accurate model (e.g., the parameter vector β in a logistic regression), and a fictitious adversary who observes the decision maker s model and subsequently selects the worst data-generating distribution in the vicinity of the empirical distribution formed from the available training data. In this context, the similarity of distributions is commonly

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

measured in terms of moment bounds [9, 39], ϕ-divergences such as the Kullback-Leibler divergence [19, 22] or the Wasserstein distance [15, 25]. It has been observed that in many cases, the resulting DR machine learning models are equivalent to regularized versions of the same models, in which case distributionally robustness provides a new perspective on regularizers that are often selected ad hoc [21, 41]. Most importantly, DR machine learning models often admit computationally efficient reformulations as finite-dimensional convex optimization problems that can be solved in polynomial time with off-the-shelf solvers.

In this work we study a DR variant of the logistic regression where the distance between the empirical distribution and the unknown true data-generating distribution is measured by the popular Wasserstein (also known as Kantorovich-Rubinstein or earth mover s) distance [15, 25]. This problem has been first studied by [31], who show that the resulting DR logistic regression problem admits an equivalent reformulation as a polynomial-size convex optimization problem if all features are numerical. Applying similar techniques to the mixed-feature logistic regression problem, which appears to be most prevalent in practice, would result in an exponential-size convex optimization problem whose naïve solution with the methods of [31] does not scale to interesting problem sizes (cf. Section 4). Our contributions may be summarized as follows.

(i) On the theoretical side, we show that the complexity of DR mixed-feature regression crucially relies on the selected loss function. In particular, for the log-loss function employed in logistic regression, the problem despite its natural representation as an exponential-size convex optimization problem admits a polynomial-time solution scheme. We also show that in stark contrast to earlier variants of the problem, our mixed-feature regression does not admit an equivalent representation as a regularized problem. This provides compelling evidence that the DR logistic regression problem with mixed features is fundamentally different to the DR problem with only numerical features.

(ii) On the computational side, we propose a column-and-constraint scheme that solves the DR mixed-feature logistic regression as a sequence of polynomial-time solvable exponential conic programs. We show that the key step of our procedure, the identification of the most violated constraint, can be implemented efficiently for a broad range of metrics, despite its natural representation as a combinatorial optimization problem. Indeed, identifying the most violated constraint by brute force is out of the question, while the standard approach [43] of solving the most violated constraint problem would have an unacceptably high runtime.

(iii) On the numerical side, we show that our column-and-constraint scheme drastically reduces computation times over a naïve monolithic implementation of the regression problem. We also show that our model performs favorably on standard categorical and mixed-feature benchmark instances when compared against classical and regularized logistic regression.

The literature on DR machine learning under the Wasserstein distance is vast and rapidly growing. Recent works have considered, among others, the use of Wasserstein DR models in multi-label learning [14], generative adversarial networks [2, 8] and the generation of adversarial examples [40], density estimation [38] and learning Gaussian mixture models [20], graph-based semi-supervised learning [35], supervised dimensionality reduction [13] and reinforcement learning [1]. DR has also been found to help alleviate problems with overfitting [24, 21], label uncertainty [21] and distribution shifts [37, 36]. We refer to [21] for a recent review of the literature. Our work is most closely related to [31], and we compare our findings with the results of that work in Section 2.

We proceed as follows. Section 2 defines and analyzes the mixed-feature DR logistic regression problem, which is solved in Section 3 via column-and-constraint generation. We report numerical results in Section 4. Auxiliary material and all proofs are relegated to the appendix.

Notation. We define B = {0, 1} and [N] = {1, . . . , N} for N N. The set of all probability distributions supported on a set Ξ is denoted by P0(Ξ), while the Dirac distribution placing unit probability mass on x Rn is denoted by δx P0(Rn). The indicator function 1[E] attains the value 1 (0) whenever the expression E is (not) satisfied. Finally, we use Roman d to denote differentials and to distinguish them from the d used to denote distances.

2 Mixed-Feature DR Logistic Regression

Section 2.1 derives an exponential-size convex programming formulation of the DR logistic regression problem that serves as the basis of our analysis. Section 2.2 compares our formulation with a naïve model that treats categorical features as continuous ones. Section 2.3, finally, shows that our formulation, despite its exponential size, can be solved in polynomial time due to the benign structure of the log-loss function. It also establishes that in stark contrast to the literature, our formulation does not reduce to a regularized non-robust logistic regression, that is, it constitutes a genuinely novel variant of the logistic regression problem.

Our formulation enjoys strong finite-sample and asymptotic performance guarantees from the literature. Moreover, despite its exponential size, our model always accommodates worst-case distributions that exhibit a desirable sparsity pattern. We relegate these results to Appendix A.

2.1 Formulation as an Exponential-Size Convex Program

From now on, we consider a mixed-feature data set {ξi := (xi, zi, yi)}i [N] with n numerical features xi = (xi 1, . . . , xi n) Rn, m categorical features zi = (zi 1, . . . , zi m) C(k1) . . . C(km) and a binary label yi { 1, +1}. Here, C(s) = {z Bs 1 : P

j [s 1] zj 1} for s N \ {1} represents the one-hot encoding of a categorical feature with s possible values; in particular, C(2) = B encodes a binary feature. We denote by C = C(k1) . . . C(km) and Ξ = Rn C { 1, +1} the support of the categorical features as well as the data set, respectively, and we let k = k1+. . .+km m be the number of slopes used for the categorical features.

If we had access to the true data-generating distribution P0 P0(Ξ), we would solve the (non-robust) logistic regression problem

minimize β EP0 [lβ(x, z, y)]

subject to β = (β0, βN, βC) R1+n+k,

where EP0 denotes the expectation under P0, and where the log-loss function lβ now takes the form

lβ(x, z, y) := log 1 + exp y β0 + βN x + βC z

to account for the presence of categorical features. Since the distribution P0 is unknown in practice, the empirical risk minimization problem replaces P0 with the empirical distribution b PN := 1

N PN i=1 δξi that places equal probability mass on all observations {ξi}i [N]. Standard arguments show that when these observations are i.i.d., the empirical risk minimization problem recovers the logistic regression under P0 as N . In practice, however, data tends to be scarce, and the empirical risk minimization problem exhibits an optimism bias that is also known as overfitting [6, 18, 27], the error maximization effect of optimization [10, 23] or the optimizer s curse [34].

DR logistic regression combats the aforementioned overfitting phenomenon by solving the semiinfinite optimization problem

minimize β sup Q Bϵ(b PN) EQ [lβ(x, z, y)]

subject to β = (β0, βN, βC) R1+n+k, (1)

where the ambiguity set Bϵ(b PN) contains all distributions Q in a (soon to be defined) vicinity of the empirical distribution b PN, and where the expectation is taken with respect to Q. Problem (1) can be interpreted as a zero-sum game between the decision maker, who chooses a logistic regression model parameterized by β, and a fictitious adversary that observes β and subsequently chooses the worst distribution (in terms of the incurred log-loss) from Bϵ(b PN). Contrary to the classical non-robust logistic regression, problem (1) guarantees to overestimate the log-loss incurred by β under the unknown true distribution P0 as long as P0 is contained in Bϵ(b PN). On the other hand, problem (1) recovers the empirical risk minimization problem when the ambiguity set Bϵ(b PN) approaches a singleton set that only contains the empirical distribution b PN. Note that problem (1) is convex as the convexity of the logistic regression objective EQ [lβ(x, z, y)] is preserved under the supremum operator. That said, the problem typically constitutes a semi-infinite program as it comprises finitely

many decision variables but if the embedded supremum in the objective is brought to the constraints via an epigraph reformulation infinitely many constraints whenever the ambiguity set harbors infinitely many distributions. As such, it is not obvious how the problem can be solved efficiently.

In this paper, we choose as ambiguity set the Wasserstein ball Bϵ(b PN) := {Q P0(Ξ) : W(Q, b PN) ϵ} of radius ϵ > 0 that is centered at the empirical distribution b PN.

Definition 1 (Wasserstein Distance). The type-1 Wasserstein (Kantorovich-Rubinstein, or earth mover s) distance between two distributions P P0(Ξ) and Q P0(Ξ) is defined as

W(Q, P) := inf Π P0(Ξ2)

Ξ2 d(ξ, ξ ) Π(dξ, dξ ) : Π(dξ, Ξ) = Q(dξ), Π(Ξ, dξ ) = P(dξ ) , (2)

where ξ = (x, z, y) Ξ and ξ = (x , z , y ) Ξ, while d(ξ, ξ ) is the ground metric on Ξ.

The Wasserstein distance can be interpreted as the minimum cost of moving Q to P when d(ξ, ξ ) is the cost of moving a unit mass from ξ to ξ . The Wasserstein radius ϵ thus imposes a budget on the transportation cost that the adversary can spend on perturbing the empirical distribution b PN.

We next define the ground metric d that we use throughout the paper. Definition 2 (Ground Metric). We measure the distance between two data-points ξ = (x, z, y) Ξ and ξ = (x , z , y ) Ξ with z = (z1, . . . , zm) and z = (z 1, . . . , z m) as

d(ξ, ξ ) := x x + d C(z, z ) + κ 1[y = y ], (3a)

where is any rational norm on Rn, κ > 0 and the metric d C on C satisfies

d C(z, z ) := X

i [m] 1[zi = z i] 1/p for some p > 0. (3b)

Intuitively, the ground metric d measures the distance of the numerical features x and x by the norm distance x x , whereas the distance of the categorical features z and z is measured by (a power of) the number of discrepancies between z and z . Likewise, any discrepancy between the labels y and y is accounted for by a constant κ, which allows for scaling between the features and the labels.

We first extend the convex optimization model of [31] for the DR continuous-feature logistic regression to an exponential conic model that accommodates for categorical features. Theorem 1 (Exponential Conic Representation). The DR logistic regression problem (1) admits the equivalent reformulation

minimize β,λ,s,u ,v λϵ + 1

subject to u+ i,z + v+ i,z 1 (u+ i,z, 1, si λd C(z, zi)) Kexp (v+ i,z, 1, yiβN xi yiβC z yiβ0 si λd C(z, zi)) Kexp

u i,z + v i,z 1 (u i,z, 1, si λd C(z, zi) λκ) Kexp (v i,z, 1, yiβN xi + yiβC z + yiβ0 si λd C(z, zi) λκ) Kexp

β = (β0, βN, βC) R1+n+k, λ 0, s RN

(u+ i,z, v+ i,z, u i,z, v i,z) R4, i [N] and z C (4) as a finite-dimensional exponential conic program, where Kexp denotes the exponential cone

Kexp := cl ({(a, b, c) : a b exp(c/b), a > 0, b > 0}) R3.

The si s and λ that appear in (4) are dual variables that arise from dual of the inner (i.e., sup) problem in (1). For each z C and i [N], (u+ i,z, v+ i,z, u i,z, v i,z) is a vector of auxiliary variables that we use to model softplus constraints that arise from our intermediate analysis of problem (1) (Appendix E).

Figure 1: Left: Estimates of β for the standard logistic regression, our mixed-feature as well as the continuous-feature model as a function of the size N of the data set. All results are reported as averages over 2,000 statistically independent runs. Right: Comparison of the empirical distribution as well as the worst-case distributions for β = 1 under the two DR models for N = 250 samples. The probabilities are plotted on a log-scale to make small values visible.

Problem (4) is finite-dimensional and convex. Moreover, since exponential conic programs admit a self-concordant barrier function, they can be solved in polynomial time relative to their input size [28]. Due to the presence of categorical features in our regression problem, however, problem (4) comprises exponentially many variables and constraints, and a naïve solution of (4) would therefore require an exponential amount of time. The exponential size of problem (4) renders our formulation fundamentally different from the model of [31], and it significantly complicates the solution.

2.2 Comparison with Continuous-Feature DR Logistic Regression

Our formulation (4) of the mixed-feature DR logistic regression (1) accounts for categorical features z C at the expense of an exponential number of variables and constraints. It is therefore tempting to treat the categorical features as continuous ones and directly apply the continuous-feature-only reformulation of the DR logistic regression (1) proposed by [31]. In the following, we argue that such a reformulation would hedge against nonsensical worst-case distributions that in turn lead to overly conservative regression models. To see this, consider a stylized setting where the data set (zi, yi)i [N] comprises a single binary feature zi { 1, +1} that impacts the label yi { 1, +1} via the logistic model Prob(y | z) = [1 + exp( y βz)] 1

with β = 1. (While we use zi {0, 1} in the other sections of the paper, the analysis and results of this subsection and Appendix B assumed zi { 1, +1} as this made comparisons with [31] easier.) We attempt to recover this logistic model in two ways:

1. Mixed-feature model. We employ our DR logistic regression (1) with a single categorical feature, using p = 1 and κ = 1 in the ground metric (cf. Definition 2). 2. Continuous-feature model. We employ a variant of problem (1) that treats the categorical feature zi as a continuous one. We choose as ground metric

d(ξ, ξ ) := 1

2|z z | + 1[y = y ].

We therefore use the feature distances 1[z = z ] in the mixed-feature case and 1

2|z z | in the continuous-feature case. Note they both attain the value 0 (1) for equal (different) feature values.

Figure 1 (left) compares the mean values (dashed lines) as well as the 15% and 85% quantiles (shaded regions) of β for the standard logistic regression, our mixed-feature as well as the continuous-feature model as a function of the size N of the data set. For both DR models, we employed the same Wasserstein radius ϵ 1/

N, which is motivated by the finite sample guarantee presented in the next section (cf. Theorem 6). The figure shows that the continuous-feature model excessively shrinks its estimates of β, thus leading to overly conservative results. In fact, the true parameter value β = 1

consistently lies outside the confidence region of the continuous-feature model, independent of the sample size N. This is due to the fact that the continuous-feature model accounts for nonsensical worst-case distributions under which the categorical feature can take values outside its domain { 1, +1}, see Figure 1 (right). In fact, the continuous-feature model places non-zero probability on an extreme scenario under which the binary feature z { 1, +1} attains the value 15, 312. (In Appendix B we explain why this value of 15, 312 arises.) Our mixed-feature model, on the other hand, restricts the worst-case distribution to the domain of the categorical feature and thus hedges against realistic distributions only.

We note that in practice, the radius ϵ of the Wasserstein ball will be chosen via cross-validation (cf. Section 4), in which case our mixed-feature model reliably outperforms the classical logistic regression on standard benchmark instances. While one may argue that the impact of nonsensical worst-case distributions in the continuous-feature model is alleviated by the cross-validated radii, Figure 1 offers strong theoretical and practical reasons against the use of a continuous-feature model for categorical or mixed-feature problems: The continuous-feature model hedges against a clumsily perturbed worst-case distribution that introduces a pronounced bias in the estimation. We also note that another possible approach would be to use the continuous-feature model but to restrict the support of binary features to [ 1, 1]. Unfortunately, for the log-loss function no tractable reformulation with support constraints appears to be known; see [32] . Moreover, Section 4 will show that the mixed-feature model, despite its exponential size, can be solved quickly and reliably with our novel column-and-constraint approach from Section 3.

2.3 Complexity Analysis

We first show that despite its exponential size, the DR logistic regression problem (1) admits a polynomial-time solution. Key to this perhaps surprising finding is the shape of the loss function: while problem (1) is strongly NP-hard (and thus unlikely to admit a polynomial-time solution scheme) for generic loss functions, it can be solved in polynomial time for the log-loss function lβ employed in logistic regression.

Theorem 2 (Complexity of the DR Logistic Regression (1)).

(i) For generic loss functions lβ, problem (1) is strongly NP-hard even if n = 0 and N = 1.

(ii) For the loss function lβ(x, z, y) = log 1 + exp y β0 + βN x + βC z and the ground metric of Definition 2, problem (1) can be solved to δ-accuracy in polynomial time.

Recall that an optimization problem is solved to δ-accuracy if a δ-suboptimal solution is identified that satisfies all constraints modulo a violation of at most δ. The consideration of δ-accurate solutions is standard in the numerical solution of nonlinear programs where an optimal solution may be irrational.

A by now well-known result shows that when m = 0 (no categorical features), the DR logistic regression problem (1) reduces to a classical logistic regression with an additional regularization term βx in the objective function when the output label weight κ in Definition 2 approaches [31, 32]. We next show that this reduction to a classical regularized logistic regression no longer holds in our problem setting when categorical features are present.

Theorem 3 (Absence of a Reformulation as a Regularized Problem). Even when the output label weight κ approaches in the DR logistic regression (1), problem (1) does not admit an equivalent reformulation

minimize β Eb PN [lβ(x, z, y)] + R(β)

subject to β = (β0, βN, βC) R1+n+k,

as a classical regularized logistic regression for any regularizer R : R1+n+k R.

Note that Theorem 3 does not only preclude the existence of a specific regularizer, but it excludes the existence of any regularizer, no matter how complex its dependence on β might be. We are not aware of any prior results of this form in the literature. Some insight for the result in Theorem 3 may be found via the special case we consider in its proof in Appendix E and in Remark 2 that follows it.

Algorithm 1 Column-and-Constraint Generation Scheme for Problem (4).

Set LB0 = and UB0 = + Choose (possibly empty) subsets W+, W [N] C while LBt = UBt do

Let θ be the optimal value and (β, λ, s, u , v ) be a minimizer of problem (4) where:

the first constraint set is only enforced for all (i, z) W+; the second constraint set is only enforced for all (i, z) W ; we only include those (u+ i,z, v+ i,z) for which (i, z) W+;

we only include those (u i,z, v i,z) for which (i, z) W .

Identify a most violated index (i+, z+) of the first constraint set in (4) and add it to W+ Identify a most violated index (i , z ) of the second constraint set in (4) and add it to W Let ϑ+ and ϑ be the violations of (i+, z+) and (i , z ), respectively Update LBt = θ and UBt = min{UBt 1, θ + log(1 + max{ϑ+, ϑ })} Update t = t + 1 end while

Algorithm 2 Identification of Most Violated Constraints in the Reduced Problem (4).

for j {1, . . . , m} do

Find a feature value z j that minimizes yi βC,j zj across all zj C(kj) \ {zi j} end for Let π : [m] [m] be an ordering such that

yi βC,π(j) (z π(j) zi π(j)) yi βC,π(j ) (z π(j ) zi π(j )) 1 j j m

Set W = for δ {0, 1, . . . , m} do

Update W = W {z}, where zj = z j if π(j) δ and zj = zi j otherwise end for Determine the most violated constraint from the candidate set W

3 Column-and-Constraint Solution Scheme

Our DR logistic regression problem (4) comprises exponentially many variables and constraints, which renders its solution as a monolithic exponential conic program challenging. Instead, Algorithm 1 employs a column-and-constraint generation scheme, e.g., [43], that alternates between (i) the solution of relaxations of problem (4) that omit most of its variables and constraints and (ii) adding those variables and constraints that promise to maximally tighten the relaxations.

Theorem 4. Algorithm 1 solves problem (4) in finitely many iterations. Moreover, LBt and UBt constitute monotone sequences of lower and upper bounds on the optimal value of problem (4).

A key step in Algorithm 1 is the identification of most violated indices (i, z) [N] C of the first and second constraint set in the reduced DR regression problem (4) for a fixed solution (β, λ, s, u , v ). For the first constraint set in (4), the identification of such indices requires for each data point i [N] the solution of the combinatorial problem

maximize z min u+,v+

u+ + v+ : (u+, 1, si λd C(z, zi)) Kexp (v+, 1, yiβN xi yiβC z yiβ0 si λd C(z, zi)) Kexp

subject to z C, (5) where an optimal value greater than 1 corresponds to a violated constraint; an analogous problem can be defined for the second constraint set in (4). Despite its combinatorial nature, the above problem can be solved efficiently by means of Algorithm 2.

Theorem 5. Algorithm 2 identifies a most violated constraint for a given data-point i [N] in time O(k + n + m2).

Figure 2: Runtime comparison between our column-and-constraint scheme and a naïve solution of problem (4) as a monolithic exponential conic program. Left: Runtimes for m = 10 binary features as a function of the number N of data points. Middle: Runtimes for N = 50 as a function of m. Right: Runtimes of our column-and-constraint scheme only for varying combinations of m and N. In all graphs, shaded regions correspond to 10%-90% confidence regions and bold lines report median values over 50 statistically independent runs. Note the log-scale in the plots.

Algorithm 2 first determines for each categorical feature zj, j = 1, . . . , m, the feature value z j C(kj) \ {zi j} that, ceteris paribus, contributes to a maximal constraint violation in time O(k). The subsequent step sorts the different features j = 1, . . . , m in descending order of their contribution to constraint violations in time O(m log m). The last step of Algorithm 2, finally, constructs the most violated constraint z across those in which δ = 0, . . . , m categorical feature values deviate from zi and subsequently picks the most violated constraint of these in time O(n + m2). We note that the overall complexity of O(k + n + m2) can be reduced to O(k + n + m log m) by a clever use of data structures in the final step; for ease of exposition, we omit the details. Finally, since Algorithm 2 is applied to each data point i [N], the overall complexity increases by a factor of N. We provide additional intuition behind Algorithm 2 immediately before the proof of Theorem 5 in Appendix E.

4 Numerical Results

Section 4.1 first compares the runtimes of our column-and-constraint scheme from Section 3 with those of solving the DR logistic regression problem (4) naïvely as a monolithic exponential conic program. We subsequently compare the classification performance of problem (1) with those of a classical unregularized and regularized logistic regression on standard benchmark instances with categorical features (Section 4.2) and mixed features (Appendix C).

All algorithms were implemented in Julia [5] (MIT license) and executed on Intel Xeon 2.66GHz processors with 8GB memory in single-core mode. We use MOSEK 9.3 [26] (commercial) to solve all exponential conic programs through Ju MP [12] (MPL2 License). (We note the open source solvers Ipopt and CVXOpt could be used instead of MOSEK.) All source codes and detailed results are available on Git Hub (https://github.com/selvi-aras/Wasserstein LR).

4.1 Runtime Comparison with Monolithic Formulation

We first compare the computation times of our column-and-constraint scheme from Section 3 with those of solving the DR logistic regression problem (4) naïvely as a monolithic exponential conic program. To this end, we randomly generate synthetic logistic regression instances with varying numbers N of data points and m of binary features. While Figure 2 (left) shows that both approaches scale similarly in the number N of data points, Figure 2 (middle) reveals that the solution of the monolithic formulation scales exponentially in the number m of binary features. In contrast, our column-and-constraint scheme scales gracefully in both the number N of data points and the number m of binary features (cf. Figure 2, right). A similar behavior can be observed with more general, non-binary categorical features; we omit the results due to space constraints.

4.2 Performance on Categorical-Feature Instances

We next compare the classification performance (in terms of the out-of-sample classification error) of our unregularized ( DRO ) and Lasso-regularized ( r-DRO ) DR logistic regression problem (1) with those of a classical unregularized ( LR ) as well as Lasso-regularized ( r-LR ), mass transportationregularized (MT) [31, 32], and robust Wasserstein profile inference-regularized (PI) [7] logistic

Data Set N k m LR DRO (κ = 1) DRO (κ = m) r-LR r-DRO (κ = 1) r-DRO (κ = m) MT (κ = 1) MT (κ = m) PI (α = 0.05)

breast-cancer 277 42 9 29.56% 29.15% 28.55% 29.05% 29.16% 28.82% 29.44% 29.40% 29.33% spect 267 22 22 18.81% 17.72% 17.51% 17.79% 18.38% 15.83% 18.49% 18.74% 20.60% monks-3 554 11 6 2.07% 2.08% 2.10% 2.14% 2.19% 2.13% 2.15% 2.15% 37.87% tic-tac-toe 958 18 9 1.92% 1.69% 1.67% 1.68% 1.68% 1.69% 1.72% 1.81% 30.02% kr-vs-kp 3,196 37 36 2.64% 2.66% 2.64% 2.69% 2.66% 2.66% 2.77% 2.74% 12.85% balance-scale 625 16 4 0.85% 0.82% 0.72% 0.73% 0.71% 0.64% 0.72% 0.69% 36.10% hayes-roth 160 11 4 17.00% 16.31% 16.09% 18.66% 18.19% 17.78% 17.50% 16.38% 37.59% lymphography 148 42 18 21.17% 17.83% 16.72% 17.38% 17.52% 17.03% 19.41% 17.83% 20.00% car 1,728 15 6 4.73% 4.73% 4.73% 4.99% 4.76% 4.87% 4.73% 4.77% 15.37% splices 3,189 229 60 6.80% 5.92% 6.66% 6.22% 5.99% 6.02% 6.57% 6.87% 11.08% house-votes-84 435 32 16 6.60% 4.43% 5.37% 4.54% 4.41% 5.16% 4.78% 5.54% 4.26% hiv 6,590 152 8 5.94% 5.93% 5.92% 5.90% 5.90% 5.90% 6.04% 6.03% 20.45% primacy-tumor 339 25 17 13.93% 13.66% 13.76% 14.51% 14.06% 14.19% 13.87% 13.85% 19.18% audiology 226 92 69 14.18% 14.03% 14.40% 3.11% 2.80% 3.91% 12.64% 7.49% 15.49%

Table 1: Classification errors of unregularized as well as Lasso-, mass transportation-, and profile inference-regularized variants of the classical logistic regression and our DR regression on UCI benchmark instances with categorical features only. The smallest error within each model group (unregularized vs. regularized) is highlighted in italics, whereas the smallest error overall (across all groups) is printed in bold. The dagger ( ) and double dagger ( ) symbols next to the best model denote statistically significant improvements over LR and the second best model, respectively.

regression on the 14 most popular UCI data sets that only contain categorical features having more than 30 rows [11] (varying licenses). Instances with multiple output labels are indicated with a star; we convert them into instances with binary output labels by distinguishing between the majority class vs. all other classes. The instances vary in the number N of data points, the number m of categorical features as well as, accordingly, the number k of slopes considered in the one-hot encoding of the categorical features (cf. Section 2.1). All results are reported as means over 100 random training set-test set splits (80%:20%). The radius ϵ {0, 10 5, . . . , 10 4, . . . , 1} of the Wasserstein ball as well as the Lasso penalty γ {0, 1

2 10 5, . . . , 1

2 10 4, . . . , 1

2} are selected via 5-fold crossvalidation. We consider two variants of our DR logistic regression that employ a different output label weight (κ = 1 vs. κ = m) in the ground metric (cf. Definition 2). The results are reported in Table 1. (Run-times and are provided in Appendix D where we also describe how statistical significance was assessed.) The table shows that for the unregularized model, the classical logistic regression achieves the lowest classification error in 21% of the instances, whereas our DR logistic regression achieve the lowest classification error in 36% and 79% of the instances for κ = 1, m, respectively. (The double-counting of ties means these percentages do not sum to 100%.) For the regularized model, the results change to 14% (classical logistic regression) vs. 57% (each of our models). Within the numerical-feature Wasserstein DRO benchmarks, PI achieves the lowest classification error in 14% of the instances, whereas MT achieves in 42% and 57% of the instances for κ = 1, m, respectively. Globally, one of our methods is the winning approach in 12 out of 14 datasets (in the remaining 2 they are the second best approach without statistically significant inferiority). In 9 of the datasets our methods win strictly (without a tie), and in 7 of these datasets the improvements are statistically significant over all other approaches.

5 Conclusions

We proposed a new DR mixed-feature logistic regression model where the proximity between the empirical distribution and the unknown true data-generating distribution is measured by the popular Wasserstein distance. Despite its exponential-size formulation, we prove that the underlying optimization problem can be solved in polynomial time, and we develop a practically efficient column-and-constraint generation scheme for its solution. The promising performance of our model is demonstrated in numerical experiments on standard benchmark instances. We note that our columnand-constraint scheme readily extends to other DR mixed-feature machine learning models such as linear regression, support vector machines and decision trees. Applying our algorithm in those settings is a promising direction for future research. A further interesting direction is the development of tailored solution schemes rather than using off-the-shelf solvers for solving our DR mixed-feature logistic regression problem.

Acknowledgments and Disclosure of Funding

Funding from the EPSRC grant EP/W003317/1 is gratefully acknowledged. The first author acknowledges support from The Alan Turing Institute. The authors are grateful for the comments and suggestions of the anonymous reviewers, which have helped to significantly improve the manuscript. This work has been produced without the involvement of any commercial entities that may cause a conflict of interest or competing interests.

[1] M. A. Abdullah, H. Ren, H. Bou-Ammar, V. Milenkovic, R. Luo, M. Zhang, and J. Wang. Wasserstein robust reinforcement learning. http://arxiv.org/abs/1907.13196.

[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, 2017.

[3] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.

[4] D. Bertsimas and D. den Hertog. Robust and Adaptive Optimization. Dynamic Ideas, 2022.

[5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. https://doi.org/10.1137/141000671.

[6] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[7] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019. doi: 10.1017/jpr.2019.49.

[8] S. Chakrabarti, H. Yiming, T. Li, S. Feizi, and X. Wu. Quantum Wasserstein generative adversarial networks. In Advances in Neural Information Processing Systems, volume 32, 2019.

[9] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):596 612, 2010.

[10] V. De Miguel and F. J. Nogales. Portfolio selection with robust estimation. Operations Research, 57(3):560 577, 2009.

[11] D. Dua and C. Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml.

[12] I. Dunning, J. Huchette, and M. Lubin. Ju MP: A modeling language for mathematical optimization. SIAM Review, 59(2):295 320, 2017.

[13] R. Flamary, M. Cuturi, N. Courty, and A. Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107(12):1923 1945, 2018.

[14] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, volume 28, 2015.

[15] R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. https://arxiv.org/abs/1604.02199.

[16] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.

[17] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization. Springer, 1988.

[18] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical learning: Data mining, Inference, and Prediction. Springer, 2009.

[19] Z. Hu and L. J. Hong. Kullback-Leibler divergence constrained distributionally robust optimization. http://www.optimization-online.org/DB_FILE/2012/11/3677.pdf.

[20] S. Kolouri, G. K. Rohde, and H. Hoffmann. Sliced Wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[21] D. Kuhn, P. Mohajerin Esfahani, V.A. Nguyen, and S. Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. INFORMS Tut ORials in Operations Research, pages 130 169, 2019.

[22] H. Lam. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 6(4):1090 1105, 2019.

[23] R. O. Michaud. The Markowitz optimization enigma: Is optimized optimal? Financial Analysts Journal, 45(1):31 42, 1989.

[24] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Math. Program., 171: 115 -166, 2018.

[25] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1 2):1 52, 2018.

[26] MOSEK Ap S. The MOSEK optimization toolbox for MATLAB manual. Version 9.0. http: //docs.mosek.com/9.0/toolbox/index.html, 2019.

[27] K. P. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022.

[28] Y. Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.

[29] Steven L Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 1(3):317 328, 1997.

[30] S. A. Serrano. Algorithms for unsymmetric cone optimization and an implementation for problems with the exponential cone. Stanford University, 2015.

[31] S. Shafieezadeh-Abadeh, P. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, volume 28, 2015.

[32] S. Shafieezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

[33] A. Shapiro. On duality theory of conic linear problems. In Semi-Infinite Programming, pages 135 165. Kluwer Academic Publishers, 2001.

[34] J. E. Smith and R. L. Winkler. The optimizer s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3):311 322, 2006.

[35] J. Solomon, R. Rustamov, L. Guibas, and A. Butscher. Wasserstein propagation for semisupervised learning. In Proceedings of the 31st International Conference on Machine Learning, 2014.

[36] Tobias Sutter, Andreas Krause, and Daniel Kuhn. Robust generalization despite distribution shift via minimum discriminating information. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29754 29767. Curran Associates, Inc., 2021.

[37] Bahar Taskesen, Man-Chung Yue, Jose Blanchet, Daniel Kuhn, and Viet Anh Nguyen. Sequential domain adaptation by synthesizing distributionally robust experts. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10162 10172. PMLR, 18 24 Jul 2021.

[38] J. Weed and Q. Berthet. Estimation of smooth densities in Wasserstein distance. In Proceedings of the 32nd Conference on Learning Theory, 2019.

[39] W. Wiesemann, D. Kuhn, and M. Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358 1376, 2014.

[40] E. Wong, F. Schmidt, and Z. Kolter. Wasserstein adversarial examples via projected Sinkhorn iterations. In Proceedings of the 36th International Conference on Machine Learning, 2019.

[41] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10(51):1485 1510, 2009.

[42] M. Yue, D. Kuhn, and W. Wiesemann. On linear optimization over Wasserstein balls. Mathematical Programming, 2021.

[43] Bo Zeng and Long Zhao. Solving two-stage robust optimization problems using a column-andconstraint generation method. Operations Research Letters, 41(5):457 461, 2013.

[44] Jingyi Zhang, Ping Ma, Wenxuan Zhong, and Cheng Meng. Projection-based techniques for high-dimensional optimal transport problems. WIREs Computational Statistics, n/a(n/a):e1587.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] The abstract explains (i) the motivation for using Wasserstein DR optimization in machine learning, (ii) recent successes of this approach in logistic regression with numerical features and (iii) the lack of work on the mixedfeature setting; these explanations are further elaborated upon in Section 1. This section also provides a high-level summary of our contributions, which are in line with the developments in the remainder of the paper. (b) Did you describe the limitations of your work? [Yes] Our numerical results have been conducted in an objective and reproducible way (choosing the most popular UCI data sets), and our experiments include data sets where the standard logistic regression outperforms our approach. (c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work extends a well-known machine learning method (DR logistic regression) to a new but well-known setting (presence of categorical features). In light of the ethics review guidelines, we therefore believe that our work does not have any direct implications on society beyond the ones that are already known. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] We have carefully reviewed the ethics guideline on the Neur IPS 2022 conference webpage. We believe that our paper conforms to these guidelines. 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] All of our lemmas and theorems state the full set of assumptions, with exception of the blanket assumptions that are assumed to hold throughout the paper and that are clearly marked as such. (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are included in the appendix of the paper. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] All code, data and instructions for our experimental results are published on Git Hub. To maintain anonymity during the review process, we did not provide a link until the camera-ready version. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The most important choices are described in the numerical section, while some of the details are relegated to the appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] The runtime results provide confidence regions across multiple random runs, and we report error bars for all other experiments in the appendix. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Information about the employed CPU is provided in the numerical results section; details about the high-performance cluster that we used are relegated to the appendix. Note that all experiments are run in single-core mode. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We are citing UCI for the data sets, MOSEK for the solution of the optimization problems as well as Julia (and Ju MP) for the implementation of the algorithms. (b) Did you mention the license of the assets? [Yes] We mention the licenses of UCI, MOSEK and Julia in the numerical results section. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

All code, data and instructions for our experimental results are published on Git Hub. To maintain anonymity during the review process, we did not provide a link until the camera-ready version.

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] We are using the most popular UCI machine learning data sets, for which to our best knowledge consent has already been given. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] We are using the most popular UCI machine learning data sets, for which to our best knowledge this issue has already been resolved. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] Our research does not have any crowdsourcing or human subject components. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] Our research does not have any crowdsourcing or human subject components. (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] Our research does not have any crowdsourcing or human subject components.