# rankone_convexification_for_sparse_regression__7ed31af4.pdf

Journal of Machine Learning Research (2025) 1-50 Submitted 1/19; Revised 10/20, 10/22, 9/24; Published

Rank-one Convexification for Sparse Regression

Alper Atamt urk atamturk@berkeley.edu Department of Industrial Engineering & Operations Research University of California Berkeley, CA 94720, USA

Andr es G omez gomezand@usc.edu Department of Industrial & Systems Engineering University of Southern California Los Angeles, CA 90089, USA

Editor: Vahab Mirrokni

Sparse regression models are increasingly prevalent due to their ease of interpretability and superior out-of-sample performance. However, the exact model of sparse regression with an ℓ0-constraint restricting the support of the estimators is a challenging (NP-hard) nonconvex optimization problem. In this paper, we derive new strong convex relaxations for sparse regression. These relaxations are based on the convex-hull formulations for rankone quadratic terms with indicator variables. The new relaxations can be formulated as semidefinite optimization problems in an extended space and are stronger and more general than the state-of-the-art formulations, including the perspective reformulation and formulations with the reverse Huber penalty and the minimax concave penalty functions. Furthermore, the proposed rank-one strengthening can be interpreted as a non-separable, non-convex, unbiased sparsity-inducing regularizer, which dynamically adjusts its penalty according to the shape of the error function without inducing bias for the sparse solutions. In our computational experiments with benchmark datasets, the proposed conic formulations are solved within seconds and result in near-optimal solutions (with 0.4% optimality gap on average) for non-convex ℓ0-problems. Moreover, the resulting estimators also outperform alternative convex approaches, such as lasso and elastic net regression, from a statistical perspective, achieving high prediction accuracy and good interpretability. Keywords: sparse regression, best subset selection, lasso, elastic net, conic formulations, non-convex regularization

1 Introduction

Given a model matrix X = [x1, . . . , xp] Rn p of explanatory variables, a vector y Rn of response variables, regularization parameters λ, µ 0 and a desired sparsity level k Z+, we consider the least squares regression problem

min β Rp y Xβ 2 2 + λ β 2 2 + µ β 1 s.t. β 0 k, (1)

where β 0 denotes cardinality of the support of β. Problem (1) encompasses a broad range of regression models. It includes as special cases: ridge regression (Hoerl and Kennard, 1970), when λ > 0, µ = 0 and k p; lasso (Tibshirani, 1996), when λ = 0, µ 0 and k p;

2025 Alper Atamt urk and Andr es G omez.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v/21-0000.html.

Atamt urk and G omez

elastic net (Zou and Hastie, 2005) when λ, µ > 0 and k p; best subset selection (Miller, 2002), when λ = µ = 0 and k < p. Additionally, Bertsimas and Van Parys (2020) propose to solve (1) with λ > 0, µ = 0 and k < p for high-dimensional regression problems, while Mazumder et al. (2022) study (1) with λ = 0, µ > 0 and k < p for problems with low Signal-to-Noise Ratios (SNR). The results in this paper cover all versions of (1) with k < p; moreover, they can be extended to problems with non-separable regularizations of the form λ Aβ 2 2 + µ Cβ 1, resulting in sparse variants of the fused lasso (Padilla et al., 2017; Tibshirani et al., 2005), generalized lasso (Lin et al., 2014; Tibshirani and Taylor, 2011) and smooth lasso (Hebiri et al., 2011), among others.

1.1 Regularization Techniques

The motivation and benefits of regularization and sparsity are well-documented in the literature. In particular, generalization and interpretability are the key drivers of sparsity in machine learning models. Generalization refers to the ability of a model to perform well out-of-sample. The principle of parsimony postulates that observed phenomena often admit simple explanations, and thus sparse models are preferable as they are more likely to capture such explanations; in fact, Hastie et al. (2001) coined the bet on sparsity principle, i.e., using an inference procedure that performs well in sparse problems since no procedure can do well in dense problems. Interpretability, on the other hand, is becoming increasingly important due to the deployment of machine learning models in high-stakes situations (Rudin, 2019), where complex models can sometimes lead to results that are unfair, discriminatory, or otherwise undesirable. As such, interpretability is critical in healthcare (Ustun and Rudin, 2016, 2019; Wanjiru et al., 2021) and public policy (Aghaei et al., 2019; Azizi et al., 2018; Zeng et al., 2017) settings for example. Moreover, interpretable learning models are also preferable when the output of the model is intended to serve as input to a downstream decision-making process (Cozad et al., 2014; Lombardi et al., 2017; Maragno et al., 2021).

Best subset selection with k < p and λ = µ = 0 is the direct approach to enforce sparsity without introducing bias. In contrast, ridge regression with λ > 0 (Tikhonov regularization) is known to induce shrinkage and bias, which can be desirable, for example, when X is not orthogonal, but it does not encourage sparsity. On the other hand, lasso, employs ℓ1 regularization with µ > 0 to achieve both shrinkage and sparsity. However, the inability to separately control for shrinkage and sparsity may result in subpar performance in some cases (Kelner et al., 2024; Miller, 2002; Zhang et al., 2008, 2012, 2014; Zhao and Yu, 2006; Zou, 2006). Moreover, achieving a target sparsity level k with lasso requires significant experimentation with the penalty parameter µ (Chichignoud et al., 2016). When k p, the ℓ0 cardinality constraint becomes redundant, and (1) reduces to a convex optimization problem and can be solved easily. On the other hand, when k < p, problem (1) is non-convex and NP-hard (Natarajan, 1995), thus finding an optimal solution may require excessive computational effort and methods to solve it approximately are used instead (Huang et al., 2018; Nevo and Ritov, 2017), often relying on local heuristics (Hazimeh and Mazumder, 2020; Zhu et al., 2022, 2020). Due to the perceived difficulties of tackling the non-convex ℓ0 constraint in (1), lasso-type simpler approaches continue to be preferred for inference problems with sparsity (Hastie et al., 2015), and have also been incorporated with

Rank-one convexification for sparse regression

other learning models such as deep neural networks (Cherepanova et al., 2024; Dinh and Ho, 2020; Lemhadri et al., 2021). Nonetheless, there has been a substantial effort to develop sparsity-inducing methodologies that do not incur as much shrinkage and bias as lasso does. These efforts have led to optimization problems of the form

min β Rp y Xβ 2 2 +

i=1 ρi(βi) (2)

where ρi : R R represent non-convex regularization functions. Examples of such regularization functions include ℓq penalties with 0 < q < 1 (Frank and Friedman, 1993) and Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan and Li, 2001). Although optimal solutions of (2) with non-convex regularizations may substantially improve upon the estimators obtained by lasso, solving (2) to optimality is still a difficult task (Hunter and Li, 2005; Mazumder et al., 2011; Zou and Li, 2008), and suboptimal solutions may not exhibit the improved statistical properties. To mitigate such difficulties, Zhang et al. (2010) propose the minimax concave penalty (MC+), a class of sparsity-inducing penalty functions where the non-convexity of ρ is offset by the convexity of y Xβ 2 2 for sufficiently sparse solutions, so that (2) remains convex Zhang et al. (2010) refer to this property as sparse convexity. Thus, in the ideal scenario (and with proper tuning of the parameter controlling the concavity of ρ), the MC+ penalty is able to retain the sparsity and unbiasedness of best subset selection while preserving convexity, resulting in the best of both worlds. However, due to the separable form of the regularization term, the effectiveness of MC+ greatly depends on the diagonal dominance of the matrix X X (this dependency will be discussed in more detail in 3), and may result in poor performance when the diagonal dominance is low. Yet, in many practical applications, the matrix X X has low eigenvalues and lacks diagonal dominance. To illustrate, Table 1 presents the diagonal dominance of five datasets from the UCI Machine Learning Repository (Dheeru and Karra Taniskidou, 2017) used by G omez and Prokopyev (2021) and Miyashiro and Takano (2015), as well as the diabetes dataset with all second interactions used by Bertsimas et al. (2016) and Efron et al. (2004). The diagonal dominance of a positive semidefinite matrix A is computed as

dd(A) := (1/tr(A)) max d Rp + e d s.t. A diag(d) 0,

where e is the p-dimensional vector of ones, diag(d) is the diagonal matrix such that diag(d)ii = di and tr(A) denotes the trace of A. Accordingly, the diagonal dominance is the trace of the largest diagonal matrix that can be extracted from A without violating positive semidefiniteness, divided by the trace of A. Table 1 shows that the diagonal dominance of X X is low or even 0% in some instances, and MC+ struggles for these datasets as we demonstrate in 5.

1.2 Mixed-Integer Optimization Formulations

An alternative to using nonconvex regularizers is to leverage recent advances in mixedinteger optimization (MIO) to tackle (1) directly (Bertsimas and King, 2015; Bertsimas

Atamt urk and G omez

dataset p n dd 100%

housing 13 506 26.7% servo 19 167 0.0% auto MPG 25 392 1.5% solar flare 26 1,066 8.8% breast cancer 37 196 3.6% diabetes 64 442 0.0% crime 100 1993 13.5 %

Table 1: Diagonal dominance of X X for benchmark datasets.

et al., 2016; Cozad et al., 2014). By introducing binary indicator variables z {0, 1}p, where zi = 1βi =0, problem (1) can be reformulated as

y y + min β,z,u 2y Xβ + β X X + λI β + µ

i=1 ui (3a)

i=1 zi k (3b)

βi ui, βi ui i = 1, . . . , p (3c)

βi(1 zi) = 0 i = 1, . . . , p (3d)

β Rp, z {0, 1}p, u Rp +. (3e)

The nonconvexity of (1) is captured by the complementary constraints (3d) and the integrality constraints z {0, 1}p. In fact, one of the main challenges to solve (3) is handling constraints (3d). A standard approach in the MIO literature is to replace (3d) with the so-called big-M constraints Mzi βi Mzi (4)

for a sufficiently large number M to bound the variables βi. However, these big-M constraints (4) are poor approximations of constraints (3d), especially in the case of regression problems where no natural big-M value is available. Bertsimas et al. (2016) propose approaches to compute provable big-M values, but such values often result in prohibitively large computational times even in problems with a few dozens variables (or, even worse, may lead to numerical instabilities and cause convex solvers to crash). Alternatively, heuristic values for the big-M values can be estimated, e.g., by setting M = τ ˆβ , where τ R+ and ˆβ is a feasible solution of (1) found via a heuristic method1. While using such heuristic values yield reasonable performance for small enough values of τ, they may eliminate optimal solutions. Due to the modeling power of MIO, formulation (3) can be further extended to incorporate additional considerations such as outliers (Insolia et al., 2022), or graphical structures arising when learning Bayesian networks (K u c ukyavuz et al., 2023).

1. This method with τ = 2 was used in the computations of Bertsimas et al. (2016).

Rank-one convexification for sparse regression

Branch-and-bound algorithms for MIO leverage strong convex relaxations of problems to prune the search space and reduce the number of sub-problems to enumerate (and, in some cases, eliminate the need for enumeration altogether). Additionally, strong relaxations can also be used to design exact screening rules that quickly identify critical explanatory variables and provably eliminate irrelevant ones to reduce the dimension of the inference problems to solve (Atamt urk and G omez, 2020; Deza and Atamt urk, 2022). Thus, a critical step to speed up the solution times for (3) is to derive convex relaxations that approximate the non-convex problem well (Atamt urk and Narayanan, 2007). Such strong relaxations can also be used directly to find good estimators for the inference problems (without branch-andbound); in fact, it is well known that the natural convex relaxation of (3) with λ = µ = 0 and big-M constraints is precisely lasso, see (Dong et al., 2015) for example. Therefore, sparsity-inducing techniques that more accurately capture the properties of the non-convex constraint β 0 k can be found by deriving tighter convex relaxations of (1). Pilanci et al. (2015) exploit the Tikhonov regularization term and convex analysis to construct an improved convex relaxation using the reverse Huber penalty. In a similar vein, Bertsimas and Van Parys (2020) leverage Tikhonov regularization and duality to give an efficient algorithm for high-dimensional sparse regression. We refer the reader to the recent paper by Tillmann et al. (2024) for a survey of solution approaches to problems with sparsity constraints, such as (3).

1.3 The Perspective Relaxation

Problem (3) is a mixed-integer convex quadratic optimization problem with indicator variables, a class of problems which has received a fair amount of attention in the optimization literature. In particular, the perspective relaxation (Akt urk et al., 2009; Frangioni and Gentile, 2006; G unl uk and Linderoth, 2010) is, by now, a standard technique that can be used to substantially strengthen the convex relaxations by exploiting separable quadratic terms. Specifically, consider the mixed-integer epigraph of a one-dimensional quadratic function with an indicator constraint,

Q1 = z {0, 1}, β R, t R+ : β2 i t, βi(1 zi) = 0

The convex hull of Q1 is obtained by relaxing the integrality constraint to bound constraints and using the closure of the perspective function2 of β2 i , expressed as a rotated cone constraint:

cl conv(Q1) = z [0, 1], β R, t R+ : β2 i zi t

2. We use the convention that β2 i zi = 0 when βi = zi = 0 and β2 i zi = if zi = 0 and βi = 0.

Atamt urk and G omez

Xie and Deng (2020) apply the perspective relaxation to the separable quadratic regularization term λ β 2 2, i.e., reformulate (3) as

y y + min β,z,u 2y Xβ + β X X β + λ

β2 i zi + µ

i=1 ui (5a)

i=1 zi k (5b)

βi ui, βi ui, i = 1, . . . , p (5c)

β Rp, z {0, 1}p, u Rp +. (5d)

Moreover, they show that the continuous relaxation of (5) is equivalent to the continuous relaxation of the formulation used by Bertsimas and Van Parys (2020). Dong et al. (2015) also study the perspective relaxation in the context of regression: first, they show that using the reverse Huber penalty (Pilanci et al., 2015) is, in fact, equivalent to just solving the convex relaxation of (5) thus the relaxations of (Bertsimas and Van Parys, 2020; Pilanci et al., 2015; Xie and Deng, 2020) all coincide; second, they propose to use an optimal perspective relaxation, i.e., by applying the perspective relaxation to a separable quadratic function β Dβ, where D is a nonnegative diagonal matrix such that X X + λI D 0; finally, they show that solving this stronger convex relaxation of the optimal perspective relaxation is, in fact, equivalent to using the MC+ penalty (Zhang et al., 2010). However, the authors also point out that if λ = 0 and a suitable matrix D cannot be found, then the optimal perspective relaxation reduces to lasso. For example, from Table 1, we see that the optimal perspective relaxation would reduce to lasso in the servo and diabetes datasets. The perspective relaxation is now a state-of-the-art method to convexify problems with separable terms and indicator variables. Recent works have extended perspective relaxations to more general classes of problems, such as inference problems with grouped variable selection constraints (Hazimeh et al., 2023b). However, there are relatively few convexification techniques for problems without separable terms (Frangioni et al., 2020; G omez, 2021; Han et al., 2023; Jeon et al., 2017). In fact, among the previously discussed methods for sparse regression, the optimal perspective relaxation of Dong et al. (2015) is the only one that does not explicitly require the use of the Tikhonov regularization λ β 2 2. Nonetheless, as the authors point out, if λ = 0 then the method is effective only when the matrix X X is sufficiently diagonally dominant, which, as illustrated in Table 1, is not necessarily the case in practice. As a consequence, perspective relaxation techniques may be insufficient to tackle problems when large shrinkage is undesirable and, λ is small.

1.4 Contributions

In this paper we derive stronger convex relaxations of (3) than the optimal perspective relaxation. These relaxations are obtained from the study of ideal (convex-hull) formulations of the mixed-integer epigraphs of non-separable rank-one quadratic functions with indicators. Since the perspective relaxation corresponds to the ideal formulation of a one-dimensional rank-one quadratic function, the proposed relaxations generalize and

Rank-one convexification for sparse regression

strengthen the existing results. In particular, they dominate perspective relaxation approaches for all values of the regularization parameter λ and, critically, are able to achieve high-quality approximations of (1) even in low diagonal dominance settings with λ = 0. Alternatively, our results can also be interpreted as a new non-separable, non-convex, unbiased regularization penalty ρR1(β) which: (i) imposes larger penalties than the separable minimax concave penalty (Zhang et al., 2010) ρMC+(β) to dense estimators, thus achieving better sparsity-inducing properties; and (ii) the nonconvexity of the penalty function is offset by the convexity of the term y Xβ 2 2, and the resulting continuous problem can be solved to global optimality using convex optimization tools. In fact, they can be formulated as semidefinite optimization and, in certain special cases, as conic quadratic optimization. We point out that in recent papers, some of the relaxations proposed in this paper have been extended to more general settings with non-quadratic functions, low rank (but not necessarily rank-one) functions, and side constraints (Han and G omez, 2024; Shafiee and Kılın c-Karzan, 2024). To illustrate the regularization point of view for the proposed relaxations, consider a two-predictor regression problem in Lagrangian form:

min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1 + κ β 0, (6)

where X X = 1 + δ 1 1 1 + δ

and δ 0 is a parameter controlling the diagonal domi-

nance. Figure 1 depicts the graphs of well-known regularizations including lasso (λ = κ = 0, µ = 1), ridge (µ = κ = 0, λ = 1), elastic net (κ = 0, λ = µ = 0.5), the MC+penalty for different values of δ and the proposed rank-one R1 regularization. The graphs of MC+ and R1 are obtained by setting λ = µ = 0 and κ = 1, and using the appropriate convex strengthening, see 3 for details. Observe that the R1 regularization results in larger penalties than MC+for all values of δ, and the improvement increases as δ 0. In addition, Figure 2 shows the effect of using the lasso constraint β 1 k, the MC+ constraint ρMC+(β) k, and the rank-one constraint ρR1(β) k in a two-dimensional problem to achieve sparse solutions satisfying β 0 1. Specifically, let

ε = min β 0 1 y Xβ 2 2

be the minimum residual error of a sparse solution of the least squares problem. Figure 2 shows in gray the (possibly dense) points satisfying y Xβ 2 2 ε , and it shows in color the set of feasible points satisfying ρ(β) k, where ρ is a given regularization and k is chosen so that the feasible region (color) intersects the level sets (gray). We see that neither lasso nor MC+ is able to exactly recover an optimal sparse solution for any diagonal dominance parameter δ, despite significant shrinkage (k < 1). In contrast, the rank-one constraint ρR1(β) k adapts to the curvature of the error function y Xβ 2 2 to induce higher sparsity: in particular, the natural constraint ρR1(β) 1, with the target sparsity k = 1, results in exact recovery without shrinkage in all cases. Finally, Figure 3 shows the strength of relaxations of (1) discussed in this paper. The big-M relaxation is the natural convex relaxation of (3) obtained by replacing z {0, 1}p

by z [0, 1]p, used by Bertsimas et al. (2016); Cozad et al. (2014). The perspective

Atamt urk and G omez

(b) elastic net

(d) MC+, δ = 1.0

(e) MC+, δ = 0.3

(f) MC+, δ = 0.1

(g) R1, δ = 1.0

(h) R1, δ = 0.3

(i) R1, δ = 0.1

Figure 1: Graphs of regularization penalties with p = 2. The horizontal axes correspond to values of β1 and β2, and the vertical axis corresponds to the regularization penalty. The ridge, elastic net, and lasso (top row) regularizations do not depend on the diagonal dominance, but induce substantial bias. The MC+ regularization (second row) does not induce as much bias, but it depends on the diagonal dominance (δ). The new non-separable, non-convex R1 regularization (bottom row) induces larger penalties than MC+ for all diagonal dominance values and is a closer approximation for the exact ℓ0 penalty. 8

Rank-one convexification for sparse regression

-1 -0.5 0 0.5 1 (a) δ = 1.0, β 1 0.60

-1 -0.5 0 0.5 1 (b) δ = 0.3, β 1 0.84

-1 -0.5 0 0.5 1 (c) δ = 0.1, β 1 0.96

-1 -0.5 0 0.5 1 (d) δ = 1.0, ρMC+(β) 0.95, ρR1(β) 0.95

-1 -0.5 0 0.5 1 (e) δ = 0.3, ρMC+(β) 0.77, ρR1(β) 0.77

-1 -0.5 0 0.5 1 (f) δ = 0.1, ρMC+(β) 0.53, ρR1(β) 0.53

-1 -0.5 0 0.5 1 (g) δ = 1.0, ρR1(β) 1.00

-1 -0.5 0 0.5 1 (h) δ = 0.3, ρR1(β) 1.00

-1 -0.5 0 0.5 1 (i) δ = 0.1, ρR1(β) 1.00

Figure 2: The axes correspond to the sparse solutions satisfying β 0 1. In gray: level sets given by y Xβ 2 2 ε ; in red: feasible region for β 1 k; in green: feasible region for ρMC+(β) k; in blue: feasible region for ρR1(β) k. All lasso and MC+ solutions above are dense even with significant shrinkage (k < 1). Rank-one constraint attains sparse solutions on the axes with no shrinkage (k = 1) for all diagonal dominance values δ.

Atamt urk and G omez

relaxation is the natural convex relaxation of (5), which is the basis of recent methods (Bertsimas and Van Parys, 2020; Hazimeh et al., 2022; Pilanci et al., 2015; Xie and Deng, 2020; Hazimeh et al., 2023a) note that this formulation may only be used if λ > 0. The optimal perspective relaxation, also referred to as sdp1 in this paper, was explicitly given by Dong et al. (2015). Interestingly, it was recently shown (Han et al., 2022) that sdp1 is equivalent to the standard Shor s SDP relaxation (Shor, 1987) for problem (3) a convex relaxation that has been proven to be very effective at approximating discrete optimization problems (Goemans and Williamson, 1995). This paper proposes new relaxations sdpr, discussed in 2, which dominate all existing relaxations in terms of strength. It also proposes the new formulation sdp LB, discussed in 4, which is easier to solve than sdpr but still compares favorably with the big-M and perspective formulations.

Outline The rest of the paper is organized as follows. In 2 we derive convex relaxations based on ideal formulations for rank-one quadratic terms with indicator variables. We also give an interpretation of the convex relaxations as unbiased regularizers and propose an explicit semidefinite optimization (SDP) formulation in an extended space, which can be implemented with off-the-shelf conic optimization solvers. In 3 we derive an explicit form of the regularization penalty for the two-dimensional case. In 4 we discuss the implementation of the proposed relaxation in a conic quadratic framework. In 5 we present computational experiments with synthetic as well as benchmark datasets, demonstrating that (i) the proposed formulation delivers near-optimal solutions (with provable optimality gaps) of (1) in most cases, (ii) using the proposed convex relaxation results in superior

Big-M relaxation

Perspective

Optimal perspective (MC+/Shor)

Figure 3: Strength of relaxations discussed in the paper. A B indicates that B is a stronger relaxation than A, i.e., is a better approximation for the non-convex problem (1). Blue boxes correspond to the new formulations proposed in this paper.

Rank-one convexification for sparse regression

statistical performance when compared with the usual convex estimators such as elastic net. In 6, we conclude the paper with a few final remarks.

Notation Define P = {1, . . . , p} and e Rp be the vector of ones. Given T P and a vector a Rp, define a T as the subvector of a induced by T, ai = a{i} as the i-th element of a, and define a(T) = P i T ai. Given a symmetric matrix A Rp p, let AT be the submatrix of A induced by T P, and let ST + be the set of |T| |T| symmetric positive semidefinite matrices, i.e., AT 0 AT ST +. We use a T or AT to make explicit that a given vector or matrix is indexed by the elements of T or T T, respectively. Given matrices A, B of the same dimension, A B denotes the Hadamard product of A and B, and A, B denotes their inner product. Given a vector a Rn, let diag(a) be the n n diagonal matrix A with Aii = ai. For a set X Rp, cl conv(X) denotes the closure of the convex hull of X. Throughout the paper, we adopt the following convention for division by 0: given a scalar s 0, s/0 = if s > 0 and s/0 if s = 0. For a scalar a R, let sign(a) = a/|a|.

2 Convexification

In this section we introduce the proposed relaxations of problem (1). First, in 2.1, we describe the ideal relaxations for the mixed-integer epigraph of a rank-one quadratic term. Then, in 2.2, we use the relaxations derived in 2.1 to give strong relaxations of (1). Next, in 2.3, we give an interpretation of the proposed relaxations as unbiased sparsityinducing regularizations. In 2.4 we present an explicit SDP representation of the proposed relaxations in an extended space. Finally, in 2.5, we comment on the strength of the proposed relaxations.

2.1 Rank-one Case

We first give a valid inequality for the mixed-integer epigraph of a convex quadratic function defined over the subsets of P. Given AT ST +, consider the set

QT = n (z, β, t) {0, 1}|T| R|T| R+ : β AT β t, βi(1 zi) = 0, i T o .

Proposition 1 The inequality β AT β

is valid for QT .

Proof Let (z, β, t) QT , and we verify that inequality (7) is satisfied. First observe that if z = 0, then β = 0 and inequality (7) reduces to 0 t, which is satisfied. Otherwise, if zi = 1 for some i T, then z(T) 1 and we find that β AT β

z(T) β AT β t, and inequality (7) is satisfied again.

Observe that if T is a singleton, i.e., T = {i}, then (7) reduces to the well-known perspective inequality Aiiβ2 i tzi. Moreover, if T = {i, j} and AT is rank-one, i.e., AT = a T a T with

Atamt urk and G omez

a T = (ai aj) and β AT β = |Aij| aβ2 i 2βiβj + (1/a)β2 j for Aij= aiaj and a = ai/aj, then (7) reduces to |Aij| aβ2 i 2βiβj + (1/a)β2 j t(zi + zj), (8)

one of the inequalities proposed by Jeon et al. (2017) in the context of quadratic optimization with indicators and bounded continuous variables. Note that inequality (8) is, in general, weak for bounded continuous variables as non-negativity or other bounds can be used to strengthen the inequalities (Atamt urk and G omez, 2018); and inequality (7) is, in general, weak for arbitrary matrices AT ST +. Nonetheless, as we show next, inequality (7) is sufficient to describe the ideal (convex hull) description for QT if AT = a T a T is a rank-one matrix. Consider the special case of QT defined with a rank-one matrix:

Qr1 T = n (z, β, t) {0, 1}|T| R|T| R+ : (a T β)2 t, βi(1 zi) = 0, i T o .

Theorem 2 If ai = 0 for all i T, then

cl conv(Qr1 T ) = (z, β, t) [0, 1]|T| R|T| R+ : (a T β)2 t, (a T β)2

Proof Consider the optimization of an arbitrary linear function over Qr1 T and QT := n (z, β, t) [0, 1]|T| R|T| R+ : (a T β)2 t, (a T β)2

min (z,β,t) Qr1 T u T z + v T β + κt, (9)

min (z,β,t) QT u T z + v T β + κt, (10)

where u T , v T R|T| and κ R. We now show that either there exists an optimal solution of (10) that is feasible for (9), hence also optimal for (9) as QT is a relaxation of Qr1 T , or that (9) and (10) are both unbounded. Observe that if κ < 0, then letting z = β = 0 and t we see that both problems are unbounded. If κ = 0 and v T = 0, then (10) reduces to minz [0,1]|T | u T z, which has an optimal integral solution z , and (z , 0, 0) is optimal for (9) and (10). If κ = 0 and vi = 0 for some i T, then letting βi , zi = 1, and βj = zj = t = 0 for j = i, we find that both problems are unbounded. Thus, we may assume, without loss of generality that κ > 0, and, by scaling, κ = 1. Additionally, as a T has no zero entry, we may assume, without loss of generality, that a T = e T , since otherwise β and v T can be scaled by letting βi = aiβi and vi = vi/ai to arrive at an equivalent problem. Moreover, a necessary condition for (9) (10) to be bounded is that < min β R|T | v T β s.t. β(T) = ζ (11)

for any fixed ζ R. It is easily seen that (11) has an optimal solution if and only if vi = vj for all i = j. Thus, we may also assume without loss of generality that v T β = v0β(T) for some scalar v0. Performing the above simplifications, we find that (10) reduces to

min z [0,1]|T |,β R|T |,t R u T z + v0β(T) + t s.t. β(T)2 t, β(T)2 tz(T). (12)

Rank-one convexification for sparse regression

Since the one-dimensional optimization minβ R v0β + β2 has an optimal solution, it follows that (12) is bounded and has an optimal solution. We now prove that (12) has an optimal solution that is integral in z and satisfies β (e z) = 0. Let (z , β , t ) be an optimal solution of (12). First note that if 0 < z (T) < 1, then (γz , γβ , γt ) is feasible for (10) for γ sufficiently close to 1, with objective value γ u T z + v0β (T) + t . If u T z + v0β (T) + t 0, then for γ = 0, (γz , γβ , γt ) has an objective value equal or lower. Otherwise, for γ = 1/z (T), (γz , γβ , γt ) is feasible and has a lower objective value. Thus, we find that either 0 is optimal for (12) (and the proof is complete), or there exists an optimal solution with z (T) 1. In the later case, observe that any ( z, β , t ) with z arg min{u T z : z (T) 1, z [0, 1]|T|} is also optimal for (12), an in particular there exists an optimal solution with z integral. Finally, let i T be any index with zi = 1. Setting βi = β (T) and βj = 0 for i = j, we find another optimal solution ( z, β, t ) for (12) that satisfies the complementary constraints, and thus is feasible and optimal for (9).

Remark 3 Observe that describing cl conv(Qr1 T ) requires two nonlinear inequalities in the original space of variables. More compactly, we can specify cl conv(Qr1 T ) using a single convex inequality, as

cl conv(Qr1 T ) = (z, β, t) [0, 1]|T| R|T| R+ : (a T β)2

min{1, z(T)} t

Finally, we point out that cl conv(Qr1 T ) is conic quadratic representable, as (z, β, t) cl conv(Qr1 T ) if and only if there exists w such that the system

z [0, 1]|T|, β R|T|, t R+, w R+, w 1, w z(T), (a T β)2 tw

is feasible, where the last constraint is a rotated conic quadratic constraint and all other constraints are linear.

2.2 General Case

Now consider again the mixed-integer optimization (3)

y y + min β,z,u 2y Xβ + µ e u + t (13a)

s.t. β X X + λI β t (13b)

e z k (13c)

β u, β u (13d)

β (e z) = 0 (13e)

β Rp, z {0, 1}p, u Rp +, t R (13f)

where the nonlinear terms of the objective is moved to constraint (13b). A direct application of (7) yields the inequality β X X + λI β tz(P), which is weak and has no effect

Atamt urk and G omez

when z(P) 1. Instead, a more effective approach is to decompose the matrix X X + λI into a sum of low-dimensional rank-one matrices, and use inequality (7) to strengthen each quadratic term in the decomposition separately, as illustrated in Example 1 bellow.

Example 1 Consider the example with p = 3 and X X + λI =

25 15 5 15 18 0 5 0 11

it follows that

β X X + λI β = (5β1 + 3β2 β3)2 + (3β2 + β3)2 + 9β2 3

and we have the corresponding valid inequality

(5β1 + 3β2 β3)2

min{1, z1 + z2 + z3} + (3β2 + β3)2

min{1, z2 + z3} + 9β2 3 z3 t. (14)

The decomposition of X X + λI illustrated in Example 1 is not unique. Since one does not obtain a strengthening when the denominator is one, it is important to have decomposition both rank-one and sparse. This motivates the question of how to find a decomposition that results in the best convex relaxation, i.e., that maximizes the left-hand side of (14). Specifically, let P 2P be a subset of the power set of P, i.e.,

P = {T1, . . . , Tm}

with Th P, h = 1, . . . , m. For each h, define a matrix variable Ah whose nonzero elements correspond to the submatrix induced by Th, and consider the valid inequality ϕP(z, β) t, where ϕP : [0, 1]p Rp R is defined as

ϕP(z, β) := max Ah,R β Rβ +

β Ahβ min{1, z(Th)} (15a)

h=1 Ah + R = X X + λI (15b)

(Ah)ij = 0 h = 1, . . . , m, i Th or j Tj (15c)

Ah SP + h = 1, . . . , m (15d)

R SP +, (15e)

where strengthening (7) is applied to each low-dimensional quadratic term β Ahβ. For a fixed value of (z, β), problem (15) finds the best decomposition of the matrix X X + λI as a sum of positive semidefinite matrices Ah, h = 1, . . . , m, and a remainder positive semidefinite matrix R to maximize the strengthening. For a given decomposition, the objective (15a) is convex in (z, β), thus ϕP is a supremum of convex functions and is convex on its domain. Observe that the inclusion or omission of the empty set does not affect function ϕP, and we assume for simplicity that P. Since inequalities (7) are ideal for rank-one matrices, inequality ϕP(z, β) t is particularly strong if matrices Ah are rank-one in optimal solutions of (15). As we now show, this is indeed the case if P is downward closed.

Rank-one convexification for sparse regression

Proposition 4 If Pis downward closed, i.e., V P = U P for all U V , then there exists an optimal solution to (15) where all matrices Ah are rank-one.

Proof Let T P, let AT be the matrix variable associated with T, and suppose AT is not rank-one in an optimal solution to (15), also suppose for simplicity that T = {1, . . . , p0} for some p0 p, and let Ti = {i, . . . , p0} for i = 1, . . . , p0. Since AT is positive semidefinite, there exists a Cholesky decomposition AT = LL where L is a lower triangular matrix (possibly with zeros on the diagonal if AT is not positive definite). Let Li denote the i-the column of L. Since AT is not a rank-one matrix, there exist at least two non-zero columns of L. Let Lj with j > 1 be the second non-zero column. Then

β T AT βT min{1, z(T)} = β T P i =j(Li L i ) βT

min{1, z(T)} + β T (Lj L j )βT min{1, z(T)}

β T P i =j(Li L i ) βT

min{1, z(T)} + β T (Lj L j )βT min{1, z( Tj)} (16)

Finally, since Tj P, the (better) decomposition (16) is feasible for (15), and the proposition is proven.

By dropping the complementary constraints (13e), replacing the integrality constraints z {0, 1}p with bound constraints z [0, 1]p, and utilizing the convex function ϕP to reformulate (13b), we obtain the convex relaxation of (1)

y y + min β,z,u 2y Xβ + µ e u + ϕP(z, β) (17a)

e z k (17b)

β u, β u (17c)

β Rp, z [0, 1]p, u Rp + (17d)

for a given P 2P . In the next section, we give an interpretation of formulation (17) as a sparsity-inducing regularization penalty.

2.3 Interpretation as Regularization

Note that the relaxation (17) can be rewritten as:

min β Rp y Xβ 2 2 + λ β 2 2 + µ β 1 + ρR1(β; k)

ρR1(β; k) := min z [0,1]p ϕP(z, β) β (X X + λI)β s.t. e z k. (18)

is the (non-convex) rank-one regularization penalty. Observe that ρR1(β; k) is the difference of two convex functions: the quadratic function β (X X + λI)β arising from the fitness term and the Tikhonov regularization; and the projection of its convexification ϕP(z, β) in the original space of the regression variables β. As we now show, unlike the usual ℓ1 penalty, the rank-one regularization penalty does not induce a bias when β is sparse.

Atamt urk and G omez

Theorem 5 If β 0 k, then ρR1(β; k) = 0.

Proof Let (β, z) Rp [0, 1]p, and let R and Ah, h = 1, . . . , m, correspond to an optimal solution of (15). Since

β (X X + λI)β = β Rβ +

β Ahβ min{1, z(Th)} = ϕP(z, β),

it follows that ρR1(β; k) 0 for any β Rp. Now let ˆβ satisfy ˆβ 0 k, let ˆT = n i P : ˆβi = 0 o be the support of ˆβ and let ˆz such that ˆzi = 1i ˆT be the indicator vector

of ˆT. By construction, e ˆz k and ˆz is feasible for problem (18). Moreover

ρR1( ˆβ; k) ϕP(ˆz, ˆβ) ˆβ (X X + λI) ˆβ

1 h m Th ˆT =

ˆβ Ah ˆβ min{1, ˆz(Th)} ˆβ Ah ˆβ

Thus, ρR1( ˆβ; k) = 0.

The rank-one regularization penalty ρR1 can also be interpreted from an optimization perspective: note that problem (15) is the separation problem that, given (β, z) Rp [0, 1]p, finds a decomposition that results in a most violated inequality after applying the rank-one strengthening. Thus, the regularization penalty ρR1(β; k) is precisely the violation of this inequality when z is chosen optimally.

In 3 we derive an explicit form of ρR1(β; k) when p = 2; Figure 1 plots the graphs of the usual regularization penalties and ρR1 for the two-dimensional case, and Figure 2 illustrates the better sparsity inducing properties of regularization ρR1. Deriving explicit forms of ρR1 is cumbersome for p 3. Fortunately, problem (17) can be explicitly reformulated in an extended space as an SDP and tackled using off-the-shelf conic optimization solvers.

2.4 Extended SDP Formulation

To state the extended SDP formulation, in addition to variables z [0, 1]p and β Rp, we introduce variables w [0, 1]m corresponding to terms wh := min{1, z(Th)} and B Rp p

corresponding to terms Bij = βiβj.

Rank-one convexification for sparse regression

Theorem 6 Problem (17) is equivalent to the SDP

y y + min 2y Xβ + e u + X X + λI, B (19a)

s.t. e z k (19b)

β u, β u (19c)

wh e Thz Th h = 1, . . . , m (19d)

wh BTh βThβ Th STh + h = 1, . . . , m (19e)

B ββ SP + (19f)

β Rp, z [0, 1]p, u Rp +, w [0, 1]m, B Rp p. (19g)

Observe that (19) is indeed an SDP, as

wh BTh βThβ Th STh + wh β Th βTh BTh

Indeed, from Schur s complement, the right-hand side of (20) is equivalent to wh 0 (automatically satisfied) and BTh 1 wh βThβ Th 0. Similarly, constraint (19f) can be

modeled as 1 β

0. Thus, constraints (19e) and (19f) are indeed SDP-representable,

and the remaining constraints and objective are linear. Proof It is easy to check that (19) is strictly feasible (set β = 0, z = e, w > 0 and B = I). Adding surplus variables Γ, Γh for h = 1, . . . , m, write (19) as

y y + min (β,z,u,w) C

n 2y Xβ + e u + min B,Γh,Γ X X + λI, B o

s.t. wh BTh Γh = βThβ Th h (Ah)

B Γ = ββ (R)

Γ SP + B Rp p,

where C = β Rp, z [0, 1]p, u Rp +, w [0, 1]m : (19b), (19c), (19d) . Using conic duality for the inner minimization problem, we find the dual

y y + min (β,z,u,w) C

n 2y Xβ + e u + max Ah,R ββ , R +

h=1 ββ , Ah o

j=1 wh Ah + R = X X + λI

(Ah)ij = 0 for i Th or j Th Ah SP + T P

After substituting Ah = wh Ah and noting that there exists an optimal solution with wh = min{1, z(Th)}, we obtain formulation (15).

Atamt urk and G omez

Note that if P = { }, there is no strengthening and (19) is equivalent to elastic net (λ, µ > 0), lasso (λ = 0, µ > 0), ridge regression (λ > 0, µ = 0) or ordinary least squares (λ = µ = 0). As |P| increases, the quality of the conic relaxation (19) for the non-convex ℓ0-problem (1) improves, but the computational burden required to solve the resulting SDP also increases. In particular, the full rank-one strengthening with P = 2P

requires 2p semidefinite constraints and is impractical. Proposition 4 suggests using downmonotone sets P with limited size

y y + min 2y Xβ + e u + X X + λI, B (21a)

s.t. e z k (21b)

β u, β u (21c)

(sdpr) 0 w T min{1, e T z T } T : |T| r (21d)

w T BT βT β T ST + T : |T| r (21e)

B ββ SP + (21f)

β Rp, z [0, 1]p, u Rp +, B Rp p, w Rm (21g)

for some r Z+, where m = | {T P : |T| r} |. Note that in the above formulation, w T is a scalar corresponding to the T-th coordinate of the m-dimensional vector w. If r = 1, then sdp1 reduces to the formulation of the optimal perspective relaxation proposed by Dong et al. (2015), which is equivalent to using MC+ regularization. Our computations experiments show that whereas sdp1 may be a weak convex relaxation for problems with low diagonal dominance, sdp2 achieves excellent relaxation bounds even for the case of low diagonal dominance within reasonable computing times. For clarity, we give the explicit form of the case sdp2:

y y + min 2y Xβ + e u + X X + λI, B (22a)

s.t. e z k (22b)

β u, β u (22c) zi βi βi Bii

0 i = 1, . . . , p (22d)

(sdp2) 0 wij min{1, zi + zj} i < j (22e)

wij βi βj βi Bii Bij βj Bij Bjj

0 i < j (22f)

β Rp, z [0, 1]p, u Rp +, B Rp p, w Rp(p 1)/2. (22h)

Constraints (22e) and (22f) correspond exactly to constraints (21d)-(21e) for the cases with |T| = 2. Moreover, for cases where T is a singleton, that is, T = {i} for some i {1, . . . , p}, constraints 0 wi zi can be omitted since it can be easily verified that

Rank-one convexification for sparse regression

wi = zi in any optimal solution; thus constraints (21e) reduce, after substitution of wi, to constraints (22d). Finally, constraints (22b), (22c) and (22g) correspond to constraints (21b), (21c) and (21f), respectively.

2.5 Additional Comments

Just like lasso can be interpreted as the best possible convex relaxation obtained from the set with bounded continuous variables {z {0, 1}, β [0, 1] : β(1 z) = 0} and the perspective relaxation is the best possible relaxation obtained from set Q1, the relaxations proposed in this section are the best possible convex relaxations obtained from study of Qr1 T . Since Qr1 T generalizes these two simpler sets, it follows that the proposed formulations are a better approximation of (3) than lasso and the perspective relaxation. Nonetheless, Qr1 T is still considerably simpler than the feasible region of (3) which involves constraints on the binary variables and general convex quadratic functions as described in set QT . We now briefly discuss two recent results that shed additional light on the strength of the formulations. Wei et al. (2022) consider a generalization of Qr1 T in which the binary variables are subject to additional constraints. The authors find that with the k-sparsity constraint Pp i=1 zi k, the rank-one relaxation described in Theorem 2 is still the best possible formulation. Moreover, the authors show that the convex hull for other classes of constraints has a similar structure to cl conv(Qr1 T ), and formulation sdpr can be extended to those cases. Wei et al. (2024) study set QT , and show that cl conv(QT ) can be described in an extended formulation with O(p2) additional variables as the intersection of a single conic constraint and a polyhedron Ψ. The proof of this result is not constructive: an explicit description of Ψ is not given, as it requires an exponential number of linear inequalities. Nonetheless, the authors show that formulations sdpr can be interpreted as relaxations in this extended space obtained by adding linear inequalities that are guaranteed to define high-dimensional faces of Ψ the dimension of the face is larger for small values of r, providing additional theoretical justification that the nonlinear inequalities (22d) and (22f) are in fact strong approximations of cl conv(QT ).

3 Regularization for the Two-Dimensional Case

To better understand the properties of the proposed conic relaxations, in this section, we study them from a regularization perspective. Consider formulation (17b) in Lagrangean form with multiplier κ:

y y + min 2y Xβ + e u + ϕP(z, β) + κe z (23a)

β u, β u (23b)

β RP , z [0, 1]P , u RP +, (23c)

where p = 2, and

X X + λI = 1 + δ1 1 1 1 + δ2

Observe that assumption (24) is without loss of generality, provided that X X is not diagonal: given a two-dimensional convex quadratic function a1β2 1 + 2a12β1β2 + a2β2 2 (with

Atamt urk and G omez

a12 = 0), the substitution β1 = αβ1 and β2 = (a12/α)β2 with |a12|/a2 α a1 yields a quadratic form satisfying (24). Also note that we are using the Lagrangian form instead of the cardinality constrained form given in (18) for simplicity; however, since ϕP(z, β) is convex in z, there exists a value of κ such that both forms are equivalent, i.e., result in the same optimal solutions ˆβ for the regression problem, and the objective values differ by the constant κ k. If P = { , {1}, {2}}, then (23) reduces to a perspective strengthening of the form

y y + min z [0,1]2,β R2, 2y Xβ + (β1 + β2)2 + δ1 β2 1 z1 + δ2 β2 2 z2 + µ β 1 + κ z 1. (25)

The links between (25) and regularization were studied3 in Dong et al. (2015).

Proposition 7 (Dong et al. (2015)) Problem (25) is equivalent to the regularization problem

min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1 + ρMC+(β; κ, δ)

ρMC+(β; κ, δ) =

P2 i=1 2 κδi|βi| δiβ2 i if δiβ2 i κ, i = 1, 2 κ + 2 κδi|βi| δiβ2 i if δiβ2 i κ and δjβ2 j > κ 2κ if δiβ2 i > κ, i = 1, 2.

Regularization ρMC+ is non-convex and separable. Moreover, as pointed out by Dong et al. (2015), the regularization given in Proposition 7 is the same as the Minimax Concave Penalty given by Zhang et al. (2010); and, if λ = δ1 = δ2, then the regularization given in Proposition 7 reduces to the reverse Huber penalty derived by Pilanci et al. (2015). Observe that the regularization function ρMC+ is highly dependent on the diagonal dominance δ: specifically, in the low diagonal dominance setting with δ = 0, we find that ρMC+(β; κ, 0) = 0. We now consider conic formulation (23) for the case P = { , {1}, {2}, {1, 2}}, corresponding to the full rank-one strengthening:

y y + min z [0,1]2,β R2, 2y Xβ + (β1 + β2)2

min{1, z1 + z2} + δ1 β2 1 z1 + δ2 β2 2 z2 + µ β 1 + κ z 1. (26)

Proposition 8 Problem (26) is equivalent to the regularization problem

min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1 + ρR1(β; κ, δ)

3. The case with µ = 0 is explicitly considered in Dong et al. (2015), but the results extend straightforwardly to the case with µ > 0 . The results presented here differ slightly from those by Dong et al. (2015) to account for a different scaling in the objective function.

Rank-one convexification for sparse regression

ρR1(β; κ, δ) =

β (X X + λI)β + 2 δ1δ2|β1β2| β (X X + λI)β

if β (X X + λI)β + 2 δ1δ2|β1β2| < κ κ + 2 δ1δ2|β1β2| if δ1|β1| + δ2|β2| 2 κ β (X X + λI)β + 2 δ1δ2|β1β2| P2 i=1 2 κδi|βi| δiβ2 i

if δ1|β1| + δ2|β2| 2 > κ & δiβ2 i κ, i = 1, 2 κ + κδi|βi| δiβ2 i if δiβ2 i κ & δjβ2 j > κ 2κ if δiβ2 i > κ, i = 1, 2.

Proof We prove the result by projecting out the z variables in (26), i.e., giving closed-form solutions for them. There are three cases to consider, depending on the optimal value for z1 + z2.

Case 1: z1 + z2 < 1 In this case, we find by setting the derivatives of the objective in (26) with respect to z1 and z2 that

κ δ1 β2 1 z2 1 (β1 + β2)2

(z1 + z2)2 = 0

κ δ2 β2 2 z2 2 (β1 + β2)2

(z1 + z2)2 = 0

|β2| |β1|z1.

Define z := z1 δ1|β1|, so z2 = δ2|β2| z, and z1 + z2 = δ1|β1| + δ2|β2| z. Moreover, we find that (26) reduces to

y y + min z>0,β R2 2y Xβ + µ β 1

+(β1 + β2)2 + δ1|β1| + δ2|β2| 2 δ1|β1| + δ2|β2| z + κ p

δ2|β2| z. (27)

An optimal solution of (27) is attained at

(β1+β2)2+( δ1|β1|+ δ2|β2|) 2

( δ1|β1|+ δ2|β2|) κ δ1|β1| + δ2|β2| =

(β1 + β2)2 + δ1|β1| + δ2|β2| 2 κ δ1|β1| + δ2|β2|

with objective value

y y + min β R2 2y Xβ + µ β 1 + 2 κ

(β1 + β2)2 + p

= min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1

(β1 + β2)2 + p

δ2|β2| 2 (β1 + β2)2 δ1β2 1 δ2β2 2

Finally, this case happens when z1 + z2 < 1 (β1 + β2)2 + ( δ1|β1| + δ2|β2)2 < κ.

Atamt urk and G omez

Case 2: z1 + z2 > 1 In this case, we find by setting the derivatives of the objective

in (26) with respect to z1 and z2 that zi = q

κ |βi| for i = 1, 2. Thus, in this case, for an optimal solution z of (26), we have z i = min{ zi, 1}, and problem (26) reduces to

y y + min β R2, 2y Xβ + (β1 + β2)2 +

i=1 max n δiβ2 i , p

κδi|βi| o + µ β 1

κδi|βi|, κ o

= min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1+

max n δiβ2 i , p

κδi|βi| o +min np

κδi|βi|, κ o δiβ2 i

= min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1+

P2 i=1 2 κδi|βi| δiβ2 i if δiβ2 i κ, i = 1, 2 κδi|βi| δiβ2 i + κ if δiβ2 i κ & δjβ2 j > κ 2κ if δiβ2 i > κ, i = 1, 2.

Finally, this case happens when z1 + z2 > 1 δ1|β1| + δ2|β2| 2 > κ. Observe that, in this case, the penalty function is precisely the one given in Proposition 7.

Case 3: z1 + z2 = 1 In this case, problem (26) reduces to

y y + min 0 z1 1,β R2, 2y Xβ + (β1 + β2)2 + δ1 β2 1 z1 + δ2 β2 2 1 z1 + µ β 1 + κ. (28)

Setting derivative with respect to z1 in (28) to 0, we have

0 = δ1β2 1(1 z1)2 δ2β2 2z2 1 = δ1β2 1 2δ1β2 1z1 + (δ1β2 1 δ2β2 2)z2 1.

Thus, we find that

z1 = 2δ1β2 1 p

4δ2 1β4 1 4δ1β2 1(δ1β2 1 δ2β2 2) 2 δ1β2 1 δ2β2 2

= δ1β2 1 δ1δ2|β1β2|

δ1β2 1 δ2β2 2 = δ1|β1|( δ1|β1| δ2|β2|)

( δ1|β1| + δ2|β2|)( p

δ1|β1| δ2|β2|)

Moreover, since 0 z1 1, we have z1 = δ1|β1| δ1|β1|+ δ2|β2| and 1 z1 = δ2|β2| δ1|β1|+ δ2|β2|. Substituting in (28), we find the equivalent form

y y + min β R2, 2y Xβ + (β1 + β2)2 + p

δ2|β2| 2 + µ β 1 + κ

= min β R2 y Xβ 2 2 + λ β 2 2 + µ β 1 + κ + 2 p

δ1δ2|β1β2|.

This final case occurs when neither case 1 or 2 does, i.e., when δ1|β1| + δ2|β2| 2 κ (β1 + β2)2 + ( δ1|β1| + δ2|β2)2.

Rank-one convexification for sparse regression

Observe that, unlike ρMC+, the function ρR1 is not separable in β1 and β2 and does not vanish when δ = 0: indeed, for δ = 0 we find that

ρR1(β; κ, 0) =

β (X X + λI)β β (X X + λI)β if β (X X + λI)β < κ

κ if 0 κ β (X X + λI)β.

The plots of ρMC+ and ρR1 shown in Figures 1 and 2 correspond to setting the natural value κ = 1.

4 Conic Quadratic Relaxations

As mentioned in 1, strong convex relaxations of problem (1), such as sdpr, can either be directly used to obtain good estimators via conic optimization, which is the approach we use in our computations or can be embedded in a branch-and-bound algorithm to solve (1) to optimality. However, using SDP formulations such as (19) in branch-and-bound may be daunting since, to date, efficient branch-and-bound algorithms with SDP relaxations are not available. In contrast, conic quadratic optimization problems are considerably easier to solve than semidefinite optimization problems, thus scaling to larger dimensions. Moreover, there exist off-the-shelf mixed-integer conic quadratic optimization solvers that are actively maintained and improved by numerous software vendors. In this section, we show how the proposed conic relaxations, and specifically sdp2, can be implemented in a conic quadratic framework. The resulting convex formulations can then be directly used as a fast approximation to the SDP formulations presented in 2, and pave the way towards an integration with branch-and-bound solvers4.

4.1 Two-Dimensional PSD Constraints

Constraint (22d), β2 i zi Bii , is a rotated cone constraint as zi 0 and Bii 0 in any feasible solution of (21), and thus conic quadratic representable.

4.2 Three-Dimensional PSD Constraints

As we now show, constraints (22f) can be accurately approximated using conic quadratic constraints.

4. An effective implementation would require careful constraint management strategies and integration with the different aspects of branch-and-bound solvers, e.g., branching strategies and heuristics. Such an implementation is beyond the scope of the paper.

Atamt urk and G omez

Proposition 9 Problem sdp2 is equivalent to the optimization problem

y y + min 2y Xβ + e u + X X + λI, B (29a)

s.t. e z k (29b)

β u, β u (29c)

zi Bii β2 i i P (29d)

0 wij 1, wij zi + zj i = j (29e)

( αβ2 i + 2βiβj + β2 j /α wij 2Bij αBii Bjj/α

i = j (29f)

( αβ2 i 2βiβj + β2 j /α wij +2Bij αBii Bjj/α

i = j (29g)

B ββ SP + (29h)

β Rp, z [0, 1]p, u Rp +, B Rp p. (29i)

Proof It suffices to compute the optimal value of α in (29f) (29g). Observe that the rhs of (29f) can be written as

wij 2Bij min α 0

α Bii β2 i wij

Bjj β2 j wij

Moreover, in an optimal solution of (29), we have that wij = min{1, zi + zj}. Thus, due to constraints (29d), we find that Bii β2 i/wij 0 in optimal solutions of (29), and equality only occurs if either zi = 1 or zj = 0. If either Bii = β2 i/min{1,zi+zj} or Bjj = β2 j/min{1,zi+zj}, then the optimal value of (30) is v = 2βiβj/min{1,zi+zj} 2Bij, by setting α or α = 0, respectively. Otherwise, the optimal α equals

Bjjwij β2 j Biiwij β2 i , (31)

with the objective value

Bii β2 i wij

Bjj β2 j wij

Observe that this expression is also correct when Bii = β2 i/min{1,zi+zj} or Bjj = β2 j/min{1,zi+zj}. Thus, constraint (29f) reduces to

0 βiβj Bijwij

r Biiwij β2 i Bjjwij β2 j . (32)

Similarly, it can be shown that constraint (29g) reduces to

0 βiβj + Bijwij

r Biiwij β2 i Bjjwij β2 j . (33)

Rank-one convexification for sparse regression

More compactly, constraints (32) (33) are equivalent to wij Bii β2 i wij Bjj β2 j (wij Bij βiβj)2 . (34)

Moreover, note that constraints (21e) with T = {i, j} are equivalent to wij Bii β2 i wij Bij βiβj wij Bij βiβj wij Bjj β2 j

wij Bii β2 i 0, wij Bjj β2 j 0, and (34).

Since the first two constraints are implied by (29d) and wij = min{1, zi + zj} in optimal solutions, the proof is complete.

Observe that, for any fixed value of α, constraints (29f) (29g) are conic quadratic representable. Thus, we can obtain relaxations of (29) of the form

y y + min 2y Xβ + e u + X X + λI, B (35a)

s.t. (29b), (29c), (29d), (29e), (29h), (29i) (35b)

0 αβ2 i + 2βiβj + β2 j /α min{1, zi + zj} 2Bij αBii Bjj/α, i = j,α V + ij (35c)

0 αβ2 i 2βiβj + β2 j /α min{1, zi + zj} +2Bij αBii Bjj/α, i = j,α V ij , (35d)

where V + ij and V ij are any finite subsets of R+. Relaxation (35) can be refined dynamically: given an optimal solution of (35), new values of α generated according to (31) (resulting in most violated constraints) can be added to sets V + ij and V ij , resulting in tighter relaxations. Note that the use of cuts (as described here) to improve the continuous relaxations of mixedinteger optimization problems is one of the main reasons of the dramatic improvements of MIO software (Bixby, 2012). In relaxation (35), V + ij and V ij can be initialized with any (possibly empty) subsets of R+. However, setting V + ij = V ij = {1} yields a relaxation with a simple interpretation, discussed next.

4.3 Diagonally Dominant Matrix Relaxation

Let Λ SP + be a diagonally-dominant matrix. Observe that for any (z, β) {0, 1}p Rp

such that β (e z) = 0,

j =i |Λij| β2 i +

j=i+1 |Λij| (βi + sign(Λij)βj)2

j =i |Λij| β2 i zi +

j=i+1 |Λij|(βi + sign(Λij)βj)2

min{1, zi + zj} , (36)

where the last line follows from using perspective strengthening for the separable quadratic terms and using (7) for the non-separable, rank-one terms. Atamt urk et al. (2021) use a similar strengthening for signal estimation based on nonnegative pairwise quadratic terms.

Atamt urk and G omez

We now consider using decompositions of the form Λ + R = X X + λI, where Λ is a diagonally dominant matrix and R SP +. Given such a decomposition, inequalities (36) can be used to strengthen the formulations. Specifically, we consider relaxations of (3) of the form

y y + min 2y Xβ + e u + ˆϕ(z, β) (37a)

(17b), (17c), (17d), (37b)

ˆϕ(z, β) := max Λ,R β Rβ +

j =i |Λij| β2 i zi +

j=i+1 |Λij|(βi + sign(Λij)βj)2

min{1, zi + zj} (38a)

s.t. Λ + R = X X + λI (38b)

j<i |Λji| + X

j>i |Λij| i P (38c)

R SP +. (38d)

Proposition 10 Problem (37) is equivalent to

y y + min 2y Xβ + e u + X X + λI, B (39a)

s.t. e z k (39b)

β u, β u (39c)

zi Bii β2 i i P (39d)

(sdpdd) 0 wij 1, wij zi + zj i = j (39e)

0 β2 i + 2βiβj + β2 j wij 2Bij Bii Bjj i = j (39f)

0 β2 i 2βiβj + β2 j wij + 2Bij Bii Bjj i = j (39g)

B ββ SP + (39h)

β Rp, z [0, 1]p, u Rp +, B Rp p. (39i)

Proof Let Γ, Γ+, Γ be nonnegative p p matrices such that: Γii = Λii and Γij = 0 for i = j; Γ+ ii = Γ ii = 0 and Γ+ ij Γ ij = Λij for i = j. Problem (38) can be written as

ˆϕ(z, β) := max Γ,Γ+,Γ R β Rβ +

j =i (Γ+ ij + Γ ij) β2 i zi (40a)

Γ+ ij (βi + βj)2

min{1, zi + zj} + Γ ij (βi βj)2

min{1, zi + zj}

s.t. Γ + Γ+ + Γ + R = X X + λI (40c)

j<i (Γ+ ji + Γ ji) + X

j>i (Γ+ ij + Γ ij) i P (40d)

R SP +. (40e)

Rank-one convexification for sparse regression

Then, similarly to the proof of Theorem 6, it is easy to show that the dual of (40) is precisely (39).

4.4 Relaxing the (p + 1)-Dimensional PSD Constraint

We now discuss a relaxation of the p-dimensional semidefinite constraint B ββ SP +, present in all formulations. Let V be a matrix whose j-th column Vj is an eigenvector of X X. Consider the optimization problem

ϕP(z, β) := max AT ,R,π β Rβ + X

β T AT βT min{1, z(T)} (41a)

T P AT + R = X X + λI (41b)

AT ST + T P (41c)

R = V diag(π)V (41d)

π Rn +. (41e)

Observe that the objective and constraints (41a) (41c) are identical to (15). However, instead of (15e), we have R = Pmin{p,n} j=1 πj Vj V j . Moreover, since π 0, R SP + in any feasible solution of (41), thus (15) is a relaxation of (41), and, hence, ϕP is indeed a lower bound on ϕP. Finally, (41) is feasible if λ = 0 or P contains all singletons, as it is possible to set A{i} = λ, AT = 0 for |T| > 1, and set π equal to the eigenvalues of X X. Therefore, instead of (17), one may use the simpler convex relaxation

y y + min 2y Xβ + e u + ϕP(z, β) (42a)

e z k (42b)

β u, β u (42c)

β Rp, z [0, 1]p, u Rp + (42d)

Atamt urk and G omez

Proposition 11 If P = {T P : |T| 2}, then problem (42) is equivalent to

y y + min 2y Xβ + e u + X X + λI, B (43a)

s.t. e z k (43b)

β u, β u (43c) zi βi βi Bii

0 i = 1, . . . , p (43d)

(sdp LB) 0 wij min{1, zi + zj} i < j (43e)

wij βi βj βi Bii Bij βj Bij Bjj

0 i < j (43f)

V j B ββ Vj 0 j = 1, . . . , min{n, p} (43g)

β Rp, z [0, 1]p, u Rp +, B Rp p. (43h)

Proof The proof is based on conic duality similar to the proof of Theorem 6.

Observe that in formulation (43), the (p + 1)-dimensional semidefinite constraint (19f) is replaced with min{p, n} rank-one quadratic constraints (43g). We denote by sdp LB the relaxation of sdp2 obtained by replacing (21f) with (43g). In general, sdp LB is still an SDP due to constraints (43f); however, note that sdp LB can be implemented in a conic quadratic framework by using cuts, as described in 4.2. Moreover, constraints (43g) could also be dynamically refined to better approximate the SDP constraint, or formulation (43) could be improved with ongoing research on approximating SDP via mixed-integer conic quadratic optimization (Kocuk et al., 2016, 2018).

Remark 12 We observe that formulation (43) is solved substantially faster than sdp2 (with Mosek) with constraints (43f) formulated as semi-definite constraints. Indeed, the O(p2) low-dimensional constraints (22f) can actually be handled efficiently, but the major computational bottleneck towards solving sdp2 is handling the single large-dimensional positive semi-definite constraint (22g).

Remark 13 Since the first submission of this paper, additional relaxations and approximations of sdpr have been proposed in the literature, see Ben-Ameur (2024).

5 Computations

In this section, we report computational experiments with the proposed conic relaxations on synthetic as well as benchmark datasets. Semidefinite optimization problems are solved with MOSEK 8.1 solver, and conic quadratic optimization problems (continuous and mixedinteger) are solved with CPLEX 12.8 solver. All computations are performed on a laptop with a 1.80GHz Intel Core TM i7-8550U CPU and 16 GB main memory. All solver parameters were set to their default values. We divide our discussion in two parts: first, in 5.2, we focus on the relaxation quality of sdpr and its ability to approximate the exact ℓ0-problem (1); then, in 5.3, we adopt the same experimental framework used by Bertsimas et al.

Rank-one convexification for sparse regression

(2016); Hastie et al. (2017) to generate synthetic instances and evaluate the proposed conic formulations from an inference perspective. In both cases, our results compare favorably with existing approaches in the literature. Finally, in 5.4, we summarize our findings and discuss possible extensions.

5.1 Datasets

We use the benchmark datasets in Table 1. The first five were first used by Miyashiro and Takano (2015) in the context of MIO algorithms for best subset selection, and later used by G omez and Prokopyev (2021). The diabetes dataset with all second interactions was introduced by Efron et al. (2004) in the context of lasso, and later used by Bertsimas et al. (2016). A few datasets require some manipulation to eliminate missing values and handle categorical variables. The processed datasets before standardization5 can be downloaded from http://atamturk.ieor.berkeley.edu/data/sparse.regression. In addition, we also use synthetic datasets generated similarly to those used by Bertsimas et al. (2016); Hastie et al. (2017). Here we present a summary of the simulation setup and refer the readers to Hastie et al. (2017) for an extended description. . For given dimensions n, p, sparsity s, predictor autocorrelation ρ, and signal-to-noise ratio SNR, the instances are generated as follows:

1. The (true) coefficients β0 have the first s components equal to one, and the rest equal to zero.

2. The rows of the predictor matrix X Rn p are drawn from i.i.d. distributions Np(0, Σ), where Σ Rp p has entry (i, j) equal to ρ|i j|.

3. The response vector y Rn is drawn from Np(Xβ0, σ2I), where σ2 = β0 Xβ0/SNR.

Similar data generation has been used in the literature (Bertsimas et al., 2016; Hastie et al., 2017).

5.2 Relaxation Quality

In this section, we test the ability of sdp2, given in (22), and of sdp LB, given in (43), to provide near-optimal solutions to problem (1), and compare its performance with MIO approaches. In 5.2.1, we focus on the pure best subset selection problem with λ = 0, which has received relatively little attention in the literature (Bertsimas et al., 2016); in 5.2.2 we consider problems with ℓ0-ℓ2 regularization, which has received more attention in the literature (Bertsimas and Van Parys, 2020; Hazimeh and Mazumder, 2020; Hazimeh et al., 2022; Xie and Deng, 2020); in 5.2.3 we study the impact of model complexity parameter r on the relaxation quality, and in 5.2.4 we study the scalability of the proposed methods.

Computing optimality gaps for sdpr The optimal objective value ν ℓof sdpr provides a lower bound on the optimal objective value of (1). To obtain an upper bound, we use a simple greedy heuristic to retrieve a feasible solution for (1): given an optimal solution vector β for sdpr, let β (k) denote the k-th largest absolute value. For T = n i P : | β i | β (k) o ,

5. In our experiments, the datasets were standardized first.

Atamt urk and G omez

let ˆβT be the k-dimensional ols/ridge estimator using only predictors in T, i.e.,

ˆβT = (X T XT + λIT ) 1X T y,

where XT denotes the n k matrix obtained by removing the columns with indexes not in T, and let β be the P-dimensional vector obtained by filling the missing entries in ˆβT with zeros. Since β 0 k by construction, β is feasible for (1), and its objective value νu is an upper bound on the optimal objective value of (1). Moreover, the optimality gap provided by any approach can be computed as

gap = νu ν ℓ ν ℓ 100. (44)

While stronger relaxations result in improved lower bounds ν ℓ, the corresponding heuristic upper bounds νu are not necessarily better; thus, the optimality gaps are not guaranteed to improve with stronger relaxations. Nevertheless, as shown next, stronger relaxations, in general, yield much smaller gaps in practice. We point out that the main focus of the strong relaxations is to obtain improved lower bounds ν ℓ. Randomized rounding methods (Pilanci et al., 2015; Xie and Deng, 2020), more sophisticated rounding heuristics (Dong et al., 2015), or alternative heuristic methods (Hazimeh and Mazumder, 2020) can be used to obtain improved upper bounds. Nevertheless, the quality of the upper bounds obtained from the greedy rounding method can be used to estimate how well the solutions from the relaxations match the sparsity pattern of the optimal solution.

5.2.1 λ = 0 case

For each dataset with λ = µ = 0, we solve the conic relaxations of (1) sdp1 and sdp2 as well as sdp LB and the mixed-integer formulation big-M given by (3) (4). In our experiments, we set M = 3 βols , where βols is the ordinary least square estimator6 and set a time limit of 10 minutes. For data with p 40 we solve problems with cardinalities k {3, . . . , 10}, and for diabetes and crime we solve problems with k {3, . . . , 30}. Table 2 shows, for each dataset and method, the average lower bound (LB) and upper bound (UB) found by each method, the gap (44), and the time required to solve the problems (in seconds) the average is taken across all k values. In all cases, lower and upper bounds are scaled so that the best upper bound for any given instance has value ν u = 100. The big-M method is highly inconsistent and prone to numerical difficulties due to the use of big-M constraints. First, for three datasets (servo, auto MPG and breast cancer) the method fails due to numerical issues ( failure to solve MIP subproblem ). In addition, for solar flare the solver reports very fast solution times but the solutions are, in fact, infeasible for problem (1): by default in CPLEX, if zi 10 5 in a solution then zi is deemed to satisfy the integrality constraint zi {0, 1}. Thus, if the big-M constant is large enough, then constraint (4) may, in fact, allow nonzero values for βi even when zi = 0 . In particular, in solar flare we found that the solution βmio reported by the MIO solver satisfies7 βmio 0 = 20, regardless of the value of k used, violating the sparsity constraint.

6. Bertsimas et al. (2016) set M = 2 ˆβ for some heuristic solution ˆβ 7. We consider βi = 0 whenever βi > 10 4.

Rank-one convexification for sparse regression

dataset method LB UB gap(%) time

sdp1 99.4 0.6 100.1 0.1 0.7 0.0 0.03 0.02 sdp2 99.6 0.6 100.1 0.1 0.5 0.6 0.07 0.03 sdp LB 98.8 0.6 100.4 0.1 1.6 0.8 0.06 0.03 big-M 100.0 0.0 100.0 0.0 0.0 0.0 0.01 0.01

sdp1 86.8 5.5 109.5 10.3 27.3 20.6 0.02 0.01 sdp2 94.9 2.9 106.2 16.5 12.2 19.8 0.10 0.01 sdp LB 89.5 2.7 109.2 15.5 21.8 14.9 0.17 0.03 big-M

sdp1 75.3 10.3 115.3 6.0 55.8 23.7 0.07 0.04 sdp2 96.7 3.3 100.5 0.8 4.0 4.2 0.24 0.02 sdp LB 78.8 7.7 101.6 2.7 30.0 14.0 0.40 0.09 big-M

solar flare

sdp1 97.5 1.5 103.3 1.1 6.0 2.0 0.07 0.03 sdp2 99.2 0.8 100.0 0.0 1.0 0.6 0.28 0.06 sdp LB 97.8 1.6 102.3 1.9 4.6 2.7 0.13 0.02 big-M 98.1 1.7 98.1 1.7 - 0.01 0.01

breast cancer

sdp1 88.9 3.1 101.5 1.7 14.4 5.6 0.15 0.02 sdp2 98.0 0.6 100.4 0.8 2.4 1.1 0.77 0.07 sdp LB 94.8 0.5 100.5 0.7 6.0 0.5 0.40 0.03 big-M

sdp1 95.2 3.2 115.2 11.8 22.2 16.3 3.58 0.77 sdp2 97.4 1.3 105.4 4.2 8.2 5.2 9.28 1.12 sdp LB big-M 99.0 0.9 100.0 0.0 1.0 0.9 416.17 260.57

sdp1 97.8 1.3 103.2 2.4 5.6 3.6 17.82 0.98 sdp2 99.0 0.8 101.6 2.0 2.7 2.7 45.29 4.06 sdp LB 94.6 2.0 109.7 2.8 16.0 4.9 5.87 0.43 big-M 96.4 1.7 100.0 0.0 3.7 1.8 527.03 185.64 Error in solving problem. Infeasible solution is reported as optimal.

Table 2: Results with λ = 0 on real instances. Lower and upper bounds are scaled so that the best upper bound for a given instance is 100. Mean stdev are reported.

We also point out that sdp LB struggles with numerical difficulties in diabetes: the problems

Atamt urk and G omez

are incorrectly found to be unbounded. In contrast, sdpr methods are solved without numerical difficulties. In terms of the relaxation quality, we find that sdp2 is the best as expected. It consistently delivers better lower and upper bounds compared to the other conic relaxations and even outperforms big-M in terms of lower bounds and gaps in the largest dataset (crime). The strength of the relaxation comes at the expense 2 4-fold larger computation time than sdp1, but on the other hand sdp2 is substantially faster than big-M on large datasets. We see that neither sdp1 nor sdp LB dominates each other in terms of relaxation quality. While sdp1 is faster on the smaller datasets, sdp LB is faster on crime, indicating that sdp LB may scale better (we corroborate this statement in 5.2.4). Finally, big-M, in datasets where numerical issues do not occur, is able to find high-quality solutions consistently but struggles to find matching lower bound in larger instances despite significantly higher computation time spent. Figures 4 and 5 present detailed results on lower bounds and gaps as a function of the sparsity parameter k for the diabetes and crime datasets. For small values of k, big-M is arguably the best method, solving the problems to optimality. However, as k increases, the quality of the lower bounds and gaps deteriorate: for diabetes, sdp2 finds better solutions than big-M for k 18; for crime, sdp1 and sdp2 find better lower bounds for k 8 (and, in the case of sdp2, better gaps as well), and sdp LB matches the lower bound found by big-M for k 14, despite requiring only five seconds (instead of 10 minutes) to find such lower bounds. Observe that the number of possible supports p k = O(pk) for problem (1) scales exponentially with k, thus enumerative methods such as branch-and-bound may struggle as k grows.

5.2.2 λ > 0 case

For each dataset with8 λ = 0.05 and µ = 0, we solve the conic relaxations of (1) sdp1, sdp2 and sdp LB and the big-M free mixed-integer formulation (5) with a time limit of

8. Since data is standardized so that each column has a unit norm, a value of λ = 0.05 corresponds to an increase of 5% in the diagonal elements of the matrix X X + λI.

3 8 13 18 23 28

Lower bound

sdp1 sdp2 big-M

(a) Lower bounds

3 8 13 18 23 28

sdp1 sdp2 big-M

Figure 4: Detailed results on the diabetes dataset with λ = 0.

Rank-one convexification for sparse regression

3 8 13 18 23 28

Lower bound

sdp1 sdp2 sdp LB big-M

(a) Lower bounds

3 8 13 18 23 28

sdp1 sdp2 sdp LB big-M

Figure 5: Detailed results on the crime dataset with λ = 0.

10 minutes (persp). This MIO formulation is possible since λ > 0, and has been shown to be competitive (Xie and Deng, 2020; Hazimeh et al., 2022) with the tailored algorithm proposed by Bertsimas and Van Parys (2020). For datasets with p 40 we solve the problems with cardinalities k {3, . . . , 10}, and for diabetes and crime we solve the problems with k {3, . . . , 30}. Table 3 shows, for each dataset and method, the average lower bound (LB) and upper bound (UB) found by each method, the gap (44), and the time required to solve the problems (in seconds) the average is taken across all k values. In all cases, lower and upper bounds are scaled so that the best upper bound for any given instance has value ν u = 100. We observe that instances with λ = 0.05 are much easier to solve than those with λ = 0: no numerical issues occur for sdp LB or persp, and lower and upper bounds are much better for all methods. The mixed integer formulation persp comfortably solves the small instances with p 40 to optimality, but sdp2 yields better lower bounds and gaps for the larger instances diabetes and crime in a fraction of the time used by persp. Figures 6 and 7 present lower bounds and gaps as a function of the regularization parameter λ, for diabetes and crime datasets (with k = 15). We observe that for a low value of λ, persp struggles to find good lower bounds, e.g., it is outperformed by all conic relaxations in crime for λ 0.02, and is worse than sdp2 for λ 0.1 in terms of lower bounds and gaps in both datasets. As λ increases, all methods deliver better bounds, and persp is eventually able to solve all problems to optimality. As expected, the performance of persp improves as λ increases. The perspective relaxation discussed in 1.3 exploits the separable terms introduced by the ℓ2-regularization: as λ increases, these separable terms have a larger weight in the objective, and the strength of the relaxation improves as a consequence. Note that the conic relaxations also improve with larger λ: they are based on decompositions of the matrix X X + λI into oneand two-variable terms, and the addition of the separable terms allows for a much richer set of decompositions. For large values of λ, X X + λI becomes highly diagonal dominant, and the perspective relaxation alone provides a substantial strengthening. In this case, the advanced conic relaxations have a marginal impact and MIO methods with perspective strengthening performs better overall. In contrast, for low values of λ, the conic relaxations

Atamt urk and G omez

dataset method LB UB gap(%) time

sdp1 99.7 0.4 100.2 0.3 0.5 0.6 0.03 0.02 sdp2 99.8 0.3 100.1 0.2 0.3 0.5 0.06 0.02 sdp LB 99.5 0.4 100.3 0.3 0.8 0.6 0.06 0.02 persp 100.0 0.0 100.0 0.0 0.0 0.0 0.11 0.03

sdp1 95.9 3.0 102.2 6.7 6.7 4.1 0.03 0.01 sdp2 99.5 0.5 100.6 1.1 1.1 1.6 0.11 0.01 sdp LB 97.6 1.4 102.0 2.1 4.6 3.3 0.16 0.02 persp 100.0 0.0 100.0 0.0 0.0 0.0 0.28 0.13

sdp1 89.1 6.1 101.4 1.2 14.4 8.5 0.05 0.01 sdp2 99.8 0.2 100.0 0.1 0.2 0.3 0.25 0.04 sdp LB 92.7 3.1 101.1 1.5 9.2 4.0 0.35 0.02 persp 100.0 0.0 100.0 0.0 0.0 0.0 1.29 0.60

solar flare

sdp1 99.3 0.5 100.1 0.1 0.8 0.5 0.07 0.01 sdp2 99.9 0.1 100.1 0.1 0.2 0.1 0.28 0.03 sdp LB 99.2 0.7 100.4 1.2 1.2 1.0 0.16 0.03 persp 100.0 0.0 100.0 0.0 0.0 0.0 1.75 1.07

breast cancer

sdp1 94.9 1.8 100.8 0.4 6.3 2.4 0.18 0.04 sdp2 99.6 0.2 100.1 0.2 0.5 0.3 0.72 0.06 sdp LB 97.5 0.6 100.5 0.4 2.9 0.9 0.36 0.05 persp 100.0 0.0 100.0 0.0 0.0 0.0 56.12 44.34

sdp1 98.9 0.6 100.2 0.2 1.2 0.7 2.13 0.24 sdp2 99.6 0.2 100.1 0.1 0.5 0.3 5.83 0.79 sdp LB 98.2 1.3 100.3 0.3 2.2 1.4 1.48 0.18 persp 99.4 0.5 100.0 0.0 0.6 0.5 441.90 258.29

sdp1 99.3 0.9 100.3 0.9 1.1 1.7 19.15 1.30 sdp2 99.7 0.4 100.2 0.8 0.5 1.0 43.86 2.38 sdp LB 98.7 1.0 100.7 1.3 2.0 2.3 5.30 0.35 persp 99.5 0.4 100.1 0.1 0.6 0.4 518.03 175.65

Table 3: Results with λ = 0.05 on real instances. Lower and upper bounds are scaled so that the best upper bound for a given instance is 100. Mean stdev are reported.

result in substantial strengthening over the perspective relaxation, and sdpr outperforms persp as a consequence.

Rank-one convexification for sparse regression

0 0.05 0.1 0.15 0.2

Lower bound

sdp1 sdp2 sdp LB persp

(a) Lower bounds

0 0.05 0.1 0.15 0.2

Lower bound

sdp1 sdp2 sdp LB persp

Figure 6: Detailed results on the diabetes dataset with k = 15.

0 0.05 0.1 0.15 0.2

Lower bound

sdp1 sdp2 sdp LB persp

(a) Lower bounds

0 0.05 0.1 0.15 0.2

Lower bound

sdp1 sdp2 sdp LB persp

Figure 7: Detailed results on the crime dataset with k = 15.

5.2.3 The effect of model complexity r

In 5.2.1 5.2.2 we reported computations with sdpr with r 2. In experiments with those datasets, sdp3 yields almost the same strengthening as sdp2, but with much larger computational cost. Since sdp2 already achieves gaps close to 0 in those instances, there is little room for improvement with higher values of r. If the matrix X X + λI has high diagonal dominance, which happens if n > p or if λ is large, then there are many ways to decompose it into low-dimensional rank-one terms, and sdpr with r small achieves good relaxations. In contrast, if the matrix X X + λI has low diagonal dominance, it may be difficult to extract low-dimensional rank-one terms. In the extreme case of a rank-one case matrix, while sdpp results in the convex description, sdpr with r < p achieves no improvement. In this section, we illustrate this phenomenon. First, we test on small synthetic instances with p = 15 and n = 10, setting the true sparsity to s = 5, autocorrelation ρ = 0.35, signal-noise-ration SNR {1, 5}, sparsity k {3, 4, 5, 6, 7, 8}, and for each combination of parameters we generate five instances. We report in Figure 8 the gaps obtained by sdpr for different values of r and λ averaging across instances and different values of SNR and k. In addition, Figure 9 depicts the distribution of

Atamt urk and G omez

computational times required to solve the problems. We observe that for λ = 0, sdpr with r 4 results in no strengthening and gaps of 100%; sdp5 results in a small improvement (note that 5 = p n), while sdpr with r 6 results in larger improvements. These results suggest that, with λ = 0, stronger formulations require rank-one strengthening with at least p n variables. We also observe that, as λ increases, the gaps reported by all methods decrease substantially, and the incremental strengthening obtained from larger values of r decreases: for λ 0.05 sdp4 performs almost identical to sdp8, and for λ = 0.15 sdp3 is similar to sdp8 and sdp2 already results in low optimality gaps. The computational time required to solve sdpr scales exponentially with r since the number of constraints increases exponentially as well. We conclude that sdp2 is well suited for the n > p case or for medium values of λ (for larger values sdp1 or even the simple perspective relaxation may be preferable), while sdpr with r 3 achieves a good improvement in relaxation quality for low values of λ, at the expense of larger computational times. Next, we test sdpr on the housing dataset with λ = 0. Since this dataset has high diagonal dominance (see Table 1), even sdp1 results in small optimality gaps (see the first row of Table 2). Thus, instead of using all n = 506 datapoints, we randomly select only n0 < n datapoints, resulting in small diagonal dominance for n0 = 10, matrix X X is rank-deficient. Figure 10 shows the optimality gaps as a function of r and n0. The relaxation quality improves with increasing n0 and increasing r. For n0 = 20, setting r 5 produces optimality gaps close to 0% while sdp1 results in a gap of 13%, sdp2 a gap of 8%, and sdp3 yields a gap of 3%. For n0 = 50, sdp2 results in a gap of 2%, and sdp3 yields a gap almost equal to 0%. Overall, the results are consistent with the experiments on synthetic data.

5.2.4 On scalability

As discussed in 5.2.3, formulation sdpr for large values of r can be expensive to solve. Moreover, even sdp1 and sdp2 are semidefinite programs, which may not scale well for large

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

1 2 3 4 5 6 7 8

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

1 2 3 4 5 6 7 8

(b) λ = 0.01

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

1 2 3 4 5 6 7 8

(c) λ = 0.02

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

1 2 3 4 5 6 7 8

(d) λ = 0.05

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

1 2 3 4 5 6 7 8

(e) λ = 0.10

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

1 2 3 4 5 6 7 8

(f) λ = 0.15

Figure 8: Optimality gaps of sdpr with synthetic data, 1 r 8.

Rank-one convexification for sparse regression

Figure 9: Time required to solve sdpr, 1 r 8.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 2 3 4 5 6 7 8

(a) n0 = 10

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 2 3 4 5 6 7 8

(b) n0 = 20

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 2 3 4 5 6 7 8

(c) n0 = 50

Figure 10: Optimality gaps of sdpr with the altered housing dataset(using only n0 datapoints), 1 r 8.

values of p. In this section, we present computations illustrating that while this is indeed the case, formulation sdp LB which replaces the semidefinite constraint B ββ SP + with the quadratic constraints (43g) scales much better and in fact can significantly outperform persp in terms of relaxation quality.

We generate synthetic instances with p {100, 150, . . . , 500}, n = 500, true sparsity parameter s = 30, autocorrelation ρ = 0.35, signal-noise-ration SNR {1, 5}, sparsity k = 30; for each combination of parameters we generate five instances, and solve them for λ {0.01, 0.02, 0.05, 0.15} and µ = 0. Table 4 reports, for sdp1, sdp2, sdp LB and persp using formulation (5) with a time limit of 600 seconds , the time required to solve the problems and the optimality gap proven.

We observe that persp is unable to solve the problems within the ten-minute time limit and results in larger gaps than all other approaches despite using substantially more time in most cases. We also observe that sdpr formulations struggle in instances with p 200. Interestingly, sdp2 requires consistently 2-4 times more than sdp1 regardless of the dimension p. A similar factor was observed in Tables 2 and 3 with real data, suggesting that computational times with sdp2 are within the same order-of-magnitude as sdp1. Finally, sdp LB is substantially faster than both sdp1 and sdp2. While it results in larger gaps than

Atamt urk and G omez

p sdp1 sdp2 sdp LB persp time(s) gap(%) time(s) gap(%) time(s) gap(%) time(s) gap(%) 100 19 3 0.5 0.6 44 9 0.1 0.1 5 1 1.0 1.1 TL 3.3 4.2 150 153 19 1.1 1.5 356 61 0.2 0.4 20 2 1.9 2.2 TL 6.2 7.2 200 673 64 2.7 3.0 1,691 165 0.6 0.9 42 3 4.3 4.0 TL 12.5 10.5 250 79 4 7.1 5.6 TL 17.2 13.2 300 147 7 12.2 7.6 TL 21.7 14.6 350 248 14 17.6 11.0 TL 25.9 17.0 400 391 36 24.0 14.9 TL 29.1 18.9 450 394 46 32.2 18.3 TL 34.3 21.0 500 462 43 39.3 21.9 TL 38.8 22.8

Table 4: Computational times and gaps on synthetic instances as a function of p. TL= Time Limit. = Unable to solve (either due to very large computational times or memory issues). Numbers after are the sample standard deviation.

sdp2 as expected since the high-dimensional constraint (22g) is relaxed, it still yields better optimality gaps than persp.

5.3 Inference study on synthetic instances

We now present inference results on synthetic data using the same simulation setup as by Bertsimas et al. (2016); Hastie et al. (2017), see Hastie et al. (2017) for an extended description. Specifically, we generate synthetic data as described in 5.1, and use the evaluation metrics used by Hastie et al. (2017), described next.

5.3.1 Evaluation metrics

Let x0 denote the test predictor drawn from Np(0, Σ) and let y0 denote its associated response value drawn from N(x 0 β0, σ2). Given an estimator ˆβ of β0, the following metrics are reported:

Relative risk

RR(ˆβ) = E x 0 ˆβ x 0 β0 2

with a perfect score 0 and null score of 1.

Relative test error

RTE(ˆβ) = E x 0 ˆβ y0 2

with a perfect score of 1 and null score of SNR+1.

Rank-one convexification for sparse regression

Proportion of variance explained

1 E x 0 ˆβ y0 2

with perfect score of SNR/(1+SNR) and null score of 0.

Sparsity We record the number of nonzeros9, ˆβ 0, as done by Hastie et al. (2017). Additionally, we also report the number of variables correctly identified, given by Pp i=1 1{ˆβi = 0 and (β0)i = 0}.

5.3.2 Procedures

In addition to the training set of size n, a validation set of size n is generated with the same parameters, matching the precision of leave-one-out cross-validation. We use the following procedures to obtain estimators ˆβ.

elastic net We solve the elastic net procedure using the parametrization

min β Rp y Xβ 2 2 + λ α β 1 + (1 α) β 2 2

where α, λ 0 are the regularization parameters. We let α = 0.1ℓfor integer 0 ℓ 10, we generated 50 values of λ ranging from λmax = X y to λmax/200 on a log scale, and using the pair (λ, µ) that results in the best prediction error on the validation set. A total of 500 (α, λ) pairs are tested.

sdp2 The estimator obtained from solving sdp2 (λ = µ = 0) for all values of k = 0, . . . , 7 and choosing the one that results in the best prediction error on the validation set.

The elastic net procedure approximately corresponds to the lasso procedure with 100 tuning parameters used by Hastie et al. (2017). Similarly, sdp2 with cross-validation approximately corresponds to the best subset procedure with 51 tuning parameters10 used by Hastie et al. (2017); nonetheless, the estimators from Hastie et al. (2017) are obtained by running an MIO solver for 3 minutes, while ours are obtained from solving to optimality a strong convex relaxation.

5.3.3 Optimality gaps and computation times

Before describing the statistical results, we briefly comment on the relaxation quality and computation time of sdp2. Table 5 shows, for instances with n = 500, p = 100, and s = 5, the optimality gap and relaxation quality of sdp2 each column represents the average over ten instances generated with the same parameters. In all cases, sdp2 produces optimal

9. An entry ˆβi is deemed to be non-zero if |ˆβi| > 10 5. This is the default integrality precision in commercial MIO solvers. 10. Hastie et al. (2017) use values of k = 0, . . . , 50. Nonetheless, in our computations with the same tuning parameters, we found that values of k 8 are never selected after cross-validation. Thus, our procedure with 8 tuning parameters results in the same results as the one with 51 parameters from a statistical viewpoint but requires only a fraction of the computational effort.

Atamt urk and G omez

SNR 0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00 avg

ρ = 0.00 gap 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 time 45.2 38.8 38.6 29.5 29.3 28.4 27.4 26.3 26.4 25.9 31.6

ρ = 0.35 gap 0.3 0.2 0.3 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 time 48.0 47.6 49.4 44.1 39.3 30.7 29.0 29.1 27.3 28.0 37.3

Table 5: Optimality gap and computation time (in seconds) of sdp2 with n = 500, p = 100, s = k = 5, λ = µ = 0.

or near-optimal estimators, with an optimality gap at most 0.3%. In fact, with sdp2, we find that 97% of the estimators for ρ = 0.00 and 68% of the estimators with ρ = 0.35 are provably optimal11 for (1). For a comparison, Hastie et al. (2017) report that, in their experiments, the MIO solver (with a time limit of three minutes) is able to prove optimality for only 35% of the instances generated with similar parameters. Although Hastie et al. (2017) do not report optimality gaps for the instances where optimality is not proven, we conjecture that such gaps are significantly larger than those reported in Table 5 due to weak relaxations with big-M formulations. In summary, for this class of instances, sdp2 is able to produce optimal or practically optimal estimators of (1) in about 30 seconds.

5.3.4 Results: accuracy metrics

Figure 11 plots the relative risk, relative test error, the proportion of variance explained, and sparsity results as a function of the SNR for instances with n = 500, p = 100, s = 5 and ρ = 0. Figure 12 plots the same results for instances with ρ = 0.35. The setting with ρ = 0.35 was also used by Hastie et al. (2017). We see that elastic net outperforms sdp2 in low SNR settings, i.e., in SNR= 0.05 for ρ = 0 and SNR 0.14 for ρ = 0.35, but results in worse predictive performance for all other SNRs. Moreover, sdp2 is able to recover the true sparsity pattern of β0 for sufficiently large SNR, while elastic net is unable to do so. We also see that sdp2 performs comparatively better than elastic net in instances with ρ = 0. Indeed, for large autocorrelations ρ, features where (β0)i = 0 still have predictive value thus the dense estimator obtained by elastic net retains a relatively good predictive performance (however, such dense solutions are undesirable from an interpretability perspective). In contrast, when ρ = 0, such features are simply noise, and elastic net results in overfitting, while methods that deliver sparse solutions such as sdp2 perform much better in comparison. We also note that sdp2 selects models corresponding to sparsities k < s in low SNRs, while it consistently selects models with k s in high SNRs. We point out that, as suggested by Mazumder et al. (2022), the results for low SNR could potentially be improved by fitting models with µ > 0.

Remark 14 As an alternative to cross-validation, the sparsity of the statistical model can also be selected using information criteria such as AIC (Akaike, 1974) and BIC (Schwarz

11. A solution is deemed optimal if gap< 10 4, which is the default parameter in MIO solvers.

Rank-one convexification for sparse regression

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

Relative risk (to null model)

elastic net sdp2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

Relative test error (to Bayes)

elastic net sdp2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

Proportion of variance explained

elastic net sdp2

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6

Correct Incorrect

Figure 11: Relative risk, relative test error, proportion of variance explained and sparsity as a function of SNR, with n = 500, p = 100, s = 5 and ρ = 0.00.

Atamt urk and G omez

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

Relative risk (to null model)

elastic net sdp2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

Relative test error (to Bayes)

elastic net sdp2

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6.00

Proportion of variance explained

elastic net sdp2 cross

0 5 10 15 20 25 30 35 40 45 50

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

sdp2 elastic

0.05 0.09 0.14 0.25 0.42 0.71 1.22 2.07 3.52 6

Correct Incorrect

Figure 12: Relative risk, relative test error, proportion of variance explained and sparsity as a function of SNR, with n = 500, p = 100, s = 5 and ρ = 0.35.

Rank-one convexification for sparse regression

et al., 1978). As shown by G omez and Prokopyev (2021), this can accomplished efficiently by combining the relaxation sdpr with fractional optimization techniques. We refer the reader to G omez and Prokopyev (2021) for detailed computational and statistical results.

5.4 Summary and extensions

The conic formulations sdp2 and sdp LB are able to deliver near-optimal solutions to problem (1) with p in the hundreds, and results in substantially better performance than MIO methods (computational times and numerical stability) in instances with low or no ℓ2 regularization. Such performance makes the approaches directly applicable in several high-stakes domains where interpretability is a major consideration (e.g., p 50 in the highly-publicized COMPAS recidivism case Wexler (2017); p 20 in the setting considered by Cozad et al. (2014), where the output of the regression is used to optimize a carbon capture adsorber). Due to the computational challenges with solving conic optimization problems via second order methods, the solution approach presented in this paper (using an off-the-shelf solver) does not scale to larger instances. Nonetheless, as we now point out, the formulations presented here may serve as a basis for methods that scale to larger values of p, or tackle statistical problems other than sparse regression. First, tailored implementations that do not rely on off-the-shelf solvers are possible. For example, Liu et al. (2023) consider sparse inference problems with graphical models, which can be interpreted as a special case of (1) where matrix X X is sparse. They use sdp2 as a base relaxation of their method and develop a tailored primal-dual method to solve it: the resulting approach scales comfortably to problems with p in the thousands. Similarly, Wei and K u c ukyavuz (2023) use outer-approximations to implement the rank-one relaxations, which (coupled with state-of-the-art branch-and-bound algorithms) can solve mixed-integer quadratic optimization problems with up to 500 variables to optimality. Second, sdpr can be used as a subroutine of a more sophisticated method. For example, Hazimeh and Mazumder (2020) propose a method that approximately solves problem (1), scales to problems with p 105 and produces combinatorially local solutions. A key idea of their approach is to solve MIO problems involving only a small subset of the variables to escape local minima. In a similar vein, it would be possible to use relaxations sdp2 with a small subset of the variables to identify descent directions. Finally, we point out that both the quadratic loss function and sparsity are used pervasively in machine learning; thus, the theory and methods developed in this paper may be generally applicable. For example, Bertsimas et al. (2021) incorporate some of the ideas discussed in 4 to develop algorithms that solve matrix completion problems to certifiable optimality.

6 Conclusions

In this paper we derive strong convex relaxations for sparse regression. The relaxations are based on the ideal formulations for rank-one quadratic terms with indicator variables. The new relaxations are formulated as semidefinite optimization problems in an extended space and are stronger and more general than the state-of-the-art formulations. In our computational experiments, the proposed conic formulations outperform the existing ap-

Atamt urk and G omez

proaches, both in terms of accurately approximating the best subset selection problems and of achieving desirable estimation properties in statistical inference problems with sparsity.

Acknowledgments and Disclosure of Funding

A. Atamt urk is supported, in part, by NSF AI Institute for Advances in Optimization Award 2112533 and Grant N00014-24-1-2149 from the Office of Naval Research. A. G omez is supported, in part, by Grants No. 1818700 and 2006762 from the National Science Foundation and grant No FA9550-24-1-0086 from the Air Force Office of Scientific Research.

Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos. Learning optimal and fair decision trees for non-discriminative decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1418 1426, 2019.

Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19:716 723, 1974.

M Selim Akt urk, Alper Atamt urk, and Sinan G urel. A strong conic quadratic reformulation for machine-job assignment with controllable processing times. Operations Research Letters, 37:187 191, 2009.

Alper Atamt urk and Andr es G omez. Strong formulations for quadratic optimization with M-matrices and indicator variables. Mathematical Programming, 170:141 176, 2018.

Alper Atamt urk and Andr es G omez. Safe screening rules for ℓ0-regression from perspective relaxations. In International Conference on Machine Learning, pages 421 430. PMLR 119, 2020.

Alper Atamt urk and Vishnu Narayanan. Cuts for conic mixed integer programming. In M. Fischetti and D. P. Williamson, editors, Proceedings of the 12th International IPCO Conference, pages 16 29, 2007.

Alper Atamt urk, Andr es G omez, and Shaoning Han. Sparse and smooth signal estimation: Convexification of ℓ0-formulations. Journal of Machine Learning Research, 22(52):1 43, 2021.

Mohammad Javad Azizi, Phebe Vayanos, Bryan Wilder, Eric Rice, and Milind Tambe. Designing fair, efficient, and interpretable policies for prioritizing homeless youth for housing resources. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research: 15th International Conference, CPAIOR 2018, Delft, The Netherlands, June 26 29, 2018, Proceedings 15, pages 35 51. Springer, 2018.

Walid Ben-Ameur. Subset selection and the cone of factor-width-k matrices. SIAM Journal on Optimization, 34(1):817 843, 2024.

Rank-one convexification for sparse regression

Dimitris Bertsimas and Angela King. OR forum an algorithmic approach to linear regression. Operations Research, 64:2 16, 2015.

Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression. The Annals of Statistics, 48(1):300 323, 2020.

Dimitris Bertsimas, Angela King, Rahul Mazumder, et al. Best subset selection via a modern optimization lens. The Annals of Statistics, 44:813 852, 2016.

Dimitris Bertsimas, Ryan Cory-Wright, and Jean Pauphilet. Mixed-projection conic optimization: A new paradigm for modeling rank constraints. Operations Research, 70(6): 3321 3344, 2021.

Robert E Bixby. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, pages 107 121, 2012.

Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C Bayan Bruss, Andrew G Wilson, Tom Goldstein, and Micah Goldblum. A performance-driven benchmark for feature selection in tabular deep learning. Advances in Neural Information Processing Systems, 36, 2024.

Micha el Chichignoud, Johannes Lederer, and Martin J Wainwright. A practical scheme and fast algorithm to tune the lasso with optimality guarantees. The Journal of Machine Learning Research, 17:8162 8181, 2016.

Alison Cozad, Nikolaos V Sahinidis, and David C Miller. Learning surrogate models for simulation-based optimization. AICh E Journal, 60:2211 2227, 2014.

Anna Deza and Alper Atamt urk. Safe screening for logistic regression with ℓ0 ℓ2 regularization. In KDIR, pages 119 126, 2022.

Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Vu C Dinh and Lam S Ho. Consistent feature selection for analytic deep neural networks. Advances in Neural Information Processing Systems, 33:2420 2431, 2020.

Hongbo Dong, Kun Chen, and Jeff Linderoth. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. ar Xiv preprint ar Xiv:1510.06083, 2015.

Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of Statistics, 32:407 499, 2004.

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348 1360, 2001.

Antonio Frangioni and Claudio Gentile. Perspective cuts for a class of convex 0 1 mixed integer programs. Mathematical Programming, 106:225 236, 2006.

Atamt urk and G omez

Antonio Frangioni, Claudio Gentile, and James Hungerford. Decompositions of semidefinite matrices and the perspective reformulation of nonseparable quadratic programs. Mathematics of Operations Research, 45(1):15 33, 2020.

LLdiko E Frank and Jerome H Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35:109 135, 1993.

Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115 1145, 1995.

Andr es G omez. Strong formulations for conic quadratic optimization with indicator variables. Mathematical Programming, 188(1):193 226, 2021.

Andr es G omez and Oleg A Prokopyev. A mixed-integer fractional optimization approach to best subset selection. INFORMS Journal on Computing, 33(2):551 565, 2021.

Oktay G unl uk and Jeff Linderoth. Perspective reformulations of mixed integer nonlinear programs with indicator variables. Mathematical Programming, 124:183 205, 2010.

Shaoning Han and Andr es G omez. Compact extended formulations for low-rank functions with indicator variables. Mathematics of Operations Research, 2024.

Shaoning Han, Andr es G omez, and Alper Atamt urk. The equivalence of optimal perspective formulation and Shor s SDP for quadratic programs with indicator variables. Operations Research Letters, 50(2):195 198, 2022.

Shaoning Han, Andr es G omez, and Alper Atamt urk. 2 2-convexifications for convex quadratic optimization with indicator variables. Mathematical Programming, 202(1):95 134, 2023.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 1. Springer series in statistics New York, NY, USA, 2001.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press, 2015.

Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. ar Xiv preprint ar Xiv:1707.08692, 2017.

Hussein Hazimeh and Rahul Mazumder. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5):1517 1537, 2020.

Hussein Hazimeh, Rahul Mazumder, and Ali Saab. Sparse regression at scale: Branch-andbound rooted in first-order optimization. Mathematical Programming, 196(1):347 388, 2022.

Rank-one convexification for sparse regression

Hussein Hazimeh, Rahul Mazumder, and Tim Nonet. L0learn: A scalable package for sparse learning using l0 regularization. Journal of Machine Learning Research, 24(205): 1 8, 2023a.

Hussein Hazimeh, Rahul Mazumder, and Peter Radchenko. Grouped variable selection with discrete optimization: Computational and statistical perspectives. The Annals of Statistics, 51(1):1 32, 2023b.

Mohamed Hebiri, Sara Van De Geer, et al. The smooth-lasso and other ℓ1+ ℓ2-penalized methods. Electronic Journal of Statistics, 5:1184 1226, 2011.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55 67, 1970.

Jian Huang, Yuling Jiao, Yanyan Liu, and Xiliang Lu. A constructive approach to L0 penalized regression. The Journal of Machine Learning Research, 19:403 439, 2018.

David R Hunter and Runze Li. Variable selection using MM algorithms. Annals of Statistics, 33:1617, 2005.

Luca Insolia, Ana Kenney, Francesca Chiaromonte, and Giovanni Felici. Simultaneous feature selection and outlier detection with optimality guarantees. Biometrics, 78(4): 1592 1603, 2022.

Hyemin Jeon, Jeff Linderoth, and Andrew Miller. Quadratic cone cutting surfaces for quadratic programs with on off constraints. Discrete Optimization, 24:32 50, 2017.

Jonathan Kelner, Frederic Koehler, Raghu Meka, and Dhruv Rohatgi. Feature adaptation for sparse linear regression. Advances in Neural Information Processing Systems, 36, 2024.

Burak Kocuk, Santanu S Dey, and X Andy Sun. Strong socp relaxations for the optimal power flow problem. Operations Research, 64:1177 1196, 2016.

Burak Kocuk, Santanu S Dey, and X Andy Sun. Matrix minor reformulation and SOCPbased spatial branch-and-cut method for the AC optimal power flow problem. Mathematical Programming Computation, 10:557 596, 2018.

Simge K u c ukyavuz, Ali Shojaie, Hasan Manzour, Linchuan Wei, and Hao-Hsiang Wu. Consistent second-order conic integer programming for learning bayesian networks. Journal of Machine Learning Research, 24(322):1 38, 2023.

Ismael Lemhadri, Feng Ruan, Louis Abraham, and Robert Tibshirani. Lassonet: A neural network with feature sparsity. Journal of Machine Learning Research, 22(127):1 29, 2021.

Xiaodong Lin, Minh Pham, and Andrzej Ruszczy nski. Alternating linearization for structured regularization problems. The Journal of Machine Learning Research, 15:3447 3481, 2014.

Atamt urk and G omez

Peijing Liu, Salar Fattahi, Andr es G omez, and Simge K u c ukyavuz. A graph-based decomposition method for convex quadratic optimization with indicators. Mathematical Programming, 200(2):669 701, 2023.

Michele Lombardi, Michela Milano, and Andrea Bartolini. Empirical decision model learning. Artificial Intelligence, 244:343 367, 2017.

Donato Maragno, Holly Wiberg, Dimitris Bertsimas, S Ilker Birbil, Dick den Hertog, and Adejuyigbe Fajemisin. Mixed-integer optimization with constraint learning. ar Xiv preprint ar Xiv:2111.04469, 2021.

Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106:1125 1138, 2011.

Rahul Mazumder, Peter Radchenko, and Antoine Dedieu. Subset selection with shrinkage: Sparse linear modeling when the SNR is low. Operations Research, 2022.

Alan Miller. Subset Selection in Regression. CRC Press, 2002.

Ryuhei Miyashiro and Yuichi Takano. Mixed integer second-order cone programming formulations for variable selection in linear regression. European Journal of Operational Research, 247:721 731, 2015.

Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24:227 234, 1995.

Daniel Nevo and Ya acov Ritov. Identifying a minimal class of models for high-dimensional data. The Journal of Machine Learning Research, 18:797 825, 2017.

Oscar Hernan Madrid Padilla, James Sharpnack, James G Scott, and Ryan J Tibshirani. The dfs fused lasso: Linear-time denoising over general graphs. The Journal of Machine Learning Research, 18:176 1, 2017.

Mert Pilanci, Martin J Wainwright, and Laurent El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming, 151:63 87, 2015.

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206 215, 2019.

Gideon Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6: 461 464, 1978.

Soroosh Shafiee and Fatma Kılın c-Karzan. Constrained optimization of rank-one functions with indicator variables. Mathematical Programming, pages 1 47, 2024.

Naum Z Shor. Quadratic optimization problems. Soviet Journal of Computer and Systems Sciences, 25:1 11, 1987.

Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996.

Rank-one convexification for sparse regression

Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:91 108, 2005.

Ryan J Tibshirani and Jonathan Taylor. The solution path of the generalized Lasso. The Annals of Statistics, 39(3):1335 1371, 2011.

Andreas M Tillmann, Daniel Bienstock, Andrea Lodi, and Alexandra Schwartz. Cardinality minimization, constraints, and regularization: a survey. SIAM Review, 66(3):403 477, 2024.

Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102:349 391, 2016.

Berk Ustun and Cynthia Rudin. Learning optimized risk scores. Journal of Machine Learning Research, 20(150):1 75, 2019.

Catherine Wanjiru, William Ogallo, Girmaw Abebe Tadesse, Charles Wachira, Isaiah Onando Mulang, and Aisha Walcott-Bryant. Automated supervised feature selection for differentiated patterns of care. ar Xiv preprint ar Xiv:2111.03495, 2021.

Linchuan Wei and Simge K u c ukyavuz. An outer approximation method for solving mixedinteger convex quadratic programs with indicators. ar Xiv preprint ar Xiv:2312.04812, 2023.

Linchuan Wei, Andr es G omez, and Simge K u c ukyavuz. Ideal formulations for constrained convex optimization problems with indicator variables. Mathematical Programming, 192 (1):57 88, 2022.

Linchuan Wei, Alper Atamt urk, Andr es G omez, and Simge K u c ukyavuz. On the convex hull of convex quadratic optimization problems with indicators. Mathematical Programming, 204(1):703 737, 2024.

Rebecca Wexler. When a computer program keeps you in jail. New York Times, 2017.

Weijun Xie and Xinwei Deng. Scalable algorithms for the sparse ridge regression. SIAM Journal on Optimization, 30(4):3359 3386, 2020.

Jiaming Zeng, Berk Ustun, and Cynthia Rudin. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society Series A: Statistics in Society, 180(3):689 722, 2017.

Cun-Hui Zhang, Jian Huang, et al. The sparsity and bias of the Lasso selection in highdimensional linear regression. The Annals of Statistics, 36:1567 1594, 2008.

Cun-Hui Zhang, Tong Zhang, et al. A general theory of concave regularization for highdimensional sparse estimation problems. Statistical Science, 27:576 593, 2012.

Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894 942, 2010.

Atamt urk and G omez

Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pages 921 948, 2014.

Peng Zhao and Bin Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7(Nov):2541 2563, 2006.

Jin Zhu, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, and Junxian Zhu. abess: a fast best-subset selection library in python and r. Journal of Machine Learning Research, 23(202):1 7, 2022.

Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, and Xueqin Wang. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117 33123, 2020.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418 1429, 2006.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:301 320, 2005.

Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36:1509 1533, 2008.