# okridge_scalable_optimal_ksparse_ridge_regression__7988c335.pdf

OKRidge: Scalable Optimal k-Sparse Ridge Regression

Jiachang Liu, Sam Rosen, Chudi Zhong, Cynthia Rudin Duke University {jiachang.liu, sam.rosen, chudi.zhong}@duke.edu, cynthia@cs.duke.edu

We consider an important problem in scientific discovery, namely identifying sparse governing equations for nonlinear dynamical systems. This involves solving sparse ridge regression problems to provable optimality in order to determine which terms drive the underlying dynamics. We propose a fast algorithm, OKRidge, for sparse ridge regression, using a novel lower bound calculation involving, first, a saddle point formulation, and from there, either solving (i) a linear system or (ii) using an ADMM-based approach, where the proximal operators can be efficiently evaluated by solving another linear system and an isotonic regression problem. We also propose a method to warm-start our solver, which leverages a beam search. Experimentally, our methods attain provable optimality with run times that are orders of magnitude faster than those of the existing MIP formulations solved by the commercial solver Gurobi.

1 Introduction

We are interested in identifying sparse and interpretable governing differential equations arising from nonlinear dynamical systems. These are scientific machine learning problems whose solution involves sparse linear regression. Specifically, these problems require the exact solution of sparse regression problems, with the most basic being sparse ridge regression:

min β y Xβ 2 2 + λ2 β 2 2 subject to β 0 k, (1)

where k specifies the number of nonzero coefficients for the model. This formulation is general, but in the case of nonlinear dynamical systems, the outcome y is a derivative (usually time or spatial) of each dimension x. Here, we assume that the practitioner has included the true variables, along with many other possibilities, and is looking to determine which terms (which transformations of the variables) are real and which are not. This problem is NP-hard [49], and is more challenging in the presence of highly correlated features. Selection of correct features is vital in this context, as many solutions may give good results on training data, but will quickly deviate from the true dynamics when extrapolating past the observed data due to the chaotic nature of complex dynamical systems.

Both heuristic and optimal algorithms have been proposed to solve these problems. Heuristic methods include greedy sequential adding of features [25, 16, 54, 22] or ensemble [31] methods. These methods are fast, but often get stuck in local minima, and there is no way to assess solution quality due to the lack of a lower bound on performance. Optimal methods provide an alternative, but are slow since they must prove optimality. MIOSR [9], a mixed-integer programming (MIP) approach, has been able to certify optimality of solutions given enough time. Slow solvers cause difficulty in performing cross-validation on λ2 (the ℓ2 regularization coefficient) and k (sparsity level).

We aim to solve sparse ridge regression to certifiable optimality, but in a fraction of the run time. We present a fast branch-and-bound (Bn B) formulation, OKRidge. A crucial challenge is obtaining a

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

tight and feasible lower bound for each node in the Bn B tree. It is possible to calculate the lower bound via the SOS1 [9], big-M [10], or the perspective formulations (also known as the rotated second-order cone constraints) [32, 4, 59]; the mixed-integer problems can then be solved by a MIP solver. However, these formulations do not consider the special mathematical structure of the regression problem. To calculate a lower bound more efficiently, we first propose a new saddle point formulation for the relaxed sparse ridge regression problem. Based on the new saddle point formulation, we propose two novel methods to calculate the lower bound. The first method is extremely efficient and relies on solving only a linear system of equations. The second method is based on ADMM and can tighten the lower bound given by the first method. Together, these methods give us a tight lower bound, used to prune nodes and provide a small optimality gap. Additionally, we propose a method based on beam-search [58] to get a near-optimal solution quickly, which can be a starting point for both our algorithm and other MIP formulations. Unlike previous methods, our method uses a dynamic programming approach so that previous solutions in the Bn B tree can be used while exploring the current node, giving a massive speedup. In summary, our contributions are:

(1) We develop a highly efficient customized branch-and-bound framework for achieving optimality in k-sparse ridge regression, using a novel lower bound calculation and heuristic search. (2) To compute the lower bound, we introduce a new saddle point formulation, from which we derive two efficient methods (one based on solving a linear system and the other on ADMM). (3) Our warm-start method is based on beam-search and implemented in a dynamic programming fashion, avoiding redundant calculations. We prove that our warm-start method is an approximation algorithm with an exponential factor tighter than previous work.

On benchmarks, OKRidge certifies optimality orders of magnitude faster than the commercial solver Gurobi. For dynamical systems, our method outperforms the state-of-the-art certifiable method by finding superior solutions, particularly in high-dimensional feature spaces.

2 Preliminary: Dual Formulation via the Perspective Function

There is an extensive literature on this topic, and a longer review of related work is in Appendix A. If we ignore the constant term y T y, we can rewrite the loss objective in Equation (1) as:

Lridge(β) := βT XT Xβ 2y T Xβ + λ2

j=1 β2 j , (2)

with p as the number of features. We are interested in the following optimization problem:

min β Lridge(β) s.t. (1 zj)βj = 0,

j=1 zj k, zj {0, 1}, (3)

where k is the number of nonzero coefficients. With the sparsity constraint, the problem is NP-hard. The constraint (1 zj)βj in Problem (3) can be reformulated with the SOS1, big-M, or the perspective formulation (with quadratic cone constraints), which then can be solved by a MIP solver. Since commercial solvers do not exploit the special structure of the problem, we develop a customized branch-and-bound framework.

For any function f(a), the perspective function is g(a, b) := bf( a

b ) for the domain b > 0 [32, 34, 26] and g(a, b) = 0 otherwise. Applying to f(a) = a2, we obtain another function g(a, b) = a2

b . As shown by [4], replacing the loss term β2 j and constraint (1 zj)βj = 0 with the perspective formula β2 j /zj in Problem (3) would not change the optimal solution. By the Fenchel conjugate [4], g( , )

can be rewritten as g(a, b) = maxc ac c2

4 b. If we define a new perspective loss as:

LFenchel ridge (β, z, c) := βT XT Xβ 2y T Xβ + λ2

βjcj c2 j 4 zj

then we can reformulate Problem (3) as:

min β,z max c LFenchel ridge (β, z, c) s.t.

j=1 zj k, zj {0, 1}. (5)

If we relax the binary constraint {0, 1} to the interval [0, 1] and swap max and min (no duality gap, as pointed by [4]), we obtain the dual formulation for the convex relaxation of Problem (5):

max c min β,z LFenchel ridge (β, z, c) s.t.

j=1 zj k, zj [0, 1]. (6)

While [4] uses the perspective formulation for safe feature screening, we use it to calculate a lower bound for Problem (3). However, directly solving the maxmin problem is computationally challenging. In Section 3.1, we propose two methods to achieve this in an efficient way.

3 Methodology

We propose a custom Bn B framework to solve Problem (3). We have 3 steps to process each node in the Bn B tree. First, we calculate a lower bound of the node, using two algorithms proposed in the next subsection. If the lower bound exceeds or equals the current best solution, we have proven that it does not lead to any optimal solution, so we prune the node. Otherwise, we go to Step 2, where we perform beam-search to find a near-optimal solution. In Step 3, we use the solution from Step 2 and propose a branching strategy to create new nodes in the Bn B tree. We continue until reaching the optimality gap tolerance. Below, we elaborate on each step. In Appendix E, we provide visual illustrations of Bn B and beam search as well as the complete pseudocodes of our algorithms.

3.1 Lower Bound Calculation

Tight Saddle Point Formulation

We first rewrite Equation (2) with a new hyperparameter λ:

Lridge λ(β, z) := βT Qλβ 2y T Xβ + (λ2 + λ)

j=1 β2 j , (7)

where Qλ := XT X λI. We restrict λ [0, λmin(XT X)], where λmin( ) denotes the minimum eigenvalue of a matrix. We see that Qλ is positive semidefinite, so the first term remains convex. This trick is related to the optimal perspective formulation [62, 28, 37], but we set the diagonal matrix diag(d) in [28] to be λI. We call this trick the eigen-perspective formulation. The optimal perspective formulation requires solving semidefinite programming (SDP) problems, which have been shown not scalable to high dimensions [28], and MI-SDP is not supported by Gurobi.

Solving Problem (3) is equivalent to solving the following problem:

min β,z Lridge λ(β, z) s.t. (1 zj)βj = 0,

j=1 zj k, zj {0, 1}. (8)

We get a continuous relaxation of Problem (3) if we relax {0, 1} to [0, 1].

We can now define a new loss analogous to the loss defined in Equation (4):

LFenchel ridge λ(β, z, c) := βT Qλβ 2y T Xβ + (λ2 + λ)

j=1 (βjcj c2 j 4 zj). (9)

Then, the dual formulation analogous to Problem (6) is:

max c min β,z LFenchel ridge λ(β, z, c) s.t.

j=1 zj k, zj [0, 1]. (10)

Solving Problem (10) provides us with a lower bound to Problem (8). More importantly, this lower bound becomes tighter as λ increases. This novel formulation is the starting point for our work.

We next propose a reparametrization trick to simplify the optimization problem above. For the inner optimization problem in Problem (10), given any c, the optimality condition for β is (take the gradient with respect to β and set the gradient to 0):

c = 2 λ2 + λ(XT y Qλβ). (11)

Inspired by this optimality condition, we have the following theorem:

Theorem 3.1. If we reparameterize c = 2 λ2+λ(XT y Qλγ) with a new parameter γ, then Problem (10) is equivalent to the following saddle point optimization problem:

max γ min z Lsaddle ridge λ(γ, z) s.t.

j=1 zj k, zj [0, 1], (12)

where Lsaddle ridge λ(γ, z) := γT Qλγ 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ), (13)

and diag(z) is a diagonal matrix with z on the diagonal.

To our knowledge, this is the first time this formulation is given. Solving the saddle point formulation to optimality in Problem (12) gives us a tight lower bound. However, this is computationally hard.

Our insight is that we can solve Problem (12) approximately while still obtaining a feasible lower bound. Let us define a new function h(γ) as short-hand for the inner minimization in Problem (12):

h(γ) = min z Lsaddle ridge λ(γ, z) s.t.

j=1 zj k, zj [0, 1]. (14)

For any arbitrary γ Rp, h(γ) is a valid lower bound for Problem (3). We should choose γ such that this lower bound h(γ) is tight. Below, we provide two efficient methods to calculate such a γ.

Fast Lower Bound Calculation

First, we provide a fast way to choose γ. The choice of γ is motivated by the following theorem:

Theorem 3.2. The function h(γ) defined in Equation (14) is lower bounded by

h(γ) γT Qλγ 1 λ2 + λ XT y Qλγ 2 2. (15)

Furthermore, the right-hand size of Equation (15) is maximized if γ = ˆγ = argminα Lridge(α), where in this case, h(γ) evaluated at ˆγ becomes

h(ˆγ) = Lridge(ˆγ) + (λ2 + λ)Sum Bottomp k({ˆγ2 j }), (16)

where Sum Bottomp k( ) denotes the summation of the smallest p k terms of a given set.

Here we provide an intuitive explanation of why h(ˆγ) is a valid lower bound. Note that the ridge regression loss is strongly convex. Assuming that the strongly convex parameter is µ (see Appendix B), by the strong convexity property, we have that for any γ Rp,

Lridge(γ) Lridge(ˆγ) + Lridge(ˆγ)T (γ ˆγ) + µ

2 γ ˆγ 2 2. (17)

Because ˆγ minimizes Lridge( ), we have Lridge(ˆγ) = 0. For the k-sparse vector γ with γ 0 k, the minimum for the right-hand side of Inequality (17) can be achieved if γj = ˆγj for the top k terms of ˆγ2 l s. This ensures the bound applies for all k-sparse γ. Thus, the k-sparse ridge regression loss is lower-bounded by

Lridge(γ) Lridge(ˆγ) + µ

2 Sum Bottomp k({ˆγ2 j })

for γ Rp with γ 0 k. For ridge regression, the strong convexity µ parameter can be chosen from [0, 2(λ2 + λmin(XT X))]. If we let µ = 2(λ2 + λ), we obtain h(ˆγ) in Theorem 3.2.

The lower bound h(ˆγ) can be calculated extremely efficiently by solving the ridge regression problem (solving the linear system (XT X+λ2I)γ = XT y for γ) and adding the extra p k terms. However, this bound is not the tightest we can achieve. In the next subsection, we discuss how to apply ADMM to maximize h(γ) further based on Equation (14).

Tight Lower Bound via ADMM

Let us define p := XT y Qλγ. Starting from Problem (12), if we minimize z in the inner optimization under the constraints Pp j=1 zj k and zj [0, 1] for j, we have zj = 1 for the top k terms of p2 j and zj = 0 otherwise. Then, Problem (12) can be reformulated as follows:

min γ (F(γ) + G(p)) s.t. Qλγ + p = XT y, (18)

where F(γ) := γT Qλγ and G(p) := 1 λ2+λSum Topk({p2 j}). The solution to this problem is a dense vector that can be used to provide a lower bound on the original k-sparse problem. This problem can be solved by the alternating direction method of multipliers (ADMM) [17]. Here, we apply the iterative algorithm with the scaled dual variable q [33]:

γt+1 = argmin γ F(γ) + ρ

2 Qλγ + pt XT y + qt 2 2 (19)

θt+1 = 2αQλγt+1 (1 2α)(pt XT y) (20)

pt+1 = argmin p G(p) + ρ

2 θt+1 + p XT y + qt 2 2 (21)

qt+1 = qt + θt+1 + pt+1 XT y, (22)

where α is the relaxation factor, and ρ is the step size.

It is known that ADMM suffers from slow convergence when the step size is not properly chosen. According to [33], to ensure the optimized linear convergence rate bound factor, we can pick α = 1 and ρ = 2

λmax(Qλ)λmin>0(Qλ) 1, where λmax( ) denotes the largest eigenvalue of a matrix, and

λmin>0( ) denotes the smallest positive eigenvalue of a matrix.

Having settled the choices for the relaxation factor α and the step size ρ, we are left with the task of solving Equation (19) and Equation (21) (also known as evaluating the proximal operators [52]). Interestingly, Equation (19) can be evaluated by solving a linear system while Equation (21) can be evaluated by recasting the problem as an isotonic regression problem.

Theorem 3.3. Let F(γ) = γT Qλγ and G(p) = 1 λ2+λSum Topk({p2 j}). Then the solution for the problem γt+1 = argminγ F(γ) + ρ

2 Qλγ + pt XT y + qt 2 2 is

1 XT y pt qt . (23)

Furthermore, let a = XT y θt+1 qt and J be the indices of the top k terms of {|aj|}. The solution for the problem pt+1 = argminp G(p)+ ρ

2 θt+1+p XT y+qt 2 2 is pt+1 j = sign(aj) ˆvj,

where ˆv = argmin v

j=1 wj(vj bj)2 s.t. vi vl if |ai| |al| (24)

( 1 if j / J 1 + 2 ρ(λ2+λ) otherwise , bj = |aj|

Problem (24) is an isotonic regression problem and can be efficiently solved in linear time [12, 21].

3.2 Beam-Search as a Heuristic

After finishing the lower bound calculation in Section 3.1, we next explain how to quickly reduce the upper bound in the Bn B tree. We discuss how to add features, keep good solutions, and use dynamic programming to improve efficiency. Lastly, we give a theoretical guarantee on the quality of our solution.

1 [33] also considers matrix preconditioning when computing the step size, but this is computationally expensive when the number of features is large, so we ignore matrix rescaling by letting E be the identity matrix in Section VI Subsection A of [33].

Starting from the vector 0, we add one coordinate at a time into our support until we reach a solution with support size k. At each iteration, we pick the coordinate that results in the largest decrease in the ridge regression loss while keeping coefficients in the existing support fixed:

j argmin j min α Lridge(β + αej) j argmax j

( j Lridge(β))2

X:j 2 2 + λ2 , (25)

where X:j denotes the j-th column of X, and the right-hand side uses an analytical solution for the line-search for α. This is similar to the sparse-simplex algorithm [6]. However, after adding a feature, we adjust the coefficients restricted on the new support by minimizing the ridge regression loss.

The above idea does not handle highly correlated features well. Once a feature is added, it cannot be removed [61]. To alleviate this problem, we use beam-search [58, 43], keeping the best B solutions at each stage of support expansion:

j arg Bottom Bj(min α Lridge(β + αej)), (26)

where j arg Bottom Bj means j belongs to the set of solutions whose loss is one of the B smallest losses. Afterwards, we finetune the solution on the newly expanded support and choose the best B solutions for the next stage of support expansion. A visual illustration of beam search can be found in Figure 6 in Appendix E, which also contains the detailed algorithm.

Although many methods have been proposed for sparse ridge regression, none of them have been designed with the Bn B tree structure in mind. Our approach is to take advantage of the search history of past nodes to speed up the search process for a current node. To achieve this, we follow a dynamic programming approach by saving the solutions of already explored support sets. Therefore, whenever we need to adjust coefficients on the new support during beam search, we can simply retrieve the coefficients from the history if a support has been explored in the past. Essentially, we trade memory space for computational efficiency.

3.2.1 Provable Guarantee

Lastly, using similar methods to [29], we quantify the gap between our found heuristic solution ˆβ and the optimal solution β in Theorem 3.4. Compared with Theorem 5 in [29], we improve the factor in the exponent from m2k

M1 (since M1 M2k, where M1 and M2k are defined in [29]).

Theorem 3.4. Let us define a k-sparse vector pair domain to be Ωk := {(x, y) Rp Rp : x 0 k, y 0 k, x y 0 k}. Any M1 satisfying f(y) f(x)+ f(x)T (y x)+ M1

2 y x 2 2 for all (x, y) Ω1 is called a restricted smooth parameter with support size 1, and any m2k satisfying f(y) f(x) + f(x)T (y x) + m2k

2 y x 2 2 for all (x, y) Ω2k is called a restricted strongly convex parameter with support size 2k. If ˆβ is our heuristic solution by the beam-search method, and β is the optimal solution, then:

Lridge(β ) Lridge( ˆβ) (1 e m2k/M1)Lridge(β ). (27)

3.3 Branching and Queuing

Branching: The most common branching techniques include most-infeasible branching and strong branching [2, 1, 15, 7]. However, these two techniques require having fractional values for the binary variables zj s, which we do not compute in our framework. Instead, we propose a new branching strategy based on our heuristic solution ˆβ: we branch on the coordinate whose coefficient, if set to 0, would result in the largest increase in the ridge regression loss Lridge (See Appendix E for details):

j = argmax j Lridge( ˆβ ˆβjej). (28)

The intuition is that the coordinate with the largest increase in Lridge potentially plays a significant role, so we want to fix such a coordinate as early as possible in the Bn B tree.

Queuing: Besides the branching strategy, we need a queue to pick a node to explore among newly created nodes. Here, we use a breadth-first approach, evaluating nodes in the order they are created.

Figure 1: Comparison of running time (top row) and optimality gap (bottom row) between our method and baselines, varying the number of features, for three correlation levels ρ = 0.1, 0.5, 0.9 (n = 100000, k = 10). Time is on the log scale. Our method is generally orders of magnitude faster than other approaches. Our method achieves the smallest optimality gap, especially when the feature correlation ρ is high.

Figure 2: Comparison of running time (top row) and optimality gap (bottom row) between our method and baselines, varying sample sizes, for three correlation levels ρ = 0.1, 0.5, 0.9 (p = 3000, k = 10). Time is on the log scale. When ρ = 0.1 and ρ = 0.5, OKRidge is generally orders of magnitude faster than other approaches. In the case ρ = 0.9, we achieve the smallest optimality gap as shown in the bottom row.

4 Experiments

We test the effectiveness of our OKRidge on synthetic benchmarks and sparse identification of nonlinear dynamical systems (SINDy)[19]. Our main focus is: assessing how well our proposed lower bound calculation speeds up certification (Section 4.1), and evaluating solution quality of OKRidge on challenging applications (Section 4.2). Additional extensive experiments are in Appendix G and H. Our algorithms are written in Python. Any improvements we see over commercial MIP solvers, which are coded in C/C++, are solely due to our specialized algorithms.

Figure 3: Results on discovering sparse differential equations. On various metrics, OKRidge outperforms all other methods, including MIOSR which uses a commercial (proprietary) MIP solver.

4.1 Assessing How Well Our Proposed Lower Bound Calculation Speeds Up Certification

Here, we demonstrate the speed of OKRidge for certifying optimality compared to existing MIPs solved by Gurobi [35]. We set a 1-hour time limit and an optimality gap of relative tolerance 10 4.

We use a value of 0.001 for λ2. Our 4 baselines include MIPs with SOS1, big-M (M = 50 to prevent cutting off optimal solutions), perspective [4], and eigen-perspective formulations (λ = λmin(XT X)) [28]. In the main text, we use plots to present the results. In Appendix G, we present the results in tables. Additionally, in Appendix G, we conduct perturbation studies on λ2 (λ2 = 0.1 and λ2 = 10) and M (M = 20 and M = 5). Finally, still in Appendix G, we also compare OKRidge with other MIPs including the MOSEK solver [3], Subset Selection CIO [11], and L0BNB [39].

Similar to the data generation process in [11, 48], we first sample xi Rp from a Gaussian distribution N(0, Σ) with mean 0 and covariance matrix Σ, where Σij = ρ|i j|. Variable ρ controls the feature correlation. Then, we create the coefficient vector β with k nonzero entries, where β j = 1 if

j mod (p/k) = 0. Next, we construct the prediction yi = x T i β + ϵi, where ϵi i.i.d. N(0, Xβ 2 2 SNR ). SNR stands for signal-to-noise ratio (SNR), and we choose SNR to be 5 in all our experiments.

In the first setting, we fix the number of samples with n = 100000 and vary the number of features p {100, 500, 1000, 3000, 5000} and correlation levels ρ {0.1, 0.5, 0.9} (See Appendix G for ρ = 0.3 and ρ = 0.7). We warm-started the MIP solvers by our beam-search solutions. The results can be seen in Figure 1. From both figures, we see that OKRidge outperforms all existing MIPs solved by Gurobi, usually by orders of magnitude.

In the second setting, we fix the number of features to p = 3000 and vary the number of samples n {3000, 4000, 5000, 6000, 7000} and the correlation levels ρ {0.1, 0.5, 0.9} (see Appendix G for ρ = 0.3 and ρ = 0.7). As in the first setting, we also warm-started the MIP solvers by our beam-search solutions. The results are in Figure 2. When n is close to p or the correlation is high

(ρ = 0.9), no methods can finish within the 1-hour time limit, but OKRidge prunes the search space well and achieves the smallest optimality gap. When n becomes larger in the case of ρ = 0.1 and ρ = 0.5, OKRidge runs orders of magnitude faster than all baselines.

4.2 Evaluating Solution Quality of OKRidge on Challenging Applications

On previous synthetic benchmarks, many heuristics (including our beam search method) can find the optimal solution without branch-and-bound. In this subsection, we work on more challenging scenarios (sparse identification of differential equations). We replicate the experiments in [9] using three dynamical systems from the Py SINDy library [27, 42]: Lorenz System, Hopf Bifurcation, and magnetohydrodynamical (MHD) model [24]. The Lorenz System is a 3-D system with the nonlinear differential equations:

dx/dt = σx + σy, dy/dt = ρx y xz, dz/dt = xy βz

where we use standard parameters σ = 10, β = 8/3, ρ = 28. The true sparsities for each dimension are (2, 3, 2). The Hopf Bifurcation is a 2-D system with nonlinear differential equations:

dx/dt = µx + ωy Ax3 Axy2, dy/dt = ωx + µy Ax2y Ay3

where we use the standard parameters µ = 0.05, ω = 1, A = 1. The true sparsities for each dimension are (4, 4). Finally, the MHD is a 6-D system with the nonlinear differential equations:

d V1/dt = 4V2V3 4B2B3, d V2/dt = 7V1V3 + 7B1B2, d V3/dt = 3V1V2 3B1B2, d B1/dt = 2B3V2 2V3B2, d B2/dt = 5V3B1 5B3V1, d B3/dt = 9V1B2 9B1V2.

The true sparsities for each dimension are (2, 2, 2, 2, 2, 2).

We use all monomial features (candidate functions) up to 5th order interactions. This results in 56 functions for the Lorentz System, 21 for Hopf Bifurcation, and 462 for MHD. Due to the high-order interaction terms, features are highly correlated, resulting in poor performance of heuristic methods.

4.2.1 Baselines and Experimental Setup

In addition to MIOSR (which relies on the SOS1 formulation), we also compare with three common baselines in the SINDy literature: STLSQ [54], SSR [16], and E-STLSQ [31]. The baseline SR3 [25] is not included since the previous literature [9] shows it performs poorly. We compare OKRidge with other baselines using the SINDy library [27, 42]. We follow the experimental setups in [9] for model selection, hyperparameter choices, and evaluation metrics (please see Appendix F for details). In Appendix H, we provide additional experiments on Gurobi with different MIP formulations and comparing with more heuristic baselines.

4.2.2 Results

Figure 4: Running time comparison between OKRidge and MIOSR on the MHD system with 462 candidate functions. OKRidge is significantly faster than the previous state of the art.

Figure 3 displays the results. OKRidge (red curves) outperforms all baselines, including MIOSR (blue curves), across evaluation metrics. On the Lorenz System, all methods recover the true feature support when the training trajectory is long enough. When the training trajectory length is short, i.e., the left part of each subplot, (or equivalently, when the number of samples is small), OKRidge performs uniformly better than all other baselines. On the Hopf Bifurcation, all heuristic methods fail to recover the true support, resulting in poor performance. On the final MHD, OKRidge maintains the top performance and outperforms MIOSR on the true positivity rate. This demonstrates the effectiveness of OKRidge, which incurs lower runtimes and yields better metric scores under high-dimensional settings. The highest runtimes are incurred for the MHD (with 462 candidate functions/features), which are shown in Figure 4.

Limitations of OKRidge When the feature dimension is low (under 100s), Gurobi can solve the problem to optimality faster than OKRidge. This is observed on the synthetic benchmarks (p = 100) and also on the Hopf Bifurcation (p = 21). Since Gurobi is a commercial proprietary solver, we cannot inspect the details of its sophisticated implementation. Gurobi may resort to an enumeration/brute-force approach, which could be faster than spending time to calculate lower bounds in the Bn B tree. This being said, OKRidge is still competitive with Gurobi in the low-dimensional setting, and OKRidge scales favorably in high-dimensional settings.

5 Conclusion

We presented a method for optimal sparse ridge regression that leverages a novel tight lower bound on the objective. We showed that the method is both faster and more accurate than existing approaches for learning differential equations a key problem in scientific discovery. This tool (unlike its main competitor) does not require proprietary software with expensive licenses and can have a significant impact on various regression applications.

Code Availability

Implementations of OKRidge discussed in this paper are available at https://github.com/ jiachangliu/OKRidge.

Acknowledgements

The authors gratefully acknowledge funding support from grants NSF IIS-2130250, NSF-NRT DGE2022040, NSF OAC-1835782, DOE DE-SC0023194, and NIH/NIDA R01 DA054994. The authors would also like to thank the anonymous reviewers for their insightful comments.

[1] T. Achterberg, T. Koch, and A. Martin. Branching rules revisited. Operations Research Letters, 33(1):42 54, 2005. [2] D. Applegate, R. Bixby, V. Chvátal, and W. Cook. On the solution of traveling salesman problems. Documenta Mathematica, pages 645 656, 1998. [3] M. Ap S. Mosek optimizer API for python. Version, 9(17):6 4, 2022. [4] A. Atamturk and A. Gómez. Safe screening rules for l0-regression from perspective relaxations. In International Conference on Machine Learning, pages 421 430. PMLR, 2020. [5] A. Atamtürk, A. Gómez, and S. Han. Sparse and smooth signal estimation: Convexification of l0-formulations. Journal of Machine Learning Research, 22:52 1, 2021. [6] A. Beck and Y. C. Eldar. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization, 23(3):1480 1509, 2013. [7] P. Belotti, C. Kirches, S. Leyffer, J. Linderoth, J. Luedtke, and A. Mahajan. Mixed-integer nonlinear optimization. Acta Numerica, 22:1 131, 2013. [8] D. Bertsekas. Convex optimization theory, volume 1. Athena Scientific, 2009. [9] D. Bertsimas and W. Gurnee. Learning sparse nonlinear dynamics via mixed-integer optimization. Nonlinear Dynamics, Jan 2023. [10] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813 852, 2016. [11] D. Bertsimas, J. Pauphilet, and B. Van Parys. Sparse regression: Scalable algorithms and empirical performance. Statistical Science, 35(4):555 578, 2020. [12] M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(1-3):425 439, 1990. [13] T. Blumensath and M. E. Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370 2382, 2008.

[14] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265 274, 2009. [15] P. Bonami, J. Lee, S. Leyffer, and A. Wächter. More branch-and-bound experiments in convex nonlinear integer programming. Preprint ANL/MCS-P1949-0911, Argonne National Laboratory, Mathematics and Computer Science Division, 91, 2011. [16] L. Boninsegna, F. Nüske, and C. Clementi. Sparse learning of stochastic dynamical equations. The Journal of Chemical Physics, 148(24):241723, June 2018. [17] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. [18] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004 2005, 2003. [19] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932 3937, 2016. [20] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. [21] F. M. Busing. Monotone regression: A simple and fast o(n) PAVA implementation. Journal of Statistical Software, 102:1 25, 2022. [22] T. T. Cai and L. Wang. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information Theory, 57(7):4680 4688, 2011. [23] P. M. Camerini, L. Fratta, and F. Maffioli. On improving relaxation methods by modified gradient techniques. In Nondifferentiable Optimization, pages 26 34. Springer Berlin Heidelberg, 1975. [24] V. Carbone and P. Veltri. Relaxation processes in magnetohydrodynamics - A triad-interaction model. Astronomy and Astrophysics, 259(1):359 372, June 1992. [25] K. Champion, P. Zheng, A. Y. Aravkin, S. L. Brunton, and J. N. Kutz. A unified sparse optimization framework to learn parsimonious physics-informed models from data. IEEE Access, 8:169259 169271, 2020. [26] P. L. Combettes. Perspective functions: Properties, constructions, and examples. Set-Valued and Variational Analysis, 26(2):247 264, 2018. [27] B. de Silva, K. Champion, M. Quade, J.-C. Loiseau, J. Kutz, and S. Brunton. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020. [28] H. Dong, K. Chen, and J. Linderoth. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. ar Xiv preprint ar Xiv:1510.06083, 2015. [29] E. R. Elenberg, R. Khanna, A. G. Dimakis, and S. Negahban. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539 3568, 2018. [30] A. Eriksson, T. Thanh Pham, T.-J. Chin, and I. Reid. The k-support norm and convex envelopes of cardinality and rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3349 3357, 2015. [31] U. Fasel, J. N. Kutz, B. W. Brunton, and S. L. Brunton. Ensemble-SINDy: Robust sparse model discovery in the low-data, high-noise limit, with active learning and control. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 478(2260), Apr. 2022. [32] A. Frangioni and C. Gentile. Perspective cuts for a class of convex 0 1 mixed integer programs. Mathematical Programming, 106(2):225 236, 2006. [33] P. Giselsson and S. Boyd. Linear convergence and metric selection for douglas-rachford splitting and admm. IEEE Transactions on Automatic Control, 62(2):532 544, 2016. [34] O. Günlük and J. Linderoth. Perspective reformulations of mixed integer nonlinear programs with indicator variables. Mathematical Programming, 124(1):183 205, 2010. [35] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. [36] W. H. Haemers. Interlacing eigenvalues and graphs. Linear Algebra and its Applications, 226:593 616, 1995.

[37] S. Han, A. Gómez, and A. Atamtürk. The equivalence of optimal perspective formulation and Shor s SDP for quadratic programs with indicator variables. Operations Research Letters, 50(2):195 198, 2022. [38] H. Hazimeh and R. Mazumder. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5):1517 1537, 2020. [39] H. Hazimeh, R. Mazumder, and A. Saab. Sparse regression at scale: Branch-and-bound rooted in first-order optimization. Mathematical Programming, 196(1):347 388, 2022. [40] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional m-estimation. Advances in Neural Information Processing Systems, 27, 2014. [41] K. Kaheman, J. N. Kutz, and S. L. Brunton. SINDy-PI: a robust algorithm for parallel implicit sparse identification of nonlinear dynamics. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 476(2242), Oct. 2020. [42] A. A. Kaptanoglu, B. M. de Silva, U. Fasel, K. Kaheman, A. J. Goldschmidt, J. Callaham, C. B. Delahunt, Z. G. Nicolaou, K. Champion, J.-C. Loiseau, J. N. Kutz, and S. L. Brunton. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022. [43] J. Liu, C. Zhong, B. Li, M. Seltzer, and C. Rudin. Faster Risk: Fast and accurate interpretable risk scores. In Advances in Neural Information Processing Systems, 2022. [44] J. Liu, C. Zhong, M. Seltzer, and C. Rudin. Fast sparse classification for generalized linear and additive models. In Proceedings of Artificial Intelligence and Statistics (AISTATS), 2022. [45] N. M. Mangan, J. N. Kutz, S. L. Brunton, and J. L. Proctor. Model selection for dynamical systems via sparse regression and information criteria. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2204):20170009, Aug. 2017. [46] D. A. Messenger and D. M. Bortz. Weak sindy for partial differential equations. Journal of Computational Physics, 443:110525, 2021. [47] D. A. Messenger and D. M. Bortz. Weak sindy: Galerkin-based data-driven model selection. Multiscale Modeling & Simulation, 19(3):1474 1497, 2021. [48] T. Moreau, M. Massias, A. Gramfort, P. Ablin, P.-A. Bannier, B. Charlier, M. Dagréou, T. Dupré la Tour, G. Durif, C. F. Dantas, Q. Klopfenstein, J. Larsson, E. Lai, T. Lefort, B. Malézieux, B. Moufad, B. T. Nguyen, A. Rakotomamonjy, Z. Ramzi, J. Salmon, and S. Vaiter. Benchopt: Reproducible, efficient and collaborative optimization benchmarks. In Advances in Neural Information Processing Systems, 2022. [49] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227 234, 1995. [50] D. Needell and J. A. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301 321, 2009. [51] D. Needell and R. Vershynin. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 4(2):310 316, 2010. [52] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127 239, 2014. [53] M. Pilanci, M. J. Wainwright, and L. El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming, 151(1):63 87, 2015. [54] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017. [55] M. E. Sander, J. Puigcerver, J. Djolonga, G. Peyré, and M. Blondel. Fast, differentiable and sparse top-k: a convex analysis perspective. In International Conference on Machine Learning, pages 29919 29936. PMLR, 2023. [56] J. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231 2242, 2004. [57] R. Vreugdenhil, V. A. Nguyen, A. Eftekhari, and P. M. Esfahani. Principal component hierarchy for sparse quadratic programs. In International Conference on Machine Learning, pages 10607 10616. PMLR, 2021.

[58] S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296 1306, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [59] W. Xie and X. Deng. Scalable algorithms for the sparse ridge regression. SIAM Journal on Optimization, 30(4):3359 3386, 2020. [60] G. Yuan, L. Shen, and W.-S. Zheng. A block decomposition algorithm for sparse optimization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 275 285, 2020. [61] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689 4708, 2011. [62] X. Zheng, X. Sun, and D. Li. Improving the performance of MIQP solvers for quadratic programs with cardinality and minimum threshold constraints: A semidefinite program approach. INFORMS Journal on Computing, 26(4):690 703, 2014. [63] J. Zhu, C. Wen, J. Zhu, H. Zhang, and X. Wang. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117 33123, 2020.

Appendix to OKRidge Scalable Optimal k-Sparse Ridge Regression

Table of Contents

A Related Work 15

B Background Concepts 17

C Derivations and Proofs 18 C.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.3 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.4 Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.5 Derivation of g(a, b) = maxc ac c2

4 b via Fenchel Conjugate . . . . . . . . . . 30

D Existing MIP Formulations for k-sparse Ridge Regression 32

E Algorithmic Charts 34 E.1 Overall Branch-and-bound Framework . . . . . . . . . . . . . . . . . . . . . . 34 E.2 Fast Lower Bound Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.3 Refine the Lower Bound through the ADMM method . . . . . . . . . . . . . . . 37 E.4 (Optional) An alternative to the Lower Bound Calculation via Subgradient Ascent and the CMF method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 E.5 Beam Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E.6 Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 E.7 A Note on the Connection with the Interlacing Eigenvalue Property . . . . . . . 41

F More Experimental Setup Details 42 F.1 Computing Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 F.2 Baselines, Licenses, and Links . . . . . . . . . . . . . . . . . . . . . . . . . . 42 F.3 Differential Equation Experiments . . . . . . . . . . . . . . . . . . . . . . . . 42

G More Experimental Results on Synthetic Benchmarks 44 G.1 Results Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 G.2 Comparison with SOS1, Big-M, Perspective, and Eigen-Perspective Formulations with different ℓ2 Regularizations . . . . . . . . . . . . . . . . . . . . . . . . . 46 G.3 Comparison with the Big-M Formulation with different M Values . . . . . . . . 106 G.4 Comparison with MOSEK Solver, Subset Selection CIO, and L0BNB . . . . . . . 143 G.5 Experiments on the Case p >> n . . . . . . . . . . . . . . . . . . . . . . . . . 173 G.6 Comparison between Fast Solve and ADMM . . . . . . . . . . . . . . . . . . . 176

H More Experimental Results on Dynamical Systems 177 H.1 Results Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 H.2 Direct Comparison of OKRidge with MIOSR . . . . . . . . . . . . . . . . . . . 178 H.3 Comparison of OKRidge with SOS1, Big-M, Perspective, and Eigen-Perspective Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 H.4 Comparison of Our Beam Search with More Heuristic Baselines . . . . . . . . . 180 H.5 Comparison of OKRidge with MIPs Warmstarted by Our Beam Search . . . . . 182 H.6 Comparison of OKRidge with MOSEK solver, Subset Selection CIO, and L0BNB 183

A Related Work

Sparse Identification of Nonlinear Dynamical Systems: The Sparse Identification of Nonlinear Dynamical Systems (SINDy) framework [19] has been widely adopted for discovering dynamical systems from observed data. The basic framework consists of approximating derivatives, picking a library of features based on positional data, and performing sparse regression on the resulting design matrix. SINDy has expanded to solving PDEs [54, 46] or implicit equations [41], noisy or low data settings [31, 47], and constrained problems [25, 9]. The basic framework can be summarized as a standard regression problem:

X = Θ(X)Ξ, (29)

where the goal is to find Ξ = [β1, . . . , βd], with d being the number of dimensions of the dynamical system, X being the observed data, and Θ is a map from observed data to candidate functions. The problem is typically solved independently across dimensions, yielding d sparse regression problems.

Many methods in the SINDy framework solve regression problems with a greedy backwards selection approach. These methods have been shown to outperform LASSO [25] and Orthogonal Matching Pursuit [16] in learning dynamical systems. The backwards selection approach typically solves a non-sparse regression problem, and then either removes all small coefficients below a threshold, or remove one feature at a time until a desired sparsity level is reached [54]. Although these methods have performed well with reasonable run times, they struggle to identify true dynamics as the number of candidate features increases.

Advances in MIP formulations have led to solving SINDy problems with greater success and ability to verify optimality [9]; however, they tend not to scale as well. The method we present also verifies optimality and is much faster than the MIP formula, particularly for larger number of features p.

Heuristic Methods: Greedy methods aim to solve the problem to near-optimality, with few guarantees. One direction is greedy pursuit (i.e., forward selection), where one coordinate at a time is added until the required support size is reached [56, 13, 50, 51, 22, 29]. Another direction is iterative thresholding, where gradient descent steps alternate with projection steps to satisfy the support size requirement [14, 40, 11, 59, 57]. Other approaches include randomized rounding on the solution of the boolean-relaxed problem [53], swapping features [61, 6, 38, 63, 44], or solving a smaller problem optimally on a working set [60]. These heuristic solutions may greatly improve the computational speed of MIP solvers when used as warm starts. Typically, the heuristic algorithm starts from scratch each time that it runs, which is slow when running it repeatedly throughout the Bn B tree. Our insight is that the search history of heuristic methods in previously solved nodes can be used to speed up the heuristic method in the current node.

Optimal Methods and Lower Bound Calculation: In order to certify optimality for this NPhard problem, we need to perform branch-and-bound and calculate the lower bound for each node in the Bn B tree. For the lower bound calculation, early works include the SOS1 formulation or the big-M formulation [10, 9]. Both formulations can be implemented in a commercial solver. However, the SOS1 formulation is not scalable to high dimensions, while the big-M formulation is sensitive to the hyperparameter M used to balance scalability and correctness. To circumvent this problem, Subset Selection CIO [11] formulate the least-squares term using the Fenchedl duality with dual variables. After the formulation, Subset Selection CIO applies callbacks to add cutting planes to get a lower bound. Although OKRidge also applies the Fenchel duality, we apply the Fenchel duality on the ℓ2 regularization term. We are solving the problem in the feature space, while Subset Selection CIO solves the problem in the sample space. As pointed in [39], the branch-and-cut method in Subset Selection CIO [11] runs slow when the ℓ2 regularization is small. Recently, the perspective formulation [32, 34, 62, 28, 4, 59, 5, 37] of the ℓ2 term has also been used. Through convex relaxation, the lower bound can be obtained by solving a quadratically constrained quadratic program (QCQP) with rotated second-order cone constraints. However, the conic formulation is still computationally intensive for large-scale datasets and has difficulty attaining optimal solutions. Our work builds upon the perspective formulation, but we propose an efficient way to calculate the lower bound through Fenchel duality. Another line of works focuses on optimal perspective formulation [62, 28, 37]. This requires solving a semidefinite programming (SDP) problem at each node, which has been shown not scalable to high dimensions [28]. What is more, MI-SDP is not supported by Gurobi. Our work is related to the optimal perspective formulation, but we set the

diagonal matrix diag(d) = λmin(XT X)I. We call this the eigen-perspective formulation. Although this is not the optimal choice, it is a good approximation and is supported by Gurobi through the QCQP formulation as discussed above. Lastly, there is a recent work called l0bnb [39], which also implements a customized branch-and-bound framework without the commercial solvers. However, l0bnb is solving the l0-regularized problem, which is not the same as the ℓ0-constrained problem. To get a solution with the specified sparsity, l0bnb needs to solve the problem multiple times with different ℓ0 regularizations.

B Background Concepts

Definition B.1 (Strong Convexity [20]). A function f : Rp R is strongly convex with parameter α > 0 if it satisfies the following subgradient inequality:

f(y) f(x) + f(x)T (y x) + α

2 y x 2 2 (30)

for x, y Rp. If the gradient f does not exist but the subgradient exists, then we have

f(y) f(x) + g T (y x) + α

2 y x 2 2 (31)

for x, y Rp, where g f and f is the subdifferential of f. Definition B.2 (Smoothness [20]). A continuously differentiable function f : Rp R is smooth with parameter β > 0 if the gradeint f is β-Lipschitz. Mathematically, this means that:

f(x) f(y) 2 β x y 2 (32)

for x, y Rp. Lemma B.3 (Smoothness Property [20]). Let f : Rp R be a β-smooth function. Then, we have the following sandwich property (f(y) is sandwiched between two quadratic functions):

f(x) + f(x)T (y x) + β

2 y x 2 2 f(y) f(x) + f(x)T (y x) β

2 y x 2 2 (33)

for x, y Rp. Definition B.4 (k-sparse Pair Domain [29]). A domain Ωk Rp Rp is a k-sparse vector pair domain if it contains all pairs of k-sparse vectors that differ in at most k entries, i.e,

Ωk := {(x, y) Rp Rp : x 0 k, y 0 k, x y k}. (34)

Definition B.5 (Restricted Strong Convexity [29]). A function f : Rp R is restricted strongly convex with parameter α > 0 on the k-sparse vector pair domain Ωk if it satisfies the following inequality:

f(y) f(x) + f(x)T (y x) + α

2 y x 2 2 (35)

for (x, y) Ωk. Definition B.6 (Restricted Smoothness [29]). A function f : Rp R is restrictedly smooth with parameter β > 0 on the k-sparse vector pair domain Ωk if it satisfies the following inequality:

f(x) + f(x)T (y x) + β

2 y x 2 2 f(y) (36)

for (x, y) Ωk. Lemma B.7 (Gradient of Ridge Regression). At β Rp, the gradient of our ridge regression function is given by:

Lridge(β) = 2XT (Xβ y) + 2λ2β (37)

Lridge(β) = (βT XT Xβ 2y XXβ + λ2βT β)

= (βT XT Xβ) (2y T Xβ) + λ2 (βT β)

= 2XT Xβ 2XT y + 2λ2β

= 2XT (Xβ y) + 2λ2β

C Derivations and Proofs

C.1 Proof of Theorem 3.1

Theorem 3.1. If we reparameterize c = 2 λ2+λ(XT y Qλγ) with a new parameter γ, then Problem (10) is equivalent to the following saddle point optimization problem:

max γ min z Lsaddle ridge λ(γ, z) s.t.

j=1 zj k, zj [0, 1], (38)

where Lsaddle ridge λ(γ, z) := γT Qλγ 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ), (39)

and diag(z) is a diagonal matrix with z on the diagonal.

Before we give the proof for Theorem 3.1, we first state one useful lemma below: Lemma C.1. For an optimal solution c to Problem (40), there exists some β such that c = 2 λ2+λ(XT y Qλβ ).

Without using this lemma, there is some concern with reparametrizing c with a new parameter γ when λ = λmin(XT X). In this case, XT X λI becomes singular and our reparametrization trick could miss some subspace. This issue of having XT X λI being singular would lead to the concern that Problem (39) is a relaxation of Problem (40), not equivalent. Fortunately, due to this lemma, the subspace which we miss via the reparametrization trick will not prevent Problem (39) from achieving the same optimal value as Problem (40).

We now prove this lemma by the method of contradiction.

Proof. Suppose the optimal solution c to Problem (40) cannot be written by Equation (41). Through the rank-nullity theorem in linear algebra, we can find d1 col(Qλ) and d2 ker(Qλ), d2 = 0 so that c = 2 λ2+λ(XT y d1 + d2), where col( ) denotes the column space of a matrix and ker( ) denotes the kernel of a matrix. If we let β = αd2 for some real value α, then the inner optimization of Problem (40) (ignore the last term involving zj s for now since β and z are separable in the objective function) becomes

LFenchel ridge λ(β, z, c) = βT Qλβ ((λ2 + λ)c 2XT y)T β

= βT Qλβ ((λ2 + λ) 2 λ2 + λ(XT y d1 + d2) 2XT y)T β

= βT Qλβ (2(XT y d1 + d2) 2XT y)T β

= βT Qλβ + 2(d1 d2)T β

= βT Qλβ + 2d T 1 β 2d T 2 β.

Because β = αd2, we have β ker(Qλ) and β d1. Thus, the first two terms become 0, and we are left with LFenchel ridge λ = 2d T 2 β = 2α d2 2 2. For the inner optimization of Problem (40), because we want to minimize over β, we can let α , and we will get LFenchel ridge λ . This is obviously not the optimal value for Problem (40) (a simple counter-example is to let c = 2 λ2+λXT y, and we can easily show that minβ,z LFenchel ridge λ(β, z, c) is finite). Therefore, d2 must be 0, and the optimal solution c to Problem (40) must be represented by some β through the equation c = 2 λ2+λ(XT y Qλβ).

Now that we have this lemma, we can prove Theorem 3.1 rigorously. We provide two alternative proofs so that readers can understand the logic from different angles. The first proof is based on using the lemma directly; the second proof is based on the sandwich argument.

Proof. For notational convenience, let us use Dz := {z | Pp j=1 zj k, zj [0, 1]} as a shorthand for the domain of z.

Recall that Problem (10) is the following optimization problem:

max c min β,z LFenchel ridge λ(β, z, c) (40)

j=1 zj k, zj [0, 1].

where LFenchel ridge λ(β, z, c) = βT Qλβ 2y T Xβ + (λ2 + λ) Pp j=1(βjcj c2 j 4 )zj.

For any fixed c, if we take the derivative of LFenchel ridge λ(β, z, c) with respect to β and set the derivative to 0, we obtain the optimality condition for β:

c = 2 λ2 + λ[XT y (XT X λI)β] (41)

If we reparametrize c with a new parameter γ and let c = 2 λ2+λ[XT y (XT X λI)γ], then β = γ satisfies the optimality condition for β in Equation (41). Therefore, whenever we reparametrize c with c = 2 λ2+λ[XT y (XT X λI)γ], we can just let β = γ, and this ensures us that we can achieve the inner minimum of Equation (40) with respect to β.

Then, we can reduce Problem (40) to Problem (38) in the following way:

max c min β,z LFenchel ridge λ(β, z, c) # subject to Pp j=1 zj k, zj [0, 1]

= max c min z Dz,β LFenchel ridge λ(β, z, c) # use Dz as the shorhand for the domain of z

= max c min z Dz,β βT Qλβ 2y T Xβ + (λ2 + λ)

j=1 (βjcj c2 j 4 zj)

# plug in the details for LFenchel ridge λ(β, z, c)

= max c min z Dz γT Qλγ 2y T Xγ + (λ2 + λ)

j=1 (γjcj c2 j 4 zj)

# letting β = γ achieves the inner minimum for β

= max c min z Dz γT Qλγ 2y T Xγ + (λ2 + λ)(γT c 1

4c T diag(z)c)

# write the summation in vector forms; diag(z) denotes a diagonal matrix with z on the diagonal

= max γ min z Dz γT Qλγ 2y T Xγ + (λ2 + λ) γT 2 λ2 + λ(XT y Qλγ)

4 2 λ2 + λ(XT y Qλγ)T diag(z) 2 λ2 + λ(XT y Qλγ)

# apply the reparametrization c = 2 λ2+λ(XT y Qλγ).

# remember that due to Lemma C.1, we will not miss the optimal solution c

= max γ min z Dz γT Qλγ 2y T Xγ + 2γT (XT y Qλγ) 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ)

# distribute λ2 + λ inside the parentheses

= max γ min z Dz γT Qλγ 2y T Xγ + 2γT XT y 2γT Qλγ 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ)

# distribute 2γ inside the parentheses

= max γ min z Dz γT Qλγ 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ)

# simplify the linear algebra

= max γ min z γT Qλγ 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ)

# subject to Pp j=1 zj k, zj [0, 1]

= max γ min z Lsaddle ridge λ(γ, z)

# subject to Pp j=1 zj k, zj [0, 1]; apply the definition of Lsaddle ridge λ(γ, z)

This completes the derivation for Theorem 3.1.

Proof. We will show maxc minβ,z LFenchel ridge λ(β, z, c) maxγ minz Lsaddle ridge λ(γ, z) and maxc minβ,z LFenchel ridge λ(β, z, c) maxγ minz Lsaddle ridge λ(γ, z) (both under the constrains Pp j=1 zj k, zj [0, 1]) and therefore establish the equivalence of optimal values.

For the first inequality, let β , c be the optimal solution to Problem (40) with the relation c = 2 λ2+λ(XT y Qλβ ) (this is always true because of Lemma C.1). Furthermore, let γ = β . Then we have

max c min β,z LFenchel ridge λ(β, z, c) = min z LFenchel ridge λ(β , z, c ) = min z Lsaddle ridge λ(γ , z) max γ min z Lsaddle ridge λ(γ, z)

The derivation for the second equality roughly follows the derivation in the first proof we gave.

For the second inequality, let ˆγ be the optimal solution to Problem (38). Let ˆc = 2 λ2+λ(XT y Qλˆγ).

Furthermore, let ˆβ = ˆγ (notice that this ˆβ satisfies the optimality condition of Problem (40)). Then we have

max γ min z Lsaddle ridge λ(γ, z) = min z Lsaddle ridge λ(ˆγ, z)

= min z LFenchel ridge λ( ˆβ, z, ˆc)

= min β,z LFenchel ridge λ(β, z, ˆc)

max c min β,z LFenchel ridge λ(β, z, c).

Similarly, the derivation for the second equality roughly follows the derivation (but this time in reverse order) in our first proof.

Using these two inequalities, we can establish the equivalence of optimal values between Problem (40) and Problem (38), i.e., maxc minβ,z LFenchel ridge λ(β, z, c) = maxγ minz Lsaddle ridge λ(γ, z).

C.2 Proof of Theorem 3.2

Theorem 3.2. The function h(γ) defined in Equation (14) is lower bounded by

h(γ) γT Qλγ 1 λ2 + λ XT y Qλγ 2 2. (42)

Furthermore, the right-hand size of Equation (42) is maximized if γ = ˆγ = argminα Lridge(α), where in this case, h(γ) on the left-hand side of Equation (42) becomes

h(ˆγ) = Lridge(ˆγ) + (λ2 + λ)Sum Bottomp k({ˆγ2 j }),

where Sum Bottomp k( ) denotes the summation of the smallest p k terms of a given set.

Proof. We first derive Inequality (42).

h(γ) = min z Dz Lsaddle ridge λ(γ, z) # Dz := {z | Pp j=1 zj k, zj [0, 1]}

= min z Dz γT Qλγ 1 λ2 + λ(XT y Qλγ)T diag(z)(XT y Qλγ)

# apply the definition of Lsaddle ridge λ(γ, z)

= min z Dz γT Qλγ 1 λ2 + λd T diag(z)d # let d := (XT y Qλγ)

= min z Dz γT Qλγ 1 λ2 + λ

j=1 d2 jzj # use summation notation instead of matrix notation

γT Qλγ 1 λ2 + λ

# notice that zj [0, 1]; letting zj = 1 for each j gives us a lower bound

= γT Qλγ 1 λ2 + λd T d # use matrix notation in stead of summation notation

= γT Qλγ 1 λ2 + λ(XT y Qλγ)T (XT y Qλγ). # apply d := (XT y Qλγ)

This completes the derivation for Inequality (42).

The right-hand side of Inequality (42) is maximized if the gradient with respect to γ is 0.

The gradient can be derived and simplified as:

γT Qλγ 1 λ2 + λ(XT y Qλγ)T (XT y Qλγ)

# gradient of the right-hand side of Inequality (42)

γT Qλγ 1 λ2 + λ(y T XXT y 2y T XQλγ + γT QT λ Qλγ)

# expand the quadratic term

= 2Qλγ 1 λ2 + λ( 2QT λ XT y + 2QT λ Qλγ)

# apply the differential operator to each term

= 2QT λ γ 1 λ2 + λ( 2QT λ XT y + 2QT λ Qλγ)

# change the first term from Qλγ to QT λ γ because Qλ = XT X λI is symmetric

= 2 λ2 + λQT λ (λ2 + λ)γ XT y + Qλγ)

# pull the common factor 2 λ2+λQT λ out of the parentheses

= 2 λ2 + λQT λ (λ2 + λ)γ XT y + (XT X λI)γ)

# substitute QT λ = XT X λI for the last term

= 2 λ2 + λQT λ (λ2 + λ)Iγ XT y + (XT X λI)γ)

# add an identity matrix in the first term

= 2 λ2 + λQT λ ( XT y + (XT X + λ2I)γ) . # add an identity matrix in the first term

Therefore, if we set (XT X + λ2I)γ = XT y, then the gradient of the right-hand side of Inequality (42) will be 0, and the optimality condition is achieved. However, this means that γ = (XT X + λ2I) 1XT y, which gives us the minimizer of the ridge regression loss Lridge( ).

Lastly, we show that when ˆγ = argminγ Lridge(γ), h(ˆγ) = Lridge(ˆγ) + (λ2 + λ)Sum Bottomp k({ˆγ2 j }).

First, let us derive a useful property. We claim that when ˆγ = argminγ Lridge(γ), XT y Qλˆγ = (λ2 + λ)ˆγ. Below, we prove this by showing that when we subtract the two terms, we get 0.

XT y Qλˆγ (λ2 + λ)ˆγ

=XT y (XT X λI)ˆγ (λ2 + λ)ˆγ # substitute Qλ = XT X λI

=XT y (XT X λI)ˆγ (λ2 + λ)I ˆγ # add the identity matrix I

=XT y (XT X + λ2I)ˆγ # aggregate the relevant terms in front of ˆγ

=XT y (XT X + λ2I)(XT X + λ2I) 1XT y # ˆγ = argminγ Lridge(γ) = XT X + λ2I

=XT y XT y # cancel out XT X + λ2I =0.

Therefore, we have XT y Qλˆγ = (λ2 + λ)ˆγ when ˆγ = argminγ Lridge(γ). With this property, we can derive the final simplified formula for h(ˆγ):

h(ˆγ) = min z Dz Lsaddle ridge λ(ˆγ, z) # Dz := {z | Pp j=1 zj k, zj [0, 1]}

= min z Dz ˆγT Qλˆγ 1 λ2 + λ(XT y Qλˆγ)T diag(z)(XT y Qλˆγ)

# apply the definition of Lsaddle ridge λ(ˆγ, z)

= min z Dz ˆγT Qλˆγ 1 λ2 + λ(λ2 + λ)ˆγT diag(z)(λ2 + λ)ˆγ

# apply the property XT y Qλˆγ = (λ2 + λ)ˆγ

= min z Dz ˆγT Qλˆγ (λ2 + λ)ˆγT diag(z)ˆγ # cancel out λ2 + λ

= min z Dz ˆγT (XT X λI)ˆγ (λ2 + λ)ˆγT diag(z)ˆγ # apply Qλ = XT X λI

= min z Dz ˆγT (XT X λI)ˆγ (λ2 + λ)

j=1 ˆγ2 j zj

# use summation instead of matrix notation

= ˆγT (XT X λI)ˆγ (λ2 + λ) max z Dz

j=1 ˆγ2 j zj

# move min( ) inside the negative sign and convert it to max( )

= ˆγT (XT X λI)ˆγ (λ2 + λ)Top Sumk{ˆγ2 j } # Top Sumk( ) denotes summation over the top k terms

= ˆγT (XT X λI)ˆγ (λ2 + λ)(

j=1 ˆγ2 j Bottom Sump k{ˆγ2 j })

# Bottom Sump k( ) denotes summation over the bottom p k terms

= ˆγT (XT X λI)ˆγ + (λ2 + λ)Bottom Sump k{ˆγ2 j } (λ2 + λ)

# distribute into the parentheses

= ˆγT (XT X λI)ˆγ + (λ2 + λ)Bottom Sump k{ˆγ2 j } (λ2 + λ)ˆγT ˆγ. # use matrix notation instead of summation notation

We note that ˆγT (XT X λI)ˆγ (λ2 + λ)ˆγT ˆγ = Lridge(ˆγ). To show this, we subtract the two terms and prove that the subtraction is equal to 0 below:

ˆγT (XT X λI)ˆγ (λ2 + λ)ˆγT ˆγ Lridge(ˆγ)

= ˆγT (XT X λI)ˆγ (λ2 + λ)ˆγT ˆγ ˆγT XT X ˆγ 2y T X ˆγ + λ2ˆγT ˆγ

# plug in the formula for Lridge(ˆγ)

= 2 ˆγT (XT X + λ2I)ˆγ y T X ˆγ # simplify the linear algebra

= 2 ˆγT (XT X + λ2I)(XT X + λ2I) 1XT y y T X ˆγ

# plug in the equation ˆγ = (XT X + λ2I) 1XT y because ˆγ is the minimizer of Lridge( )

= 2 ˆγT XT y y T X ˆγ # cancel out XT X + λ2I

Using the property that ˆγT (XT X λI)ˆγ (λ2 + λ)ˆγT ˆγ = Lridge(ˆγ), we have:

h(ˆγ) = ˆγT (XT X λI)ˆγ (λ2 + λ)ˆγT ˆγ + (λ2 + λ)Bottom Sump k{ˆγ2 j }

=Lridge(ˆγ) + (λ2 + λ)Bottom Sump k{ˆγ2 j }.

This completes the proof for Theorem 3.2.

C.3 Proof of Theorem 3.3

Theorem 3.3. Let F(γ) = γT Qλγ and G(p) = 1 λ2+λSum Topk({p2 j}). Then the solution for the problem γt+1 = argminγ F(γ) + ρ

2 Qλγ + pt XT y + qt 2 2 is

ρI + Qλ) 1(XT y pt qt). (43)

Furthermore, let a = XT y θt+1 qt and J be the indices of the top k terms of {|aj|}. The solution for the problem pt+1 = argminp G(p)+ ρ

2 θt+1+p XT y+qt 2 2 is pt+1 j = sign(aj) ˆvj,

where ˆv = argmin v

j=1 wj(vj bj)2 s.t. vi vl if |ai| |al| (44)

( 1 if j / J 1 + 2 ρ(λ2+λ) otherwise , bj = |aj|

Problem (44) is an isotonic regression problem and can be efficiently solved in linear time [12, 21].

Proof. Part I. We first show why Equation (43) is the analytic solution to the first optimization problem γt+1 = argminγ F(γ) + ρ

2 Qλγ + pt XT y + qt 2 2.

The loss function we want to minimize is (writing out F(γ) = γT Qλγ explicitly)

2 Qλγ + pt XT y + qt 2 2

=γT Qλγ + ρ

2(γT QT λ Qλγ + 2(pt XT y + qt)T Qλγ + pt XT y + qt 2 2).

# write out the square term as a sum of three terms

This loss function is minimized if we take the gradient with respect to γ and set it to 0:

0 = 2Qλγ + ρ

2(2QT λ Qλγ + 2QT λ (pt XT y + qt))

# take the gradient w.r.t. γ and set it to 0; ignore the last term because it is a constant

= 2Qλγ + ρ(QT λ Qλγ + QT λ (pt XT y + qt)) # distribute 2 inside parentheses

= 2QT λ γ + ρ(QT λ Qλγ + QT λ (pt XT y + qt)) # Qλ is symmetric

= QT λ (2γ + ρ(Qλγ + (pt XT y + qt))) # take out the common factor QT λ = QT λ (2Iγ + ρQλγ + ρ(pt XT y + qt)) # multiply by identity matrix I

= QT λ ((2I + ρQλ)γ + ρ(pt XT y + qt)) # aggregate the coefficient for γ

= ρQT λ ((2

ρI + Qλ)γ + (pt XT y + qt)). # take out the common factor ρ

A sufficient condition for the gradient to be 0 is to have

ρI + Qλ)γ + (pt XT y + qt) = 0.

If we solve the above equation, we obtain Equation (43):

ρI + Qλ) 1(XT y pt qt).

Part II. Next, we show why pt+1 j = sign(aj) ˆvj together with Equation (44) gives us the solution to the second optimization problem pt+1 = argminp G(p) + ρ

2 θt+1 + p XT y + qt 2 2.

If we write out G(p) = 1 λ2+λSum Topk({p2 j}) explicitly and use a = XT y θt+1 qt as a shorthand to represent the constant, the objective function can be rewritten as:

pt+1 = argmin p 1 λ2 + λSum Topk({p2 j}) + ρ

= argmin p 2 ρ(λ2 + λ)Sum Topk({p2 j}) + p a 2 2.

# multiplying by 2

ρ doesn t change the optimal solution

According to Proposition 3.1 (b) in [30], the optimal solution pt+1 has the property that sign(pt+1 j ) = sign(aj).

Inspired by this sign-preserving property, let us make a change of variable by writing pj = sign(aj) vj. Due to this change of variable, our optimal solution pt+1 can be obtained by solving the following optimization problem: pt+1 j = sign(aj) ˆvj

where ˆv = argmin v 2 ρ(λ2 + λ)Sum Topk({v2 j }) +

j=1 (vj |aj|)2. (45)

Note that in Problem (45), if we add the extra constraints vj 0, this will not change the optimal solution. We can show this by the argument of contradiction. Suppose there exists an optimal solution ˆv and an index j with ˆvj < 0. Then, if we flip the sign of ˆvj, then the objective loss in Problem (45) is decreased. This contradicts with the assumption that ˆv is the optimal solution. Therefore, we are safe to add the extra constraints vj 0 into Problem (45) without cutting off the optimal solution.

Thus, our optimal solution pt+1 becomes: pt+1 j = sign(aj) ˆvj

where ˆv = argmin v 2 ρ(λ2 + λ)Sum Topk({v2 j }) +

j=1 (vj |aj|)2 s.t. vj 0 for j. (46)

Next, according to Proposition 3.1 (c) in [30], we have ˆvi ˆvl if |ai| |al| in Problem (46). Thus, we add these extra constraints into Problem (46) without changing the optimal solution and pt+1 becomes: pt+1 j = sign(aj) ˆvj

where ˆv = argmin v 2 ρ(λ2 + λ)Sum Topk({v2 j }) +

j=1 (vj |aj|)2 (47)

s.t. vj 0 for j, and vi vj if |ai| |aj|. for i, j. Note that in Problem (47), because of the extra constraints |ai| |aj| for i, j, we can rewrite Sum Topk({v2 j }) = P

j J v2 j where J is the set of indices of the top k terms of {|aj|} s. Then the optimal solution pt+1 becomes: pt+1 j = sign(aj) ˆvj

where ˆv = argmin v 2 ρ(λ2 + λ)

j=1 (vj |aj|)2 (48)

s.t. vj 0 for j, and vi vj if |ai| |aj|. for i, j.

Next, we drop the constraints vj 0 for j as this does not change the optimal solution, using the same argument of contradiction we have shown before. Therefore, our optimal solution pt+1 can be obtained in the following way: pt+1 j = sign(aj) ˆvj

where ˆv = argmin v 2 ρ(λ2 + λ)

j=1 (vj |aj|)2 (49)

s.t. vi vj if |ai| |aj|. for i, j.

We can obtain the formulation in Problem (44) by performing some linear algebra manipulations as follows:

argmin v 2 ρ(λ2 + λ)

j=1 (vj |aj|)2

2 ρ(λ2 + λ)v2 j + (vj |aj|)2 + X

j / J (vj |aj|)2

# divide the summation into two groups based on the set J

2 ρ(λ2 + λ) + 1 v2 j 2|aj|vj + |aj|2 + X

j / J (vj |aj|)2

# combine the coefficients for v2 j

2 ρ(λ2 + λ) + 1

vj |aj| 2 ρ(λ2+λ) + 1

2 ρ(λ2+λ) + 1 + |aj|2

j / J (vj |aj|)2

# complete the square

2 ρ(λ2 + λ) + 1

vj |aj| 2 ρ(λ2+λ) + 1

j / J (vj |aj|)2

# get rid of the constant terms which do not affect the optimal solution

# wj = 2 ρ(λ2+λ) + 1 if j J and wj = 1 if j / J

j=1 wj(vj bj)2.

# let bj = |aj|

wj and combine the two summations into a single summation

Using this new property, we can obtain the final result stated in the theorem for our optimal solution pt+1 to the second optimization problem:

pt+1 j = sign(aj) ˆvj

where ˆv = argmin v

j=1 wj(vj bj)2 s.t. vi vj if |ai| |aj| for i, j (50)

( 1 if j / J 2 ρ(λ2+λ) + 1 otherwise , bj = |aj|

Therefore, the task of evaluating the proximal operator is essentially reduced to solving an isotonic regression problem. We would like to acknowledge that the connection between Sum Topk( ) and isotonic regression has also been recently discovered by [55]. However, the context and applications are totally different. While [55] discovers this relationship in the context of training neural networks, we find this connection in the context of ADMM and perspective relaxation of k-sparse ridge regression. Mathematically, our objective function is also different as our objective function contains a linear term of p in the proximal operator evaluation.

C.4 Proof of Theorem 3.4

Theorem 3.4. Let us define a k-sparse vector pair domain to be Ωk := {(x, y) Rp Rp : x 0 k, y 0 k, x y 0 k}. Any M1 satisfying f(y) f(x)+ f(x)T (y x)+ M1

2 y x 2 2 for all (x, y) Ω1 is called a restricted smooth parameter with support size 1, and any m2k satisfying

f(y) f(x) + f(x)T (y x) + m2k

2 y x 2 2 for all (x, y) Ω2k is called a restricted strongly convex parameter with support size 2k. If ˆβ is our heuristic solution by the beam-search method, and β is the optimal solution, then:

Lridge(β ) Lridge( ˆβ) (1 e m2k/M1)Lridge(β ). (51)

Proof. The left-hand-side inequality Lridge(β ) Lridge( ˆβ) is obvious because the optimal loss should always be less than or equal to the loss found by our heuristic method. Therefore, we focus on the right-hand-side inquality Lridge( ˆβ) (1 e m2k/M1)Lridge(β ).

We first give an overview of our proof strategy, which is similar to the proof in [29], but our decay factor m2k/M1 in the exponent is much larger than theirs, which is m2k/M2k (we can obtain the definition of M2k by extending the definition of M1 to the support size 2k). Let Lridge(βt) and Lridge(βt+1) be the losses at the t-th and (t + 1)-th iterations of our beam search algorithm. We first derive a lower bound formula on the decrease of the loss function Lridge( ) between two successive iterations, i.e., Lridge(βt) Lridge(βt+1) . We will then show that this quantity of decrease can be expressed in terms of the parameters M1 and m2k defined in the statement of the theorem. Finally, we use the relationship between the decrease of the loss and M1 and m2k to derive the right-hand inequality in Equation (51).

Step 1: Decrease of Loss between Two Successive Iterations:

For an arbitrary coordinate j that is not part of the support of βt, or more explicitly βt j = 0, if we change the coefficient on the j-th coordinate βt from 0 to α, the loss becomes: Lridge(βt + αej) # ej denotes a vector with j-th entry to be 1 and 0 otherwise

=(y X(βt + αej))T (y X(βt + αej)) + λ2(βt + αej)T (βt + αej) # apply the definition of ridge regression loss

=(y Xβt αXej)T (y Xβt αXej) + λ2(βt + αej)T (βt + αej) # distribute into the parentheses

=(y Xβt αX:j)T (y Xβt αX:j) + λ2(βt + αej)T (βt + αej) # X:j denotes the j-th column of X

=(y Xβt)T (y Xβt) 2α(y Xβt)T X:j + α2XT :j X:j + λ2 (βt)T βt + 2α(βt)T ej + α2

# expand each quadratic term into three summation terms

=(y Xβt)T (y Xβt) 2α(y Xβt)T X:j + α2XT :j X:j + λ2 (βt)T βt + 2αβt j + α2

# (βt)T ej = βt j; ej takes the j-th component of βt

=(y Xβt)T (y Xβt) 2α(y Xβt)T X:j + α2XT :j X:j + λ2 (βt)T βt + α2

# βt j = 0 because we are optimizing on a coordinate with coefficient equal to 0

=(y Xβt)T (y Xβt) + λ2(βt)T βt + ( X:j 2 2 + λ2)α2 2((y Xβt)T X:j)α # collect relevant coefficients for α2 and α

=Lridge(βt) + ( X:j 2 2 + λ2)α2 2((y Xβt)T X:j)α # simplify the first two terms using Lridge( )

=Lridge(βt) + ( X:j 2 2 + λ2)α2 + j Lridge(βt)α

# the j-th partial derivative is j Lridge(βt) = 2(Xβt y)T X:j according to Lemma B.7 with βt j = 0

=Lridge(βt) + ( X:j 2 2 + λ2) α2 + j Lridge(βt)

X:j 2 2 + λ2 α # extract out a coefficient

=Lridge(βt) + ( X:j 2 2 + λ2)

2 j Lridge(βt)

X:j 2 2 + λ2 α + 1

2 j Lridge(βt)

X:j 2 2 + λ2

2 j Lridge(βt)

X:j 2 2 + λ2

# add and subtract the same term

=Lridge(βt) + ( X:j 2 2 + λ2)

2 j Lridge(βt)

X:j 2 2 + λ2

2 j Lridge(βt)

X:j 2 2 + λ2

# complete the square

=Lridge(βt) 1

4 ( j Lridge(βt))2

X:j 2 2 + λ2 + ( X:j 2 2 + λ2) α + 1

2 j Lridge(βt)

X:j 2 2 + λ2

# distribute into the parentheses

The first two terms are constant, and the third term is a quadratic function of α.

The minimum loss we can achieve for Lridge(βt + αej) by optimizing on the j-th coefficient is:

min α Lridge(βt + αej) = Lridge(βt) 1

4 ( j Lridge(βt))2

X:j 2 2 + λ2 ,

and the maximum decrease of our ridge regression loss by optimizing a single coefficient is then:

max α (Lridge(βt) Lridge(βt + αej)) = Lridge(βt) min α Lridge(βt + αej) = 1

4 ( j Lridge(βt))2

X:j 2 2 + λ2 ,

where X:j is the j-th column of the design matrix X.

Now, for our beam-search method, the decrease of loss between two successive iterations will be greater than the decrease of loss by optimizing on a single coordinate (because we are fine-tuning the coefficients on the newly expanded support and picking the best expanded support among all support candidates in the beam search), we have

Lridge(βt) Lridge(βt+1) max j max α (Lridge(βt) Lridge(βt + αej)) = max j 1 4 ( j Lridge(βt))2

X:j 2 2 + λ2 .

For an arbitrary set of coordinates J , we have the following inequality:

max j 1 4 ( j Lridge(βt))2

X:j 2 2 + λ2 1 |J |

1 4 ( j Lridge(βt))2

X:j 2 2 + λ2 . (54)

Inequality 54 holds because the largest decrease of loss among all possible coordinates is greater than or equal to the average decrease of loss among any arbitrary set of coordinates J .

If we choose J = SR := S \ supp(βt), where S is the set of coordinates for the optimal solution β and supp( ) denotes the support of the solution, we have

Lridge(βt) Lridge(βt+1) max j 1 4 ( j Lridge(βt))2

X:j 2 2 + λ2

1 4 ( j Lridge(βt))2

X:j 2 2 + λ2

( j Lridge(βt))2

X:j 2 2 + λ2 # pull 1

4 to the front

( j Lridge(βt))2

X:j 2 2 + λ2 # because SR S , |SR| |S | = k

4k 1 maxj SR( X:j 2 2 + λ2)

j SR ( j Lridge(βt))2

# take the maximum of the denominator and pull it to the front

Lridge(βt) Lridge(βt+1) 1

4k 1 maxj SR( X:j 2 2 + λ2)

j SR ( j Lridge(βt))2. (55)

Inequality (55) is the lower bound of decrease of loss between two successive iterations of our beam-search algorithm.

In Step 2 and Step 3 below, we will provide bounds on maxj SR( X:j 2 2 + λ2) and P

j SR( j Lridge(βt))2 in terms of M1 and m2k, respectively.

Step 2: Upper bound on maxj SR( X:j 2 2 + λ2)

We will show below that M1

2 maxj SR( X:j 2 2 + λ2).

According to the definition of the restricted smoothness of M1 (See Background Concepts in Appendix B), we have that

2 w2 w1 2 2 Lridge(w2) Lridge(w1) Lridge(w1)T (w2 w1) (56)

for w1, w2 Rp with w1 0 1, w2 0 1, and w2 w1 0 1.

The right-hand side can be rewritten as

Lridge(w2) Lridge(w1) Lridge(w1)T (w2 w1)

=( Xw2 y 2 2 + λ2 w2 2 2) ( Xw1 y 2 2 + λ2 w1 2 2) 2(XT (Xw1 y) + λ2w1)T (w2 w1)

# write out Lridge( ) explicitly and substitute Lridge(w1) = 2 XT (Xw1 y) + w1 according to Lemma B.7

=( (Xw2 Xw1) + (Xw1 y) 2 2 + λ2 (w2 w1) + w1 2 2)

( Xw1 y 2 2 + λ2 w1 2 2) 2(XT (Xw1 y) + λ2w1)T (w2 w1) # We subtract and add identical terms inside the first parentheses

= Xw2 Xw1 2 2 + Xw1 y 2 2 + 2(Xw2 Xw1)T (Xw1 y)

+ λ2 w2 w1 2 2 + λ2 w1 2 2 + 2λ2(w2 w1)T w1 ( Xw1 y 2 2 + λ2 w1 2 2) 2(XT (Xw1 y) + λ2w1)T (w2 w1) # expand the terms inside for the first line into two lines

=( Xw2 Xw1 2 2 + Xw1 y 2 2 + 2(XT (Xw1 y))T (w2 w1))

+ λ2 w2 w1 2 2 + λ2 w1 2 2 + 2(λ2w1)T (w2 w1)

( Xw1 y 2 2 + λ2 w1 2 2) 2(XT (Xw1 y) + λ2w1)T (w2 w1) # rewrite (Xw1 Xw2)T (Xw1 y) = (XT (Xw1 y))T (w2 w1)

= Xw2 Xw1 2 2 + λ2 w2 w1 2 2 # cancel out terms

=(w2 w1)T (λ2I + XT X)(w2 w1)

Lridge(w2) Lridge(w1) Lridge(w1)T (w2 w1) = (w2 w1)T (λ2I + XT X)(w2 w1).

Therefore, together with Inequality (56), we have that

2 w2 w1 2 2 (w2 w1)T (λ2I + XT X)(w2 w1) (57)

for any w2 w1 Rp and w2 w1 0 1. If we let w2 w1 = ej where ej is a vector with 1 on the j-th entry and 0 otherwise, we have

2 λ2 + X:j 2 2

for j [1, ..., p]. Thus, we can derive our upper bound at the beginning of Step 2:

2 max j SR( X:j 2 2 + λ2) (58)

Step 3: Lower bound on P

j SR( j Lridge(βt))2

We will show that P

j J ( j Lridge(βt))2 2m2k(Lridge(βt) Lridge(β )).

According to the definition of the restricted strong convexity parameter m2k, we have:

Lridge(βt)T (β βt) + m2k

2 β βt 2 2 Lridge(β ) Lridge(βt),

where βt Rp is our heuristic solution at the t-th iteration, β is the optimal k-sparse ridge regression solution, and βt 0 = t k < 2k, β 0 = k < 2k, and β βt 0 2k

By rearranging some terms, we have:

Lridge(βt) Lridge(β )

Lridge(βt)T (β βt) m2k

2 β βt 2 2 # always holds because of restricted strong convexity

= Lridge(βt)T (βt +

j=1 αjej βt) m2k

j=1 αjej βt 2 2

# ej is a vector with 1 on the j-th entry and 0 otherwise; αj is a fixed number with αj = β j βt j

= Lridge(βt)T (

j=1 αjej) m2k

j=1 αjej 2 2 # αj is a fixed number with αj = β j βt j

j Lridge(βt)αj m2k

2 α2 j # Combine into one summation

j Lridge(βt)αj m2k

j Lridge(βt)αj m2k

j / SR supp(βt)

j Lridge(βt)αj m2k

# Recall SR = S \ supp(βt). We divide the indices {1, 2, ..., p} into three groups: SR, supp(βt), and the rest.

j Lridge(βt)αj m2k

# j Lridge(βt) = 0 for j supp(βt), and αj = 0 for j / SR supp(βt)

j Lridge(βt)αj m2k

2 α2 j # Change the index notation from j to j

j Lridge(βt)νj m2k

# Allow α to take any values with a new variable ν. The inequality holds if we take the maximum with respect to ν

j J ( j Lridge(βt))2 # each term can be maximized because it is a quadratic function

This gives us the lower bound: X

j J ( j Lridge(βt))2 2m2k(Lridge(βt) Lridge(β )) (59)

Step 4: Final Bound for the Loss of Our Heuristic Solution

If we plug Inequalities (58) and (59) into the Inequality (55), we have:

Lridge(βt) Lridge(βt+1) 1

4k 1 maxj SR( X:j 2 2 + λ2)

j SR ( j Lridge(βt))2

Lridge(βt) Lridge(βt+1) m2k

k M1 (Lridge(βt) Lridge(β )) (60)

Lastly, let s use Inequality (60) to derive the right-hand-side of Inequality (51) stated in the Theorem. Inequality (60) can be rewritten as:

(Lridge(βt) Lridge(β )) (Lridge(βt+1) Lridge(β )) m2k

k M1 (Lridge(βt) Lridge(β )),

which can be rearranged to the following expression:

k M1 )(Lridge(βt) Lridge(β )) (Lridge(βt+1) Lridge(β ))

Therefore, by applying the above inequality k times, we have:

k M1 )k(Lridge(0) Lridge(β )) (Lridge(βk) Lridge(β ))

However, Lridge(0) = 0. This gives us

Lridge(βk) 1 (1 m2k

k M1 )k Lridge(β )

To get the final inequality, we start from a well-known inequality result 1 + x ex for any x R.

k > 1, then we have 1 a

k e a/k. If we take both sides to the power of k, we have (1 a

If we let a = m2k

M1 (because m2k

M1 < 1 [29] we have x = a

M1k > 1 for any k 1), then we have

k M1 )k e m2k

Because Lridge(β ) is a negative number, we finally have

Lridge(βk) 1 (1 m2k

k M1 )k Lridge(β )

M1 )Lridge(β ).

C.5 Derivation of g(a, b) = maxc ac c2

4 b via Fenchel Conjugate

Recall our function g : R R R {+ } is defined as:

b if b > 0 0 if a = b = 0 otherwise (62)

The convex conjugate of g( , ), denoted as g ( , ), is defined as [8]:

g (c, d) = sup a,b (ac + bd g(a, b)) (63)

Since function g(a, b) is convex and lower semi-continuous, the biconjugate of g(a, b) is equal to the original function, i.e., (g ) = g [8]. Writing this explicitly, we have:

g(a, b) = sup c,d (ac + bd g (c, d)) (64)

For now, let us first derive an analytic formula for the convex conjugate function g ( , ).

According to the definition of g(a, b) in Equation (62), we have

h(a, b, c, d) = ac + bd g(a, b) =

b if b > 0 0 if a = b = 0 otherwise .

For any fixed c and d, the optimality of the first case can be achieved if the first derivatives of ac + bd a2

b with respect to a and b are both equal to 0. This means that,

b = 0 and d + a2

4 , the optimality condition is achieved if we let a = bc

2 , and then ac + bd a2

2 c + b( c2

4 , the optimality condition can not be achieved. However, if we still let a = bc

2 , we have ac + bd a2

2 c + b(d) c2

4 + d)b. Then, we have two specific cases to discuss: 1) if

4 + d > 0, the supremum is if we let b ; 2) if c2

4 + d < 0, the supremum is 0 if we let b 0.

With all these cases discussed, we have that

g (c, d) = sup a,b h(a, b, c, d)

4 + d 0 otherwise

Now let us revisit Equation (64), where g(a, b) = supc,d ac + bd g (c, d). Below, we discuss three cases and derive the alternative formulas for g(a, b) under different scenarios: (1) c2

4 + d = 0, (2) c2

4 + d 0, and (3) c2

g1(a, b) = sup

4 +d=0 ac + bd g (c, d) = sup c ac + b( c2

g2(a, b) = sup

4 +d<0 ac + bd g (c, d) = sup c,d ac + bd

g3(a, b) = sup

4 +d>0 ac + bd g (c, d) =

After discussing these three cases, we can obtain the final results by taking the maximum of the three optimal values, i.e., g(a, b) = max(g1(a, b), g2(a, b), g3(a, b)). The third case c2

4 + d > 0 is not interesting and can be easily eliminated. The second case c2

4 + d < 0 is also not interesting because ac + bd = ac + b( c2

4 ) + b( c2

4 + d) ac + b( c2

4 ) for any b 0, which is the domain we are interested in. This gives us g2(a, b) g1(a, b), and the equality is only achieved when b = 0. Thus,

g(a, b) = max(g1(a, b), g2(a, b), g3(a, b)) = g1(a, b) (65)

= sup c ac + b( c2

= max c ac bc2

where in the last step we convert sup( ) to max( ) because the function is quadratic and the optimum can be achieved.

D Existing MIP Formulations for k-sparse Ridge Regression

We want to solve the following k-sparse ridge regression problem:

min β,z βT XT Xβ 2y T Xβ + λ2

j=1 β2 j (67)

subject to (1 zj)βj = 0,

j=1 zj k, zj {0, 1}

However, we can not handle the constraint (1 zj)βj = 0 directly in a mixed-integer programming (MIP) solver. Currently in the research community, there are four formulations we can write before sending the problem to a MIP solver. They are (1) SOS1, (2) big-M, (3) perspective, and (4) optimal perspective. We write each formulation explicitly below:

1. SOS1 formulation:

min β,z βT XT Xβ 2y T Xβ + λ2

j=1 β2 j (68)

subject to (1 zj, βj) SOS1,

j=1 zj k, zj {0, 1},

where (a, b) SOS1 means that at most one variable (either a or b but not both) can be nonzero. 2. big-M formulation:

min β,z βT XT Xβ 2y T Xβ + λ2

j=1 β2 j (69)

subject to Mzj βj Mzj,

j=1 zj k, zj {0, 1}

where M > 0. For the big-M formulation, as we mentioned in the main paper, it is very challenging to choose the right value for M. If M is too big, the lower bound given by the relaxed problem is too loose; if M is too small, we risk cutting off the optimal solution. 3. perspective formulation:

min β,z,s βT XT Xβ 2y T Xβ + λ2

j=1 sj (70)

subject to β2 j sjzj,

j=1 zj k, zj {0, 1}

where sj s are new variables which ensure that when zj = 0, the corresponding coefficient βj = 0. The problem is a mixed-integer quadratically-constrained quadratic problem (QCQP). The problem can be solved by a commercial solver which accepts QCQP formulations. 4. eigen-perspective formulation:

min β,z,s βT XT X λmin(XT X)I β 2y T Xβ + λ2 + λmin(XT X) p X

subject to β2 j sjzj,

j=1 zj k, zj {0, 1},

where λmin( ) denotes the smallest eigenvalue of a matrix. This formulation is similar to the QCQP formulation above, but we increase the ℓ2 coefficient by subtracting λmin(XT X)I from the diagonal of XT X.

5. optimal perspective formulation:

min B,β,z XT X, B 2y T Xβ (72)

subject to B ββT 0, β2 j Bjjzj,

j=1 zj k, zj {0, 1},

where A 0 means that A is positive semidefinite. Such formulation is called a mixedinteger semidefinite programming (MI-SDP) problem and is proposed in [62, 28, 37]. Currently, there is no commercial solver that can solve MI-SDP problems: Gurobi does not support SDP; MOSEK can solve SDP but does not support MI-SDP. As shown by [28], SDP is not scalable to high dimensions. We will confirm this observation in our experiments by solving the relaxed convex SDP problem (relaxing {0, 1} to [0, 1]) via MOSEK, the state-of-the-art conic solver. After we get the solution from the relaxed problem, we find the indices corresponding to the top k largest zj s and retrieve a feasible solution on these features.

E Algorithmic Charts

E.1 Overall Branch-and-bound Framework

We first provide the overall branch-and-bound framework. Then we go into details on how to calculate the lower bound, how to obtain a near-optimal solution and upper bound (via beam search), and how to do branching. A visualization of the branch-and-bound tree can be found in Figure 5.

Figure 5: Branch-and-bound diagram.

Algorithm 1 OKRidge(D,k,λ2, ϵ, Titer, Tmax) β

Input: dataset D, sparsity constraint k, ℓ2 coefficient λ2, optimality gap ϵ, iteration limit Titer for subgradient ascent, and time limit Tmax. Output: a sparse continuous coefficient vector β Rp with β 0 k.

1: ϵbest, Lbest lower, βbest upper, Lbest upper = , , 0, 0 Initialize best optimality gap to , best lower bound to , best solution to 0 and its loss to 0. 2: Q initialize an empty queue (FIFO) structure FIFO corresponds to a breadth-first search. 3: Nselect = , Navoid = Coordinates in Nselect can be selected during heuristic search and lower bound calculation for a node; Coordinates in Navoid must be avoided. 4: Root Node = Create Node(D, k, λ2, Nselect, Navoid) Create the first node, which is the root node in the Bn B tree. 5: Q = Q.push(Root Node) Put the root node in the queue. 6: lunsolved = 1 Smallest depth where at least one node is unsolved in the Bn B tree 7: WHILE Q = and Telapsed < Tmax do Keep searching until the time limit is reached or there is no node in the queue. 8: Node = Q.get Node() 9: Llower = Node.lower Solve Fast() Get the lower bound for this node through the fast method 10: if Llower < Lupper best then Llower = Node.lower Solve Tight() Get a tighter lower bound for this node through the subgradient ascent 11: if Llower Lupper best then continue Prune Node if its lower bound is higher than or equal to the current best upper bound. 12: if All nodes in depth lunsolved of the Bn B tree are solved then 13: Lbest lower = maxq({Lq, for q {1, 2, ..., lunsolved}) Tighten the best lower bound; Lq is the smallest lower bound of all nodes in depth q 14: lunsolved = lunsolved + 1 Raise up the smallest unsolved depth by 1 15: ϵbest = (Lbest upper Lbest lower)/|Lbest upper| 16: if ϵbest < ϵ then return βbest Optimality gap tolerance is reached, return the current best solution 17: βupper, Lupper = Node.upper Solve() Get feasible sparse solution and its loss from beam search, which is an upper bound for our Bn B tree. 18: if Lupper < Lbest upper then 19: Lbest upper, βbest upper = Lupper, βupper Update our best solution and best upper bound 20: ϵbest = (Lbest upper Lbest lower)/|Lbest upper| 21: if ϵbest < ϵ then return βbest Optimality gap tolerance is reached, return the current best solution 22: j = Node.get Branch Coord(βupper) Get which feature coordinate to branch 23: Child Node1 = Create Node(D, k, λ2, (Node Nselect) {j}, Node Navoid) New child node where feature j must be selected 24: Child Node2 = Create Node(D, k, λ2, Node Nselect, (Node Navoid) {j}) New child node where feature j must be avoided 25: Q.push(Child Node1, Child Node2) Put the two newly generated nodes into our queue. 26: return βbest Return the current best solution when time limit is reached or there is no node in the queue

E.2 Fast Lower Bound Calculation

Algorithm 2 Node.lower Solve Fast() Llower

Attributes in the Node class: dataset D = {(xi, yi)}n i=1, sparsity constraint k, ℓ2 coefficient λ2, set of features that must be selected Nselect, set of features that must be avoided Navoid. Output: a lower bound for the node Llower.

1: γ = argminβ,βj=0 if j Navoid βT XT Xβ 2y T Xβ + λ2 β 2 2 Solve the ridge regression problem. We avoid using coordinates belonging to Navoid by fixing the coefficients on those coordinates to be 0 . 2: Lridge = γT XT Xγ 2y T Xγ + λ2 γ 2 2 Loss for the ridge regression. 3: Lextra = (λ2 + λmin(XT X)) Sum Bottomp |Navoid| k({β2 j , for j [1, ..., p] \ (Navoid Nselect)}) First of all, we are restricted to the coordinates not belonging to Navoid. Second, coordinates in the set Nselect must be selected. If we work through the strong convexity argument in Section 3.1 in the main paper, we would get this formula. 4: Llower = Lridge + Lextra 5: return Llower

E.3 Refine the Lower Bound through the ADMM method

Next, we show how to get a tighter lower bound through the ADMM method.

Algorithm 3 Node.lower Solve Tight ADMM() Llower

Attributes in the Node class: dataset D = {(xi, yi)}n i=1, sparsity constraint k, ℓ2 coefficient λ2, set of features that must be selected Nselect, and iteration limit Titer for ADMM (For simplicity of presentation, we assume Navoid is empty. If Navoid = , we can do a selection and relabeling step on X, y, Qλ, Nselect to focus solely on coordinates not belonging to Navoid). Output: a lower bound for the node Llower.

1: γ1 = argminβ βT XT Xβ 2y T Xβ + λ2 β 2 2 Initialize γ by solving the ridge regression problem. This will be used as a warm start for the ADMM method below. To avoid redundant computation, we can use the value for γ from Node.lower Solve Fast(). 2: p1 = XT y Qλγ Initialize p as stated in Problem 18. 3: q1 = 0 Initialize the scaled dual variable q in the ADMM. 4: ρ = 2/ p

λmax(Qλ)λmin>0(Qλ) Calculate the step size used in the ADMM 5: for t = 2, 3, ...Titer. do 6: γt = ( 2

ρI + Qλ) 1(XT y pt 1 qt 1) Update γ in the ADMM.

7: θt = 2Qλγt + pt 1 XT y Update θ in the ADMM. 8: a = XT y θt qt 1 This line and the next 3 lines are used to calculate variables used for setting up the isotonic regression problem. 9: Let J be the indices of the top k |Nselect| terms of |aj| with the constraint that any element in J does not belong to Nselect. 10: Let wj = 1 if j / J and j / Nselect and let wj = 2 ρ(λ2+λ)+1 otherwise.

11: Let bj = |aj|

12: Solve the problem ˆv = argminv P

j wj(vj bj)2 such that for i, l / Nselect, vi vl if |ai| |al|. We can decompose this optimization problem into two independent problems with coordinates belonging to Nselect (can be solved individually) and coordinates not belonging to Nselect (can be solved via standard isotonic regression). 13: pt = sign(a) ˆv Update p in the ADMM. The symbol denotes the element-wise (Hadamard) product. 14: qt = qt 1 + θt + pt XT y Update q in the ADMM. 15: end for 16: Llower = h(γTiter) Calculate the lower bound. 17: return Llower

E.4 (Optional) An alternative to the Lower Bound Calculation via Subgradient Ascent and the CMF method

In addition to the ADMM method, we also provide an alternative algorithm showing how to get a tighter lower bound through the subgradient ascent algorithm using the CMF method. Note that this subgradient method has been superseded by the ADMM method, which converges faster. We provide this method in the Appendix for the sake of completeness and for future readers to consult.

Algorithm 4 Node.lower Solve Tight Subgradient() Llower

Attributes in the Node class: dataset D = {(xi, yi)}n i=1, sparsity constraint k, ℓ2 coefficient λ2, set of features that must be selected Nselect, set of features that must be avoided Navoid, and iteration limit Titer for subgradient ascent Output: a lower bound for the node Llower.

1: γ = argminβ,βj=0 if j Navoid βT XT Xβ 2y T Xβ + λ2 β 2 2 Solve the ridge regression problem and avoid using coordinates in Navoid. This will be used as a warm start for the subgradient ascent procedure below. To avoid redundant computation, we can use the value for γ from Node.lower Solve Fast() 2: v = γ Initialization for vt at iteration step t = 1 3: h = Lridge(βroot upper) get an estimate of maxγh(γ); if βroot upper has not been solved at the root node, solve βroot upper by calling the upper Solve() function at the root node. 4: Llower = h(γ) 5: for t = 2, 3, ...Titer do 6: γ, v = Node.subgrad Ascent CMF(γ, v) Perform one step of subgradient ascent using the CMF method 7: Llower = max(Llower, h(γ)) Loss decreases stochastically during the subgradient ascent procedure. We keep the largest value of h(γ) 8: end for 9: return Llower

In the above algorithm, we call a new sub algorithm called sub Grad Ascent CMF. We provide the details on how to do this below. We try to maximize h(γ) through iterative subgradient ascent. A more sophisticated algorithm (CMF method [23]) can be applied to adaptively choose the step size, which has been shown to achieve faster convergence in practice. The computational procedure for the subgradient ascent algorithm using the CMF method [18, 23] is as follows. Given γt 1 at iteration t 1, γt can be computed as:

wt = Lsaddle ridge λ(γ, Z(γt 1))

γ=γt 1 # get the subgradient

st = max(0, dt(vt 1)T wt/ vt 1 2 2) # st will be used as a smoothing factor between vt 1 and wt

vt = wt + stvt 1 # combine vt 1 and wt using the smoothing factor

αt = (h h(γt 1))/ vt 2 2 # calculate the step size; h is an estimate of maxγ h(γ); here we let h = Lridge(βroot upper)

γt = γt 1 + αtvt # perform subgradient ascent

where at the beginning of the iteration (t = 1), v1 = 0 and we choose dt = 1.5 as a constant for all iterations as recommended by the original paper. Below, we provide the detailed algorithm to perform subgradient ascent using the CMF method.

Algorithm 5 Node.subgrad Ascent CMF(γ, v, h ) γ, v

Attributes in the Node class: dataset D = {(xi, yi)}n i=1, sparsity constraint k, ℓ2 coefficient λ2, set of features that must be selected Nselect, set of features that must be avoided Navoid Input: Current variable γ for the lower bound function h( ) for Equation 14, current ascent vector v, and h is an estimate of maxγ h(γ). Output: updated variable γ and updated ascent vector v.

1: ˆz = argminz Lsaddle ridge λ(γ, z) get the minimizer for z at the current value for γ; optimization is performed under the constraint Pp j=1 zj k, zj [0, 1], zj = 1 for j Nselect and zj = 0 for j Nselect

2: w = Lsaddle ridge λ(γ, ˆz) / ( γ) get the subgradient

3: wj = 0 for j Navoid We don t perform any ascent steps on the coordinates belong to Navoid. 4: s = max(0, 1.5v T w/ v 2 2) Calculate the smoothing factor 5: v = w + sv update the ascent vector by combining the current subgradient and previous ascent vector 6: α = (h h(γ)) / v 2 2 calculate the step size for the ascent step 7: γ = γ + αv Perform one step of ascent along the direction of the ascent vector 8: return γ, v

E.5 Beam Search Algorithm

We provide a visualization of the beam search algorithm in Figure 6

Figure 6: We add into our support one feature at a time by picking the feature which would result in the largest decrease in the objective function. After that, we finetune all coefficients on the support. We keep the top B solutions during each stage of support expansion.

Algorithm 6 Node.upper Solve(B = 50) βupper, Lupper

Attributes in the Node class: dataset D = {(xi, yi)}n i=1, sparsity constraint k, ℓ2 coefficient λ2, a set of feature coordinates that must be selected Nselect, and a set of feature coordinates that must be avoided Navoid. Input: beam search size B, which is set to be 50 as the default value. Output: a near-optimal solution βupper with βupper 0 = k and the corresponding loss Lupper.

1: β = argminβ,βj=0 for j / Nselect βT XT Xβ 2y T Xβ + λ2 β 2 2 Initialize the coefficients by fitting on the must-be-selected coordinates. 2: W = {β } collection of solutions at each iteration (initially we only have one solution) 3: F = Initialize the collection of found supports as an empty set 4: for t = |Nselect| + 1, ..., k do 5: Wtmp 6: for β W do Each of these has support t 1 7: (W , F) Expand Supp By1(D, β , F, B, Navoid). Returns B models with supp. t 8: Wtmp Wtmp W

9: end for 10: Reset W to be the B solutions in Wtmp with the smallest ridge regression loss values. 11: end for 12: Pick βupper from W with the smallest ridge regression loss. 13: Lupper = βT upper XT Xβupper 2y T Xβupper + λ2 βupper 2 2 14: return βupper, Lupper.

Algorithm 7 Expand Supp By1(D, β , F, B, Navoid) W, F

Input: dataset D = {(xi, yi)}n i=1, current solution β , set of found supports F, beam search size B, and a set of feature coordinates that must be avoided Navoid. Output: a collection of solutions W = {βt} with βt 0 = β 0 + 1 for t. All of these solutions include the support of β plus one more feature. None of the solutions have the same support as any element of F, meaning we do not discover the same support set multiple times. We also output the updated F.

1: Let Sc {j | β j = 0 and j / Navoid} Non-support of the given solution 2: Pick up to B coords (j s) in Sc with largest decreases in j Lridge( ), call this set J . We will use these supports, which include the support of w plus one more. 3: W 4: for j J do Optimize on the top B coordinates 5: if (supp(w) {j}) F then 6: continue We ve already seen this support, so skip. 7: end if 8: F F {supp(w) {j}}. Add new support to F. 9: u argminu Lridge(D, u) with supp(u) = supp(w ) {j} . Update coefficients on the newly expanded support 10: W W {u } cache the solution to the central pool 11: end for 12: return W and F.

E.6 Branching

Algorithm 8 Node.get Branch Coord(β) j

Attributes in the Node class: dataset D = {(xi, yi)}n i=1, sparsity constraint k, ℓ2 coefficient λ2, a set of feature coordinates that must be selected Nselect, and a set of feature coordinates that must be avoided Navoid. Output: a single coordinate j on which the node should branch in the Bn B tree.

1: j = argmaxj / Nselect 2βT (XT X + λ2I)βjej + ((XT X)jj + λ2)β2 j + 2y T Xβjej select the coordinate whose coefficient, if we set to 0, would result in the greatest increase in our ridge regression loss 2: return j

E.7 A Note on the Connection with the Interlacing Eigenvalue Property

During the branch-and-bound procedure, finding the smallest eigenvalue of any principle submatrix of XT X is computationally expensive if we do it for every node. Fortunately, we can get a lower bound for the smallest eigenvalue based on the following Interlacing Eigenvalue Property [36]: Theorem E.1. Suppose A Rn n is symmetric. Let B Rm m with m < n be a principal submatrix (obtained by deleting both i-th row and i-th column for some values of i). Suppose A has eigenvalues γ1 ... γn and B has eigenvalues α1 ... αm. Then

γk αk γk+n m for k = 1, ..., m

And if m = n 1,

γ1 α1 γ2 α2 ... αn 1 γn

Using this Theorem, we can let λ = λmin(XT X) for every node in the branch-and-bound tree in our algorithm. This saves us a lot of computational time because we only have to calculate the smallest eigenvalue once in our root node. However, if the minimum eigenvalue were to be recalculated at each node efficiently, the lower bounds would be even tighter.

F More Experimental Setup Details

F.1 Computing Platform

All experiments were run on the 10x Tensor EX TS2-673917-DPN Intel Xeon Gold 6226 Processor, 2.7Ghz. We set the memory limit to be 100GB.

F.2 Baselines, Licenses, and Links

Commerical Solvers For the MIP formulations listed in Appendix D, we used Gurobi and MOSEK. We implemented the SOS1, big M, perspective, and eigen-perspective formulations in Gurobi. The Gurobi version is 10.0, which can be installed through conda (https://anaconda.org/gurobi/ gurobi). We used the Academic Site License. We implemented the perspective formulations and relaxed convex optimal perspective formulations in MOSEK. The MOSEK version is 10.0, which can be installed through conda (https://anaconda.org/MOSEK/mosek). We used the Personal Academic License.

Py SINDy, STLSQ, SSR, E-STLSQ We used the Py SINDy library (https://github.com/ dynamicslab/pysindy) to perform the differential equation experiments. The license is MIT. The heuristic methods STLSQ, SSR, and E-STLSQ are implemented in Py SINDy.

MIOSR For the differential equation experiments, the MIOSR code is available on Git Hub(https: //github.com/wesg52/sindy_mio_paper). The license for this code repository is MIT.

Subset Selection CIO The Subset Selection CIO code is available on Git Hub (https: //github.com/jeanpauphilet/Subset Selection CIO.jl) with commit version 8b03d1fba9262b150d6c0f230c6bf9e8ee57f92e. The license for this code repository is MIT Expat License (https://github.com/jeanpauphilet/Subset Selection CIO.jl/blob/ master/LICENSE.md). Due to package compatibility, we use Gurobi 9.0 for Subset Selection CIO. See the Susbet Selection CIO package version specification at this link: https: //github.com/jeanpauphilet/Subset Selection CIO.jl/blob/master/Project.toml.

l0bnb The l0bnb code is available on Git Hub (https://github.com/alisaab/l0bnb) with commit version 54375f8baeb64baf751e3d0effc86f8b05a386ce. The license for this code repository is MIT.

Sparse Simplex The Sparse Simplex code is available at the author s website (https://sites. google.com/site/amirbeck314/software?authuser=0). We ran the greedy sparse simplex algorithm. The license for this code repository is GNU General Public License 2.0.

Block Decomposition The block decomposition code is available at the author s website (https: //yuangzh.github.io). The author didn t specify a license for this code repository. However, we didn t modify the code and only used it for benchmarking purposes, which is allowed by the software community. Besides the block decompostion algorithm, the code repository also contains reimplementations for other algorithms, such as GP, SSP, OMP, proximal gradient l0c, QPM, ROMP. We used these reimplementations in the differential equation experiments as well. Please see [60] for details of these baselines.

F.3 Differential Equation Experiments

We compare the certifiable method MIOSR along with other baselines on solving dynamical system problems. In particular, for each dynamical system, we perform 20 trials with random initial conditions for 12 different trajectory lengths. Each dynamical system has observed positions every 0.002 seconds, so a larger trajectory length corresponds to a larger sample size n. Derivatives were approximated using a smoothing finite difference with a window size of 9 points. The derivatives were calculated automatically by the Py SINDY library. Each trajectory has 0.2% noise added to it to demonstrate robustness to noise. For the SINDy approach, true dynamics exist in the finitedimensional candidate functions/features. We investigate which method recovers the correct dynamics more effectively.

F.3.1 Model Selection

Model selection was performed via cross-validation with training data encompassing the first 2/3rds of a trajectory and the final third used for validation. Predicted derivatives were compared to the validation set, with the best model being the one that minimizes AICc for sparse regression in the setting of dynamical systems [45].

F.3.2 Hyperparameter Choices

With respect to each trajectory, each algorithm was trained with various combinations of hyperparameter choices. For both MIOSR and our method OKRidge, the ridge regression hyperparameter choices are λ2 {10 5, 10 3, 10 2, 0.05, 0.2}, and the sparsity level hyperparameter choices are k {1, 2, 3, 4, 5}. As in the experiments of MIOSR [9], we set the time limit for each optimization to be 30 seconds for both MIOSR and our method OKRidge. For SSR and STLSQ, the ridge regression hyperparameter choices are λ2 {10 5, 10 3, 10 2, 0.05, 0.2}. SSR and STLSQ may automatically pick a sparsity level, but they require a thresholding hyperparameter. We choose 50 thresholding hyperparameters with values between 10 3 and 101.5, depending on the system being fit. Together there are 250 different hyperparameter combinations we need to run on each trajectory length for for SSR and STLSQ. When fitting E-STLSQ, cross-validation is not performed as the chosen hyperparameters for STLSQ are used instead. For SSR, MIOSR, and OKRidge, we cross validate on each dimension of the data and combine the strongest results into a final model. For STLSQ and E-STLSQ, no post-processing is done and the same hyperparameters are used for each dimension.

F.3.3 Evaluation Metrics

Three metrics are considered to evaluate each algorithm: true positivity rate, L2 coefficient error, and root-mean-squared error (RMSE). The true positivity rate is defined as the ratio of the number of correctly chosen features divided by the sum of the number of true features and incorrectly chosen features. This metric equals 1 if and only if the correct form of the dynamical system is recovered. L2 coefficient error is defined as the sum of squared errors of the predicted and the true coefficients. Finally, for each model fit, we produce 10 simulations with different initial conditions of the true model, and calculate the root-mean-squared error (RMSE) of the predicted derivatives and the true derivatives. Besides these three metrics, we also report the running times, comparing OKRidge with other optimal ridge regression methods.

G More Experimental Results on Synthetic Benchmarks

G.1 Results Summarization

ℓ2 Perturbation Study We first compare with SOS1, Big-M (Big-M50 means that the big-M value is set to 50), Perspective, and Eigen-Perspective formulations with different ℓ2 regularizations, with and without using beam search as the warm start. We have three levels of ℓ2 regularizations: 0.001, 0.1, 10.0. The results are shown in Tables 1-60 in Appendix G.2. When the method has - in the upper bound or gap column, it means that method does not produce any nontrivial solution within the time limit.

Extensive results have shown that OKRidge outperforms all exisiting MIP formulations solved by the state-of-the-art MIP solver Gurobi. With the beam search solutions (usually they are already the optimal solutions) as the warm starts, Gurobi are faster than without using warm starts. However, OKRidge still outperforms all existing MIP formulations. If Gurobi can solve the problem within the time limit, OKRidge is usually usually orders of magnitude faster than all baselines. When Gurobi cannot solve the problem within the time limit, OKRidge can either solve the problem to optimality or achieve a much smaller optimality gap than all baselines, especially in the presence of high feature correlation and/or high feature dimension.

Big-M Perturbation Study We also conduct perturbation study on the big-M values. We pick three values: 50, 20, and 5. In practice, it is difficult to select the right value because setting the big-M value too small will accidentally cut off the optimal solutions. Nonetheless, we still pick a small big-M value (5; the optimal coefficients are all 1 s in the synthetic benchmarks) to do a thorough comparison. The results are shown in Tables 66-120 in Appendix G.3. When the method has - in the upper bound or gap column, it means that method does not produce any nontrivial solution within the time limit.

The results show that although choosing a small big-M value can often lead to smaller running time, OKRidge is still orders of magnitude faster than the big-M formulation, and achieves the smallest optimality gap when all algorithms reach the time limit. Additionally, we find that having a small big-M value will sometimes lead to increased running time on solving the root relaxation. Sometimes, when the root relaxation is not finished within the 1h time limit, Gurobi will provide a loose lower bound, resulting in large optimality gaps. Aside from accidentally cutting off the optimal solution, this is another reason against choosing a very small big-M value.

Comparison with MOSEK, Subset Selection CIO, and l0bnb For the purpose of thorough comparison, we also solve the problem with other MIP solvers/formulations. The results are shown in Tables 121-Table 135 in Appendix G.4. When the method has - in the upper bound or gap column, it means that method does not produce any nontrivial solution within the time limit. For the MOSEK solver, when the upper bound and gap columns have - values, and the time is exactly 3600.00, this means that MOSEK runs out of memory (100GB).

First, we try to solve the problem using the commercial conic solver called MOSEK. We use MOSEK to solve the perspective formulation, with and without using beam search solutions as warm starts. The results indicate that Gurobi is in general much faster and more memory-efficient than MOSEK in solving the perspective formulation. We also use MOSEK to solve the relaxed convex optimal perspective formulation (relaxing the binary constraint zj {0, 1} to the interval constraint zj [0, 1]). we only solve the relaxed convex SDP problem because MOSEK does not support MI-SDP. The results show that MOSEK has limited capacity in solving the perspective formulation. MOSEK also runs out of memory for large-scale problems (n = 100000, p = 3000 or p = 5000). Besides, the experimental results show that convex SDP problem is not scalable to high dimensions. For the high-dimensional cases, OKRidge solving the MIP problem is usually much faster than MOSEK solving a convex SDP problem.

Second, we uses the Subset Selection CIO method, which solves the original sparse regression problem using branch-and-cut (lazy-constraint callbacks) in Gurobi. However, as the results indicate, Subset Selection CIO produces wrong optimal solutions when ℓ2 regularization is small and the number of samples is large (n = 100000). When Subset Selection CIO does not have errors, the running times or optimality gaps are much worse than those of OKRidge.

Lastly, we solve the sparse ridge regression problem using l0bnb. In contrast to other methods, l0bnb controlls the sparsity level implicitly through the ℓ0 regularization, instead of the ℓ0 constraint. l0bnb needs to try several different ℓ0 regularizations to find the desired sparsity level. For each ℓ0 regularization coefficient, we set the time limit to be 180 seconds, and we stop the l0bnb algorithm after a time limit of 7200 seconds (this is longer than 3600 seconds because we want to give enough time for l0bnb to try different ℓ0 coefficients). Moreover, l0bnb imposes a box constraint (similar to the big-M method) on all coefficients. The box constraint tightens the lower bound and potentially accelerates the pruning process, but it could also cut off the optimal solution if the M value is not chosen properly or large enough. To our best knowledge, there is no reliable way to calculate the M value. The l0bnb package internally estimates the M value in a heuristic way. Despite the box constraint giving l0bnb an extra advantage, we still compare this baseline with OKRidge. As the results indicate, l0bnb often takes a long time to find the desired sparsity level by trying different ℓ0 regularizations. In some cases, l0bnb finds a solution with support size different than the required sparsity level (demonstrating the difficulty of finding a good ℓ0 coefficient). In other cases, l0bnb still has an optimality gap after reaching the time limit while OKRidge can finish certification within seconds.

Discussion on the Case p n For the application of discovering differential equations, the datasets usually have n p. Therefore, the case p n is not the main focus of this paper. Nonetheless, we provide additional experiments to give a complete view of our algorithm.

We set the number of features p = 3000 and the number of samples n = 100, 200, 500. The feature correlation is set to be ρ = 0.1 since this is already a very challenging case. The results are shown in Tables 151-153 in Appendix G.5. The results show that OKRidge is competitive with other MIP methods. l0bnb is the only method that can solve the problem when n = 500 within the time limit. However, this algorithm has difficulty in finding the desired sparsity level in the case of n = 100 and n = 200 even with a large time limit of 7200 seconds. In some cases, MOSEK gives the best optimality gap. However, when n p as we have shown in the previous subsections, MOSEK is not scalable to high dimensions. Additionally, the results in differential equations show that MOSEK and l0bnb are not competitive in discovering differential equations (datasets have n p), which is the main focus of this paper.

In the future, we plan to extend OKRidge to develop a more scalable algorithm for the case p n.

Fast Solve vs. ADMM In the main paper, we proposed two ways to calculate the saddle point problem. The first approach, fast solve, is based on an analytical formula. The second approach is based on the ADMM method. Intuitively, since the fast solve method is only maximizing a lower bound of the saddle point objective and ADMM is maximizing the saddle point objective directly, we expect that ADMM would give much tighter lower bounds than the fast solve approach does. Here, we give empirical evidence to support our claim. The results can be seen in Figures 7 and 8.

G.2 Comparison with SOS1, Big-M, Perspective, and Eigen-Perspective Formulations with different ℓ2 Regularizations

G.2.1 Synthetic 1 Benchmark with λ2 = 0.001 with No Warmstart

# of features method upper bound time(s) gap(%)

100 SOS1 2.10878 106 1.68 0.00 100 big-M50 2.10874 106 0.39 0.00 100 persp. 2.10878 106 0.08 0.00 100 eig. persp. 2.10878 106 0.01 0.00 100 ours 2.10878 106 0.29 0.00

500 SOS1 2.12833 106 8.04 0.00 500 big-M50 2.12828 106 36.80 0.00 500 persp. 2.12833 106 20.29 0.00 500 eig. persp. 2.12832 106 0.69 0.00 500 ours 2.12833 106 0.37 0.00

1000 SOS1 2.11005 106 82.53 0.00 1000 big-M50 2.11005 106 1911.41 0.00 1000 persp. 2.11005 106 1143.11 0.00 1000 eig. persp. 2.11005 106 4.11 0.00 1000 ours 2.11005 106 0.58 0.00

3000 SOS1 2.10230 106 3600.15 0.59 3000 big-M50 7.83869 105 3601.20 169.78 3000 persp. 2.10231 106 2288.93 0.00 3000 eig. persp. 2.10230 106 102.18 0.00 3000 ours 2.10230 106 1.73 0.00

5000 SOS1 1.98939 106 3600.13 6.43 5000 big-M50 1.87965 106 3603.25 12.65 5000 persp. 2.09637 106 3600.62 1.00 5000 eig. persp. - 3880.65 - 5000 ours 2.09635 106 4.22 0.00 Table 1: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 SOS1 5.28726 106 3.37 0.00 100 big-M50 5.28725 106 0.42 0.00 100 persp. 5.28726 106 0.07 0.00 100 eig. persp. 5.28726 106 0.02 0.00 100 ours 5.28726 106 0.29 0.00

500 SOS1 5.31852 106 9.97 0.00 500 big-M50 5.31819 106 89.04 0.00 500 persp. 5.31852 106 8.44 0.00 500 eig. persp. 5.31852 106 0.99 0.00 500 ours 5.31852 106 0.41 0.00

1000 SOS1 5.26974 106 88.42 0.00 1000 big-M50 5.26974 106 1528.98 0.00 1000 persp. 5.26974 106 54.96 0.00 1000 eig. persp. 5.26974 106 300.72 0.00 1000 ours 5.26974 106 0.52 0.00

3000 SOS1 5.16793 106 3600.38 2.71 3000 big-M50 3.72978 106 3602.55 42.31 3000 persp. 5.27668 106 3600.33 0.59 3000 eig. persp. 5.27668 106 100.58 0.00 3000 ours 5.27668 106 1.75 0.00

5000 SOS1 5.16088 106 3600.14 3.16 5000 big-M50 4.16126 106 3602.53 27.95 5000 persp. 5.27126 106 3600.72 1.00 5000 eig. persp. - 3890.12 - 5000 ours 5.27126 106 4.32 0.00 Table 2: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 SOS1 1.10084 107 2.97 0.00 100 big-M50 1.10084 107 0.41 0.00 100 persp. 1.10084 107 0.09 0.00 100 eig. persp. 1.10084 107 0.11 0.00 100 ours 1.10084 107 0.29 0.00

500 SOS1 1.10518 107 9.65 0.00 500 big-M50 1.10515 107 89.59 0.00 500 persp. 1.10518 107 3600.01 0.08 500 eig. persp. 1.10518 107 25.13 0.00 500 ours 1.10518 107 0.40 0.00

1000 SOS1 1.09508 107 92.16 0.00 1000 big-M50 1.09506 107 1757.09 0.00 1000 persp. 1.09508 107 3600.04 0.20 1000 eig. persp. 1.09508 107 226.82 0.00 1000 ours 1.09508 107 0.59 0.00

3000 SOS1 1.10012 107 3600.10 0.52 3000 big-M50 - 3615.51 - 3000 persp. 1.10012 107 3600.91 0.59 3000 eig. persp. 1.10012 107 100.85 0.00 3000 ours 1.10012 107 1.74 0.00

5000 SOS1 1.08838 107 3600.13 2.05 5000 big-M50 - 3612.62 - 5000 persp. 1.09966 107 3600.17 1.00 5000 eig. persp. - 3732.98 - 5000 ours 1.09966 107 4.00 0.00 Table 3: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 SOS1 2.43575 107 2.83 0.00 100 big-M50 2.43573 107 0.46 0.00 100 persp. 2.43575 107 0.21 0.00 100 eig. persp. 2.43575 107 0.02 0.00 100 ours 2.43575 107 0.32 0.00

500 SOS1 2.44185 107 10.25 0.00 500 big-M50 2.44178 107 92.53 0.00 500 persp. 2.44185 107 114.80 0.00 500 eig. persp. 2.44185 107 26.49 0.00 500 ours 2.44185 107 0.46 0.00

1000 SOS1 2.41989 107 88.38 0.00 1000 big-M50 2.41987 107 1361.05 0.00 1000 persp. 2.41989 107 3600.08 0.20 1000 eig. persp. 2.41989 107 232.96 0.00 1000 ours 2.41989 107 0.52 0.00

3000 SOS1 2.43713 107 3600.47 0.59 3000 big-M50 - 3640.12 - 3000 persp. 2.43713 107 3600.70 0.59 3000 eig. persp. - 3606.15 - 3000 ours 2.43713 107 1.77 0.00

5000 SOS1 2.42551 107 3600.13 1.48 5000 big-M50 2.43694 107 3643.28 1.00 5000 persp. 2.43694 107 3604.07 1.00 5000 eig. persp. - 3711.01 - 5000 ours 2.43694 107 4.15 0.00 Table 4: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 SOS1 9.11026 107 0.05 0.00 100 big-M50 9.11026 107 0.13 0.00 100 persp. 9.11026 107 0.27 0.00 100 eig. persp. 9.11019 107 1.63 0.00 100 ours 9.11026 107 0.30 0.00

500 SOS1 9.12047 107 21.43 0.00 500 big-M50 9.12047 107 94.03 0.00 500 persp. 9.12047 107 176.04 0.00 500 eig. persp. 9.12047 107 85.92 0.00 500 ours 9.12047 107 0.38 0.00

1000 SOS1 9.04075 107 1085.91 0.00 1000 big-M50 9.04075 107 3600.01 0.07 1000 persp. 9.04075 107 3600.01 0.20 1000 eig. persp. 9.04075 107 341.28 0.00 1000 ours 9.04075 107 0.52 0.00

3000 SOS1 9.09247 107 3600.05 0.98 3000 big-M50 - 3605.36 - 3000 persp. 9.12761 107 3600.05 0.59 3000 eig. persp. - 3600.41 - 3000 ours 9.12761 107 20.74 0.00

5000 SOS1 9.00628 107 3600.16 2.38 5000 big-M50 - 3607.68 - 5000 persp. 9.12274 107 3600.13 1.07 5000 eig. persp. - 3810.32 - 5000 ours 9.12916 107 58.29 0.00 Table 5: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.001).

G.2.2 Synthetic 1 Benchmark with λ2 = 0.001 with Warmstart

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 2.10878 106 0.05 0.00 100 big-M50 + warm start 2.10878 106 0.08 0.00 100 persp. + warm start 2.10878 106 0.08 0.00 100 eig. persp. + warm start 2.10878 106 0.01 0.00 100 ours 2.10878 106 0.29 0.00

500 SOS1 + warm start 2.12833 106 6.31 0.00 500 big-M50 + warm start 2.12833 106 37.62 0.00 500 persp. + warm start 2.12833 106 19.60 0.00 500 eig. persp. + warm start 2.12833 106 0.61 0.00 500 ours 2.12833 106 0.37 0.00

1000 SOS1 + warm start 2.11005 106 77.82 0.00 1000 big-M50 + warm start 2.11005 106 28.51 0.00 1000 persp. + warm start 2.11005 106 1148.21 0.00 1000 eig. persp. + warm start 2.11005 106 4.08 0.00 1000 ours 2.11005 106 0.58 0.00

3000 SOS1 + warm start 2.10230 106 2100.58 0.00 3000 big-M50 + warm start 2.10230 106 407.92 0.00 3000 persp. + warm start 2.10231 106 2275.23 0.00 3000 eig. persp. + warm start 2.10230 106 102.19 0.00 3000 ours 2.10230 106 1.73 0.00

5000 SOS1 + warm start 2.09635 106 3601.23 0.95 5000 big-M50 + warm start 2.09635 106 1380.38 0.00 5000 persp. + warm start 2.09637 106 3600.68 1.00 5000 eig. persp. + warm start 2.09635 106 483.38 0.00 5000 ours 2.09635 106 4.22 0.00 Table 6: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 5.28726 106 0.05 0.00 100 big-M50 + warm start 5.28726 106 0.26 0.00 100 persp. + warm start 5.28726 106 0.07 0.00 100 eig. persp. + warm start 5.28726 106 0.02 0.00 100 ours 5.28726 106 0.29 0.00

500 SOS1 + warm start 5.31852 106 9.61 0.00 500 big-M50 + warm start 5.31852 106 3.00 0.00 500 persp. + warm start 5.31852 106 8.06 0.00 500 eig. persp. + warm start 5.31852 106 0.63 0.00 500 ours 5.31852 106 0.41 0.00

1000 SOS1 + warm start 5.26974 106 88.36 0.00 1000 big-M50 + warm start 5.26974 106 13.79 0.00 1000 persp. + warm start 5.26974 106 54.26 0.00 1000 eig. persp. + warm start 5.26974 106 4.18 0.00 1000 ours 5.26974 106 0.52 0.00

3000 SOS1 + warm start 5.27668 106 2740.42 0.00 3000 big-M50 + warm start 5.27668 106 313.22 0.00 3000 persp. + warm start 5.27668 106 144.09 0.00 3000 eig. persp. + warm start 5.27668 106 101.95 0.00 3000 ours 5.27668 106 1.75 0.00

5000 SOS1 + warm start 5.27126 106 3600.51 1.00 5000 big-M50 + warm start 5.27126 106 1705.37 0.00 5000 persp. + warm start 5.27126 106 543.39 0.00 5000 eig. persp. + warm start 5.27126 106 485.74 0.00 5000 ours 5.27126 106 4.32 0.00 Table 7: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 1.10084 107 0.05 0.00 100 big-M50 + warm start 1.10084 107 0.06 0.00 100 persp. + warm start 1.10084 107 0.09 0.00 100 eig. persp. + warm start 1.10084 107 0.02 0.00 100 ours 1.10084 107 0.29 0.00

500 SOS1 + warm start 1.10518 107 7.68 0.00 500 big-M50 + warm start 1.10518 107 11.59 0.00 500 persp. + warm start 1.10518 107 3600.02 0.08 500 eig. persp. + warm start 1.10518 107 0.63 0.00 500 ours 1.10518 107 0.40 0.00

1000 SOS1 + warm start 1.09508 107 90.16 0.00 1000 big-M50 + warm start 1.09508 107 21.16 0.00 1000 persp. + warm start 1.09508 107 9.85 0.00 1000 eig. persp. + warm start 1.09508 107 4.04 0.00 1000 ours 1.09508 107 0.59 0.00

3000 SOS1 + warm start 1.10012 107 3044.33 0.00 3000 big-M50 + warm start 1.10012 107 243.42 0.00 3000 persp. + warm start 1.10012 107 141.68 0.00 3000 eig. persp. + warm start 1.10012 107 100.50 0.00 3000 ours 1.10012 107 1.74 0.00

5000 SOS1 + warm start 1.09966 107 3603.69 0.99 5000 big-M50 + warm start 1.09966 107 3600.95 1.00 5000 persp. + warm start 1.09966 107 3603.33 1.00 5000 eig. persp. + warm start 1.09966 107 445.26 0.00 5000 ours 1.09966 107 4.00 0.00 Table 8: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 2.43575 107 0.05 0.00 100 big-M50 + warm start 2.43575 107 0.05 0.00 100 persp. + warm start 2.43575 107 0.19 0.00 100 eig. persp. + warm start 2.43575 107 0.02 0.00 100 ours 2.43575 107 0.32 0.00

500 SOS1 + warm start 2.44185 107 8.94 0.00 500 big-M50 + warm start 2.44185 107 4.02 0.00 500 persp. + warm start 2.44185 107 271.52 0.00 500 eig. persp. + warm start 2.44185 107 0.60 0.00 500 ours 2.44185 107 0.46 0.00

1000 SOS1 + warm start 2.41989 107 83.96 0.00 1000 big-M50 + warm start 2.41989 107 11.63 0.00 1000 persp. + warm start 2.41989 107 3600.13 0.20 1000 eig. persp. + warm start 2.41989 107 4.02 0.00 1000 ours 2.41989 107 0.52 0.00

3000 SOS1 + warm start 2.43713 107 3601.04 0.59 3000 big-M50 + warm start 2.43713 107 3608.31 0.59 3000 persp. + warm start 2.43713 107 3600.34 0.59 3000 eig. persp. + warm start 2.43713 107 110.94 0.00 3000 ours 2.43713 107 1.77 0.00

5000 SOS1 + warm start 2.43694 107 3600.18 1.00 5000 big-M50 + warm start 2.43694 107 3670.94 1.00 5000 persp. + warm start 2.43694 107 3600.18 1.00 5000 eig. persp. + warm start 2.43694 107 442.81 0.00 5000 ours 2.43694 107 4.15 0.00 Table 9: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 9.11026 107 0.04 0.00 100 big-M50 + warm start 9.11026 107 0.03 0.00 100 persp. + warm start 9.11026 107 0.26 0.00 100 eig. persp. + warm start 9.11026 107 0.02 0.00 100 ours 9.11026 107 0.30 0.00

500 SOS1 + warm start 9.12047 107 48.24 0.00 500 big-M50 + warm start 9.12047 107 34.86 0.00 500 persp. + warm start 9.12047 107 3600.02 0.09 500 eig. persp. + warm start 9.12047 107 0.59 0.00 500 ours 9.12047 107 0.38 0.00

1000 SOS1 + warm start 9.04075 107 1090.45 0.00 1000 big-M50 + warm start 9.04075 107 3600.01 0.08 1000 persp. + warm start 9.04075 107 3600.08 0.20 1000 eig. persp. + warm start 9.04075 107 4.15 0.00 1000 ours 9.04075 107 0.52 0.00

3000 SOS1 + warm start 9.12761 107 3600.07 0.59 3000 big-M50 + warm start 9.12761 107 3612.88 0.59 3000 persp. + warm start 9.12761 107 3600.07 0.59 3000 eig. persp. + warm start 9.12761 107 100.26 0.00 3000 ours 9.12761 107 20.74 0.00

5000 SOS1 + warm start 9.12916 107 3601.57 1.00 5000 big-M50 + warm start 9.12916 107 3622.52 1.00 5000 persp. + warm start 9.12916 107 3602.87 1.00 5000 eig. persp. + warm start 9.12916 107 439.82 0.00 5000 ours 9.12916 107 58.29 0.00 Table 10: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.001). All baselines use our beam search solution as a warm start.

G.2.3 Synthetic 1 Benchmark with λ2 = 0.1 with No Warmstart

# of features method upper bound time(s) gap(%)

100 SOS1 2.10878 106 3.04 0.00 100 big-M50 2.10878 106 0.86 0.00 100 persp. 2.10878 106 0.07 0.00 100 eig. persp. 2.10878 106 0.02 0.00 100 ours 2.10878 106 0.36 0.00

500 SOS1 2.12833 106 8.13 0.00 500 big-M50 2.12833 106 37.96 0.00 500 persp. 2.12833 106 3.87 0.00 500 eig. persp. 2.12833 106 0.63 0.00 500 ours 2.12833 106 0.37 0.00

1000 SOS1 2.11005 106 81.40 0.00 1000 big-M50 2.11005 106 1007.27 0.00 1000 persp. 2.11005 106 1195.64 0.00 1000 eig. persp. 2.11005 106 4.07 0.00 1000 ours 2.11005 106 0.57 0.00

3000 SOS1 2.10230 106 3601.25 0.59 3000 big-M50 7.99504 105 3608.36 164.51 3000 persp. 2.10230 106 2055.12 0.00 3000 eig. persp. 2.10230 106 103.29 0.00 3000 ours 2.10230 106 1.79 0.00

5000 SOS1 1.98939 106 3600.15 6.43 5000 big-M50 9.69753 105 3619.50 118.35 5000 persp. 2.09635 106 3600.72 1.00 5000 eig. persp. 2.09635 106 584.03 0.00 5000 ours 2.09635 106 4.20 0.00 Table 11: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 SOS1 5.28726 106 3.37 0.00 100 big-M50 4.60269 106 1.83 0.00 100 persp. 5.28726 106 0.07 0.00 100 eig. persp. 5.28726 106 0.11 0.00 100 ours 5.28726 106 0.27 0.00

500 SOS1 5.31852 106 10.00 0.00 500 big-M50 5.31827 106 85.52 0.00 500 persp. 5.31852 106 8.41 0.00 500 eig. persp. 5.31852 106 1.07 0.00 500 ours 5.31852 106 0.41 0.00

1000 SOS1 5.26974 106 88.50 0.00 1000 big-M50 5.26974 106 1576.75 0.00 1000 persp. 5.26974 106 91.53 0.00 1000 eig. persp. 5.26974 106 4.11 0.00 1000 ours 5.26974 106 0.52 0.00

3000 SOS1 5.16793 106 3600.25 2.71 3000 big-M50 3.73049 106 3603.09 42.28 3000 persp. 5.27668 106 3600.09 0.59 3000 eig. persp. 5.27668 106 103.60 0.00 3000 ours 5.27668 106 1.84 0.00

5000 SOS1 5.16087 106 3600.13 3.16 5000 big-M50 3.98069 106 3606.79 33.75 5000 persp. 5.27126 106 3601.72 1.00 5000 eig. persp. 5.27126 106 611.45 0.00 5000 ours 5.27126 106 4.42 0.00 Table 12: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 SOS1 1.10084 107 2.97 0.00 100 big-M50 1.10084 107 0.44 0.00 100 persp. 1.10084 107 0.96 0.00 100 eig. persp. 1.10084 107 0.13 0.00 100 ours 1.10084 107 0.32 0.00

500 SOS1 1.10518 107 9.66 0.00 500 big-M50 1.10515 107 86.52 0.00 500 persp. 1.10518 107 3600.03 0.08 500 eig. persp. 1.10518 107 0.62 0.00 500 ours 1.10518 107 0.40 0.00

1000 SOS1 1.09508 107 87.74 0.00 1000 big-M50 1.09506 107 1539.25 0.00 1000 persp. 1.09508 107 3600.12 0.19 1000 eig. persp. 1.09508 107 335.72 0.00 1000 ours 1.09508 107 0.53 0.00

3000 SOS1 1.10012 107 3600.36 0.51 3000 big-M50 - 3608.09 - 3000 persp. 1.10012 107 3600.08 0.59 3000 eig. persp. 1.10012 107 103.03 0.00 3000 ours 1.10012 107 1.67 0.00

5000 SOS1 1.08838 107 3602.31 2.05 5000 big-M50 - 3649.14 - 5000 persp. 1.09966 107 3601.44 1.00 5000 eig. persp. - 3758.62 - 5000 ours 1.09966 107 4.13 0.00 Table 13: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 SOS1 2.43575 107 0.05 0.00 100 big-M50 2.43573 107 0.41 0.00 100 persp. 2.43575 107 0.43 0.00 100 eig. persp. 2.43575 107 0.13 0.00 100 ours 2.43575 107 0.34 0.00

500 SOS1 2.44185 107 9.10 0.00 500 big-M50 2.44167 107 96.59 0.00 500 persp. 2.44185 107 148.21 0.00 500 eig. persp. 2.44185 107 23.81 0.00 500 ours 2.44185 107 0.44 0.00

1000 SOS1 2.41989 107 89.99 0.00 1000 big-M50 2.41986 107 1608.58 0.00 1000 persp. 2.41989 107 3600.10 0.19 1000 eig. persp. 2.41989 107 4.42 0.00 1000 ours 2.41989 107 0.53 0.00

3000 SOS1 2.42629 107 3601.14 1.04 3000 big-M50 - 3605.20 - 3000 persp. 2.43713 107 3601.16 0.59 3000 eig. persp. - 3658.27 - 3000 ours 2.43713 107 1.75 0.00

5000 SOS1 2.42551 107 3600.14 1.48 5000 big-M50 2.43694 107 3650.52 1.00 5000 persp. 2.43694 107 3606.01 1.00 5000 eig. persp. - 3750.44 - 5000 ours 2.43694 107 4.22 0.00 Table 14: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 SOS1 9.11026 107 0.05 0.00 100 big-M50 9.11026 107 0.18 0.00 100 persp. 9.11026 107 0.30 0.00 100 eig. persp. 9.11026 107 0.13 0.00 100 ours 9.11026 107 0.31 0.00

500 SOS1 9.12047 107 22.15 0.00 500 big-M50 9.12047 107 398.02 0.00 500 persp. 9.12047 107 3600.01 0.08 500 eig. persp. 9.12046 107 84.16 0.00 500 ours 9.12047 107 0.38 0.00

1000 SOS1 9.04075 107 1090.30 0.00 1000 big-M50 9.04075 107 3600.05 0.01 1000 persp. 9.04075 107 3600.08 0.20 1000 eig. persp. 9.04074 107 881.89 0.00 1000 ours 9.04075 107 0.52 0.00

3000 SOS1 9.11671 107 3600.55 0.71 3000 big-M50 - 3619.63 - 3000 persp. 9.12761 107 3600.05 0.59 3000 eig. persp. - 3600.21 - 3000 ours 9.12761 107 20.88 0.00

5000 SOS1 9.02495 107 3602.25 2.17 5000 big-M50 - 3620.37 - 5000 persp. 9.12900 107 3600.18 1.00 5000 eig. persp. - 4028.26 - 5000 ours 9.12916 107 58.07 0.00 Table 15: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.1).

G.2.4 Synthetic 1 Benchmark with λ2 = 0.1 with Warmstart

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 2.10878 106 0.05 0.00 100 big-M50 + warm start 2.10878 106 0.08 0.00 100 persp. + warm start 2.10878 106 0.07 0.00 100 eig. persp. + warm start 2.10878 106 0.02 0.00 100 ours 2.10878 106 0.36 0.00

500 SOS1 + warm start 2.12833 106 5.04 0.00 500 big-M50 + warm start 2.12833 106 3.78 0.00 500 persp. + warm start 2.12833 106 3.84 0.00 500 eig. persp. + warm start 2.12833 106 0.60 0.00 500 ours 2.12833 106 0.37 0.00

1000 SOS1 + warm start 2.11005 106 74.29 0.00 1000 big-M50 + warm start 2.11005 106 26.94 0.00 1000 persp. + warm start 2.11005 106 1175.58 0.00 1000 eig. persp. + warm start 2.11005 106 4.18 0.00 1000 ours 2.11005 106 0.57 0.00

3000 SOS1 + warm start 2.10230 106 2144.33 0.00 3000 big-M50 + warm start 2.10230 106 193.85 0.00 3000 persp. + warm start 2.10230 106 2108.24 0.00 3000 eig. persp. + warm start 2.10230 106 110.85 0.00 3000 ours 2.10230 106 1.79 0.00

5000 SOS1 + warm start 2.09635 106 3600.66 0.95 5000 big-M50 + warm start 2.09635 106 1366.14 0.00 5000 persp. + warm start 2.09635 106 3601.28 1.00 5000 eig. persp. + warm start 2.09635 106 460.24 0.00 5000 ours 2.09635 106 4.20 0.00 Table 16: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 5.28726 106 0.05 0.00 100 big-M50 + warm start 5.28726 106 0.07 0.00 100 persp. + warm start 5.28726 106 0.07 0.00 100 eig. persp. + warm start 5.28726 106 0.02 0.00 100 ours 5.28726 106 0.27 0.00

500 SOS1 + warm start 5.31852 106 9.77 0.00 500 big-M50 + warm start 5.31852 106 39.88 0.00 500 persp. + warm start 5.31852 106 8.26 0.00 500 eig. persp. + warm start 5.31852 106 0.63 0.00 500 ours 5.31852 106 0.41 0.00

1000 SOS1 + warm start 5.26974 106 89.01 0.00 1000 big-M50 + warm start 5.26974 106 14.33 0.00 1000 persp. + warm start 5.26974 106 91.61 0.00 1000 eig. persp. + warm start 5.26974 106 4.09 0.00 1000 ours 5.26974 106 0.52 0.00

3000 SOS1 + warm start 5.27668 106 2743.09 0.00 3000 big-M50 + warm start 5.27668 106 317.41 0.00 3000 persp. + warm start 5.27668 106 139.45 0.00 3000 eig. persp. + warm start 5.27668 106 111.91 0.00 3000 ours 5.27668 106 1.84 0.00

5000 SOS1 + warm start 5.27126 106 3600.80 1.00 5000 big-M50 + warm start 5.27126 106 1658.42 0.00 5000 persp. + warm start 5.27126 106 546.50 0.00 5000 eig. persp. + warm start 5.27126 106 454.67 0.00 5000 ours 5.27126 106 4.42 0.00 Table 17: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 1.10084 107 0.05 0.00 100 big-M50 + warm start 1.10084 107 0.06 0.00 100 persp. + warm start 1.10084 107 0.90 0.00 100 eig. persp. + warm start 1.10084 107 0.02 0.00 100 ours 1.10084 107 0.32 0.00

500 SOS1 + warm start 1.10518 107 9.26 0.00 500 big-M50 + warm start 1.10518 107 2.28 0.00 500 persp. + warm start 1.10518 107 3600.02 0.08 500 eig. persp. + warm start 1.10518 107 0.64 0.00 500 ours 1.10518 107 0.40 0.00

1000 SOS1 + warm start 1.09508 107 89.44 0.00 1000 big-M50 + warm start 1.09508 107 14.11 0.00 1000 persp. + warm start 1.09508 107 10.09 0.00 1000 eig. persp. + warm start 1.09508 107 4.05 0.00 1000 ours 1.09508 107 0.53 0.00

3000 SOS1 + warm start 1.10012 107 3046.03 0.00 3000 big-M50 + warm start 1.10012 107 222.56 0.00 3000 persp. + warm start 1.10012 107 141.64 0.00 3000 eig. persp. + warm start 1.10012 107 110.14 0.00 3000 ours 1.10012 107 1.67 0.00

5000 SOS1 + warm start 1.09966 107 3600.73 0.99 5000 big-M50 + warm start 1.09966 107 3606.43 1.00 5000 persp. + warm start 1.09966 107 3600.38 1.00 5000 eig. persp. + warm start 1.09966 107 455.87 0.00 5000 ours 1.09966 107 4.13 0.00 Table 18: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 2.43575 107 0.05 0.00 100 big-M50 + warm start 2.43575 107 0.06 0.00 100 persp. + warm start 2.43575 107 0.43 0.00 100 eig. persp. + warm start 2.43575 107 0.02 0.00 100 ours 2.43575 107 0.34 0.00

500 SOS1 + warm start 2.44185 107 9.20 0.00 500 big-M50 + warm start 2.44185 107 4.00 0.00 500 persp. + warm start 2.44185 107 146.41 0.00 500 eig. persp. + warm start 2.44185 107 0.62 0.00 500 ours 2.44185 107 0.44 0.00

1000 SOS1 + warm start 2.41989 107 88.02 0.00 1000 big-M50 + warm start 2.41989 107 12.62 0.00 1000 persp. + warm start 2.41989 107 1117.62 0.00 1000 eig. persp. + warm start 2.41989 107 4.05 0.00 1000 ours 2.41989 107 0.53 0.00

3000 SOS1 + warm start 2.43713 107 3600.80 0.59 3000 big-M50 + warm start 2.43713 107 3600.43 0.59 3000 persp. + warm start 2.43713 107 3600.68 0.59 3000 eig. persp. + warm start 2.43713 107 101.46 0.00 3000 ours 2.43713 107 1.75 0.00

5000 SOS1 + warm start 2.43694 107 3600.86 1.00 5000 big-M50 + warm start 2.43694 107 3616.71 1.00 5000 persp. + warm start 2.43694 107 3603.18 1.00 5000 eig. persp. + warm start 2.43694 107 455.57 0.00 5000 ours 2.43694 107 4.22 0.00 Table 19: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 9.11026 107 0.04 0.00 100 big-M50 + warm start 9.11026 107 0.04 0.00 100 persp. + warm start 9.11026 107 0.29 0.00 100 eig. persp. + warm start 9.11026 107 0.02 0.00 100 ours 9.11026 107 0.31 0.00

500 SOS1 + warm start 9.12047 107 52.72 0.00 500 big-M50 + warm start 9.12047 107 14.34 0.00 500 persp. + warm start 9.12047 107 166.73 0.00 500 eig. persp. + warm start 9.12047 107 0.60 0.00 500 ours 9.12047 107 0.38 0.00

1000 SOS1 + warm start 9.04075 107 1068.59 0.00 1000 big-M50 + warm start 9.04075 107 3600.16 0.01 1000 persp. + warm start 9.04075 107 3600.01 0.20 1000 eig. persp. + warm start 9.04075 107 4.17 0.00 1000 ours 9.04075 107 0.52 0.00

3000 SOS1 + warm start 9.12761 107 3600.39 0.59 3000 big-M50 + warm start 9.12761 107 3611.02 0.59 3000 persp. + warm start 9.12761 107 3600.23 0.59 3000 eig. persp. + warm start 9.12761 107 101.65 0.00 3000 ours 9.12761 107 20.88 0.00

5000 SOS1 + warm start 9.12916 107 3603.57 1.00 5000 big-M50 + warm start 9.12916 107 3606.64 1.00 5000 persp. + warm start 9.12916 107 3601.56 1.00 5000 eig. persp. + warm start 9.12916 107 454.63 0.00 5000 ours 9.12916 107 58.07 0.00 Table 20: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.1). All baselines use our beam search solution as a warm start.

G.2.5 Synthetic 1 Benchmark with λ2 = 10.0 with No Warmstart

# of features method upper bound time(s) gap(%)

100 SOS1 2.10868 106 3.37 0.00 100 big-M50 2.10868 106 0.28 0.00 100 persp. 2.10868 106 2.06 0.00 100 eig. persp. 2.10868 106 0.02 0.00 100 ours 2.10868 106 0.30 0.00

500 SOS1 2.12823 106 6.58 0.00 500 big-M50 2.12819 106 42.85 0.00 500 persp. 2.12823 106 143.68 0.00 500 eig. persp. 2.12823 106 0.65 0.00 500 ours 2.12823 106 0.38 0.00

1000 SOS1 2.10995 106 70.57 0.00 1000 big-M50 2.10995 106 1072.15 0.00 1000 persp. 2.10995 106 452.04 0.00 1000 eig. persp. 2.10995 106 4.22 0.00 1000 ours 2.10995 106 0.57 0.00

3000 SOS1 1.87788 106 3600.81 12.60 3000 big-M50 7.92671 105 3605.01 166.77 3000 persp. 2.10220 106 3600.04 0.55 3000 eig. persp. 2.10220 106 102.51 0.00 3000 ours 2.10220 106 1.68 0.00

5000 SOS1 1.98929 106 3600.16 6.43 5000 big-M50 7.89596 105 3617.52 168.16 5000 persp. 2.09584 106 3601.36 1.00 5000 eig. persp. 2.09625 106 475.37 0.00 5000 ours 2.09625 106 5.09 0.00 Table 21: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 SOS1 5.28716 106 2.98 0.00 100 big-M50 5.28713 106 0.40 0.00 100 persp. 5.28716 106 0.77 0.00 100 eig. persp. 5.28716 106 0.10 0.00 100 ours 5.28716 106 0.30 0.00

500 SOS1 5.31842 106 10.06 0.00 500 big-M50 5.31819 106 89.34 0.00 500 persp. 5.31841 106 8.73 0.00 500 eig. persp. 5.31842 106 0.60 0.00 500 ours 5.31842 106 0.41 0.00

1000 SOS1 5.26964 106 87.65 0.00 1000 big-M50 5.26964 106 1869.66 0.00 1000 persp. 5.26964 106 97.81 0.00 1000 eig. persp. 5.26964 106 226.32 0.00 1000 ours 5.26964 106 0.51 0.00

3000 SOS1 5.27658 106 3600.46 0.55 3000 big-M50 3.96610 106 3600.32 33.83 3000 persp. 5.27658 106 3600.06 0.58 3000 eig. persp. - 3615.07 - 3000 ours 5.27658 106 1.76 0.00

5000 SOS1 5.16077 106 3603.93 3.16 5000 big-M50 3.96757 106 3607.78 34.19 5000 persp. 5.27116 106 3600.43 1.00 5000 eig. persp. 5.27116 106 614.83 0.00 5000 ours 5.27116 106 4.32 0.00 Table 22: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 SOS1 1.10083 107 2.83 0.00 100 big-M50 1.10083 107 0.44 0.00 100 persp. 1.10083 107 0.46 0.00 100 eig. persp. 1.10083 107 0.13 0.00 100 ours 1.10083 107 0.32 0.00

500 SOS1 1.10517 107 11.73 0.00 500 big-M50 1.10515 107 91.37 0.00 500 persp. 1.10517 107 3600.01 0.08 500 eig. persp. 1.10517 107 0.63 0.00 500 ours 1.10517 107 0.41 0.00

1000 SOS1 1.09507 107 90.27 0.00 1000 big-M50 1.09505 107 1534.37 0.00 1000 persp. 1.09507 107 3600.04 0.19 1000 eig. persp. 1.09507 107 3.92 0.00 1000 ours 1.09507 107 0.55 0.00

3000 SOS1 1.10011 107 3600.92 0.51 3000 big-M50 - 3603.26 - 3000 persp. 1.10011 107 3601.24 0.58 3000 eig. persp. - 3646.60 - 3000 ours 1.10011 107 1.73 0.00

5000 SOS1 1.08825 107 3600.12 2.06 5000 big-M50 9.62941 106 3608.37 15.34 5000 persp. 1.09965 107 3602.42 1.00 5000 eig. persp. - 3600.21 - 5000 ours 1.09965 107 4.24 0.00 Table 23: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 SOS1 2.43574 107 0.05 0.00 100 big-M50 2.43574 107 0.42 0.00 100 persp. 2.43574 107 0.55 0.00 100 eig. persp. 2.43574 107 0.12 0.00 100 ours 2.43574 107 0.30 0.00

500 SOS1 2.44184 107 10.39 0.00 500 big-M50 2.44178 107 85.12 0.00 500 persp. 2.44184 107 248.24 0.00 500 eig. persp. 2.44184 107 0.61 0.00 500 ours 2.44184 107 0.44 0.00

1000 SOS1 2.41988 107 91.02 0.00 1000 big-M50 2.41986 107 1415.19 0.00 1000 persp. 2.41988 107 3600.04 0.16 1000 eig. persp. 2.41988 107 9.28 0.00 1000 ours 2.41988 107 0.52 0.00

3000 SOS1 2.43712 107 3601.13 0.59 3000 big-M50 - 3615.34 - 3000 persp. 2.43712 107 3600.26 0.58 3000 eig. persp. - 3669.39 - 3000 ours 2.43712 107 1.81 0.00

5000 SOS1 2.42550 107 3600.13 1.48 5000 big-M50 2.43693 107 3607.06 1.00 5000 persp. 2.43693 107 3601.61 1.00 5000 eig. persp. 2.43693 107 619.11 0.00 5000 ours 2.43693 107 4.08 0.00 Table 24: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 SOS1 9.11025 107 0.05 0.00 100 big-M50 9.11025 107 0.17 0.00 100 persp. 9.11025 107 0.29 0.00 100 eig. persp. 9.11018 107 11.35 0.00 100 ours 9.11025 107 0.32 0.00

500 SOS1 9.12046 107 13.62 0.00 500 big-M50 9.12046 107 117.86 0.00 500 persp. 9.12046 107 997.83 0.00 500 eig. persp. 9.12046 107 84.29 0.00 500 ours 9.12046 107 0.38 0.00

1000 SOS1 9.04074 107 1201.62 0.00 1000 big-M50 9.04074 107 3600.04 0.01 1000 persp. 9.04074 107 3600.03 0.20 1000 eig. persp. 9.04074 107 343.13 0.00 1000 ours 9.04074 107 0.52 0.00

3000 SOS1 9.12760 107 3601.23 0.59 3000 big-M50 - 3611.90 - 3000 persp. 9.12755 107 3601.40 0.59 3000 eig. persp. - 3625.51 - 3000 ours 9.12760 107 20.84 0.00

5000 SOS1 8.98304 107 3600.14 2.65 5000 big-M50 - 3615.45 - 5000 persp. 9.12241 107 3600.13 1.07 5000 eig. persp. - 3819.54 - 5000 ours 9.12915 107 57.41 0.00 Table 25: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 10.0).

G.2.6 Synthetic 1 Benchmark with λ2 = 10.0 with Warmstart

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 2.10868 106 0.05 0.00 100 big-M50 + warm start 2.10868 106 0.25 0.00 100 persp. + warm start 2.10868 106 2.05 0.00 100 eig. persp. + warm start 2.10868 106 0.01 0.00 100 ours 2.10868 106 0.30 0.00

500 SOS1 + warm start 2.12823 106 6.54 0.00 500 big-M50 + warm start 2.12823 106 3.76 0.00 500 persp. + warm start 2.12823 106 144.06 0.00 500 eig. persp. + warm start 2.12823 106 0.61 0.00 500 ours 2.12823 106 0.38 0.00

1000 SOS1 + warm start 2.10995 106 67.85 0.00 1000 big-M50 + warm start 2.10995 106 23.00 0.00 1000 persp. + warm start 2.10995 106 436.41 0.00 1000 eig. persp. + warm start 2.10995 106 4.18 0.00 1000 ours 2.10995 106 0.57 0.00

3000 SOS1 + warm start 2.10220 106 2171.96 0.00 3000 big-M50 + warm start 2.10220 106 241.10 0.00 3000 persp. + warm start 2.10220 106 3571.06 0.00 3000 eig. persp. + warm start 2.10220 106 99.06 0.00 3000 ours 2.10220 106 1.68 0.00

5000 SOS1 + warm start 2.09625 106 3602.96 0.95 5000 big-M50 + warm start 2.09625 106 1309.75 0.00 5000 persp. + warm start 2.09625 106 3601.45 0.98 5000 eig. persp. + warm start 2.09625 106 442.61 0.00 5000 ours 2.09625 106 5.09 0.00 Table 26: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 5.28716 106 0.05 0.00 100 big-M50 + warm start 5.28716 106 0.25 0.00 100 persp. + warm start 5.28716 106 0.89 0.00 100 eig. persp. + warm start 5.28716 106 0.02 0.00 100 ours 5.28716 106 0.30 0.00

500 SOS1 + warm start 5.31842 106 7.34 0.00 500 big-M50 + warm start 5.31842 106 2.93 0.00 500 persp. + warm start 5.31842 106 8.30 0.00 500 eig. persp. + warm start 5.31842 106 0.62 0.00 500 ours 5.31842 106 0.41 0.00

1000 SOS1 + warm start 5.26964 106 84.96 0.00 1000 big-M50 + warm start 5.26964 106 15.10 0.00 1000 persp. + warm start 5.26964 106 97.38 0.00 1000 eig. persp. + warm start 5.26964 106 4.10 0.00 1000 ours 5.26964 106 0.51 0.00

3000 SOS1 + warm start 5.27658 106 2809.30 0.00 3000 big-M50 + warm start 5.27658 106 315.30 0.00 3000 persp. + warm start 5.27658 106 152.21 0.00 3000 eig. persp. + warm start 5.27658 106 100.20 0.00 3000 ours 5.27658 106 1.76 0.00

5000 SOS1 + warm start 5.27116 106 3601.86 0.99 5000 big-M50 + warm start 5.27116 106 1100.87 0.00 5000 persp. + warm start 5.27116 106 633.11 0.00 5000 eig. persp. + warm start 5.27116 106 460.01 0.00 5000 ours 5.27116 106 4.32 0.00 Table 27: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 1.10083 107 0.05 0.00 100 big-M50 + warm start 1.10083 107 0.06 0.00 100 persp. + warm start 1.10083 107 0.46 0.00 100 eig. persp. + warm start 1.10083 107 0.02 0.00 100 ours 1.10083 107 0.32 0.00

500 SOS1 + warm start 1.10517 107 10.26 0.00 500 big-M50 + warm start 1.10517 107 8.51 0.00 500 persp. + warm start 1.10517 107 3600.01 0.08 500 eig. persp. + warm start 1.10517 107 0.64 0.00 500 ours 1.10517 107 0.41 0.00

1000 SOS1 + warm start 1.09507 107 88.05 0.00 1000 big-M50 + warm start 1.09507 107 14.03 0.00 1000 persp. + warm start 1.09507 107 10.25 0.00 1000 eig. persp. + warm start 1.09507 107 4.13 0.00 1000 ours 1.09507 107 0.55 0.00

3000 SOS1 + warm start 1.10011 107 3008.66 0.00 3000 big-M50 + warm start 1.10011 107 201.32 0.00 3000 persp. + warm start 1.10011 107 153.29 0.00 3000 eig. persp. + warm start 1.10011 107 102.13 0.00 3000 ours 1.10011 107 1.73 0.00

5000 SOS1 + warm start 1.09965 107 3601.18 0.99 5000 big-M50 + warm start 1.09965 107 3601.20 1.00 5000 persp. + warm start 1.09965 107 3602.12 1.00 5000 eig. persp. + warm start 1.09965 107 499.65 0.00 5000 ours 1.09965 107 4.24 0.00 Table 28: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 2.43574 107 0.05 0.00 100 big-M50 + warm start 2.43574 107 0.16 0.00 100 persp. + warm start 2.43574 107 0.51 0.00 100 eig. persp. + warm start 2.43574 107 0.02 0.00 100 ours 2.43574 107 0.30 0.00

500 SOS1 + warm start 2.44184 107 8.95 0.00 500 big-M50 + warm start 2.44184 107 2.11 0.00 500 persp. + warm start 2.44184 107 215.86 0.00 500 eig. persp. + warm start 2.44184 107 0.61 0.00 500 ours 2.44184 107 0.44 0.00

1000 SOS1 + warm start 2.41988 107 86.91 0.00 1000 big-M50 + warm start 2.41988 107 12.39 0.00 1000 persp. + warm start 2.41988 107 3600.03 0.08 1000 eig. persp. + warm start 2.41988 107 4.14 0.00 1000 ours 2.41988 107 0.52 0.00

3000 SOS1 + warm start 2.43712 107 3601.60 0.59 3000 big-M50 + warm start 2.43712 107 3600.43 0.59 3000 persp. + warm start 2.43712 107 3601.09 0.58 3000 eig. persp. + warm start 2.43712 107 113.06 0.00 3000 ours 2.43712 107 1.81 0.00

5000 SOS1 + warm start 2.43693 107 3601.29 1.00 5000 big-M50 + warm start 2.43693 107 3624.24 1.00 5000 persp. + warm start 2.43693 107 3600.19 1.00 5000 eig. persp. + warm start 2.43693 107 457.54 0.00 5000 ours 2.43693 107 4.08 0.00 Table 29: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 SOS1 + warm start 9.11025 107 0.04 0.00 100 big-M50 + warm start 9.11025 107 0.04 0.00 100 persp. + warm start 9.11025 107 0.29 0.00 100 eig. persp. + warm start 9.11025 107 0.03 0.00 100 ours 9.11025 107 0.32 0.00

500 SOS1 + warm start 9.12046 107 54.77 0.00 500 big-M50 + warm start 9.12046 107 36.63 0.00 500 persp. + warm start 9.12046 107 306.17 0.00 500 eig. persp. + warm start 9.12046 107 0.59 0.00 500 ours 9.12046 107 0.38 0.00

1000 SOS1 + warm start 9.04074 107 1178.82 0.00 1000 big-M50 + warm start 9.04074 107 3600.19 0.06 1000 persp. + warm start 9.04074 107 3600.04 0.19 1000 eig. persp. + warm start 9.04074 107 4.15 0.00 1000 ours 9.04074 107 0.52 0.00

3000 SOS1 + warm start 9.12760 107 3600.52 0.59 3000 big-M50 + warm start 9.12760 107 3606.60 0.59 3000 persp. + warm start 9.12760 107 3600.05 0.59 3000 eig. persp. + warm start 9.12760 107 103.22 0.00 3000 ours 9.12760 107 20.84 0.00

5000 SOS1 + warm start 9.12915 107 3602.40 1.00 5000 big-M50 + warm start 9.12915 107 3606.47 1.00 5000 persp. + warm start 9.12915 107 3600.16 1.00 5000 eig. persp. + warm start 9.12915 107 490.05 0.00 5000 ours 9.12915 107 57.41 0.00 Table 30: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 10.0). All baselines use our beam search solution as a warm start.

G.2.7 Synthetic 2 Benchmark with λ2 = 0.001 with No Warmstart

# of samples method upper bound time(s) gap(%)

3000 SOS1 5.42178 104 3600.10 39.82 3000 big-M50 6.34585 104 3600.07 19.16 3000 persp. 4.05327 104 3600.06 87.02 3000 eig. persp. 5.42702 104 3600.07 39.69 3000 ours 6.34585 104 3600.49 19.45

4000 SOS1 8.64294 104 3601.19 15.37 4000 big-M50 8.64294 104 3600.07 15.32 4000 persp. 8.29421 104 3600.06 20.22 4000 eig. persp. 8.29301 104 3600.08 6.80 4000 ours 8.64294 104 20.25 0.00

5000 SOS1 9.85324 104 3600.10 24.46 5000 big-M50 1.09576 105 3600.81 10.25 5000 persp. 1.09576 105 3600.07 11.91 5000 eig. persp. 1.09576 105 3600.07 0.05 5000 ours 1.09576 105 20.51 0.00

6000 SOS1 1.15043 105 3600.09 21.96 6000 big-M50 1.27643 105 3600.08 9.93 6000 persp. 1.27643 105 3600.14 9.93 6000 eig. persp. 1.27643 105 3600.06 12.64 6000 ours 1.27643 105 1.70 0.00

7000 SOS1 1.50614 105 3600.86 8.32 7000 big-M50 1.50614 105 3600.09 8.29 7000 persp. 1.50614 105 3600.21 8.33 7000 eig. persp. 1.50614 105 106.73 0.00 7000 ours 1.50614 105 1.67 0.00 Table 31: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 SOS1 1.50509 105 3601.34 26.68 3000 big-M50 1.59509 105 3600.08 19.07 3000 persp. 1.48091 105 3600.99 28.74 3000 eig. persp. 1.35700 105 3600.07 40.51 3000 ours 1.59509 105 3600.67 19.53

4000 SOS1 2.09237 105 3600.08 19.87 4000 big-M50 2.17574 105 3600.07 14.78 4000 persp. 2.17574 105 3600.06 14.91 4000 eig. persp. 2.02936 105 3600.06 23.67 4000 ours 2.17574 105 100.13 0.00

5000 SOS1 2.71986 105 3600.29 12.00 5000 big-M50 2.71986 105 3600.07 11.22 5000 persp. 2.71986 105 3600.40 11.98 5000 eig. persp. 2.71986 105 2152.71 0.00 5000 ours 2.71986 105 20.80 0.00

6000 SOS1 3.16761 105 3600.70 9.96 6000 big-M50 3.16761 105 3600.09 9.96 6000 persp. 3.16761 105 3600.06 9.96 6000 eig. persp. 3.16761 105 1968.93 0.00 6000 ours 3.16761 105 20.70 0.00

7000 SOS1 3.72610 105 3601.25 8.33 7000 big-M50 3.72610 105 3601.19 8.33 7000 persp. 3.72610 105 3600.14 8.33 7000 eig. persp. 3.72610 105 2906.29 0.00 7000 ours 3.72610 105 1.98 0.00 Table 32: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 SOS1 3.24748 105 3600.06 22.83 3000 big-M50 3.33604 105 3600.08 18.79 3000 persp. 3.03254 105 3600.06 31.54 3000 eig. persp. 3.04485 105 3600.80 31.01 3000 ours 3.33604 105 3600.58 19.57

4000 SOS1 4.49670 105 3600.55 16.16 4000 big-M50 4.53261 105 3600.07 14.98 4000 persp. 4.50269 105 3600.06 15.99 4000 eig. persp. 4.30194 105 3600.06 21.59 4000 ours 4.53261 105 3600.25 0.63

5000 SOS1 5.57342 105 3600.12 13.18 5000 big-M50 5.63007 105 3601.19 11.68 5000 persp. 5.63007 105 3600.57 11.75 5000 eig. persp. 5.63007 105 3600.05 0.29 5000 ours 5.63007 105 36.97 0.00

6000 SOS1 6.51710 105 3600.08 11.01 6000 big-M50 6.57799 105 3600.34 9.98 6000 persp. 6.57799 105 3600.15 9.97 6000 eig. persp. 6.57799 105 1685.67 0.00 6000 ours 6.57799 105 20.75 0.00

7000 SOS1 7.46135 105 3600.71 12.13 7000 big-M50 7.72267 105 3600.11 8.33 7000 persp. 7.72267 105 3600.50 8.32 7000 eig. persp. 7.72267 105 3600.27 10.37 7000 ours 7.72267 105 21.02 0.00 Table 33: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 SOS1 7.34855 105 3600.26 20.63 3000 big-M50 7.41187 105 3600.05 18.15 3000 persp. 6.86643 105 3600.74 29.10 3000 eig. persp. 7.02765 105 3600.06 26.14 3000 ours 7.41187 105 3601.14 19.60

4000 SOS1 1.00137 106 3601.49 15.39 4000 big-M50 1.00103 106 3600.07 15.19 4000 persp. 9.96124 105 3600.10 15.98 4000 eig. persp. 9.90959 105 3600.06 17.00 4000 ours 1.00287 106 3600.37 1.88

5000 SOS1 1.23058 106 3600.25 12.96 5000 big-M50 1.24027 106 3601.10 11.88 5000 persp. 1.23565 106 3601.00 12.49 5000 eig. persp. 1.22403 106 3600.05 2.31 5000 ours 1.24027 106 466.42 0.00

6000 SOS1 1.44944 106 3600.09 10.36 6000 big-M50 1.45417 106 3600.11 9.78 6000 persp. 1.45417 106 3600.55 9.82 6000 eig. persp. 1.45417 106 3430.19 0.00 6000 ours 1.45417 106 57.67 0.00

7000 SOS1 1.68757 106 3600.20 9.45 7000 big-M50 1.70483 106 3601.55 8.34 7000 persp. 1.70483 106 3600.96 8.13 7000 eig. persp. 1.70483 106 3432.42 0.00 7000 ours 1.70483 106 21.42 0.00 Table 34: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 SOS1 2.77903 106 3600.07 19.87 3000 big-M50 2.78521 106 3600.15 15.90 3000 persp. 2.76075 106 3600.05 20.66 3000 eig. persp. 2.74468 106 3600.81 21.37 3000 ours 2.78521 106 3600.01 19.60

4000 SOS1 3.75203 106 3600.12 15.10 4000 big-M50 3.75202 106 3600.06 13.33 4000 persp. 3.72754 106 3600.07 15.88 4000 eig. persp. 3.72921 106 3600.51 16.56 4000 ours 3.75203 106 3600.42 3.50

5000 SOS1 4.61204 106 3600.15 12.27 5000 big-M50 4.61877 106 3600.19 11.37 5000 persp. 4.60317 106 3600.36 12.48 5000 eig. persp. 4.60063 106 3600.26 13.32 5000 ours 4.61877 106 3600.57 0.93

6000 SOS1 5.41709 106 3600.27 10.45 6000 big-M50 5.43826 106 3600.06 9.77 6000 persp. 5.41956 106 3600.06 10.39 6000 eig. persp. 5.40204 106 3600.06 11.80 6000 ours 5.43826 106 3600.65 0.33

7000 SOS1 6.35797 106 3600.07 8.51 7000 big-M50 6.36767 106 3600.38 8.22 7000 persp. 6.35116 106 3600.24 8.62 7000 eig. persp. 6.36156 106 3610.76 0.53 7000 ours 6.36767 106 3600.78 0.11 Table 35: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.001).

G.2.8 Synthetic 2 Benchmark with λ2 = 0.001 with Warmstart

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 6.34585 104 3600.96 19.46 3000 big-M50 + warm start 6.34585 104 3601.20 19.16 3000 persp. + warm start 6.34585 104 3600.10 19.45 3000 eig. persp. + warm start 6.34585 104 3600.13 19.46 3000 ours 6.34585 104 3600.49 19.45

4000 SOS1 + warm start 8.64294 104 3600.56 15.37 4000 big-M50 + warm start 8.64294 104 3600.87 15.32 4000 persp. + warm start 8.64294 104 3600.07 14.33 4000 eig. persp. + warm start 8.64294 104 3600.12 1.64 4000 ours 8.64294 104 20.25 0.00

5000 SOS1 + warm start 1.09576 105 3600.38 11.91 5000 big-M50 + warm start 1.09576 105 3601.86 10.25 5000 persp. + warm start 1.09576 105 3600.07 11.91 5000 eig. persp. + warm start 1.09576 105 3389.68 0.00 5000 ours 1.09576 105 20.51 0.00

6000 SOS1 + warm start 1.27643 105 3600.46 9.92 6000 big-M50 + warm start 1.27643 105 3601.14 9.93 6000 persp. + warm start 1.27643 105 3600.05 9.93 6000 eig. persp. + warm start 1.27643 105 3600.07 12.65 6000 ours 1.27643 105 1.70 0.00

7000 SOS1 + warm start 1.50614 105 3600.07 5.84 7000 big-M50 + warm start 1.50614 105 3600.13 8.29 7000 persp. + warm start 1.50614 105 3600.17 8.33 7000 eig. persp. + warm start 1.50614 105 101.37 0.00 7000 ours 1.50614 105 1.67 0.00 Table 36: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 1.59509 105 3600.74 19.53 3000 big-M50 + warm start 1.59509 105 3600.60 19.07 3000 persp. + warm start 1.59509 105 3601.09 19.53 3000 eig. persp. + warm start 1.59509 105 3601.11 19.54 3000 ours 1.59509 105 3600.67 19.53

4000 SOS1 + warm start 2.17574 105 3600.07 15.28 4000 big-M50 + warm start 2.17574 105 3600.10 14.78 4000 persp. + warm start 2.17574 105 3601.12 14.91 4000 eig. persp. + warm start 2.17574 105 3600.11 15.34 4000 ours 2.17574 105 100.13 0.00

5000 SOS1 + warm start 2.71986 105 3600.44 12.00 5000 big-M50 + warm start 2.71986 105 3600.06 11.22 5000 persp. + warm start 2.71986 105 3600.14 11.35 5000 eig. persp. + warm start 2.71986 105 2118.55 0.00 5000 ours 2.71986 105 20.80 0.00

6000 SOS1 + warm start 3.16761 105 3600.73 9.96 6000 big-M50 + warm start 3.16761 105 3600.07 9.96 6000 persp. + warm start 3.16761 105 3601.28 9.96 6000 eig. persp. + warm start 3.16761 105 3600.50 11.79 6000 ours 3.16761 105 20.70 0.00

7000 SOS1 + warm start 3.72610 105 3600.61 8.33 7000 big-M50 + warm start 3.72610 105 3600.08 8.33 7000 persp. + warm start 3.72610 105 3600.08 8.33 7000 eig. persp. + warm start 3.72610 105 3600.81 9.45 7000 ours 3.72610 105 1.98 0.00 Table 37: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 3.33604 105 3600.69 19.57 3000 big-M50 + warm start 3.33604 105 3600.17 18.79 3000 persp. + warm start 3.33604 105 3600.06 19.57 3000 eig. persp. + warm start 3.33604 105 3600.06 19.58 3000 ours 3.33604 105 3600.58 19.57

4000 SOS1 + warm start 4.53261 105 3600.07 15.24 4000 big-M50 + warm start 4.53261 105 3600.17 14.98 4000 persp. + warm start 4.53261 105 3600.08 15.23 4000 eig. persp. + warm start 4.53261 105 3600.09 15.41 4000 ours 4.53261 105 3600.25 0.63

5000 SOS1 + warm start 5.63007 105 3600.25 12.04 5000 big-M50 + warm start 5.63007 105 3600.29 11.70 5000 persp. + warm start 5.63007 105 3600.76 11.75 5000 eig. persp. + warm start 5.63007 105 3600.39 0.37 5000 ours 5.63007 105 36.97 0.00

6000 SOS1 + warm start 6.57799 105 3600.29 9.98 6000 big-M50 + warm start 6.57799 105 3600.94 9.98 6000 persp. + warm start 6.57799 105 3601.41 9.57 6000 eig. persp. + warm start 6.57799 105 1420.54 0.00 6000 ours 6.57799 105 20.75 0.00

7000 SOS1 + warm start 7.72267 105 3600.06 8.33 7000 big-M50 + warm start 7.72267 105 3600.55 8.33 7000 persp. + warm start 7.72267 105 3600.39 7.86 7000 eig. persp. + warm start 7.72267 105 3600.81 10.37 7000 ours 7.72267 105 21.02 0.00 Table 38: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 7.41187 105 3600.08 19.60 3000 big-M50 + warm start 7.41187 105 3600.76 18.15 3000 persp. + warm start 7.41187 105 3600.32 19.60 3000 eig. persp. + warm start 7.41187 105 3600.07 19.60 3000 ours 7.41187 105 3601.14 19.60

4000 SOS1 + warm start 1.00287 106 3600.07 15.22 4000 big-M50 + warm start 1.00287 106 3600.07 14.98 4000 persp. + warm start 1.00287 106 3600.06 15.20 4000 eig. persp. + warm start 1.00287 106 3600.07 15.61 4000 ours 1.00287 106 3600.37 1.88

5000 SOS1 + warm start 1.24027 106 3600.06 12.08 5000 big-M50 + warm start 1.24027 106 3600.05 11.88 5000 persp. + warm start 1.24027 106 3600.07 12.07 5000 eig. persp. + warm start 1.24027 106 3600.20 0.97 5000 ours 1.24027 106 466.42 0.00

6000 SOS1 + warm start 1.45417 106 3601.20 10.00 6000 big-M50 + warm start 1.45417 106 3600.45 9.78 6000 persp. + warm start 1.45417 106 3600.04 9.82 6000 eig. persp. + warm start 1.45417 106 2727.13 0.00 6000 ours 1.45417 106 57.67 0.00

7000 SOS1 + warm start 1.70483 106 3600.69 8.34 7000 big-M50 + warm start 1.70483 106 3601.38 8.34 7000 persp. + warm start 1.70483 106 3601.09 8.13 7000 eig. persp. + warm start 1.70483 106 1517.88 0.00 7000 ours 1.70483 106 21.42 0.00 Table 39: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 2.78521 106 3600.34 19.60 3000 big-M50 + warm start 2.78521 106 3600.74 15.90 3000 persp. + warm start 2.78521 106 3600.06 19.60 3000 eig. persp. + warm start 2.78521 106 3600.44 19.61 3000 ours 2.78521 106 3600.01 19.60

4000 SOS1 + warm start 3.75203 106 3600.06 15.10 4000 big-M50 + warm start 3.75203 106 3600.07 13.33 4000 persp. + warm start 3.75203 106 3600.06 15.12 4000 eig. persp. + warm start 3.75203 106 3600.29 15.85 4000 ours 3.75203 106 3600.42 3.50

5000 SOS1 + warm start 4.61877 106 3600.07 12.11 5000 big-M50 + warm start 4.61877 106 3601.36 11.37 5000 persp. + warm start 4.61877 106 3600.06 12.10 5000 eig. persp. + warm start 4.61877 106 3604.20 12.88 5000 ours 4.61877 106 3600.57 0.93

6000 SOS1 + warm start 5.43826 106 3600.07 10.02 6000 big-M50 + warm start 5.43826 106 3600.22 9.77 6000 persp. + warm start 5.43826 106 3601.43 10.01 6000 eig. persp. + warm start 5.43826 106 3600.95 0.85 6000 ours 5.43826 106 3600.65 0.33

7000 SOS1 + warm start 6.36767 106 3600.06 8.35 7000 big-M50 + warm start 6.36767 106 3600.12 8.22 7000 persp. + warm start 6.36767 106 3600.06 8.34 7000 eig. persp. + warm start 6.36767 106 3600.28 0.35 7000 ours 6.36767 106 3600.78 0.11 Table 40: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.001). All baselines use our beam search solution as a warm start.

G.2.9 Synthetic 2 Benchmark with λ2 = 0.1 with No Warmstart

# of samples method upper bound time(s) gap(%)

3000 SOS1 6.34575 104 3600.08 19.36 3000 big-M50 6.34575 104 3600.07 19.14 3000 persp. 4.81335 104 3600.14 57.24 3000 eig. persp. 4.80471 104 3600.06 57.76 3000 ours 6.34575 104 3600.33 19.31

4000 SOS1 8.64284 104 3600.20 15.37 4000 big-M50 8.64284 104 3600.23 15.33 4000 persp. 8.29378 104 3600.27 20.19 4000 eig. persp. 8.64284 104 3600.49 1.61 4000 ours 8.64284 104 20.81 0.00

5000 SOS1 1.09575 105 3600.96 11.91 5000 big-M50 1.09575 105 3600.08 11.60 5000 persp. 1.09575 105 3600.07 11.90 5000 eig. persp. 1.09575 105 3600.18 0.23 5000 ours 1.09575 105 20.65 0.00

6000 SOS1 1.27642 105 3600.09 9.92 6000 big-M50 1.27642 105 3601.02 9.92 6000 persp. 1.27642 105 3600.93 9.92 6000 eig. persp. 1.27642 105 3600.74 12.65 6000 ours 1.27642 105 1.70 0.00

7000 SOS1 1.50613 105 3600.12 8.32 7000 big-M50 1.50613 105 3600.45 8.32 7000 persp. 1.50613 105 3600.20 8.29 7000 eig. persp. 1.50613 105 113.48 0.00 7000 ours 1.50613 105 1.67 0.00 Table 41: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 SOS1 1.52666 105 3600.76 24.80 3000 big-M50 1.59508 105 3601.50 19.07 3000 persp. 1.28190 105 3600.48 48.57 3000 eig. persp. 1.34180 105 3600.42 42.09 3000 ours 1.59508 105 3600.27 19.40

4000 SOS1 2.04336 105 3600.51 22.74 4000 big-M50 2.17573 105 3600.23 14.78 4000 persp. 2.17573 105 3600.79 14.88 4000 eig. persp. 2.09228 105 3600.20 19.95 4000 ours 2.17573 105 99.53 0.00

5000 SOS1 2.66411 105 3600.59 14.34 5000 big-M50 2.71985 105 3600.09 11.22 5000 persp. 2.71985 105 3600.21 11.96 5000 eig. persp. 2.71985 105 2019.93 0.00 5000 ours 2.71985 105 21.01 0.00

6000 SOS1 3.16760 105 3600.27 9.96 6000 big-M50 3.16760 105 3600.07 9.96 6000 persp. 3.16760 105 3600.07 9.96 6000 eig. persp. 3.16760 105 1196.41 0.00 6000 ours 3.16760 105 20.91 0.00

7000 SOS1 3.55314 105 3600.32 13.60 7000 big-M50 3.72609 105 3600.80 8.33 7000 persp. 3.72609 105 3600.06 8.31 7000 eig. persp. 3.72609 105 3600.07 9.46 7000 ours 3.72609 105 1.73 0.00 Table 42: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 SOS1 3.23979 105 3601.13 23.04 3000 big-M50 3.33603 105 3601.02 18.78 3000 persp. 2.94042 105 3600.41 35.54 3000 eig. persp. 3.07151 105 3600.07 29.86 3000 ours 3.33603 105 3600.68 19.44

4000 SOS1 4.35290 105 3600.08 20.00 4000 big-M50 4.53260 105 3600.12 14.95 4000 persp. 4.50184 105 3600.07 15.99 4000 eig. persp. 4.41599 105 3600.06 18.45 4000 ours 4.53260 105 3600.41 0.62

5000 SOS1 5.63006 105 3600.09 12.04 5000 big-M50 5.63006 105 3600.07 11.69 5000 persp. 5.63006 105 3600.16 11.74 5000 eig. persp. 5.63006 105 3600.43 0.46 5000 ours 5.63006 105 37.26 0.00

6000 SOS1 6.57798 105 3600.98 9.98 6000 big-M50 6.57798 105 3600.53 9.98 6000 persp. 6.57798 105 3600.13 9.96 6000 eig. persp. 6.57798 105 1737.91 0.00 6000 ours 6.57798 105 20.64 0.00

7000 SOS1 7.46134 105 3600.11 12.13 7000 big-M50 7.72266 105 3600.96 8.33 7000 persp. 7.72266 105 3600.23 8.31 7000 eig. persp. 7.72266 105 3601.08 10.36 7000 ours 7.72266 105 20.91 0.00 Table 43: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 SOS1 7.36308 105 3601.07 20.30 3000 big-M50 7.41186 105 3600.75 18.15 3000 persp. 7.11307 105 3600.38 24.54 3000 eig. persp. 7.14103 105 3600.06 24.13 3000 ours 7.41186 105 3600.59 19.46

4000 SOS1 1.00107 106 3600.08 15.42 4000 big-M50 1.00287 106 3600.07 14.98 4000 persp. 9.98563 105 3600.06 15.67 4000 eig. persp. 9.88486 105 3600.06 17.29 4000 ours 1.00287 106 3600.08 1.88

5000 SOS1 1.23516 106 3600.08 12.54 5000 big-M50 1.24027 106 3600.07 11.88 5000 persp. 1.23565 106 3600.08 12.47 5000 eig. persp. 1.22304 106 3600.46 14.19 5000 ours 1.24027 106 469.43 0.00

6000 SOS1 1.45416 106 3600.07 10.00 6000 big-M50 1.45416 106 3601.17 9.78 6000 persp. 1.45416 106 3601.30 9.82 6000 eig. persp. 1.45416 106 3600.05 0.03 6000 ours 1.45416 106 59.16 0.00

7000 SOS1 1.69945 106 3600.08 8.68 7000 big-M50 1.70483 106 3602.06 8.34 7000 persp. 1.70483 106 3600.92 8.13 7000 eig. persp. 1.70483 106 1537.97 0.00 7000 ours 1.70483 106 21.12 0.00 Table 44: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 SOS1 2.77826 106 3600.41 19.82 3000 big-M50 2.78468 106 3600.31 15.93 3000 persp. 2.76199 106 3600.06 20.50 3000 eig. persp. 2.76520 106 3600.06 20.47 3000 ours 2.78521 106 3600.23 19.46

4000 SOS1 3.74434 106 3600.35 15.33 4000 big-M50 3.75203 106 3600.08 13.35 4000 persp. 3.73180 106 3600.06 15.73 4000 eig. persp. 3.74150 106 3600.69 16.17 4000 ours 3.75203 106 3601.10 3.50

5000 SOS1 4.60905 106 3600.07 12.34 5000 big-M50 4.61877 106 3600.08 11.40 5000 persp. 4.54788 106 3600.79 13.84 5000 eig. persp. 4.59650 106 3600.06 13.42 5000 ours 4.61877 106 3600.40 0.93

6000 SOS1 5.40939 106 3600.07 10.60 6000 big-M50 5.43826 106 3600.23 9.79 6000 persp. 5.42336 106 3600.65 10.30 6000 eig. persp. 5.43107 106 3600.99 11.20 6000 ours 5.43826 106 3600.26 0.33

7000 SOS1 6.33762 106 3600.07 8.86 7000 big-M50 6.36767 106 3600.07 8.22 7000 persp. 6.26888 106 3600.06 10.04 7000 eig. persp. 6.34367 106 3600.05 10.10 7000 ours 6.36767 106 3600.58 0.11 Table 45: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.1).

G.2.10 Synthetic 2 Benchmark with λ2 = 0.1 with Warmstart

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 6.34575 104 3600.07 19.36 3000 big-M50 + warm start 6.34575 104 3600.12 19.14 3000 persp. + warm start 6.34575 104 3600.14 19.27 3000 eig. persp. + warm start 6.34575 104 3600.13 19.45 3000 ours 6.34575 104 3600.33 19.31

4000 SOS1 + warm start 8.64284 104 3600.36 15.36 4000 big-M50 + warm start 8.64284 104 3600.05 15.33 4000 persp. + warm start 8.64284 104 3600.52 14.30 4000 eig. persp. + warm start 8.64284 104 3600.17 1.64 4000 ours 8.64284 104 20.81 0.00

5000 SOS1 + warm start 1.09575 105 3600.99 11.91 5000 big-M50 + warm start 1.09575 105 3601.88 11.60 5000 persp. + warm start 1.09575 105 3600.14 11.90 5000 eig. persp. + warm start 1.09575 105 2971.48 0.00 5000 ours 1.09575 105 20.65 0.00

6000 SOS1 + warm start 1.27642 105 3600.07 9.92 6000 big-M50 + warm start 1.27642 105 3600.62 9.93 6000 persp. + warm start 1.27642 105 3600.06 9.92 6000 eig. persp. + warm start 1.27642 105 3601.19 12.65 6000 ours 1.27642 105 1.70 0.00

7000 SOS1 + warm start 1.50613 105 3600.13 5.84 7000 big-M50 + warm start 1.50613 105 3600.07 8.32 7000 persp. + warm start 1.50613 105 3600.19 8.29 7000 eig. persp. + warm start 1.50613 105 103.79 0.00 7000 ours 1.50613 105 1.67 0.00 Table 46: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 1.59508 105 3600.06 19.45 3000 big-M50 + warm start 1.59508 105 3600.22 19.07 3000 persp. + warm start 1.59508 105 3600.06 19.40 3000 eig. persp. + warm start 1.59508 105 3600.14 19.52 3000 ours 1.59508 105 3600.27 19.40

4000 SOS1 + warm start 2.17573 105 3600.40 15.28 4000 big-M50 + warm start 2.17573 105 3600.10 14.78 4000 persp. + warm start 2.17573 105 3600.06 14.88 4000 eig. persp. + warm start 2.17573 105 3600.10 15.34 4000 ours 2.17573 105 99.53 0.00

5000 SOS1 + warm start 2.71985 105 3601.08 12.00 5000 big-M50 + warm start 2.71985 105 3600.07 11.22 5000 persp. + warm start 2.71985 105 3600.06 11.34 5000 eig. persp. + warm start 2.71985 105 1642.59 0.00 5000 ours 2.71985 105 21.01 0.00

6000 SOS1 + warm start 3.16760 105 3600.83 9.96 6000 big-M50 + warm start 3.16760 105 3600.07 9.96 6000 persp. + warm start 3.16760 105 3600.05 9.96 6000 eig. persp. + warm start 3.16760 105 3600.06 11.77 6000 ours 3.16760 105 20.91 0.00

7000 SOS1 + warm start 3.72609 105 3600.67 8.33 7000 big-M50 + warm start 3.72609 105 3601.58 8.33 7000 persp. + warm start 3.72609 105 3601.01 8.31 7000 eig. persp. + warm start 3.72609 105 3600.55 9.45 7000 ours 3.72609 105 1.73 0.00 Table 47: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 3.33603 105 3600.06 19.49 3000 big-M50 + warm start 3.33603 105 3601.53 18.78 3000 persp. + warm start 3.33603 105 3600.07 19.47 3000 eig. persp. + warm start 3.33603 105 3600.06 19.56 3000 ours 3.33603 105 3600.68 19.44

4000 SOS1 + warm start 4.53260 105 3600.06 15.24 4000 big-M50 + warm start 4.53260 105 3600.07 15.04 4000 persp. + warm start 4.53260 105 3600.07 15.20 4000 eig. persp. + warm start 4.53260 105 3601.61 15.40 4000 ours 4.53260 105 3600.41 0.62

5000 SOS1 + warm start 5.63006 105 3600.47 12.04 5000 big-M50 + warm start 5.63006 105 3601.70 11.69 5000 persp. + warm start 5.63006 105 3600.06 11.74 5000 eig. persp. + warm start 5.63006 105 3600.50 0.47 5000 ours 5.63006 105 37.26 0.00

6000 SOS1 + warm start 6.57798 105 3601.18 9.98 6000 big-M50 + warm start 6.57798 105 3600.18 9.98 6000 persp. + warm start 6.57798 105 3600.06 9.56 6000 eig. persp. + warm start 6.57798 105 1376.14 0.00 6000 ours 6.57798 105 20.64 0.00

7000 SOS1 + warm start 7.72266 105 3600.10 8.33 7000 big-M50 + warm start 7.72266 105 3600.63 8.33 7000 persp. + warm start 7.72266 105 3600.34 7.86 7000 eig. persp. + warm start 7.72266 105 3600.44 10.37 7000 ours 7.72266 105 20.91 0.00 Table 48: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 7.41186 105 3600.06 19.51 3000 big-M50 + warm start 7.41186 105 3601.56 18.15 3000 persp. + warm start 7.41186 105 3600.87 19.52 3000 eig. persp. + warm start 7.41186 105 3600.05 19.59 3000 ours 7.41186 105 3600.59 19.46

4000 SOS1 + warm start 1.00287 106 3600.07 15.21 4000 big-M50 + warm start 1.00287 106 3600.09 14.98 4000 persp. + warm start 1.00287 106 3600.07 15.18 4000 eig. persp. + warm start 1.00287 106 3600.06 15.61 4000 ours 1.00287 106 3600.08 1.88

5000 SOS1 + warm start 1.24027 106 3601.15 12.08 5000 big-M50 + warm start 1.24027 106 3600.40 11.88 5000 persp. + warm start 1.24027 106 3601.42 12.05 5000 eig. persp. + warm start 1.24027 106 3600.19 0.97 5000 ours 1.24027 106 469.43 0.00

6000 SOS1 + warm start 1.45416 106 3600.09 10.00 6000 big-M50 + warm start 1.45416 106 3600.05 9.78 6000 persp. + warm start 1.45416 106 3600.10 9.82 6000 eig. persp. + warm start 1.45416 106 3600.16 0.22 6000 ours 1.45416 106 59.16 0.00

7000 SOS1 + warm start 1.70483 106 3600.74 8.34 7000 big-M50 + warm start 1.70483 106 3600.07 8.34 7000 persp. + warm start 1.70483 106 3600.06 8.13 7000 eig. persp. + warm start 1.70483 106 1417.14 0.00 7000 ours 1.70483 106 21.12 0.00 Table 49: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 2.78521 106 3600.07 19.52 3000 big-M50 + warm start 2.78521 106 3600.07 15.91 3000 persp. + warm start 2.78521 106 3600.06 19.50 3000 eig. persp. + warm start 2.78521 106 3600.06 19.60 3000 ours 2.78521 106 3600.23 19.46

4000 SOS1 + warm start 3.75203 106 3600.07 15.10 4000 big-M50 + warm start 3.75203 106 3600.27 13.35 4000 persp. + warm start 3.75203 106 3600.51 15.10 4000 eig. persp. + warm start 3.75203 106 3600.07 15.85 4000 ours 3.75203 106 3601.10 3.50

5000 SOS1 + warm start 4.61877 106 3600.07 12.11 5000 big-M50 + warm start 4.61877 106 3600.12 11.40 5000 persp. + warm start 4.61877 106 3600.06 12.09 5000 eig. persp. + warm start 4.61877 106 3600.35 12.88 5000 ours 4.61877 106 3600.40 0.93

6000 SOS1 + warm start 5.43826 106 3600.37 10.02 6000 big-M50 + warm start 5.43826 106 3600.07 9.79 6000 persp. + warm start 5.43826 106 3600.07 10.00 6000 eig. persp. + warm start 5.43826 106 3600.08 11.05 6000 ours 5.43826 106 3600.26 0.33

7000 SOS1 + warm start 6.36767 106 3600.50 8.35 7000 big-M50 + warm start 6.36767 106 3600.07 8.22 7000 persp. + warm start 6.36767 106 3600.06 8.33 7000 eig. persp. + warm start 6.36767 106 3600.05 0.38 7000 ours 6.36767 106 3600.58 0.11 Table 50: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.1). All baselines use our beam search solution as a warm start.

G.2.11 Synthetic 2 Benchmark with λ2 = 10.0 with No Warmstart

# of samples method upper bound time(s) gap(%)

3000 SOS1 6.33576 104 3600.08 18.27 3000 big-M50 6.33576 104 3600.31 18.47 3000 persp. 6.33576 104 3600.20 17.53 3000 eig. persp. 4.46130 104 3601.33 69.85 3000 ours 6.33576 104 3601.41 4.54

4000 SOS1 8.63287 104 3600.28 15.23 4000 big-M50 8.63287 104 3600.85 15.21 4000 persp. 8.63287 104 3600.15 12.69 4000 eig. persp. 8.58997 104 3600.10 1.74 4000 ours 8.63287 104 20.33 0.00

5000 SOS1 1.09474 105 3600.08 11.86 5000 big-M50 1.09474 105 3600.28 11.85 5000 persp. 1.09474 105 3600.80 10.44 5000 eig. persp. 1.09474 105 3600.06 13.55 5000 ours 1.09474 105 20.71 0.00

6000 SOS1 1.27542 105 3600.19 9.90 6000 big-M50 1.27542 105 3600.37 9.90 6000 persp. 1.27542 105 3600.14 9.03 6000 eig. persp. 1.27542 105 2609.62 0.00 6000 ours 1.27542 105 1.69 0.00

7000 SOS1 1.42695 105 3600.12 14.24 7000 big-M50 1.50512 105 3600.56 8.31 7000 persp. 1.50512 105 3600.15 7.73 7000 eig. persp. 1.50512 105 107.90 0.00 7000 ours 1.50512 105 1.70 0.00 Table 51: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 SOS1 1.52806 105 3601.20 23.64 3000 big-M50 1.59409 105 3600.72 18.52 3000 persp. 1.59409 105 3600.14 17.84 3000 eig. persp. 1.25897 105 3600.88 51.38 3000 ours 1.59409 105 3600.62 4.68

4000 SOS1 2.17472 105 3600.70 15.13 4000 big-M50 2.17472 105 3600.07 14.64 4000 persp. 2.17472 105 3600.05 12.93 4000 eig. persp. 2.03767 105 3600.20 23.07 4000 ours 2.17472 105 86.90 0.00

5000 SOS1 2.71885 105 3600.32 11.94 5000 big-M50 2.71885 105 3603.74 11.16 5000 persp. 2.71885 105 3600.05 10.84 5000 eig. persp. 2.71885 105 1969.80 0.00 5000 ours 2.71885 105 20.75 0.00

6000 SOS1 3.10076 105 3600.08 12.27 6000 big-M50 3.16660 105 3600.07 9.93 6000 persp. 3.16660 105 3600.06 9.24 6000 eig. persp. 3.16660 105 2020.73 0.00 6000 ours 3.16660 105 21.05 0.00

7000 SOS1 3.72509 105 3600.17 8.31 7000 big-M50 3.72509 105 3601.17 8.31 7000 persp. 3.72509 105 3600.13 7.78 7000 eig. persp. 3.72509 105 2139.72 0.00 7000 ours 3.72509 105 1.71 0.00 Table 52: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 SOS1 3.31006 105 3600.07 19.45 3000 big-M50 3.33503 105 3600.07 18.43 3000 persp. 3.30991 105 3600.07 18.99 3000 eig. persp. 2.97162 105 3600.08 34.18 3000 ours 3.33503 105 3600.25 6.21

4000 SOS1 4.36975 105 3600.35 19.35 4000 big-M50 4.53159 105 3600.16 14.84 4000 persp. 4.50480 105 3600.38 12.96 4000 eig. persp. 4.38770 105 3600.11 19.13 4000 ours 4.53159 105 3600.95 0.35

5000 SOS1 5.37846 105 3600.98 17.20 5000 big-M50 5.62906 105 3600.16 11.59 5000 persp. 5.62906 105 3600.07 10.92 5000 eig. persp. 5.62906 105 3600.19 12.71 5000 ours 5.62906 105 31.47 0.00

6000 SOS1 6.51605 105 3600.45 10.98 6000 big-M50 6.57698 105 3600.16 9.95 6000 persp. 6.57698 105 3601.04 9.39 6000 eig. persp. 6.57698 105 2550.81 0.00 6000 ours 6.57698 105 21.24 0.00

7000 SOS1 7.65939 105 3600.07 9.20 7000 big-M50 7.72166 105 3600.12 8.31 7000 persp. 7.72166 105 3600.13 7.90 7000 eig. persp. 7.72166 105 3600.06 10.13 7000 ours 7.72166 105 20.37 0.00 Table 53: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 SOS1 7.35981 105 3600.07 19.40 3000 big-M50 7.41086 105 3600.65 17.88 3000 persp. 7.25990 105 3600.38 20.81 3000 eig. persp. 7.18339 105 3600.06 23.35 3000 ours 7.41086 105 3600.48 8.32

4000 SOS1 9.99525 105 3600.07 15.44 4000 big-M50 1.00277 106 3600.09 14.85 4000 persp. 9.91636 105 3601.08 14.29 4000 eig. persp. 9.92065 105 3600.10 16.78 4000 ours 1.00277 106 3600.62 1.66

5000 SOS1 1.23548 106 3600.08 12.44 5000 big-M50 1.24017 106 3600.25 11.82 5000 persp. 1.23555 106 3600.42 11.18 5000 eig. persp. 1.21712 106 3600.05 14.73 5000 ours 1.24017 106 386.72 0.00

6000 SOS1 1.42629 106 3600.81 12.11 6000 big-M50 1.45406 106 3600.07 9.75 6000 persp. 1.45406 106 3600.07 9.44 6000 eig. persp. 1.45406 106 2941.57 0.00 6000 ours 1.45406 106 54.13 0.00

7000 SOS1 1.69139 106 3600.49 9.17 7000 big-M50 1.70473 106 3600.81 8.32 7000 persp. 1.70473 106 3600.06 7.85 7000 eig. persp. 1.70473 106 1585.19 0.00 7000 ours 1.70473 106 21.30 0.00 Table 54: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 SOS1 2.78026 106 3600.08 18.78 3000 big-M50 2.78507 106 3600.17 15.95 3000 persp. 2.75178 106 3600.06 20.10 3000 eig. persp. 2.73809 106 3600.11 21.60 3000 ours 2.78507 106 3600.78 9.79

4000 SOS1 3.74696 106 3600.08 15.12 4000 big-M50 3.75192 106 3600.06 13.19 4000 persp. 3.73412 106 3600.07 14.42 4000 eig. persp. 3.73458 106 3600.81 16.31 4000 ours 3.75192 106 3600.32 3.23

5000 SOS1 4.60115 106 3600.29 12.47 5000 big-M50 4.61867 106 3600.17 11.32 5000 persp. 4.60427 106 3600.07 11.73 5000 eig. persp. 4.58981 106 3600.07 13.56 5000 ours 4.61867 106 3600.30 0.89

6000 SOS1 5.42828 106 3600.08 10.18 6000 big-M50 5.43816 106 3600.12 9.75 6000 persp. 5.41942 106 3600.07 9.90 6000 eig. persp. 5.41142 106 3600.42 11.59 6000 ours 5.43816 106 3600.22 0.32

7000 SOS1 6.33201 106 3600.05 8.93 7000 big-M50 6.36757 106 3600.32 8.20 7000 persp. 6.35270 106 3600.06 8.23 7000 eig. persp. 6.36342 106 3600.05 0.46 7000 ours 6.36757 106 3600.06 0.10 Table 55: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 10.0).

G.2.12 Synthetic 2 Benchmark with λ2 = 10.0 with Warmstart

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 6.33576 104 3601.40 18.27 3000 big-M50 + warm start 6.33576 104 3600.12 18.47 3000 persp. + warm start 6.33576 104 3601.92 17.53 3000 eig. persp. + warm start 6.33576 104 3600.22 19.61 3000 ours 6.33576 104 3601.41 4.54

4000 SOS1 + warm start 8.63287 104 3600.08 15.23 4000 big-M50 + warm start 8.63287 104 3600.29 15.21 4000 persp. + warm start 8.63287 104 3600.30 12.69 4000 eig. persp. + warm start 8.63287 104 3600.38 1.32 4000 ours 8.63287 104 20.33 0.00

5000 SOS1 + warm start 1.09474 105 3600.06 11.86 5000 big-M50 + warm start 1.09474 105 3600.15 11.85 5000 persp. + warm start 1.09474 105 3601.20 10.44 5000 eig. persp. + warm start 1.09474 105 3601.16 13.51 5000 ours 1.09474 105 20.71 0.00

6000 SOS1 + warm start 1.27542 105 3600.07 9.90 6000 big-M50 + warm start 1.27542 105 3601.39 9.89 6000 persp. + warm start 1.27542 105 3600.21 9.03 6000 eig. persp. + warm start 1.27542 105 3601.11 12.70 6000 ours 1.27542 105 1.69 0.00

7000 SOS1 + warm start 1.50512 105 3600.28 5.83 7000 big-M50 + warm start 1.50512 105 3601.02 8.31 7000 persp. + warm start 1.50512 105 3600.16 7.73 7000 eig. persp. + warm start 1.50512 105 104.66 0.00 7000 ours 1.50512 105 1.70 0.00 Table 56: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 1.59409 105 3601.55 18.52 3000 big-M50 + warm start 1.59409 105 3600.67 18.52 3000 persp. + warm start 1.59409 105 3600.26 17.84 3000 eig. persp. + warm start 1.59409 105 3600.07 19.55 3000 ours 1.59409 105 3600.62 4.68

4000 SOS1 + warm start 2.17472 105 3600.06 15.13 4000 big-M50 + warm start 2.17472 105 3600.08 14.64 4000 persp. + warm start 2.17472 105 3600.78 12.93 4000 eig. persp. + warm start 2.17472 105 3600.35 2.79 4000 ours 2.17472 105 86.90 0.00

5000 SOS1 + warm start 2.71885 105 3601.07 11.94 5000 big-M50 + warm start 2.71885 105 3600.07 11.16 5000 persp. + warm start 2.71885 105 3600.22 10.25 5000 eig. persp. + warm start 2.71885 105 3109.70 0.00 5000 ours 2.71885 105 20.75 0.00

6000 SOS1 + warm start 3.16660 105 3600.92 9.93 6000 big-M50 + warm start 3.16660 105 3600.06 9.93 6000 persp. + warm start 3.16660 105 3600.04 9.24 6000 eig. persp. + warm start 3.16660 105 3601.01 11.80 6000 ours 3.16660 105 21.05 0.00

7000 SOS1 + warm start 3.72509 105 3600.29 8.31 7000 big-M50 + warm start 3.72509 105 3601.61 8.31 7000 persp. + warm start 3.72509 105 3600.12 7.78 7000 eig. persp. + warm start 3.72509 105 3600.07 9.46 7000 ours 3.72509 105 1.71 0.00 Table 57: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 3.33503 105 3600.95 18.55 3000 big-M50 + warm start 3.33503 105 3600.08 18.43 3000 persp. + warm start 3.33503 105 3600.40 18.09 3000 eig. persp. + warm start 3.33503 105 3600.06 19.55 3000 ours 3.33503 105 3600.25 6.21

4000 SOS1 + warm start 4.53159 105 3600.06 15.09 4000 big-M50 + warm start 4.53159 105 3600.06 14.84 4000 persp. + warm start 4.53159 105 3673.11 12.29 4000 eig. persp. + warm start 4.53159 105 3600.09 15.34 4000 ours 4.53159 105 3600.95 0.35

5000 SOS1 + warm start 5.62906 105 3600.07 11.98 5000 big-M50 + warm start 5.62906 105 3600.08 11.59 5000 persp. + warm start 5.62906 105 3600.06 10.92 5000 eig. persp. + warm start 5.62906 105 3600.31 0.24 5000 ours 5.62906 105 31.47 0.00

6000 SOS1 + warm start 6.57698 105 3600.88 9.95 6000 big-M50 + warm start 6.57698 105 3600.14 9.95 6000 persp. + warm start 6.57698 105 3600.06 9.02 6000 eig. persp. + warm start 6.57698 105 1703.35 0.00 6000 ours 6.57698 105 21.24 0.00

7000 SOS1 + warm start 7.72166 105 3600.06 8.32 7000 big-M50 + warm start 7.72166 105 3601.52 8.31 7000 persp. + warm start 7.72166 105 3600.16 7.90 7000 eig. persp. + warm start 7.72166 105 3600.41 10.39 7000 ours 7.72166 105 20.37 0.00 Table 58: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 7.41086 105 3600.63 18.58 3000 big-M50 + warm start 7.41086 105 3601.32 17.88 3000 persp. + warm start 7.41086 105 3600.59 18.35 3000 eig. persp. + warm start 7.41086 105 3600.49 19.55 3000 ours 7.41086 105 3600.48 8.32

4000 SOS1 + warm start 1.00277 106 3600.31 15.06 4000 big-M50 + warm start 1.00277 106 3600.44 14.85 4000 persp. + warm start 1.00277 106 3600.06 13.03 4000 eig. persp. + warm start 1.00277 106 3600.21 15.54 4000 ours 1.00277 106 3600.62 1.66

5000 SOS1 + warm start 1.24017 106 3600.99 12.01 5000 big-M50 + warm start 1.24017 106 3600.27 11.82 5000 persp. + warm start 1.24017 106 3600.75 10.76 5000 eig. persp. + warm start 1.24017 106 3600.07 12.59 5000 ours 1.24017 106 386.72 0.00

6000 SOS1 + warm start 1.45406 106 3600.07 9.97 6000 big-M50 + warm start 1.45406 106 3600.26 9.75 6000 persp. + warm start 1.45406 106 3600.31 9.44 6000 eig. persp. + warm start 1.45406 106 2855.45 0.00 6000 ours 1.45406 106 54.13 0.00

7000 SOS1 + warm start 1.70473 106 3601.18 8.32 7000 big-M50 + warm start 1.70473 106 3600.20 8.32 7000 persp. + warm start 1.70473 106 3600.18 7.85 7000 eig. persp. + warm start 1.70473 106 1264.32 0.00 7000 ours 1.70473 106 21.30 0.00 Table 59: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 SOS1 + warm start 2.78507 106 3601.20 18.57 3000 big-M50 + warm start 2.78507 106 3600.07 15.95 3000 persp. + warm start 2.78507 106 3600.06 18.67 3000 eig. persp. + warm start 2.78507 106 3600.36 19.55 3000 ours 2.78507 106 3600.78 9.79

4000 SOS1 + warm start 3.75192 106 3600.30 14.97 4000 big-M50 + warm start 3.75192 106 3600.06 13.19 4000 persp. + warm start 3.75192 106 3600.05 13.88 4000 eig. persp. + warm start 3.75192 106 3600.06 15.77 4000 ours 3.75192 106 3600.32 3.23

5000 SOS1 + warm start 4.61867 106 3601.14 12.04 5000 big-M50 + warm start 4.61867 106 3600.06 11.32 5000 persp. + warm start 4.61867 106 3600.06 11.39 5000 eig. persp. + warm start 4.61867 106 3600.07 12.85 5000 ours 4.61867 106 3600.30 0.89

6000 SOS1 + warm start 5.43816 106 3600.06 9.98 6000 big-M50 + warm start 5.43816 106 3600.39 9.75 6000 persp. + warm start 5.43816 106 3600.10 9.52 6000 eig. persp. + warm start 5.43816 106 3600.13 11.05 6000 ours 5.43816 106 3600.22 0.32

7000 SOS1 + warm start 6.36757 106 3600.54 8.33 7000 big-M50 + warm start 6.36757 106 3600.10 8.20 7000 persp. + warm start 6.36757 106 3600.13 7.98 7000 eig. persp. + warm start 6.36757 106 3600.25 0.34 7000 ours 6.36757 106 3600.06 0.10 Table 60: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 10.0). All baselines use our beam search solution as a warm start.

G.3 Comparison with the Big-M Formulation with different M Values

G.3.1 Synthetic 1 Benchmark with λ2 = 0.001 with No Warmstart

# of features method upper bound time(s) gap(%)

100 big-M50 2.10874 106 0.39 0.00 100 big-M20 2.10878 106 0.44 0.00 100 big-M5 2.10878 106 0.32 0.00 100 ours 2.10878 106 0.29 0.00

500 big-M50 2.12828 106 36.80 0.00 500 big-M20 2.12833 106 81.36 0.00 500 big-M5 2.12833 106 52.95 0.00 500 ours 2.12833 106 0.37 0.00

1000 big-M50 2.11005 106 1911.41 0.00 1000 big-M20 2.11005 106 1260.21 0.00 1000 big-M5 2.11005 106 771.63 0.00 1000 ours 2.11005 106 0.58 0.00

3000 big-M50 7.83869 105 3601.20 169.78 3000 big-M20 1.14989 106 3603.59 83.91 3000 big-M5 2.10230 106 1998.40 0.00 3000 ours 2.10230 106 1.73 0.00

5000 big-M50 1.87965 106 3603.25 12.65 5000 big-M20 9.85913 105 3602.87 114.78 5000 big-M5 - 3618.45 - 5000 ours 2.09635 106 4.22 0.00 Table 61: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 big-M50 5.28725 106 0.42 0.00 100 big-M20 5.28726 106 0.45 0.00 100 big-M5 5.28726 106 0.36 0.00 100 ours 5.28726 106 0.29 0.00

500 big-M50 5.31819 106 89.04 0.00 500 big-M20 5.31845 106 75.86 0.00 500 big-M5 5.31852 106 51.98 0.00 500 ours 5.31852 106 0.41 0.00

1000 big-M50 5.26974 106 1528.98 0.00 1000 big-M20 5.26973 106 1180.33 0.00 1000 big-M5 5.26974 106 794.14 0.00 1000 ours 5.26974 106 0.52 0.00

3000 big-M50 3.72978 106 3602.55 42.31 3000 big-M20 3.97937 106 3604.47 33.38 3000 big-M5 - 3660.43 - 3000 ours 5.27668 106 1.75 0.00

5000 big-M50 4.16126 106 3602.53 27.95 5000 big-M20 4.20081 106 3601.38 26.74 5000 big-M5 5.27126 106 3600.14 1.00 5000 ours 5.27126 106 4.32 0.00 Table 62: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 big-M50 1.10084 107 0.41 0.00 100 big-M20 1.10084 107 0.37 0.00 100 big-M5 1.10084 107 0.33 0.00 100 ours 1.10084 107 0.29 0.00

500 big-M50 1.10515 107 89.59 0.00 500 big-M20 1.10518 107 79.56 0.00 500 big-M5 1.10518 107 54.59 0.00 500 ours 1.10518 107 0.40 0.00

1000 big-M50 1.09506 107 1757.09 0.00 1000 big-M20 1.09508 107 1133.80 0.00 1000 big-M5 1.09508 107 744.15 0.00 1000 ours 1.09508 107 0.59 0.00

3000 big-M50 - 3615.51 - 3000 big-M20 - 3635.15 - 3000 big-M5 - 3648.05 - 3000 ours 1.10012 107 1.74 0.00

5000 big-M50 - 3612.62 - 5000 big-M20 1.01062 107 3612.19 9.90 5000 big-M5 - 3605.20 - 5000 ours 1.09966 107 4.00 0.00 Table 63: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 big-M50 2.43573 107 0.46 0.00 100 big-M20 2.43575 107 0.38 0.00 100 big-M5 2.43575 107 0.31 0.00 100 ours 2.43575 107 0.32 0.00

500 big-M50 2.44178 107 92.53 0.00 500 big-M20 2.44185 107 75.08 0.00 500 big-M5 2.44185 107 52.27 0.00 500 ours 2.44185 107 0.46 0.00

1000 big-M50 2.41987 107 1361.05 0.00 1000 big-M20 2.41988 107 974.43 0.00 1000 big-M5 2.41989 107 681.46 0.00 1000 ours 2.41989 107 0.52 0.00

3000 big-M50 - 3640.12 - 3000 big-M20 - 3637.70 - 3000 big-M5 2.43713 107 3600.23 0.56 3000 ours 2.43713 107 1.77 0.00

5000 big-M50 2.43694 107 3643.28 1.00 5000 big-M20 2.43694 107 3602.42 1.00 5000 big-M5 - 3631.91 - 5000 ours 2.43694 107 4.15 0.00 Table 64: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.001).

# of features method upper bound time(s) gap(%)

100 big-M50 9.11026 107 0.13 0.00 100 big-M20 9.11026 107 0.13 0.00 100 big-M5 9.11026 107 0.10 0.00 100 ours 9.11026 107 0.30 0.00

500 big-M50 9.12047 107 94.03 0.00 500 big-M20 9.12047 107 90.67 0.00 500 big-M5 9.12047 107 185.31 0.00 500 ours 9.12047 107 0.38 0.00

1000 big-M50 9.04075 107 3600.01 0.07 1000 big-M20 9.04075 107 3600.50 0.20 1000 big-M5 9.04075 107 2081.96 0.00 1000 ours 9.04075 107 0.52 0.00

3000 big-M50 - 3605.36 - 3000 big-M20 - 3623.06 - 3000 big-M5 - 3611.34 - 3000 ours 9.12761 107 20.74 0.00

5000 big-M50 - 3607.68 - 5000 big-M20 - 3616.13 - 5000 big-M5 - 3627.90 - 5000 ours 9.12916 107 58.29 0.00 Table 65: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.001).

G.3.2 Synthetic 1 Benchmark with λ2 = 0.001 with Warmstart

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 2.10878 106 0.08 0.00 100 big-M20 + warm start 2.10878 106 0.07 0.00 100 big-M5 + warm start 2.10878 106 0.05 0.00 100 ours 2.10878 106 0.29 0.00

500 big-M50 + warm start 2.12833 106 37.62 0.00 500 big-M20 + warm start 2.12833 106 2.82 0.00 500 big-M5 + warm start 2.12833 106 3.00 0.00 500 ours 2.12833 106 0.37 0.00

1000 big-M50 + warm start 2.11005 106 28.51 0.00 1000 big-M20 + warm start 2.11005 106 17.68 0.00 1000 big-M5 + warm start 2.11005 106 25.46 0.00 1000 ours 2.11005 106 0.58 0.00

3000 big-M50 + warm start 2.10230 106 407.92 0.00 3000 big-M20 + warm start 2.10230 106 255.31 0.00 3000 big-M5 + warm start 2.10230 106 400.99 0.00 3000 ours 2.10230 106 1.73 0.00

5000 big-M50 + warm start 2.09635 106 1380.38 0.00 5000 big-M20 + warm start 2.09635 106 1107.68 0.00 5000 big-M5 + warm start 2.09635 106 1673.77 0.00 5000 ours 2.09635 106 4.22 0.00 Table 66: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 5.28726 106 0.26 0.00 100 big-M20 + warm start 5.28726 106 0.17 0.00 100 big-M5 + warm start 5.28726 106 0.05 0.00 100 ours 5.28726 106 0.29 0.00

500 big-M50 + warm start 5.31852 106 3.00 0.00 500 big-M20 + warm start 5.31852 106 12.76 0.00 500 big-M5 + warm start 5.31852 106 1.94 0.00 500 ours 5.31852 106 0.41 0.00

1000 big-M50 + warm start 5.26974 106 13.79 0.00 1000 big-M20 + warm start 5.26974 106 18.04 0.00 1000 big-M5 + warm start 5.26974 106 12.19 0.00 1000 ours 5.26974 106 0.52 0.00

3000 big-M50 + warm start 5.27668 106 313.22 0.00 3000 big-M20 + warm start 5.27668 106 461.86 0.00 3000 big-M5 + warm start 5.27668 106 245.59 0.00 3000 ours 5.27668 106 1.75 0.00

5000 big-M50 + warm start 5.27126 106 1705.37 0.00 5000 big-M20 + warm start 5.27126 106 1666.20 0.00 5000 big-M5 + warm start 5.27126 106 3601.58 1.00 5000 ours 5.27126 106 4.32 0.00 Table 67: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 1.10084 107 0.06 0.00 100 big-M20 + warm start 1.10084 107 0.06 0.00 100 big-M5 + warm start 1.10084 107 0.04 0.00 100 ours 1.10084 107 0.29 0.00

500 big-M50 + warm start 1.10518 107 11.59 0.00 500 big-M20 + warm start 1.10518 107 6.55 0.00 500 big-M5 + warm start 1.10518 107 1.97 0.00 500 ours 1.10518 107 0.40 0.00

1000 big-M50 + warm start 1.09508 107 21.16 0.00 1000 big-M20 + warm start 1.09508 107 14.95 0.00 1000 big-M5 + warm start 1.09508 107 12.99 0.00 1000 ours 1.09508 107 0.59 0.00

3000 big-M50 + warm start 1.10012 107 243.42 0.00 3000 big-M20 + warm start 1.10012 107 301.69 0.00 3000 big-M5 + warm start 1.10012 107 242.64 0.00 3000 ours 1.10012 107 1.74 0.00

5000 big-M50 + warm start 1.09966 107 3600.95 1.00 5000 big-M20 + warm start 1.09966 107 3600.28 1.00 5000 big-M5 + warm start 1.09966 107 3602.10 570161330.64 5000 ours 1.09966 107 4.00 0.00 Table 68: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.001). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 2.43575 107 0.05 0.00 100 big-M20 + warm start 2.43575 107 0.26 0.00 100 big-M5 + warm start 2.43575 107 0.05 0.00 100 ours 2.43575 107 0.32 0.00

500 big-M50 + warm start 2.44185 107 4.02 0.00 500 big-M20 + warm start 2.44185 107 2.15 0.00 500 big-M5 + warm start 2.44185 107 20.44 0.00 500 ours 2.44185 107 0.46 0.00

1000 big-M50 + warm start 2.41989 107 11.63 0.00 1000 big-M20 + warm start 2.41989 107 13.11 0.00 1000 big-M5 + warm start 2.41989 107 10.49 0.00 1000 ours 2.41989 107 0.52 0.00

3000 big-M50 + warm start 2.43713 107 3608.31 0.59 3000 big-M20 + warm start 2.43713 107 3600.14 0.59 3000 big-M5 + warm start 2.43713 107 3600.95 0.56 3000 ours 2.43713 107 1.77 0.00

5000 big-M50 + warm start 2.43694 107 3670.94 1.00 5000 big-M20 + warm start 2.43694 107 3604.91 1.00 5000 big-M5 + warm start 2.43694 107 3646.34 600317552.81 5000 ours 2.43694 107 4.15 0.00 Table 69: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.001). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 9.11026 107 0.03 0.00 100 big-M20 + warm start 9.11026 107 0.03 0.00 100 big-M5 + warm start 9.11026 107 0.04 0.00 100 ours 9.11026 107 0.30 0.00

500 big-M50 + warm start 9.12047 107 34.86 0.00 500 big-M20 + warm start 9.12047 107 28.75 0.00 500 big-M5 + warm start 9.12047 107 18.67 0.00 500 ours 9.12047 107 0.38 0.00

1000 big-M50 + warm start 9.04075 107 3600.01 0.08 1000 big-M20 + warm start 9.04075 107 3600.04 0.20 1000 big-M5 + warm start 9.04075 107 1662.95 0.00 1000 ours 9.04075 107 0.52 0.00

3000 big-M50 + warm start 9.12761 107 3612.88 0.59 3000 big-M20 + warm start 9.12761 107 3600.12 0.59 3000 big-M5 + warm start 9.12761 107 3613.07 222997912.50 3000 ours 9.12761 107 20.74 0.00

5000 big-M50 + warm start 9.12916 107 3622.52 1.00 5000 big-M20 + warm start 9.12916 107 3606.87 1.00 5000 big-M5 + warm start 9.12916 107 3600.86 618133887.09 5000 ours 9.12916 107 58.29 0.00 Table 70: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.001). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

G.3.3 Synthetic 1 Benchmark with λ2 = 0.1 with No Warmstart

# of features method upper bound time(s) gap(%)

100 big-M50 2.10878 106 0.86 0.00 100 big-M20 2.10878 106 0.41 0.00 100 big-M5 2.10878 106 0.32 0.00 100 ours 2.10878 106 0.36 0.00

500 big-M50 2.12833 106 37.96 0.00 500 big-M20 2.12833 106 27.57 0.00 500 big-M5 2.12833 106 55.79 0.00 500 ours 2.12833 106 0.37 0.00

1000 big-M50 2.11005 106 1007.27 0.00 1000 big-M20 2.11005 106 373.44 0.00 1000 big-M5 2.11005 106 74.44 0.00 1000 ours 2.11005 106 0.57 0.00

3000 big-M50 7.99504 105 3608.36 164.51 3000 big-M20 1.15295 106 3603.16 83.42 3000 big-M5 2.10230 106 1289.57 0.00 3000 ours 2.10230 106 1.79 0.00

5000 big-M50 9.69753 105 3619.50 118.35 5000 big-M20 9.85912 105 3611.26 114.78 5000 big-M5 - 3610.20 - 5000 ours 2.09635 106 4.20 0.00 Table 71: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 big-M50 4.60269 106 1.83 0.00 100 big-M20 5.28726 106 0.40 0.00 100 big-M5 5.28726 106 0.34 0.00 100 ours 5.28726 106 0.27 0.00

500 big-M50 5.31827 106 85.52 0.00 500 big-M20 5.31844 106 72.87 0.00 500 big-M5 5.31852 106 52.81 0.00 500 ours 5.31852 106 0.41 0.00

1000 big-M50 5.26974 106 1576.75 0.00 1000 big-M20 5.26974 106 1112.97 0.00 1000 big-M5 5.26974 106 791.17 0.00 1000 ours 5.26974 106 0.52 0.00

3000 big-M50 3.73049 106 3603.09 42.28 3000 big-M20 3.98562 106 3602.15 33.17 3000 big-M5 - 3634.94 - 3000 ours 5.27668 106 1.84 0.00

5000 big-M50 3.98069 106 3606.79 33.75 5000 big-M20 4.58351 106 3615.50 16.16 5000 big-M5 5.27126 106 3600.14 1.00 5000 ours 5.27126 106 4.42 0.00 Table 72: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 big-M50 1.10084 107 0.44 0.00 100 big-M20 1.10084 107 0.37 0.00 100 big-M5 1.10084 107 0.33 0.00 100 ours 1.10084 107 0.32 0.00

500 big-M50 1.10515 107 86.52 0.00 500 big-M20 1.10518 107 77.61 0.00 500 big-M5 1.10518 107 51.24 0.00 500 ours 1.10518 107 0.40 0.00

1000 big-M50 1.09506 107 1539.25 0.00 1000 big-M20 1.09508 107 1118.41 0.00 1000 big-M5 1.09508 107 893.44 0.00 1000 ours 1.09508 107 0.53 0.00

3000 big-M50 - 3608.09 - 3000 big-M20 - 3614.30 - 3000 big-M5 - 3610.50 - 3000 ours 1.10012 107 1.67 0.00

5000 big-M50 - 3649.14 - 5000 big-M20 1.01130 107 3600.71 9.83 5000 big-M5 - 3628.44 - 5000 ours 1.09966 107 4.13 0.00 Table 73: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 big-M50 2.43573 107 0.41 0.00 100 big-M20 2.43575 107 0.39 0.00 100 big-M5 2.43575 107 0.30 0.00 100 ours 2.43575 107 0.34 0.00

500 big-M50 2.44167 107 96.59 0.00 500 big-M20 2.44185 107 72.41 0.00 500 big-M5 2.44185 107 54.87 0.00 500 ours 2.44185 107 0.44 0.00

1000 big-M50 2.41986 107 1608.58 0.00 1000 big-M20 2.41988 107 984.16 0.00 1000 big-M5 2.41989 107 639.70 0.00 1000 ours 2.41989 107 0.53 0.00

3000 big-M50 - 3605.20 - 3000 big-M20 - 3617.57 - 3000 big-M5 2.43713 107 3600.08 0.56 3000 ours 2.43713 107 1.75 0.00

5000 big-M50 2.43694 107 3650.52 1.00 5000 big-M20 2.43694 107 3624.42 1.00 5000 big-M5 - 3650.91 - 5000 ours 2.43694 107 4.22 0.00 Table 74: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.1).

# of features method upper bound time(s) gap(%)

100 big-M50 9.11026 107 0.18 0.00 100 big-M20 9.11026 107 0.11 0.00 100 big-M5 9.11026 107 0.12 0.00 100 ours 9.11026 107 0.31 0.00

500 big-M50 9.12047 107 398.02 0.00 500 big-M20 9.12047 107 113.34 0.00 500 big-M5 9.12047 107 187.63 0.00 500 ours 9.12047 107 0.38 0.00

1000 big-M50 9.04075 107 3600.05 0.01 1000 big-M20 9.04075 107 3600.24 0.20 1000 big-M5 9.04075 107 2116.59 0.00 1000 ours 9.04075 107 0.52 0.00

3000 big-M50 - 3619.63 - 3000 big-M20 - 3613.39 - 3000 big-M5 - 3609.80 - 3000 ours 9.12761 107 20.88 0.00

5000 big-M50 - 3620.37 - 5000 big-M20 - 3603.08 - 5000 big-M5 - 3623.71 - 5000 ours 9.12916 107 58.07 0.00 Table 75: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.1).

G.3.4 Synthetic 1 Benchmark with λ2 = 0.1 with Warmstart

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 2.10878 106 0.08 0.00 100 big-M20 + warm start 2.10878 106 0.17 0.00 100 big-M5 + warm start 2.10878 106 0.05 0.00 100 ours 2.10878 106 0.36 0.00

500 big-M50 + warm start 2.12833 106 3.78 0.00 500 big-M20 + warm start 2.12833 106 2.75 0.00 500 big-M5 + warm start 2.12833 106 3.10 0.00 500 ours 2.12833 106 0.37 0.00

1000 big-M50 + warm start 2.11005 106 26.94 0.00 1000 big-M20 + warm start 2.11005 106 18.08 0.00 1000 big-M5 + warm start 2.11005 106 16.21 0.00 1000 ours 2.11005 106 0.57 0.00

3000 big-M50 + warm start 2.10230 106 193.85 0.00 3000 big-M20 + warm start 2.10230 106 255.55 0.00 3000 big-M5 + warm start 2.10230 106 283.37 0.00 3000 ours 2.10230 106 1.79 0.00

5000 big-M50 + warm start 2.09635 106 1366.14 0.00 5000 big-M20 + warm start 2.09635 106 1114.05 0.00 5000 big-M5 + warm start 2.09635 106 1660.23 0.00 5000 ours 2.09635 106 4.20 0.00 Table 76: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 5.28726 106 0.07 0.00 100 big-M20 + warm start 5.28726 106 0.07 0.00 100 big-M5 + warm start 5.28726 106 0.05 0.00 100 ours 5.28726 106 0.27 0.00

500 big-M50 + warm start 5.31852 106 39.88 0.00 500 big-M20 + warm start 5.31852 106 3.02 0.00 500 big-M5 + warm start 5.31852 106 1.86 0.00 500 ours 5.31852 106 0.41 0.00

1000 big-M50 + warm start 5.26974 106 14.33 0.00 1000 big-M20 + warm start 5.26974 106 14.92 0.00 1000 big-M5 + warm start 5.26974 106 13.35 0.00 1000 ours 5.26974 106 0.52 0.00

3000 big-M50 + warm start 5.27668 106 317.41 0.00 3000 big-M20 + warm start 5.27668 106 470.35 0.00 3000 big-M5 + warm start 5.27668 106 249.02 0.00 3000 ours 5.27668 106 1.84 0.00

5000 big-M50 + warm start 5.27126 106 1658.42 0.00 5000 big-M20 + warm start 5.27126 106 3607.98 1.00 5000 big-M5 + warm start 5.27126 106 3600.14 1.00 5000 ours 5.27126 106 4.42 0.00 Table 77: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 1.10084 107 0.06 0.00 100 big-M20 + warm start 1.10084 107 0.06 0.00 100 big-M5 + warm start 1.10084 107 0.04 0.00 100 ours 1.10084 107 0.32 0.00

500 big-M50 + warm start 1.10518 107 2.28 0.00 500 big-M20 + warm start 1.10518 107 6.61 0.00 500 big-M5 + warm start 1.10518 107 13.48 0.00 500 ours 1.10518 107 0.40 0.00

1000 big-M50 + warm start 1.09508 107 14.11 0.00 1000 big-M20 + warm start 1.09508 107 13.83 0.00 1000 big-M5 + warm start 1.09508 107 12.46 0.00 1000 ours 1.09508 107 0.53 0.00

3000 big-M50 + warm start 1.10012 107 222.56 0.00 3000 big-M20 + warm start 1.10012 107 290.65 0.00 3000 big-M5 + warm start 1.10012 107 251.83 0.00 3000 ours 1.10012 107 1.67 0.00

5000 big-M50 + warm start 1.09966 107 3606.43 1.00 5000 big-M20 + warm start 1.09966 107 3611.47 1.00 5000 big-M5 + warm start 1.09966 107 3631.72 570161381.99 5000 ours 1.09966 107 4.13 0.00 Table 78: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.1). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 2.43575 107 0.06 0.00 100 big-M20 + warm start 2.43575 107 0.24 0.00 100 big-M5 + warm start 2.43575 107 0.04 0.00 100 ours 2.43575 107 0.34 0.00

500 big-M50 + warm start 2.44185 107 4.00 0.00 500 big-M20 + warm start 2.44185 107 8.60 0.00 500 big-M5 + warm start 2.44185 107 19.75 0.00 500 ours 2.44185 107 0.44 0.00

1000 big-M50 + warm start 2.41989 107 12.62 0.00 1000 big-M20 + warm start 2.41989 107 12.12 0.00 1000 big-M5 + warm start 2.41989 107 9.30 0.00 1000 ours 2.41989 107 0.53 0.00

3000 big-M50 + warm start 2.43713 107 3600.43 0.59 3000 big-M20 + warm start 2.43713 107 3600.05 0.59 3000 big-M5 + warm start 2.43713 107 3600.15 0.56 3000 ours 2.43713 107 1.75 0.00

5000 big-M50 + warm start 2.43694 107 3616.71 1.00 5000 big-M20 + warm start 2.43694 107 3615.82 1.00 5000 big-M5 + warm start 2.43694 107 3626.45 600317577.18 5000 ours 2.43694 107 4.22 0.00 Table 79: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.1). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 9.11026 107 0.04 0.00 100 big-M20 + warm start 9.11026 107 0.03 0.00 100 big-M5 + warm start 9.11026 107 0.03 0.00 100 ours 9.11026 107 0.31 0.00

500 big-M50 + warm start 9.12047 107 14.34 0.00 500 big-M20 + warm start 9.12047 107 28.63 0.00 500 big-M5 + warm start 9.12047 107 18.60 0.00 500 ours 9.12047 107 0.38 0.00

1000 big-M50 + warm start 9.04075 107 3600.16 0.01 1000 big-M20 + warm start 9.04075 107 3600.11 0.20 1000 big-M5 + warm start 9.04075 107 1687.23 0.00 1000 ours 9.04075 107 0.52 0.00

3000 big-M50 + warm start 9.12761 107 3611.02 0.59 3000 big-M20 + warm start 9.12761 107 3601.67 0.59 3000 big-M5 + warm start 9.12761 107 3602.09 222997914.92 3000 ours 9.12761 107 20.88 0.00

5000 big-M50 + warm start 9.12916 107 3606.64 1.00 5000 big-M20 + warm start 9.12916 107 3610.82 1.00 5000 big-M5 + warm start 9.12916 107 3632.24 618133893.79 5000 ours 9.12916 107 58.07 0.00 Table 80: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.1). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

G.3.5 Synthetic 1 Benchmark with λ2 = 10.0 with No Warmstart

# of features method upper bound time(s) gap(%)

100 big-M50 2.10868 106 0.28 0.00 100 big-M20 2.10868 106 0.44 0.00 100 big-M5 2.10868 106 0.32 0.00 100 ours 2.10868 106 0.30 0.00

500 big-M50 2.12819 106 42.85 0.00 500 big-M20 2.12823 106 28.73 0.00 500 big-M5 2.12823 106 57.10 0.00 500 ours 2.12823 106 0.38 0.00

1000 big-M50 2.10995 106 1072.15 0.00 1000 big-M20 2.10995 106 451.79 0.00 1000 big-M5 2.10995 106 904.01 0.00 1000 ours 2.10995 106 0.57 0.00

3000 big-M50 7.92671 105 3605.01 166.77 3000 big-M20 9.76712 105 3602.25 116.51 3000 big-M5 - 3623.24 - 3000 ours 2.10220 106 1.68 0.00

5000 big-M50 7.89596 105 3617.52 168.16 5000 big-M20 9.77684 105 3612.02 116.57 5000 big-M5 1.87959 106 3600.23 12.65 5000 ours 2.09625 106 5.09 0.00 Table 81: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 big-M50 5.28713 106 0.40 0.00 100 big-M20 5.28716 106 0.39 0.00 100 big-M5 5.28716 106 0.35 0.00 100 ours 5.28716 106 0.30 0.00

500 big-M50 5.31819 106 89.34 0.00 500 big-M20 5.31835 106 73.89 0.00 500 big-M5 5.31842 106 52.55 0.00 500 ours 5.31842 106 0.41 0.00

1000 big-M50 5.26964 106 1869.66 0.00 1000 big-M20 5.26964 106 1241.04 0.00 1000 big-M5 5.26964 106 711.84 0.00 1000 ours 5.26964 106 0.51 0.00

3000 big-M50 3.96610 106 3600.32 33.83 3000 big-M20 3.98698 106 3601.87 33.13 3000 big-M5 - 3604.15 - 3000 ours 5.27658 106 1.76 0.00

5000 big-M50 3.96757 106 3607.78 34.19 5000 big-M20 4.37277 106 3610.74 21.76 5000 big-M5 5.27116 106 3600.14 1.00 5000 ours 5.27116 106 4.32 0.00 Table 82: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 big-M50 1.10083 107 0.44 0.00 100 big-M20 1.10083 107 0.37 0.00 100 big-M5 1.10083 107 0.33 0.00 100 ours 1.10083 107 0.32 0.00

500 big-M50 1.10515 107 91.37 0.00 500 big-M20 1.10517 107 74.71 0.00 500 big-M5 1.10517 107 99.37 0.00 500 ours 1.10517 107 0.41 0.00

1000 big-M50 1.09505 107 1534.37 0.00 1000 big-M20 1.09507 107 1102.78 0.00 1000 big-M5 1.09507 107 814.92 0.00 1000 ours 1.09507 107 0.55 0.00

3000 big-M50 - 3603.26 - 3000 big-M20 - 3606.10 - 3000 big-M5 - 3606.02 - 3000 ours 1.10011 107 1.73 0.00

5000 big-M50 9.62941 106 3608.37 15.34 5000 big-M20 1.00932 107 3602.48 10.04 5000 big-M5 - 3633.80 - 5000 ours 1.09965 107 4.24 0.00 Table 83: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 big-M50 2.43574 107 0.42 0.00 100 big-M20 2.43574 107 0.36 0.00 100 big-M5 2.43574 107 0.37 0.00 100 ours 2.43574 107 0.30 0.00

500 big-M50 2.44178 107 85.12 0.00 500 big-M20 2.44184 107 70.81 0.00 500 big-M5 2.44184 107 53.85 0.00 500 ours 2.44184 107 0.44 0.00

1000 big-M50 2.41986 107 1415.19 0.00 1000 big-M20 2.41987 107 878.61 0.00 1000 big-M5 2.41988 107 613.98 0.00 1000 ours 2.41988 107 0.52 0.00

3000 big-M50 - 3615.34 - 3000 big-M20 - 3632.97 - 3000 big-M5 2.43712 107 3600.25 0.56 3000 ours 2.43712 107 1.81 0.00

5000 big-M50 2.43693 107 3607.06 1.00 5000 big-M20 2.43693 107 3603.18 1.00 5000 big-M5 - 3626.02 - 5000 ours 2.43693 107 4.08 0.00 Table 84: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 10.0).

# of features method upper bound time(s) gap(%)

100 big-M50 9.11025 107 0.17 0.00 100 big-M20 9.11025 107 0.12 0.00 100 big-M5 9.11025 107 0.10 0.00 100 ours 9.11025 107 0.32 0.00

500 big-M50 9.12046 107 117.86 0.00 500 big-M20 9.12046 107 128.18 0.00 500 big-M5 9.12046 107 184.18 0.00 500 ours 9.12046 107 0.38 0.00

1000 big-M50 9.04074 107 3600.04 0.01 1000 big-M20 9.04074 107 3600.17 0.20 1000 big-M5 9.04074 107 2000.72 0.00 1000 ours 9.04074 107 0.52 0.00

3000 big-M50 - 3611.90 - 3000 big-M20 - 3625.57 - 3000 big-M5 - 3607.35 - 3000 ours 9.12760 107 20.84 0.00

5000 big-M50 - 3615.45 - 5000 big-M20 - 3613.78 - 5000 big-M5 - 3613.28 - 5000 ours 9.12915 107 57.41 0.00 Table 85: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 10.0).

G.3.6 Synthetic 1 Benchmark with λ2 = 10.0 with Warmstart

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 2.10868 106 0.25 0.00 100 big-M20 + warm start 2.10868 106 0.17 0.00 100 big-M5 + warm start 2.10868 106 0.05 0.00 100 ours 2.10868 106 0.30 0.00

500 big-M50 + warm start 2.12823 106 3.76 0.00 500 big-M20 + warm start 2.12823 106 2.68 0.00 500 big-M5 + warm start 2.12823 106 2.17 0.00 500 ours 2.12823 106 0.38 0.00

1000 big-M50 + warm start 2.10995 106 23.00 0.00 1000 big-M20 + warm start 2.10995 106 18.30 0.00 1000 big-M5 + warm start 2.10995 106 26.24 0.00 1000 ours 2.10995 106 0.57 0.00

3000 big-M50 + warm start 2.10220 106 241.10 0.00 3000 big-M20 + warm start 2.10220 106 302.29 0.00 3000 big-M5 + warm start 2.10220 106 273.21 0.00 3000 ours 2.10220 106 1.68 0.00

5000 big-M50 + warm start 2.09625 106 1309.75 0.00 5000 big-M20 + warm start 2.09625 106 1142.75 0.00 5000 big-M5 + warm start 2.09625 106 1636.89 0.00 5000 ours 2.09625 106 5.09 0.00 Table 86: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 5.28716 106 0.25 0.00 100 big-M20 + warm start 5.28716 106 0.05 0.00 100 big-M5 + warm start 5.28716 106 0.05 0.00 100 ours 5.28716 106 0.30 0.00

500 big-M50 + warm start 5.31842 106 2.93 0.00 500 big-M20 + warm start 5.31842 106 2.99 0.00 500 big-M5 + warm start 5.31842 106 1.86 0.00 500 ours 5.31842 106 0.41 0.00

1000 big-M50 + warm start 5.26964 106 15.10 0.00 1000 big-M20 + warm start 5.26964 106 14.76 0.00 1000 big-M5 + warm start 5.26964 106 11.79 0.00 1000 ours 5.26964 106 0.51 0.00

3000 big-M50 + warm start 5.27658 106 315.30 0.00 3000 big-M20 + warm start 5.27658 106 390.22 0.00 3000 big-M5 + warm start 5.27658 106 249.80 0.00 3000 ours 5.27658 106 1.76 0.00

5000 big-M50 + warm start 5.27116 106 1100.87 0.00 5000 big-M20 + warm start 5.27116 106 1014.36 0.00 5000 big-M5 + warm start 5.27116 106 3627.13 1.00 5000 ours 5.27116 106 4.32 0.00 Table 87: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 1.10083 107 0.06 0.00 100 big-M20 + warm start 1.10083 107 0.05 0.00 100 big-M5 + warm start 1.10083 107 0.05 0.00 100 ours 1.10083 107 0.32 0.00

500 big-M50 + warm start 1.10517 107 8.51 0.00 500 big-M20 + warm start 1.10517 107 6.01 0.00 500 big-M5 + warm start 1.10517 107 13.97 0.00 500 ours 1.10517 107 0.41 0.00

1000 big-M50 + warm start 1.09507 107 14.03 0.00 1000 big-M20 + warm start 1.09507 107 15.08 0.00 1000 big-M5 + warm start 1.09507 107 11.81 0.00 1000 ours 1.09507 107 0.55 0.00

3000 big-M50 + warm start 1.10011 107 201.32 0.00 3000 big-M20 + warm start 1.10011 107 292.08 0.00 3000 big-M5 + warm start 1.10011 107 241.63 0.00 3000 ours 1.10011 107 1.73 0.00

5000 big-M50 + warm start 1.09965 107 3601.20 1.00 5000 big-M20 + warm start 1.09965 107 3623.71 1.00 5000 big-M5 + warm start 1.09965 107 3629.56 570166517.91 5000 ours 1.09965 107 4.24 0.00 Table 88: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 10.0). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 2.43574 107 0.16 0.00 100 big-M20 + warm start 2.43574 107 0.17 0.00 100 big-M5 + warm start 2.43574 107 0.04 0.00 100 ours 2.43574 107 0.30 0.00

500 big-M50 + warm start 2.44184 107 2.11 0.00 500 big-M20 + warm start 2.44184 107 11.80 0.00 500 big-M5 + warm start 2.44184 107 1.69 0.00 500 ours 2.44184 107 0.44 0.00

1000 big-M50 + warm start 2.41988 107 12.39 0.00 1000 big-M20 + warm start 2.41988 107 13.36 0.00 1000 big-M5 + warm start 2.41988 107 10.59 0.00 1000 ours 2.41988 107 0.52 0.00

3000 big-M50 + warm start 2.43712 107 3600.43 0.59 3000 big-M20 + warm start 2.43712 107 3600.08 0.59 3000 big-M5 + warm start 2.43712 107 3600.72 0.56 3000 ours 2.43712 107 1.81 0.00

5000 big-M50 + warm start 2.43693 107 3624.24 1.00 5000 big-M20 + warm start 2.43693 107 3603.69 1.00 5000 big-M5 + warm start 2.43693 107 3646.36 600320013.98 5000 ours 2.43693 107 4.08 0.00 Table 89: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 10.0). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

# of features method upper bound time(s) gap(%)

100 big-M50 + warm start 9.11025 107 0.04 0.00 100 big-M20 + warm start 9.11025 107 0.04 0.00 100 big-M5 + warm start 9.11025 107 0.04 0.00 100 ours 9.11025 107 0.32 0.00

500 big-M50 + warm start 9.12046 107 36.63 0.00 500 big-M20 + warm start 9.12046 107 28.85 0.00 500 big-M5 + warm start 9.12046 107 18.35 0.00 500 ours 9.12046 107 0.38 0.00

1000 big-M50 + warm start 9.04074 107 3600.19 0.06 1000 big-M20 + warm start 9.04074 107 3600.33 0.20 1000 big-M5 + warm start 9.04074 107 1545.53 0.00 1000 ours 9.04074 107 0.52 0.00

3000 big-M50 + warm start 9.12760 107 3606.60 0.59 3000 big-M20 + warm start 9.12760 107 3603.80 0.59 3000 big-M5 + warm start 9.12760 107 3605.53 222998156.52 3000 ours 9.12760 107 20.84 0.00

5000 big-M50 + warm start 9.12915 107 3606.47 1.00 5000 big-M20 + warm start 9.12915 107 3602.91 1.00 5000 big-M5 + warm start 9.12915 107 3634.46 618134564.44 5000 ours 9.12915 107 57.41 0.00 Table 90: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 10.0). All baselines use our beam search solution as a warm start. The method big-M5 has some large optimality gaps because Gurobi couldn t finish solving the root relaxation within 1h.

G.3.7 Synthetic 2 Benchmark with λ2 = 0.001 with No Warmstart

# of samples method upper bound time(s) gap(%)

3000 big-M50 6.34585 104 3600.07 19.16 3000 big-M20 6.34585 104 3600.07 18.11 3000 big-M5 6.34585 104 2252.12 0.00 3000 ours 6.34585 104 3600.49 19.45

4000 big-M50 8.64294 104 3600.07 15.32 4000 big-M20 8.64294 104 3600.09 14.37 4000 big-M5 8.64294 104 1536.42 0.00 4000 ours 8.64294 104 20.25 0.00

5000 big-M50 1.09576 105 3600.81 10.25 5000 big-M20 1.09576 105 3600.09 11.72 5000 big-M5 1.09576 105 1154.40 0.00 5000 ours 1.09576 105 20.51 0.00

6000 big-M50 1.27643 105 3600.08 9.93 6000 big-M20 1.27643 105 3600.59 9.92 6000 big-M5 1.27643 105 1018.24 0.00 6000 ours 1.27643 105 1.70 0.00

7000 big-M50 1.50614 105 3600.09 8.29 7000 big-M20 1.50614 105 3600.07 5.79 7000 big-M5 1.50614 105 851.91 0.00 7000 ours 1.50614 105 1.67 0.00 Table 91: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 big-M50 1.59509 105 3600.08 19.07 3000 big-M20 1.59509 105 3600.67 17.35 3000 big-M5 1.59509 105 3600.06 6.40 3000 ours 1.59509 105 3600.67 19.53

4000 big-M50 2.17574 105 3600.07 14.78 4000 big-M20 2.17574 105 3600.14 14.34 4000 big-M5 2.17574 105 3600.40 4.63 4000 ours 2.17574 105 100.13 0.00

5000 big-M50 2.71986 105 3600.07 11.22 5000 big-M20 2.71986 105 3600.07 11.12 5000 big-M5 2.71986 105 2568.73 0.00 5000 ours 2.71986 105 20.80 0.00

6000 big-M50 3.16761 105 3600.09 9.96 6000 big-M20 3.16761 105 3600.06 9.95 6000 big-M5 3.16761 105 1834.08 0.00 6000 ours 3.16761 105 20.70 0.00

7000 big-M50 3.72610 105 3601.19 8.33 7000 big-M20 3.72610 105 3600.40 8.13 7000 big-M5 3.72610 105 1682.96 0.00 7000 ours 3.72610 105 1.98 0.00 Table 92: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 big-M50 3.33604 105 3600.08 18.79 3000 big-M20 3.31111 105 3600.31 16.99 3000 big-M5 3.33604 105 3600.23 7.53 3000 ours 3.33604 105 3600.58 19.57

4000 big-M50 4.53261 105 3600.07 14.98 4000 big-M20 4.53261 105 3600.48 13.50 4000 big-M5 4.53261 105 3600.17 6.23 4000 ours 4.53261 105 3600.25 0.63

5000 big-M50 5.63007 105 3601.19 11.68 5000 big-M20 5.63007 105 3600.05 11.25 5000 big-M5 5.63007 105 3600.07 3.85 5000 ours 5.63007 105 36.97 0.00

6000 big-M50 6.57799 105 3600.34 9.98 6000 big-M20 6.57799 105 3600.43 9.36 6000 big-M5 6.57799 105 3600.05 2.87 6000 ours 6.57799 105 20.75 0.00

7000 big-M50 7.72267 105 3600.11 8.33 7000 big-M20 7.72267 105 3600.96 7.85 7000 big-M5 7.72267 105 3600.06 2.40 7000 ours 7.72267 105 21.02 0.00 Table 93: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 big-M50 7.41187 105 3600.05 18.15 3000 big-M20 7.41187 105 3600.20 14.15 3000 big-M5 7.41187 105 3600.22 5.94 3000 ours 7.41187 105 3601.14 19.60

4000 big-M50 1.00103 106 3600.07 15.19 4000 big-M20 1.00287 106 3600.21 11.81 4000 big-M5 1.00287 106 3600.11 5.06 4000 ours 1.00287 106 3600.37 1.88

5000 big-M50 1.24027 106 3601.10 11.88 5000 big-M20 1.24027 106 3601.01 10.26 5000 big-M5 1.24027 106 3600.19 4.34 5000 ours 1.24027 106 466.42 0.00

6000 big-M50 1.45417 106 3600.11 9.78 6000 big-M20 1.45417 106 3600.08 9.11 6000 big-M5 1.45417 106 3600.19 3.94 6000 ours 1.45417 106 57.67 0.00

7000 big-M50 1.70483 106 3601.55 8.34 7000 big-M20 1.70483 106 3600.10 7.77 7000 big-M5 1.70483 106 3600.38 3.39 7000 ours 1.70483 106 21.42 0.00 Table 94: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.001).

# of samples method upper bound time(s) gap(%)

3000 big-M50 2.78521 106 3600.15 15.90 3000 big-M20 2.78521 106 3600.05 10.37 3000 big-M5 2.78521 106 3600.36 3.69 3000 ours 2.78521 106 3600.01 19.60

4000 big-M50 3.75202 106 3600.06 13.33 4000 big-M20 3.75203 106 3600.28 8.64 4000 big-M5 3.75132 106 3600.37 3.18 4000 ours 3.75203 106 3600.42 3.50

5000 big-M50 4.61877 106 3600.19 11.37 5000 big-M20 4.61877 106 3600.05 7.62 5000 big-M5 4.61877 106 3600.06 2.72 5000 ours 4.61877 106 3600.57 0.93

6000 big-M50 5.43826 106 3600.06 9.77 6000 big-M20 5.43826 106 3600.07 6.92 6000 big-M5 5.43826 106 3600.15 2.50 6000 ours 5.43826 106 3600.65 0.33

7000 big-M50 6.36767 106 3600.38 8.22 7000 big-M20 6.36767 106 3600.05 6.06 7000 big-M5 6.36767 106 3600.21 2.23 7000 ours 6.36767 106 3600.78 0.11 Table 95: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.001).

G.3.8 Synthetic 2 Benchmark with λ2 = 0.001 with Warmstart

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 6.34585 104 3601.20 19.16 3000 big-M20 + warm start 6.34585 104 3600.06 18.11 3000 big-M5 + warm start 6.34585 104 2435.86 0.00 3000 ours 6.34585 104 3600.49 19.45

4000 big-M50 + warm start 8.64294 104 3600.87 15.32 4000 big-M20 + warm start 8.64294 104 3600.12 14.37 4000 big-M5 + warm start 8.64294 104 1618.73 0.00 4000 ours 8.64294 104 20.25 0.00

5000 big-M50 + warm start 1.09576 105 3601.86 10.25 5000 big-M20 + warm start 1.09576 105 3601.59 11.72 5000 big-M5 + warm start 1.09576 105 1530.18 0.00 5000 ours 1.09576 105 20.51 0.00

6000 big-M50 + warm start 1.27643 105 3601.14 9.93 6000 big-M20 + warm start 1.27643 105 3600.64 9.82 6000 big-M5 + warm start 1.27643 105 972.83 0.00 6000 ours 1.27643 105 1.70 0.00

7000 big-M50 + warm start 1.50614 105 3600.13 8.29 7000 big-M20 + warm start 1.50614 105 3600.94 5.81 7000 big-M5 + warm start 1.50614 105 824.89 0.00 7000 ours 1.50614 105 1.67 0.00 Table 96: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 1.59509 105 3600.60 19.07 3000 big-M20 + warm start 1.59509 105 3600.07 17.35 3000 big-M5 + warm start 1.59509 105 3600.05 6.40 3000 ours 1.59509 105 3600.67 19.53

4000 big-M50 + warm start 2.17574 105 3600.10 14.78 4000 big-M20 + warm start 2.17574 105 3600.68 14.34 4000 big-M5 + warm start 2.17574 105 3600.04 4.63 4000 ours 2.17574 105 100.13 0.00

5000 big-M50 + warm start 2.71986 105 3600.06 11.22 5000 big-M20 + warm start 2.71986 105 3601.05 11.12 5000 big-M5 + warm start 2.71986 105 2571.91 0.00 5000 ours 2.71986 105 20.80 0.00

6000 big-M50 + warm start 3.16761 105 3600.07 9.96 6000 big-M20 + warm start 3.16761 105 3600.65 9.95 6000 big-M5 + warm start 3.16761 105 1861.02 0.00 6000 ours 3.16761 105 20.70 0.00

7000 big-M50 + warm start 3.72610 105 3600.08 8.33 7000 big-M20 + warm start 3.72610 105 3600.41 8.13 7000 big-M5 + warm start 3.72610 105 1752.62 0.00 7000 ours 3.72610 105 1.98 0.00 Table 97: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 3.33604 105 3600.17 18.79 3000 big-M20 + warm start 3.33604 105 3600.12 16.11 3000 big-M5 + warm start 3.33604 105 3600.20 7.53 3000 ours 3.33604 105 3600.58 19.57

4000 big-M50 + warm start 4.53261 105 3600.17 14.98 4000 big-M20 + warm start 4.53261 105 3600.77 13.50 4000 big-M5 + warm start 4.53261 105 3600.23 6.23 4000 ours 4.53261 105 3600.25 0.63

5000 big-M50 + warm start 5.63007 105 3600.29 11.70 5000 big-M20 + warm start 5.63007 105 3600.07 11.25 5000 big-M5 + warm start 5.63007 105 3600.32 3.38 5000 ours 5.63007 105 36.97 0.00

6000 big-M50 + warm start 6.57799 105 3600.94 9.98 6000 big-M20 + warm start 6.57799 105 3600.28 9.36 6000 big-M5 + warm start 6.57799 105 3600.06 2.89 6000 ours 6.57799 105 20.75 0.00

7000 big-M50 + warm start 7.72267 105 3600.55 8.33 7000 big-M20 + warm start 7.72267 105 3600.07 7.85 7000 big-M5 + warm start 7.72267 105 3600.65 2.40 7000 ours 7.72267 105 21.02 0.00 Table 98: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 7.41187 105 3600.76 18.15 3000 big-M20 + warm start 7.41187 105 3600.29 14.15 3000 big-M5 + warm start 7.41187 105 3600.04 5.94 3000 ours 7.41187 105 3601.14 19.60

4000 big-M50 + warm start 1.00287 106 3600.07 14.98 4000 big-M20 + warm start 1.00287 106 3600.10 11.81 4000 big-M5 + warm start 1.00287 106 3600.04 5.06 4000 ours 1.00287 106 3600.37 1.88

5000 big-M50 + warm start 1.24027 106 3600.05 11.88 5000 big-M20 + warm start 1.24027 106 3600.49 10.26 5000 big-M5 + warm start 1.24027 106 3600.24 4.34 5000 ours 1.24027 106 466.42 0.00

6000 big-M50 + warm start 1.45417 106 3600.45 9.78 6000 big-M20 + warm start 1.45417 106 3600.26 9.06 6000 big-M5 + warm start 1.45417 106 3600.07 3.94 6000 ours 1.45417 106 57.67 0.00

7000 big-M50 + warm start 1.70483 106 3601.38 8.34 7000 big-M20 + warm start 1.70483 106 3600.06 7.77 7000 big-M5 + warm start 1.70483 106 3600.05 3.39 7000 ours 1.70483 106 21.42 0.00 Table 99: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.001). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 2.78521 106 3600.74 15.90 3000 big-M20 + warm start 2.78521 106 3600.24 10.37 3000 big-M5 + warm start 2.78521 106 3600.32 3.69 3000 ours 2.78521 106 3600.01 19.60

4000 big-M50 + warm start 3.75203 106 3600.07 13.33 4000 big-M20 + warm start 3.75203 106 3600.24 8.64 4000 big-M5 + warm start 3.75203 106 3600.16 3.16 4000 ours 3.75203 106 3600.42 3.50

5000 big-M50 + warm start 4.61877 106 3601.36 11.37 5000 big-M20 + warm start 4.61877 106 3600.23 7.62 5000 big-M5 + warm start 4.61877 106 3600.07 2.72 5000 ours 4.61877 106 3600.57 0.93

6000 big-M50 + warm start 5.43826 106 3600.22 9.77 6000 big-M20 + warm start 5.43826 106 3601.07 6.92 6000 big-M5 + warm start 5.43826 106 3600.43 2.50 6000 ours 5.43826 106 3600.65 0.33

7000 big-M50 + warm start 6.36767 106 3600.12 8.22 7000 big-M20 + warm start 6.36767 106 3600.57 6.06 7000 big-M5 + warm start 6.36767 106 3600.04 2.23 7000 ours 6.36767 106 3600.78 0.11 Table 100: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.001). All baselines use our beam search solution as a warm start.

G.3.9 Synthetic 2 Benchmark with λ2 = 0.1 with No Warmstart

# of samples method upper bound time(s) gap(%)

3000 big-M50 6.34575 104 3600.07 19.14 3000 big-M20 6.34575 104 3600.11 18.10 3000 big-M5 6.34575 104 2452.71 0.00 3000 ours 6.34575 104 3600.33 19.31

4000 big-M50 8.64284 104 3600.23 15.33 4000 big-M20 8.64284 104 3600.11 14.04 4000 big-M5 8.64284 104 1675.19 0.00 4000 ours 8.64284 104 20.81 0.00

5000 big-M50 1.09575 105 3600.08 11.60 5000 big-M20 1.09575 105 3600.07 9.94 5000 big-M5 1.09575 105 1316.91 0.00 5000 ours 1.09575 105 20.65 0.00

6000 big-M50 1.27642 105 3601.02 9.92 6000 big-M20 1.27642 105 3600.06 9.92 6000 big-M5 1.27642 105 1040.84 0.00 6000 ours 1.27642 105 1.70 0.00

7000 big-M50 1.50613 105 3600.45 8.32 7000 big-M20 1.50613 105 3600.10 5.74 7000 big-M5 1.50613 105 901.46 0.00 7000 ours 1.50613 105 1.67 0.00 Table 101: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 big-M50 1.59508 105 3601.50 19.07 3000 big-M20 1.59508 105 3600.41 17.35 3000 big-M5 1.59508 105 3600.05 6.40 3000 ours 1.59508 105 3600.27 19.40

4000 big-M50 2.17573 105 3600.23 14.78 4000 big-M20 2.17573 105 3600.11 14.34 4000 big-M5 2.17573 105 3600.36 4.63 4000 ours 2.17573 105 99.53 0.00

5000 big-M50 2.71985 105 3600.09 11.22 5000 big-M20 2.71985 105 3600.06 11.29 5000 big-M5 2.71985 105 2986.61 0.00 5000 ours 2.71985 105 21.01 0.00

6000 big-M50 3.16760 105 3600.07 9.96 6000 big-M20 3.16760 105 3601.06 9.95 6000 big-M5 3.16760 105 2025.19 0.00 6000 ours 3.16760 105 20.91 0.00

7000 big-M50 3.72609 105 3600.80 8.33 7000 big-M20 3.72609 105 3600.81 8.12 7000 big-M5 3.72609 105 1452.72 0.00 7000 ours 3.72609 105 1.73 0.00 Table 102: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 big-M50 3.33603 105 3601.02 18.78 3000 big-M20 3.33603 105 3600.70 16.11 3000 big-M5 3.33603 105 3600.22 7.55 3000 ours 3.33603 105 3600.68 19.44

4000 big-M50 4.53260 105 3600.12 14.95 4000 big-M20 4.53260 105 3600.25 13.47 4000 big-M5 4.53260 105 3600.05 6.23 4000 ours 4.53260 105 3600.41 0.62

5000 big-M50 5.63006 105 3600.07 11.69 5000 big-M20 5.63006 105 3601.43 11.26 5000 big-M5 5.63006 105 3600.31 3.51 5000 ours 5.63006 105 37.26 0.00

6000 big-M50 6.57798 105 3600.53 9.98 6000 big-M20 6.57798 105 3600.62 9.36 6000 big-M5 6.57798 105 3600.05 2.82 6000 ours 6.57798 105 20.64 0.00

7000 big-M50 7.72266 105 3600.96 8.33 7000 big-M20 7.72266 105 3600.57 7.80 7000 big-M5 7.72266 105 3600.14 2.34 7000 ours 7.72266 105 20.91 0.00 Table 103: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 big-M50 7.41186 105 3600.75 18.15 3000 big-M20 7.41186 105 3600.78 14.15 3000 big-M5 7.41186 105 3600.34 5.92 3000 ours 7.41186 105 3600.59 19.46

4000 big-M50 1.00287 106 3600.07 14.98 4000 big-M20 1.00287 106 3600.06 12.06 4000 big-M5 1.00287 106 3600.36 5.06 4000 ours 1.00287 106 3600.08 1.88

5000 big-M50 1.24027 106 3600.07 11.88 5000 big-M20 1.24027 106 3600.08 10.26 5000 big-M5 1.24027 106 3600.22 4.34 5000 ours 1.24027 106 469.43 0.00

6000 big-M50 1.45416 106 3601.17 9.78 6000 big-M20 1.45416 106 3600.60 9.11 6000 big-M5 1.45416 106 3600.20 3.94 6000 ours 1.45416 106 59.16 0.00

7000 big-M50 1.70483 106 3602.06 8.34 7000 big-M20 1.70483 106 3600.61 7.77 7000 big-M5 1.70483 106 3600.22 3.39 7000 ours 1.70483 106 21.12 0.00 Table 104: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.1).

# of samples method upper bound time(s) gap(%)

3000 big-M50 2.78468 106 3600.31 15.93 3000 big-M20 2.78521 106 3600.05 10.59 3000 big-M5 2.78521 106 3600.09 3.69 3000 ours 2.78521 106 3600.23 19.46

4000 big-M50 3.75203 106 3600.08 13.35 4000 big-M20 3.75203 106 3600.04 8.60 4000 big-M5 3.75203 106 3600.14 3.16 4000 ours 3.75203 106 3601.10 3.50

5000 big-M50 4.61877 106 3600.08 11.40 5000 big-M20 4.61877 106 3600.05 7.62 5000 big-M5 4.61877 106 3600.16 2.72 5000 ours 4.61877 106 3600.40 0.93

6000 big-M50 5.43826 106 3600.23 9.79 6000 big-M20 5.43826 106 3600.11 6.92 6000 big-M5 5.43826 106 3600.17 2.50 6000 ours 5.43826 106 3600.26 0.33

7000 big-M50 6.36767 106 3600.07 8.22 7000 big-M20 6.36767 106 3600.24 6.06 7000 big-M5 6.36767 106 3600.05 2.25 7000 ours 6.36767 106 3600.58 0.11 Table 105: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.1).

G.3.10 Synthetic 2 Benchmark with λ2 = 0.1 with Warmstart

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 6.34575 104 3600.12 19.14 3000 big-M20 + warm start 6.34575 104 3600.22 18.10 3000 big-M5 + warm start 6.34575 104 2424.12 0.00 3000 ours 6.34575 104 3600.33 19.31

4000 big-M50 + warm start 8.64284 104 3600.05 15.33 4000 big-M20 + warm start 8.64284 104 3600.18 14.04 4000 big-M5 + warm start 8.64284 104 1702.95 0.00 4000 ours 8.64284 104 20.81 0.00

5000 big-M50 + warm start 1.09575 105 3601.88 11.60 5000 big-M20 + warm start 1.09575 105 3600.31 9.94 5000 big-M5 + warm start 1.09575 105 1284.19 0.00 5000 ours 1.09575 105 20.65 0.00

6000 big-M50 + warm start 1.27642 105 3600.62 9.93 6000 big-M20 + warm start 1.27642 105 3600.15 9.92 6000 big-M5 + warm start 1.27642 105 1004.13 0.00 6000 ours 1.27642 105 1.70 0.00

7000 big-M50 + warm start 1.50613 105 3600.07 8.32 7000 big-M20 + warm start 1.50613 105 3600.06 5.74 7000 big-M5 + warm start 1.50613 105 794.91 0.00 7000 ours 1.50613 105 1.67 0.00 Table 106: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 1.59508 105 3600.22 19.07 3000 big-M20 + warm start 1.59508 105 3600.22 17.30 3000 big-M5 + warm start 1.59508 105 3600.05 6.35 3000 ours 1.59508 105 3600.27 19.40

4000 big-M50 + warm start 2.17573 105 3600.10 14.78 4000 big-M20 + warm start 2.17573 105 3600.06 14.34 4000 big-M5 + warm start 2.17573 105 3600.04 4.63 4000 ours 2.17573 105 99.53 0.00

5000 big-M50 + warm start 2.71985 105 3600.07 11.22 5000 big-M20 + warm start 2.71985 105 3600.10 11.29 5000 big-M5 + warm start 2.71985 105 2980.53 0.00 5000 ours 2.71985 105 21.01 0.00

6000 big-M50 + warm start 3.16760 105 3600.07 9.96 6000 big-M20 + warm start 3.16760 105 3600.74 9.95 6000 big-M5 + warm start 3.16760 105 2034.39 0.00 6000 ours 3.16760 105 20.91 0.00

7000 big-M50 + warm start 3.72609 105 3601.58 8.33 7000 big-M20 + warm start 3.72609 105 3600.38 8.12 7000 big-M5 + warm start 3.72609 105 1513.64 0.00 7000 ours 3.72609 105 1.73 0.00 Table 107: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 3.33603 105 3601.53 18.78 3000 big-M20 + warm start 3.33603 105 3600.81 16.11 3000 big-M5 + warm start 3.33603 105 3600.22 7.55 3000 ours 3.33603 105 3600.68 19.44

4000 big-M50 + warm start 4.53260 105 3600.07 15.04 4000 big-M20 + warm start 4.53260 105 3600.26 13.47 4000 big-M5 + warm start 4.53260 105 3600.21 6.23 4000 ours 4.53260 105 3600.41 0.62

5000 big-M50 + warm start 5.63006 105 3601.70 11.69 5000 big-M20 + warm start 5.63006 105 3600.25 11.26 5000 big-M5 + warm start 5.63006 105 3600.05 3.43 5000 ours 5.63006 105 37.26 0.00

6000 big-M50 + warm start 6.57798 105 3600.18 9.98 6000 big-M20 + warm start 6.57798 105 3600.55 9.36 6000 big-M5 + warm start 6.57798 105 3600.04 2.89 6000 ours 6.57798 105 20.64 0.00

7000 big-M50 + warm start 7.72266 105 3600.63 8.33 7000 big-M20 + warm start 7.72266 105 3600.06 7.80 7000 big-M5 + warm start 7.72266 105 3600.15 2.46 7000 ours 7.72266 105 20.91 0.00 Table 108: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 7.41186 105 3601.56 18.15 3000 big-M20 + warm start 7.41186 105 3600.35 14.15 3000 big-M5 + warm start 7.41186 105 3600.27 5.92 3000 ours 7.41186 105 3600.59 19.46

4000 big-M50 + warm start 1.00287 106 3600.09 14.98 4000 big-M20 + warm start 1.00287 106 3600.12 12.06 4000 big-M5 + warm start 1.00287 106 3600.07 5.06 4000 ours 1.00287 106 3600.08 1.88

5000 big-M50 + warm start 1.24027 106 3600.40 11.88 5000 big-M20 + warm start 1.24027 106 3600.98 10.26 5000 big-M5 + warm start 1.24027 106 3600.22 4.34 5000 ours 1.24027 106 469.43 0.00

6000 big-M50 + warm start 1.45416 106 3600.05 9.78 6000 big-M20 + warm start 1.45416 106 3600.06 9.06 6000 big-M5 + warm start 1.45416 106 3600.19 3.94 6000 ours 1.45416 106 59.16 0.00

7000 big-M50 + warm start 1.70483 106 3600.07 8.34 7000 big-M20 + warm start 1.70483 106 3600.28 7.77 7000 big-M5 + warm start 1.70483 106 3600.24 3.39 7000 ours 1.70483 106 21.12 0.00 Table 109: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.1). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 2.78521 106 3600.07 15.91 3000 big-M20 + warm start 2.78521 106 3600.71 10.37 3000 big-M5 + warm start 2.78521 106 3600.08 3.69 3000 ours 2.78521 106 3600.23 19.46

4000 big-M50 + warm start 3.75203 106 3600.27 13.35 4000 big-M20 + warm start 3.75203 106 3600.05 8.60 4000 big-M5 + warm start 3.75203 106 3600.09 3.16 4000 ours 3.75203 106 3601.10 3.50

5000 big-M50 + warm start 4.61877 106 3600.12 11.40 5000 big-M20 + warm start 4.61877 106 3600.06 7.62 5000 big-M5 + warm start 4.61877 106 3600.09 2.72 5000 ours 4.61877 106 3600.40 0.93

6000 big-M50 + warm start 5.43826 106 3600.07 9.79 6000 big-M20 + warm start 5.43826 106 3600.33 6.97 6000 big-M5 + warm start 5.43826 106 3600.24 2.50 6000 ours 5.43826 106 3600.26 0.33

7000 big-M50 + warm start 6.36767 106 3600.07 8.22 7000 big-M20 + warm start 6.36767 106 3600.67 6.06 7000 big-M5 + warm start 6.36767 106 3600.07 2.25 7000 ours 6.36767 106 3600.58 0.11 Table 110: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.1). All baselines use our beam search solution as a warm start.

G.3.11 Synthetic 2 Benchmark with λ2 = 10.0 with No Warmstart

# of samples method upper bound time(s) gap(%)

3000 big-M50 6.33576 104 3600.31 18.47 3000 big-M20 6.33576 104 3600.43 17.82 3000 big-M5 6.33576 104 2349.33 0.00 3000 ours 6.33576 104 3601.41 4.54

4000 big-M50 8.63287 104 3600.85 15.21 4000 big-M20 8.63287 104 3600.06 13.93 4000 big-M5 8.63287 104 1577.17 0.00 4000 ours 8.63287 104 20.33 0.00

5000 big-M50 1.09474 105 3600.28 11.85 5000 big-M20 1.09474 105 3600.55 10.17 5000 big-M5 1.09474 105 1627.54 0.00 5000 ours 1.09474 105 20.71 0.00

6000 big-M50 1.27542 105 3600.37 9.90 6000 big-M20 1.27542 105 3600.09 9.81 6000 big-M5 1.27542 105 1071.17 0.00 6000 ours 1.27542 105 1.69 0.00

7000 big-M50 1.50512 105 3600.56 8.31 7000 big-M20 1.50512 105 3600.07 8.30 7000 big-M5 1.50512 105 856.04 0.00 7000 ours 1.50512 105 1.70 0.00 Table 111: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 big-M50 1.59409 105 3600.72 18.52 3000 big-M20 1.59409 105 3600.07 17.23 3000 big-M5 1.59409 105 3600.44 5.92 3000 ours 1.59409 105 3600.62 4.68

4000 big-M50 2.17472 105 3600.07 14.64 4000 big-M20 2.17472 105 3600.82 14.15 4000 big-M5 2.17472 105 3600.05 4.05 4000 ours 2.17472 105 86.90 0.00

5000 big-M50 2.71885 105 3603.74 11.16 5000 big-M20 2.71885 105 3601.16 11.14 5000 big-M5 2.71885 105 2522.90 0.00 5000 ours 2.71885 105 20.75 0.00

6000 big-M50 3.16660 105 3600.07 9.93 6000 big-M20 3.16660 105 3600.26 9.92 6000 big-M5 3.16660 105 1893.58 0.00 6000 ours 3.16660 105 21.05 0.00

7000 big-M50 3.72509 105 3601.17 8.31 7000 big-M20 3.72509 105 3600.31 8.29 7000 big-M5 3.72509 105 1472.84 0.00 7000 ours 3.72509 105 1.71 0.00 Table 112: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 big-M50 3.33503 105 3600.07 18.43 3000 big-M20 3.33503 105 3600.06 16.05 3000 big-M5 3.33503 105 3600.22 7.54 3000 ours 3.33503 105 3600.25 6.21

4000 big-M50 4.53159 105 3600.16 14.84 4000 big-M20 4.53159 105 3600.90 13.44 4000 big-M5 4.53159 105 3600.21 6.24 4000 ours 4.53159 105 3600.95 0.35

5000 big-M50 5.62906 105 3600.16 11.59 5000 big-M20 5.62906 105 3600.07 11.13 5000 big-M5 5.62906 105 3600.36 3.65 5000 ours 5.62906 105 31.47 0.00

6000 big-M50 6.57698 105 3600.16 9.95 6000 big-M20 6.57698 105 3600.52 9.33 6000 big-M5 6.57698 105 3600.37 2.90 6000 ours 6.57698 105 21.24 0.00

7000 big-M50 7.72166 105 3600.12 8.31 7000 big-M20 7.72166 105 3600.05 7.81 7000 big-M5 7.72166 105 3600.05 2.33 7000 ours 7.72166 105 20.37 0.00 Table 113: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 big-M50 7.41086 105 3600.65 17.88 3000 big-M20 7.41086 105 3600.14 14.12 3000 big-M5 7.41086 105 3600.35 5.94 3000 ours 7.41086 105 3600.48 8.32

4000 big-M50 1.00277 106 3600.09 14.85 4000 big-M20 1.00277 106 3600.47 11.77 4000 big-M5 1.00277 106 3600.05 5.04 4000 ours 1.00277 106 3600.62 1.66

5000 big-M50 1.24017 106 3600.25 11.82 5000 big-M20 1.24017 106 3600.06 10.25 5000 big-M5 1.24017 106 3600.22 4.34 5000 ours 1.24017 106 386.72 0.00

6000 big-M50 1.45406 106 3600.07 9.75 6000 big-M20 1.44919 106 3600.53 9.46 6000 big-M5 1.45406 106 3600.20 3.95 6000 ours 1.45406 106 54.13 0.00

7000 big-M50 1.70473 106 3600.81 8.32 7000 big-M20 1.70473 106 3600.06 7.68 7000 big-M5 1.70473 106 3600.20 3.41 7000 ours 1.70473 106 21.30 0.00 Table 114: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 10.0).

# of samples method upper bound time(s) gap(%)

3000 big-M50 2.78507 106 3600.17 15.95 3000 big-M20 2.78507 106 3600.56 10.36 3000 big-M5 2.78507 106 3600.09 3.69 3000 ours 2.78507 106 3600.78 9.79

4000 big-M50 3.75192 106 3600.06 13.19 4000 big-M20 3.75192 106 3600.05 8.60 4000 big-M5 3.75192 106 3600.41 3.16 4000 ours 3.75192 106 3600.32 3.23

5000 big-M50 4.61867 106 3600.17 11.32 5000 big-M20 4.61867 106 3600.56 7.66 5000 big-M5 4.61867 106 3600.22 2.72 5000 ours 4.61867 106 3600.30 0.89

6000 big-M50 5.43816 106 3600.12 9.75 6000 big-M20 5.43816 106 3600.33 6.91 6000 big-M5 5.43816 106 3600.20 2.50 6000 ours 5.43816 106 3600.22 0.32

7000 big-M50 6.36757 106 3600.32 8.20 7000 big-M20 6.36757 106 3600.05 6.05 7000 big-M5 6.36757 106 3600.37 2.25 7000 ours 6.36757 106 3600.06 0.10 Table 115: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 10.0).

G.3.12 Synthetic 2 Benchmark with λ2 = 10.0 with Warmstart

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 6.33576 104 3600.12 18.47 3000 big-M20 + warm start 6.33576 104 3600.44 17.82 3000 big-M5 + warm start 6.33576 104 2370.86 0.00 3000 ours 6.33576 104 3601.41 4.54

4000 big-M50 + warm start 8.63287 104 3600.29 15.21 4000 big-M20 + warm start 8.63287 104 3600.09 13.93 4000 big-M5 + warm start 8.63287 104 1870.35 0.00 4000 ours 8.63287 104 20.33 0.00

5000 big-M50 + warm start 1.09474 105 3600.15 11.85 5000 big-M20 + warm start 1.09474 105 3600.43 10.17 5000 big-M5 + warm start 1.09474 105 1146.28 0.00 5000 ours 1.09474 105 20.71 0.00

6000 big-M50 + warm start 1.27542 105 3601.39 9.89 6000 big-M20 + warm start 1.27542 105 3600.08 9.81 6000 big-M5 + warm start 1.27542 105 1038.32 0.00 6000 ours 1.27542 105 1.69 0.00

7000 big-M50 + warm start 1.50512 105 3601.02 8.31 7000 big-M20 + warm start 1.50512 105 3600.99 7.98 7000 big-M5 + warm start 1.50512 105 853.59 0.00 7000 ours 1.50512 105 1.70 0.00 Table 116: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 1.59409 105 3600.67 18.52 3000 big-M20 + warm start 1.59409 105 3601.34 17.23 3000 big-M5 + warm start 1.59409 105 3600.06 5.92 3000 ours 1.59409 105 3600.62 4.68

4000 big-M50 + warm start 2.17472 105 3600.08 14.64 4000 big-M20 + warm start 2.17472 105 3600.35 14.15 4000 big-M5 + warm start 2.17472 105 3600.06 3.77 4000 ours 2.17472 105 86.90 0.00

5000 big-M50 + warm start 2.71885 105 3600.07 11.16 5000 big-M20 + warm start 2.71885 105 3600.19 11.09 5000 big-M5 + warm start 2.71885 105 2700.47 0.00 5000 ours 2.71885 105 20.75 0.00

6000 big-M50 + warm start 3.16660 105 3600.06 9.93 6000 big-M20 + warm start 3.16660 105 3600.74 9.92 6000 big-M5 + warm start 3.16660 105 1742.17 0.00 6000 ours 3.16660 105 21.05 0.00

7000 big-M50 + warm start 3.72509 105 3601.61 8.31 7000 big-M20 + warm start 3.72509 105 3600.25 8.11 7000 big-M5 + warm start 3.72509 105 1466.07 0.00 7000 ours 3.72509 105 1.71 0.00 Table 117: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 3.33503 105 3600.08 18.43 3000 big-M20 + warm start 3.33503 105 3600.16 16.05 3000 big-M5 + warm start 3.33503 105 3600.22 7.54 3000 ours 3.33503 105 3600.25 6.21

4000 big-M50 + warm start 4.53159 105 3600.06 14.84 4000 big-M20 + warm start 4.53159 105 3600.21 13.44 4000 big-M5 + warm start 4.53159 105 3600.22 6.24 4000 ours 4.53159 105 3600.95 0.35

5000 big-M50 + warm start 5.62906 105 3600.08 11.59 5000 big-M20 + warm start 5.62906 105 3601.71 11.22 5000 big-M5 + warm start 5.62906 105 3600.13 3.66 5000 ours 5.62906 105 31.47 0.00

6000 big-M50 + warm start 6.57698 105 3600.14 9.95 6000 big-M20 + warm start 6.57698 105 3600.97 9.33 6000 big-M5 + warm start 6.57698 105 3600.14 2.91 6000 ours 6.57698 105 21.24 0.00

7000 big-M50 + warm start 7.72166 105 3601.52 8.31 7000 big-M20 + warm start 7.72166 105 3600.30 7.81 7000 big-M5 + warm start 7.72166 105 3600.46 2.33 7000 ours 7.72166 105 20.37 0.00 Table 118: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 7.41086 105 3601.32 17.88 3000 big-M20 + warm start 7.41086 105 3600.23 14.06 3000 big-M5 + warm start 7.41086 105 3600.11 5.94 3000 ours 7.41086 105 3600.48 8.32

4000 big-M50 + warm start 1.00277 106 3600.44 14.85 4000 big-M20 + warm start 1.00277 106 3600.09 11.77 4000 big-M5 + warm start 1.00277 106 3600.26 5.04 4000 ours 1.00277 106 3600.62 1.66

5000 big-M50 + warm start 1.24017 106 3600.27 11.82 5000 big-M20 + warm start 1.24017 106 3600.69 10.25 5000 big-M5 + warm start 1.24017 106 3600.22 4.34 5000 ours 1.24017 106 386.72 0.00

6000 big-M50 + warm start 1.45406 106 3600.26 9.75 6000 big-M20 + warm start 1.45406 106 3600.07 9.09 6000 big-M5 + warm start 1.45406 106 3600.18 3.95 6000 ours 1.45406 106 54.13 0.00

7000 big-M50 + warm start 1.70473 106 3600.20 8.32 7000 big-M20 + warm start 1.70473 106 3600.12 7.68 7000 big-M5 + warm start 1.70473 106 3600.04 3.41 7000 ours 1.70473 106 21.30 0.00 Table 119: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 10.0). All baselines use our beam search solution as a warm start.

# of samples method upper bound time(s) gap(%)

3000 big-M50 + warm start 2.78507 106 3600.07 15.95 3000 big-M20 + warm start 2.78507 106 3600.04 10.36 3000 big-M5 + warm start 2.78507 106 3600.09 3.69 3000 ours 2.78507 106 3600.78 9.79

4000 big-M50 + warm start 3.75192 106 3600.06 13.19 4000 big-M20 + warm start 3.75192 106 3600.11 8.61 4000 big-M5 + warm start 3.75192 106 3600.15 3.16 4000 ours 3.75192 106 3600.32 3.23

5000 big-M50 + warm start 4.61867 106 3600.06 11.32 5000 big-M20 + warm start 4.61867 106 3600.79 7.64 5000 big-M5 + warm start 4.61867 106 3600.30 2.72 5000 ours 4.61867 106 3600.30 0.89

6000 big-M50 + warm start 5.43816 106 3600.39 9.75 6000 big-M20 + warm start 5.43816 106 3600.05 6.91 6000 big-M5 + warm start 5.43816 106 3600.05 2.50 6000 ours 5.43816 106 3600.22 0.32

7000 big-M50 + warm start 6.36757 106 3600.10 8.20 7000 big-M20 + warm start 6.36757 106 3600.28 6.05 7000 big-M5 + warm start 6.36757 106 3600.37 2.25 7000 ours 6.36757 106 3600.06 0.10 Table 120: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 10.0). All baselines use our beam search solution as a warm start.

G.4 Comparison with MOSEK Solver, Subset Selection CIO, and L0BNB

G.4.1 Synthetic 1 Benchmark with λ2 = 0.001

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 4.79406 105 2 2.46 0.00 100 Subset Select CIO + warm start 4.79406 105 2 2.72 0.00 100 L0Bn B 2.10878 106 10 1420.85 0.00 100 MSK persp. 2.10878 106 10 932.06 0.00 100 MSK persp. + warm start 2.10878 106 10 898.00 0.00 100 Convex MSK opt. persp. 2.10878 106 10 0.10 0.00 100 ours 2.10878 106 10 0.29 0.00

500 Subset Select CIO 4.79201 105 2 3.91 0.00 500 Subset Select CIO + warm start 4.79201 105 2 3.59 0.00 500 L0Bn B 2.12833 106 10 1754.22 0.00 500 MSK persp. 2.12833 106 10 3606.30 0.09 500 MSK persp. + warm start 2.12833 106 10 3600.23 0.09 500 Convex MSK opt. persp. 2.12833 106 10 6.79 0.01 500 ours 2.12833 106 10 0.37 0.00

1000 Subset Select CIO 4.71433 105 2 4.36 0.00 1000 Subset Select CIO + warm start 4.71433 105 2 4.39 0.00 1000 L0Bn B 2.11005 106 10 2114.74 0.00 1000 MSK persp. - - 3673.42 - 1000 MSK persp. + warm start 2.11005 106 10 3608.90 0.00 1000 Convex MSK opt. persp. 2.11005 106 10 54.02 0.01 1000 ours 2.11005 106 10 0.58 0.00

3000 Subset Select CIO 4.72136 105 2 5.51 0.00 3000 Subset Select CIO + warm start 4.72136 105 2 9.19 0.00 3000 L0Bn B 2.10230 106 10 2302.34 1.20 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start - - 3600.00 - 3000 Convex MSK opt. persp. 2.10230 106 10 1759.45 0.02 3000 ours 2.10230 106 10 1.73 0.00

5000 Subset Select CIO 4.62647 105 2 12.04 0.00 5000 Subset Select CIO + warm start 4.62647 105 2 12.66 0.00 5000 L0Bn B 2.09635 106 10 2643.23 5.18 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 2.09635 106 10 3760.81 0.05 5000 ours 2.09635 106 10 4.22 0.00 Table 121: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.001).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 2.52792 106 2 2.52 0.00 100 Subset Select CIO + warm start 2.52792 106 2 2.71 0.00 100 L0Bn B 5.28726 106 10 3182.66 0.00 100 MSK persp. 5.28726 106 10 3600.08 0.01 100 MSK persp. + warm start 5.28726 106 10 3600.67 0.01 100 Convex MSK opt. persp. 5.28726 106 10 0.11 0.00 100 ours 5.28726 106 10 0.29 0.00

500 Subset Select CIO 2.52985 106 2 3.43 0.00 500 Subset Select CIO + warm start 2.52985 106 2 3.26 0.00 500 L0Bn B 5.31852 106 10 3977.43 1.65 500 MSK persp. 5.31852 106 10 3631.98 0.10 500 MSK persp. + warm start 5.31852 106 10 3607.49 0.10 500 Convex MSK opt. persp. 5.31852 106 10 7.77 0.01 500 ours 5.31852 106 10 0.41 0.00

1000 Subset Select CIO 2.50201 106 2 4.67 0.00 1000 Subset Select CIO + warm start 2.50201 106 2 4.99 0.00 1000 L0Bn B 5.26974 106 10 4224.81 2.82 1000 MSK persp. - - 3607.94 - 1000 MSK persp. + warm start 5.26974 106 10 3601.79 0.20 1000 Convex MSK opt. persp. 5.26974 106 10 54.48 0.01 1000 ours 5.26974 106 10 0.52 0.00

3000 Subset Select CIO 2.51861 106 2 8.73 0.00 3000 Subset Select CIO + warm start 2.51861 106 2 8.50 0.00 3000 L0Bn B 5.27668 106 10 4945.04 0.35 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start - - 3600.00 - 3000 Convex MSK opt. persp. 5.27668 106 10 1724.71 0.02 3000 ours 5.27668 106 10 1.75 0.00

5000 Subset Select CIO 2.49167 106 2 7.60 0.00 5000 Subset Select CIO + warm start 2.49167 106 2 13.32 0.00 5000 L0Bn B 5.27126 106 10 5589.24 0.49 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 5.27126 106 10 3660.81 0.05 5000 ours 5.27126 106 10 4.32 0.00 Table 122: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.001).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 7.42601 106 2 2.18 0.00 100 Subset Select CIO + warm start 7.42601 106 2 2.61 0.00 100 L0Bn B 1.10084 107 10 3464.13 0.00 100 MSK persp. 1.10084 107 10 3605.37 0.01 100 MSK persp. + warm start 1.10084 107 10 3602.75 0.01 100 Convex MSK opt. persp. 1.10084 107 10 0.14 0.00 100 ours 1.10084 107 10 0.29 0.00

500 Subset Select CIO 7.42485 106 2 3.33 0.00 500 Subset Select CIO + warm start 7.42485 106 2 3.59 0.00 500 L0Bn B 1.10518 107 10 4143.57 0.77 500 MSK persp. 1.10518 107 10 3606.04 0.10 500 MSK persp. + warm start 1.10518 107 10 3600.96 0.10 500 Convex MSK opt. persp. 1.10518 107 10 7.89 0.01 500 ours 1.10518 107 10 0.40 0.00

1000 Subset Select CIO 7.35715 106 2 3.71 0.00 1000 Subset Select CIO + warm start 7.35715 106 2 4.39 0.00 1000 L0Bn B 1.09508 107 10 1931.45 0.65 1000 MSK persp. - - 3708.29 - 1000 MSK persp. + warm start 1.09508 107 10 3692.66 0.00 1000 Convex MSK opt. persp. 1.09508 107 10 56.04 0.01 1000 ours 1.09508 107 10 0.59 0.00

3000 Subset Select CIO 7.42176 106 2 8.32 0.00 3000 Subset Select CIO + warm start 7.42176 106 2 8.23 0.00 3000 L0Bn B 1.10012 107 10 5903.24 1.27 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 1.10012 107 10 4174.40 0.00 3000 Convex MSK opt. persp. 1.10012 107 10 1848.58 0.02 3000 ours 1.10012 107 10 1.74 0.00

5000 Subset Select CIO 7.37395 106 2 8.42 0.00 5000 Subset Select CIO + warm start 7.37395 106 2 11.87 0.00 5000 L0Bn B 1.09966 107 10 6029.87 1.11 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start 1.09966 107 10 6700.63 0.00 5000 Convex MSK opt. persp. 1.09966 107 10 3657.70 0.06 5000 ours 1.09966 107 10 4.00 0.00 Table 123: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.001).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 2.01567 107 2 2.50 0.00 100 Subset Select CIO + warm start 2.01567 107 2 2.64 0.00 100 L0Bn B 2.43575 107 10 2716.56 0.00 100 MSK persp. 2.43575 107 10 3602.86 0.01 100 MSK persp. + warm start 2.43575 107 10 3600.41 0.01 100 Convex MSK opt. persp. 2.43575 107 10 0.12 0.00 100 ours 2.43575 107 10 0.32 0.00

500 Subset Select CIO 2.01420 107 2 3.26 0.00 500 Subset Select CIO + warm start 2.01420 107 2 3.40 0.00 500 L0Bn B 2.44185 107 10 3747.19 0.00 500 MSK persp. 2.24367 107 10 3608.22 8.94 500 MSK persp. + warm start 2.44185 107 10 3623.31 0.10 500 Convex MSK opt. persp. 2.44185 107 10 8.07 0.01 500 ours 2.44185 107 10 0.46 0.00

1000 Subset Select CIO 1.99762 107 2 3.67 0.00 1000 Subset Select CIO + warm start 1.99762 107 2 4.71 0.00 1000 L0Bn B 2.41994 107 11 4179.51 0.01 1000 MSK persp. - - 3620.44 - 1000 MSK persp. + warm start 2.41989 107 10 3686.70 0.20 1000 Convex MSK opt. persp. 2.41989 107 10 57.29 0.01 1000 ours 2.41989 107 10 0.52 0.00

3000 Subset Select CIO 2.01741 107 2 9.04 0.00 3000 Subset Select CIO + warm start 2.01741 107 2 9.24 0.00 3000 L0Bn B 2.42599 107 9 7445.35 0.00 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 2.43713 107 10 3888.61 0.00 3000 Convex MSK opt. persp. 2.43713 107 10 1929.94 0.02 3000 ours 2.43713 107 10 1.77 0.00

5000 Subset Select CIO 2.00985 107 2 13.61 0.00 5000 Subset Select CIO + warm start 2.00985 107 2 12.21 0.00 5000 L0Bn B 2.43694 107 10 4485.99 23.68 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 2.43694 107 10 3748.60 0.06 5000 ours 2.43694 107 10 4.15 0.00 Table 124: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.001).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 8.64519 107 2 2.27 0.00 100 Subset Select CIO + warm start 8.64519 107 2 2.79 0.00 100 L0Bn B 9.11185 107 93 6954.00 0.04 100 MSK persp. 8.96852 107 10 3601.43 1.59 100 MSK persp. + warm start 9.11026 107 10 3601.28 0.01 100 Convex MSK opt. persp. 9.11026 107 10 0.13 0.00 100 ours 9.11026 107 10 0.30 0.00

500 Subset Select CIO 8.63791 107 2 2.89 0.00 500 Subset Select CIO + warm start 8.63791 107 2 3.47 0.00 500 L0Bn B 9.10885 107 9 7562.05 0.14 500 MSK persp. - - 3605.36 - 500 MSK persp. + warm start 9.12047 107 10 3630.75 0.10 500 Convex MSK opt. persp. 9.12047 107 10 7.26 0.01 500 ours 9.12047 107 10 0.38 0.00

1000 Subset Select CIO 8.56943 107 2 4.58 0.00 1000 Subset Select CIO + warm start 8.56943 107 2 4.73 0.00 1000 L0Bn B 8.23906 107 1 7300.29 9.17 1000 MSK persp. - - 3646.29 - 1000 MSK persp. + warm start 9.04075 107 10 3613.37 0.00 1000 Convex MSK opt. persp. 9.04075 107 10 67.12 0.01 1000 ours 9.04075 107 10 0.52 0.00

3000 Subset Select CIO 8.66243 107 2 8.50 0.00 3000 Subset Select CIO + warm start 8.66243 107 2 8.12 0.00 3000 L0Bn B 9.12781 107 12 1706.18 7.80 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 9.12761 107 10 3998.60 0.00 3000 Convex MSK opt. persp. 9.12761 107 10 2038.29 0.02 3000 ours 9.12761 107 10 20.74 0.00

5000 Subset Select CIO 8.64926 107 2 12.63 0.00 5000 Subset Select CIO + warm start 8.64926 107 2 11.94 0.00 5000 L0Bn B 8.32072 107 1 7472.42 9.32 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 9.12916 107 10 3722.78 3.21 5000 ours 9.12916 107 10 58.29 0.00 Table 125: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.001).

G.4.2 Synthetic 1 Benchmark with λ2 = 0.1

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO - - 3600.00 - 100 Subset Select CIO + warm start - - 3600.00 - 100 L0Bn B 2.10878 106 10 1415.23 0.00 100 MSK persp. 2.10878 106 10 928.86 0.00 100 MSK persp. + warm start 2.10878 106 10 912.78 0.00 100 Convex MSK opt. persp. 2.10878 106 10 0.10 0.00 100 ours 2.10878 106 10 0.36 0.00

500 Subset Select CIO 1.19239 105 1 3.17 0.00 500 Subset Select CIO + warm start 1.19239 105 1 3.39 0.00 500 L0Bn B 2.12833 106 10 1973.90 0.00 500 MSK persp. 2.12830 106 10 3606.82 0.10 500 MSK persp. + warm start 2.12833 106 10 3604.01 0.10 500 Convex MSK opt. persp. 2.12833 106 10 6.49 0.01 500 ours 2.12833 106 10 0.37 0.00

1000 Subset Select CIO 4.71433 105 2 4.47 0.00 1000 Subset Select CIO + warm start 4.71433 105 2 4.68 0.00 1000 L0Bn B 2.11005 106 10 2101.63 0.00 1000 MSK persp. - - 3602.15 - 1000 MSK persp. + warm start 2.11005 106 10 3600.98 0.20 1000 Convex MSK opt. persp. 2.11005 106 10 58.36 0.01 1000 ours 2.11005 106 10 0.57 0.00

3000 Subset Select CIO 4.72135 105 2 8.81 0.00 3000 Subset Select CIO + warm start 4.72135 105 2 8.98 0.00 3000 L0Bn B 2.10230 106 10 2515.21 0.00 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 2.10230 106 10 3660.82 0.00 3000 Convex MSK opt. persp. 2.10230 106 10 1765.70 0.02 3000 ours 2.10230 106 10 1.79 0.00

5000 Subset Select CIO 4.62647 105 2 12.26 0.00 5000 Subset Select CIO + warm start 4.62647 105 2 12.06 0.00 5000 L0Bn B 2.09635 106 10 2666.21 3.42 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 2.09635 106 10 3685.08 0.06 5000 ours 2.09635 106 10 4.20 0.00 Table 126: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 0.1).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 2.52792 106 2 2.32 0.00 100 Subset Select CIO + warm start 2.52792 106 2 2.46 0.00 100 L0Bn B 5.28726 106 10 3182.16 0.00 100 MSK persp. 5.28724 106 10 3601.53 0.01 100 MSK persp. + warm start 5.28726 106 10 3600.02 0.01 100 Convex MSK opt. persp. 5.28726 106 10 0.10 0.00 100 ours 5.28726 106 10 0.27 0.00

500 Subset Select CIO 2.52985 106 2 3.54 0.00 500 Subset Select CIO + warm start 2.52985 106 2 3.26 0.00 500 L0Bn B 5.31852 106 10 3918.77 1.65 500 MSK persp. 5.31852 106 10 3601.88 0.10 500 MSK persp. + warm start 5.31852 106 10 3602.35 0.10 500 Convex MSK opt. persp. 5.31852 106 10 8.26 0.01 500 ours 5.31852 106 10 0.41 0.00

1000 Subset Select CIO 2.50201 106 2 4.49 0.00 1000 Subset Select CIO + warm start 2.50201 106 2 4.12 0.00 1000 L0Bn B 5.26974 106 10 4052.64 1.67 1000 MSK persp. - - 3600.90 - 1000 MSK persp. + warm start 5.26974 106 10 3600.10 0.20 1000 Convex MSK opt. persp. 5.26974 106 10 55.50 0.01 1000 ours 5.26974 106 10 0.52 0.00

3000 Subset Select CIO 2.51861 106 2 5.94 0.00 3000 Subset Select CIO + warm start 2.51861 106 2 8.61 0.00 3000 L0Bn B 5.27668 106 10 4807.82 2.18 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 5.27668 106 10 3744.12 0.00 3000 Convex MSK opt. persp. 5.27668 106 10 1795.54 0.02 3000 ours 5.27668 106 10 1.84 0.00

5000 Subset Select CIO 2.49167 106 2 13.36 0.00 5000 Subset Select CIO + warm start 2.49167 106 2 12.61 0.00 5000 L0Bn B 5.27126 106 10 5667.77 0.00 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 5.27126 106 10 3606.96 0.05 5000 ours 5.27126 106 10 4.42 0.00 Table 127: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 0.1).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 7.42601 106 2 2.59 0.00 100 Subset Select CIO + warm start 7.42601 106 2 2.80 0.00 100 L0Bn B 1.10084 107 10 3398.37 0.00 100 MSK persp. 1.09018 107 10 3599.98 0.99 100 MSK persp. + warm start 1.10084 107 10 3603.80 0.01 100 Convex MSK opt. persp. 1.10084 107 10 0.11 0.00 100 ours 1.10084 107 10 0.32 0.00

500 Subset Select CIO 7.42485 106 2 2.97 0.00 500 Subset Select CIO + warm start 7.42485 106 2 3.22 0.00 500 L0Bn B 1.10518 107 10 4364.09 0.77 500 MSK persp. - - 3605.36 - 500 MSK persp. + warm start 1.10518 107 10 3610.84 0.00 500 Convex MSK opt. persp. 1.10518 107 10 7.25 0.01 500 ours 1.10518 107 10 0.40 0.00

1000 Subset Select CIO 7.35715 106 2 4.35 0.00 1000 Subset Select CIO + warm start 7.35715 106 2 4.97 0.00 1000 L0Bn B 1.09508 107 10 1882.56 0.65 1000 MSK persp. - - 3609.88 - 1000 MSK persp. + warm start 1.09508 107 10 3616.06 0.00 1000 Convex MSK opt. persp. 1.09508 107 10 63.89 0.01 1000 ours 1.09508 107 10 0.53 0.00

3000 Subset Select CIO 7.42176 106 2 5.59 0.00 3000 Subset Select CIO + warm start 7.42176 106 2 8.42 0.00 3000 L0Bn B 1.10012 107 10 5588.51 1.24 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 1.10012 107 10 4173.41 0.00 3000 Convex MSK opt. persp. 1.10012 107 10 1779.87 0.02 3000 ours 1.10012 107 10 1.67 0.00

5000 Subset Select CIO 7.37395 106 2 8.72 0.00 5000 Subset Select CIO + warm start 7.37395 106 2 12.24 0.00 5000 L0Bn B 1.09966 107 10 6428.17 1.09 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start 1.09966 107 10 6106.85 0.00 5000 Convex MSK opt. persp. 1.09966 107 10 3665.75 0.06 5000 ours 1.09966 107 10 4.13 0.00 Table 128: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 0.1).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 2.01567 107 2 2.23 0.00 100 Subset Select CIO + warm start 2.01567 107 2 3.02 0.00 100 L0Bn B 2.43575 107 10 2673.71 0.00 100 MSK persp. 2.41233 107 10 3601.54 0.98 100 MSK persp. + warm start 2.43575 107 10 3600.03 0.01 100 Convex MSK opt. persp. 2.43575 107 10 0.14 0.00 100 ours 2.43575 107 10 0.34 0.00

500 Subset Select CIO 2.01420 107 2 3.58 0.00 500 Subset Select CIO + warm start 2.01420 107 2 3.34 0.00 500 L0Bn B 2.44185 107 10 3602.76 0.00 500 MSK persp. 2.44185 107 10 3603.68 0.10 500 MSK persp. + warm start 2.44185 107 10 3604.92 0.10 500 Convex MSK opt. persp. 2.44185 107 10 7.37 0.01 500 ours 2.44185 107 10 0.44 0.00

1000 Subset Select CIO 1.99762 107 2 4.47 0.00 1000 Subset Select CIO + warm start 1.99762 107 2 4.56 0.00 1000 L0Bn B 2.41994 107 11 4204.09 0.01 1000 MSK persp. - - 3610.81 - 1000 MSK persp. + warm start 2.41989 107 10 3609.15 0.00 1000 Convex MSK opt. persp. 2.41989 107 10 56.82 0.01 1000 ours 2.41989 107 10 0.53 0.00

3000 Subset Select CIO 2.01741 107 2 5.42 0.00 3000 Subset Select CIO + warm start 2.01741 107 2 8.40 0.00 3000 L0Bn B 2.42599 107 9 7207.55 0.00 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 2.43713 107 10 3980.12 0.00 3000 Convex MSK opt. persp. 2.43713 107 10 1799.08 0.02 3000 ours 2.43713 107 10 1.75 0.00

5000 Subset Select CIO 2.00985 107 2 12.64 0.00 5000 Subset Select CIO + warm start 2.00985 107 2 13.96 0.00 5000 L0Bn B 2.43694 107 10 4979.94 21.63 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 2.43694 107 10 3606.45 0.06 5000 ours 2.43694 107 10 4.22 0.00 Table 129: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 0.1).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 8.11196 107 1 2.42 0.00 100 Subset Select CIO + warm start 8.11196 107 1 2.56 0.00 100 L0Bn B 9.11185 107 93 6919.82 0.04 100 MSK persp. 8.94241 107 10 3600.44 1.89 100 MSK persp. + warm start 9.11026 107 10 3600.46 0.01 100 Convex MSK opt. persp. 9.11026 107 10 0.18 0.00 100 ours 9.11026 107 10 0.31 0.00

500 Subset Select CIO 8.63791 107 2 3.47 0.00 500 Subset Select CIO + warm start 8.63791 107 2 3.56 0.00 500 L0Bn B 9.10885 107 9 7359.92 0.14 500 MSK persp. - - 3601.63 - 500 MSK persp. + warm start 9.12047 107 10 3602.72 0.00 500 Convex MSK opt. persp. 9.12047 107 10 8.27 0.01 500 ours 9.12047 107 10 0.38 0.00

1000 Subset Select CIO 8.56943 107 2 4.83 0.00 1000 Subset Select CIO + warm start 8.56943 107 2 4.74 0.00 1000 L0Bn B 8.23906 107 1 7326.90 9.06 1000 MSK persp. - - 3641.35 - 1000 MSK persp. + warm start 9.04075 107 10 3611.36 0.00 1000 Convex MSK opt. persp. 9.04075 107 10 61.25 0.01 1000 ours 9.04075 107 10 0.52 0.00

3000 Subset Select CIO 8.66243 107 2 8.37 0.00 3000 Subset Select CIO + warm start 8.66243 107 2 9.68 0.00 3000 L0Bn B 9.12781 107 12 2205.35 5.43 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 9.12761 107 10 3661.26 0.00 3000 Convex MSK opt. persp. 9.12761 107 10 2087.06 0.02 3000 ours 9.12761 107 10 20.88 0.00

5000 Subset Select CIO 8.64926 107 2 12.48 0.00 5000 Subset Select CIO + warm start 8.64926 107 2 12.73 0.00 5000 L0Bn B 8.32072 107 1 7201.18 9.32 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start 9.12916 107 10 8314.78 0.00 5000 Convex MSK opt. persp. 9.12916 107 10 3691.61 3.40 5000 ours 9.12916 107 10 58.07 0.00 Table 130: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 0.1).

G.4.3 Synthetic 1 Benchmark with λ2 = 10.0

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 1.62628 106 10 3693.06 55.44 100 Subset Select CIO + warm start 1.62806 106 10 3672.85 55.27 100 L0Bn B 2.10868 106 10 1432.58 0.00 100 MSK persp. 2.10868 106 10 926.33 0.00 100 MSK persp. + warm start 2.10868 106 10 919.61 0.00 100 Convex MSK opt. persp. 2.10868 106 10 0.10 0.00 100 ours 2.10868 106 10 0.30 0.00

500 Subset Select CIO 1.33331 106 10 3776.44 91.62 500 Subset Select CIO + warm start 4.87330 105 7 1654.44 0.00 500 L0Bn B 2.12823 106 10 1929.01 0.00 500 MSK persp. 2.12822 106 10 3601.80 0.09 500 MSK persp. + warm start 2.12823 106 10 3609.69 0.09 500 Convex MSK opt. persp. 2.12823 106 10 7.05 0.01 500 ours 2.12823 106 10 0.38 0.00

1000 Subset Select CIO 7.92745 105 10 3709.35 219.40 1000 Subset Select CIO + warm start 1.33080 106 10 3702.48 90.26 1000 L0Bn B 2.10995 106 10 2118.04 0.00 1000 MSK persp. - - 3601.31 - 1000 MSK persp. + warm start 2.10995 106 10 3600.48 0.20 1000 Convex MSK opt. persp. 2.10995 106 10 54.57 0.01 1000 ours 2.10995 106 10 0.57 0.00

3000 Subset Select CIO 7.89940 105 10 3633.47 219.21 3000 Subset Select CIO + warm start 1.31406 106 10 4260.43 91.89 3000 L0Bn B 2.10220 106 10 2318.67 4.73 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start - - 3600.00 - 3000 Convex MSK opt. persp. 2.10220 106 10 1669.36 0.02 3000 ours 2.10220 106 10 1.68 0.00

5000 Subset Select CIO 9.76808 105 10 3838.13 157.56 5000 Subset Select CIO + warm start 1.30652 106 10 4063.53 92.56 5000 L0Bn B 2.09625 106 10 2700.89 3.42 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 2.09625 106 10 3669.10 0.06 5000 ours 2.09625 106 10 5.09 0.00 Table 131: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.1, λ2 = 10.0).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 4.75666 106 10 3695.63 33.24 100 Subset Select CIO + warm start 3.72965 106 10 1008.71 0.00 100 L0Bn B 5.28716 106 10 3186.48 0.00 100 MSK persp. 5.28716 106 10 3600.15 0.01 100 MSK persp. + warm start 5.28716 106 10 3600.07 0.01 100 Convex MSK opt. persp. 5.28716 106 10 0.16 0.00 100 ours 5.28716 106 10 0.30 0.00

500 Subset Select CIO 4.61560 106 10 3761.41 38.34 500 Subset Select CIO + warm start 4.00602 106 10 3621.18 59.39 500 L0Bn B 5.31842 106 10 3958.53 1.65 500 MSK persp. 5.31842 106 10 3604.61 0.09 500 MSK persp. + warm start 5.31842 106 10 3604.65 0.09 500 Convex MSK opt. persp. 5.31842 106 10 7.05 0.01 500 ours 5.31842 106 10 0.41 0.00

1000 Subset Select CIO 3.05644 106 6 3348.60 0.00 1000 Subset Select CIO + warm start 3.95760 106 10 3658.16 59.79 1000 L0Bn B 5.26964 106 10 4043.93 1.62 1000 MSK persp. - - 3600.44 - 1000 MSK persp. + warm start 5.26964 106 10 3600.14 0.20 1000 Convex MSK opt. persp. 5.26964 106 10 54.61 0.01 1000 ours 5.26964 106 10 0.51 0.00

3000 Subset Select CIO 4.74347 106 10 3927.16 33.42 3000 Subset Select CIO + warm start - - 3600.00 - 3000 L0Bn B 5.27658 106 10 4906.10 0.35 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start - - 3600.00 - 3000 Convex MSK opt. persp. 5.27658 106 10 1962.60 0.02 3000 ours 5.27658 106 10 1.76 0.00

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start - - 3600.00 - 5000 L0Bn B - - 7200.00 - 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 5.27116 106 10 3651.58 0.05 5000 ours 5.27116 106 10 4.32 0.00 Table 132: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.3, λ2 = 10.0).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 1.00851 107 10 3728.97 30.84 100 Subset Select CIO + warm start 1.02880 107 10 3654.11 28.26 100 L0Bn B 1.10083 107 10 3417.76 0.26 100 MSK persp. 1.10083 107 10 1286.52 0.00 100 MSK persp. + warm start 1.10083 107 10 1786.69 0.00 100 Convex MSK opt. persp. 1.10083 107 10 0.16 0.00 100 ours 1.10083 107 10 0.32 0.00

500 Subset Select CIO 9.42612 106 10 3685.22 40.77 500 Subset Select CIO + warm start 9.92278 106 10 3685.15 33.72 500 L0Bn B 1.10517 107 10 4308.04 0.77 500 MSK persp. 1.10517 107 10 3601.09 0.09 500 MSK persp. + warm start 1.10517 107 10 3602.10 0.09 500 Convex MSK opt. persp. 1.10517 107 10 7.72 0.01 500 ours 1.10517 107 10 0.41 0.00

1000 Subset Select CIO 1.00437 107 10 3649.62 30.84 1000 Subset Select CIO + warm start 6.62765 106 2 3147.50 0.00 1000 L0Bn B 1.09507 107 10 1834.54 0.65 1000 MSK persp. - - 3600.46 - 1000 MSK persp. + warm start 1.09507 107 10 3601.83 0.20 1000 Convex MSK opt. persp. 1.09507 107 10 57.71 0.01 1000 ours 1.09507 107 10 0.55 0.00

3000 Subset Select CIO - - 3600.00 - 3000 Subset Select CIO + warm start - - 3600.00 - 3000 L0Bn B 1.10011 107 10 5263.44 1.10 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 1.10011 107 10 4076.24 0.00 3000 Convex MSK opt. persp. 1.10011 107 10 1538.48 0.02 3000 ours 1.10011 107 10 1.73 0.00

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start - - 3600.00 - 5000 L0Bn B 1.09965 107 10 6222.74 1.11 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 1.09965 107 10 3742.63 0.05 5000 ours 1.09965 107 10 4.24 0.00 Table 133: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.5, λ2 = 10.0).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 2.36231 107 10 3732.54 23.59 100 Subset Select CIO + warm start 2.32147 107 10 3623.89 25.77 100 L0Bn B 2.43574 107 10 2671.45 0.00 100 MSK persp. 2.39754 107 10 3601.78 1.61 100 MSK persp. + warm start 2.43574 107 10 3604.40 0.01 100 Convex MSK opt. persp. 2.43574 107 10 0.13 0.00 100 ours 2.43574 107 10 0.30 0.00

500 Subset Select CIO 2.27569 107 10 3601.02 28.83 500 Subset Select CIO + warm start 2.32447 107 10 3624.13 26.12 500 L0Bn B 2.44184 107 10 3822.46 0.00 500 MSK persp. - - 3605.48 - 500 MSK persp. + warm start 2.44184 107 10 3603.64 0.00 500 Convex MSK opt. persp. 2.44184 107 10 7.40 0.01 500 ours 2.44184 107 10 0.44 0.00

1000 Subset Select CIO 2.18726 107 7 2623.29 0.00 1000 Subset Select CIO + warm start 2.32752 107 10 4344.29 24.76 1000 L0Bn B 2.41993 107 11 3983.89 0.31 1000 MSK persp. - - 3606.66 - 1000 MSK persp. + warm start 2.41988 107 10 3608.58 0.00 1000 Convex MSK opt. persp. 2.41988 107 10 63.01 0.01 1000 ours 2.41988 107 10 0.52 0.00

3000 Subset Select CIO 2.34295 107 10 3767.28 24.75 3000 Subset Select CIO + warm start - - 3600.00 - 3000 L0Bn B 2.42598 107 9 7332.12 0.00 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 2.43712 107 10 4135.58 0.00 3000 Convex MSK opt. persp. 2.43712 107 10 1968.86 0.02 3000 ours 2.43712 107 10 1.81 0.00

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start - - 3600.00 - 5000 L0Bn B 2.43693 107 10 4661.71 23.68 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start - - 3600.00 - 5000 Convex MSK opt. persp. 2.43693 107 10 3769.92 0.06 5000 ours 2.43693 107 10 4.08 0.00 Table 134: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.7, λ2 = 10.0).

# of features method upper bound support size time(s) gap(%)

100 Subset Select CIO 8.97109 107 10 3602.39 21.73 100 Subset Select CIO + warm start 9.05496 107 10 3626.15 20.60 100 L0Bn B 9.11184 107 93 7102.10 0.04 100 MSK persp. 8.99265 107 10 3600.18 1.32 100 MSK persp. + warm start 9.11025 107 10 3600.47 0.01 100 Convex MSK opt. persp. 9.11025 107 10 0.13 0.00 100 ours 9.11025 107 10 0.32 0.00

500 Subset Select CIO 9.00037 107 10 4088.17 21.67 500 Subset Select CIO + warm start 9.02231 107 10 3817.48 21.37 500 L0Bn B 9.11085 107 11 843.53 6.21 500 MSK persp. - - 3604.01 - 500 MSK persp. + warm start 9.12046 107 10 3604.99 0.00 500 Convex MSK opt. persp. 9.12046 107 10 7.57 0.01 500 ours 9.12046 107 10 0.38 0.00

1000 Subset Select CIO - - 3600.00 - 1000 Subset Select CIO + warm start - - 3600.00 - 1000 L0Bn B 8.23898 107 1 7369.81 9.08 1000 MSK persp. - - 3604.50 - 1000 MSK persp. + warm start 9.04074 107 10 3688.88 0.00 1000 Convex MSK opt. persp. 9.04074 107 10 62.54 0.01 1000 ours 9.04074 107 10 0.52 0.00

3000 Subset Select CIO - - 3600.00 - 3000 Subset Select CIO + warm start - - 3600.00 - 3000 L0Bn B 9.12760 107 10 1883.50 6.26 3000 MSK persp. - - 3600.00 - 3000 MSK persp. + warm start 9.12760 107 10 3764.99 0.00 3000 Convex MSK opt. persp. 9.12760 107 10 1999.52 0.02 3000 ours 9.12760 107 10 20.84 0.00

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start - - 3600.00 - 5000 L0Bn B 8.32064 107 1 7664.70 9.32 5000 MSK persp. - - 3600.00 - 5000 MSK persp. + warm start 9.12915 107 10 8218.76 0.00 5000 Convex MSK opt. persp. 9.12915 107 10 3654.46 3.94 5000 ours 9.12915 107 10 57.41 0.00 Table 135: Comparison of time and optimality gap on the synthetic datasets (n = 100000, k = 10, ρ = 0.9, λ2 = 10.0).

G.4.4 Synthetic 2 Benchmark with λ2 = 0.001

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 1.26607 104 2 1.93 0.00 3000 Subset Select CIO + warm start 1.26607 104 2 2.02 0.00 3000 L0Bn B 6.34585 104 10 1200.55 0.00 3000 MSK persp. 5.27912 104 10 3601.00 43.41 3000 MSK persp. + warm start 6.34585 104 10 3601.81 19.30 3000 Convex MSK opt. persp. 4.55044 104 10 1573.44 66.10 3000 ours 6.34585 104 10 3600.49 19.45

4000 Subset Select CIO 1.88907 104 2 1.90 0.00 4000 Subset Select CIO + warm start 1.88907 104 2 2.06 0.00 4000 L0Bn B 8.64294 104 10 579.39 0.00 4000 MSK persp. 8.64294 104 10 3602.16 15.37 4000 MSK persp. + warm start 8.64294 104 10 3600.84 15.37 4000 Convex MSK opt. persp. 8.64294 104 10 1398.74 2.40 4000 ours 8.64294 104 10 20.25 0.00

5000 Subset Select CIO 2.49366 104 2 2.08 0.00 5000 Subset Select CIO + warm start 2.49366 104 2 1.96 0.00 5000 L0Bn B 1.09576 105 10 1292.85 0.00 5000 MSK persp. 1.09576 105 10 3601.60 11.91 5000 MSK persp. + warm start 1.09576 105 10 3600.55 11.91 5000 Convex MSK opt. persp. 1.09576 105 10 1496.25 1.03 5000 ours 1.09576 105 10 20.51 0.00

6000 Subset Select CIO - - 3600.00 - 6000 Subset Select CIO + warm start 2.83063 104 2 2.22 0.00 6000 L0Bn B 1.27643 105 10 1379.36 0.00 6000 MSK persp. - - 3600.07 - 6000 MSK persp. + warm start 1.27643 105 10 3600.30 9.93 6000 Convex MSK opt. persp. 1.27643 105 10 1318.81 0.62 6000 ours 1.27643 105 10 1.70 0.00

7000 Subset Select CIO 3.24781 104 2 2.06 0.00 7000 Subset Select CIO + warm start 3.24781 104 2 2.22 0.00 7000 L0Bn B 1.50614 105 10 1105.58 0.00 7000 MSK persp. 2.07297 104 3 3600.06 687.08 7000 MSK persp. + warm start 1.50614 105 10 3599.93 8.33 7000 Convex MSK opt. persp. 1.50614 105 10 1455.27 0.45 7000 ours 1.50614 105 10 1.67 0.00 Table 136: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.001).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 7.20755 104 2 1.99 0.00 3000 Subset Select CIO + warm start 7.20755 104 2 2.08 0.00 3000 L0Bn B 1.59509 105 10 3023.58 0.00 3000 MSK persp. 1.48826 105 10 3601.42 27.95 3000 MSK persp. + warm start 1.59509 105 10 3601.71 19.38 3000 Convex MSK opt. persp. 1.13813 105 10 1627.92 67.04 3000 ours 1.59509 105 10 3600.67 19.53

4000 Subset Select CIO 1.03099 105 2 2.07 0.00 4000 Subset Select CIO + warm start 1.03099 105 2 1.81 0.00 4000 L0Bn B 2.17574 105 10 3396.85 0.00 4000 MSK persp. 2.17574 105 10 3600.26 15.27 4000 MSK persp. + warm start 2.17574 105 10 3601.91 15.27 4000 Convex MSK opt. persp. 2.17574 105 10 1509.12 2.41 4000 ours 2.17574 105 10 100.13 0.00

5000 Subset Select CIO 1.30339 105 2 1.98 0.00 5000 Subset Select CIO + warm start 1.30339 105 2 2.01 0.00 5000 L0Bn B 2.71986 105 10 531.44 0.00 5000 MSK persp. 2.71986 105 10 3600.74 12.00 5000 MSK persp. + warm start 2.71986 105 10 3600.26 12.00 5000 Convex MSK opt. persp. 2.71986 105 10 1514.31 1.04 5000 ours 2.71986 105 10 20.80 0.00

6000 Subset Select CIO 1.50202 105 2 2.02 0.00 6000 Subset Select CIO + warm start 1.50202 105 2 1.96 0.00 6000 L0Bn B 3.16761 105 10 3098.88 0.00 6000 MSK persp. 7.83273 104 1 3603.79 344.72 6000 MSK persp. + warm start 3.16761 105 10 3604.35 9.96 6000 Convex MSK opt. persp. 3.16761 105 10 1445.46 0.62 6000 ours 3.16761 105 10 20.70 0.00

7000 Subset Select CIO 1.75217 105 2 2.12 0.00 7000 Subset Select CIO + warm start 1.75217 105 2 2.12 0.00 7000 L0Bn B 3.72610 105 10 3130.56 0.00 7000 MSK persp. - - 3599.99 - 7000 MSK persp. + warm start 3.72610 105 10 3600.24 8.33 7000 Convex MSK opt. persp. 3.72610 105 10 1520.22 0.45 7000 ours 3.72610 105 10 1.98 0.00 Table 137: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.001).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 2.19279 105 2 1.96 0.00 3000 Subset Select CIO + warm start 2.19279 105 2 1.85 0.00 3000 L0Bn B 3.27442 105 8 7379.57 5.16 3000 MSK persp. 2.83751 105 10 3601.66 40.41 3000 MSK persp. + warm start 3.33604 105 10 3601.38 19.43 3000 Convex MSK opt. persp. 2.77416 105 10 1878.24 43.38 3000 ours 3.33604 105 10 3600.58 19.57

4000 Subset Select CIO 3.03679 105 2 1.87 0.00 4000 Subset Select CIO + warm start 3.03679 105 2 1.89 0.00 4000 L0Bn B 4.53261 105 10 4149.90 32.94 4000 MSK persp. 4.53261 105 10 3602.72 15.24 4000 MSK persp. + warm start 4.53261 105 10 3599.91 15.24 4000 Convex MSK opt. persp. 4.53261 105 10 1445.07 2.53 4000 ours 4.53261 105 10 3600.25 0.63

5000 Subset Select CIO 4.12520 105 3 1.79 0.00 5000 Subset Select CIO + warm start 4.12520 105 3 2.03 0.00 5000 L0Bn B 5.63007 105 10 3522.00 0.00 5000 MSK persp. 5.63007 105 10 3600.76 12.04 5000 MSK persp. + warm start 5.63007 105 10 3603.42 12.04 5000 Convex MSK opt. persp. 5.63007 105 10 1618.41 1.04 5000 ours 5.63007 105 10 36.97 0.00

6000 Subset Select CIO 4.42065 105 2 1.94 0.00 6000 Subset Select CIO + warm start 4.42065 105 2 2.12 0.00 6000 L0Bn B 6.57799 105 10 3327.45 0.00 6000 MSK persp. - - 3605.66 - 6000 MSK persp. + warm start 6.57799 105 10 3605.89 9.98 6000 Convex MSK opt. persp. 6.57799 105 10 1523.01 0.62 6000 ours 6.57799 105 10 20.75 0.00

7000 Subset Select CIO 5.16405 105 2 2.11 0.00 7000 Subset Select CIO + warm start 5.16405 105 2 2.14 0.00 7000 L0Bn B 7.72267 105 10 3567.05 0.00 7000 MSK persp. - - 3600.27 - 7000 MSK persp. + warm start 7.72267 105 10 3600.18 8.33 7000 Convex MSK opt. persp. 7.72267 105 10 1681.35 0.45 7000 ours 7.72267 105 10 21.02 0.00 Table 138: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.001).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 6.06571 105 2 1.88 0.00 3000 Subset Select CIO + warm start 6.06571 105 2 1.88 0.00 3000 L0Bn B 5.52986 105 1 7308.45 34.21 3000 MSK persp. 6.89430 105 10 3600.74 28.43 3000 MSK persp. + warm start 7.41187 105 10 3601.61 19.46 3000 Convex MSK opt. persp. 6.84471 105 10 1652.72 29.14 3000 ours 7.41187 105 10 3601.14 19.60

4000 Subset Select CIO 8.24785 105 2 1.85 0.00 4000 Subset Select CIO + warm start 8.24785 105 2 1.86 0.00 4000 L0Bn B 7.53893 105 1 7293.04 32.01 4000 MSK persp. 9.95041 105 10 3601.67 16.12 4000 MSK persp. + warm start 1.00287 106 10 3602.00 15.21 4000 Convex MSK opt. persp. 1.00287 106 10 1442.72 2.82 4000 ours 1.00287 106 10 3600.37 1.88

5000 Subset Select CIO 1.02490 106 2 2.05 0.00 5000 Subset Select CIO + warm start 1.02490 106 2 1.92 0.00 5000 L0Bn B 1.23555 106 9 7387.59 4.90 5000 MSK persp. 1.24027 106 10 3604.54 12.07 5000 MSK persp. + warm start 1.24027 106 10 3600.77 12.07 5000 Convex MSK opt. persp. 1.24027 106 10 1654.86 1.06 5000 ours 1.24027 106 10 466.42 0.00

6000 Subset Select CIO 1.20146 106 2 1.96 0.00 6000 Subset Select CIO + warm start 1.20146 106 2 2.05 0.00 6000 L0Bn B 1.44860 106 9 7265.23 4.30 6000 MSK persp. 1.14100 106 2 3603.46 40.19 6000 MSK persp. + warm start 1.45417 106 10 3604.24 10.00 6000 Convex MSK opt. persp. 1.45417 106 10 1667.19 0.63 6000 ours 1.45417 106 10 57.67 0.00

7000 Subset Select CIO 1.46875 106 3 1.86 0.00 7000 Subset Select CIO + warm start 1.46875 106 3 2.09 0.00 7000 L0Bn B 1.69804 106 9 7320.83 2.92 7000 MSK persp. 1.32733 106 2 3600.13 39.15 7000 MSK persp. + warm start 1.70483 106 10 3600.05 8.34 7000 Convex MSK opt. persp. 1.70483 106 10 1760.17 0.45 7000 ours 1.70483 106 10 21.42 0.00 Table 139: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.001).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 2.63352 106 2 1.90 0.00 3000 Subset Select CIO + warm start 2.63352 106 2 2.04 0.00 3000 L0Bn B 2.55637 106 1 7236.54 18.84 3000 MSK persp. 2.72590 106 10 3599.87 22.06 3000 MSK persp. + warm start 2.72590 106 10 3600.80 22.06 3000 Convex MSK opt. persp. 2.72998 106 10 2311.76 21.67 3000 ours 2.78521 106 10 3600.01 19.60

4000 Subset Select CIO 3.38912 106 1 1.79 0.00 4000 Subset Select CIO + warm start 3.38912 106 1 1.89 0.00 4000 L0Bn B 3.45201 106 1 7291.93 15.48 4000 MSK persp. - - 3609.13 - 4000 MSK persp. + warm start 3.75203 106 10 3605.54 15.12 4000 Convex MSK opt. persp. 3.75202 106 10 1481.77 3.23 4000 ours 3.75203 106 10 3600.42 3.50

5000 Subset Select CIO 4.37679 106 2 1.99 0.00 5000 Subset Select CIO + warm start 4.37679 106 2 2.07 0.00 5000 L0Bn B 4.23095 106 1 7316.72 13.65 5000 MSK persp. 4.60534 106 10 3602.67 12.43 5000 MSK persp. + warm start 4.61877 106 10 3601.35 12.11 5000 Convex MSK opt. persp. 4.61877 106 10 1537.61 1.22 5000 ours 4.61877 106 10 3600.57 0.93

6000 Subset Select CIO 5.15897 106 2 2.23 0.00 6000 Subset Select CIO + warm start 5.15897 106 2 2.24 0.00 6000 L0Bn B 4.97742 106 1 7330.86 13.05 6000 MSK persp. - - 3603.01 - 6000 MSK persp. + warm start 5.43826 106 10 3602.04 0.00 6000 Convex MSK opt. persp. 5.43826 106 10 1652.62 0.65 6000 ours 5.43826 106 10 3600.65 0.33

7000 Subset Select CIO 6.02832 106 2 1.89 0.00 7000 Subset Select CIO + warm start 6.02832 106 2 2.27 0.00 7000 L0Bn B 5.82321 106 1 7356.27 11.85 7000 MSK persp. - - 3600.06 - 7000 MSK persp. + warm start 6.36767 106 10 3600.21 8.35 7000 Convex MSK opt. persp. 6.36767 106 10 1821.91 0.45 7000 ours 6.36767 106 10 3600.78 0.11 Table 140: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.001).

G.4.5 Synthetic 2 Benchmark with λ2 = 0.1

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 1.83147 104 10 4229.80 313.93 3000 Subset Select CIO + warm start 5.70956 104 10 3602.11 32.77 3000 L0Bn B 6.34575 104 10 1201.41 0.00 3000 MSK persp. 6.34575 104 10 3601.57 17.90 3000 MSK persp. + warm start 6.34575 104 10 3600.12 17.89 3000 Convex MSK opt. persp. 6.34575 104 10 1444.07 17.85 3000 ours 6.34575 104 10 3600.33 19.31

4000 Subset Select CIO 3.10931 104 10 3872.26 235.85 4000 Subset Select CIO + warm start 7.28764 104 10 4101.05 43.29 4000 L0Bn B 8.64284 104 10 566.25 0.00 4000 MSK persp. 8.64284 104 10 3602.29 15.03 4000 MSK persp. + warm start 8.64284 104 10 3601.14 15.05 4000 Convex MSK opt. persp. 8.64284 104 10 1458.63 2.40 4000 ours 8.64284 104 10 20.81 0.00

5000 Subset Select CIO 4.06480 104 10 3939.75 222.71 5000 Subset Select CIO + warm start 9.16380 104 10 3804.66 43.14 5000 L0Bn B 1.09575 105 10 1314.94 0.00 5000 MSK persp. 1.09575 105 10 3603.48 11.78 5000 MSK persp. + warm start 1.09575 105 10 3602.79 11.78 5000 Convex MSK opt. persp. 1.09575 105 10 1478.60 1.03 5000 ours 1.09575 105 10 20.65 0.00

6000 Subset Select CIO - - 3600.00 - 6000 Subset Select CIO + warm start 1.21549 105 10 3850.54 26.10 6000 L0Bn B 1.27642 105 10 1536.59 0.00 6000 MSK persp. - - 3606.50 - 6000 MSK persp. + warm start 1.27642 105 10 3603.98 9.85 6000 Convex MSK opt. persp. 1.27642 105 10 1303.39 0.62 6000 ours 1.27642 105 10 1.70 0.00

7000 Subset Select CIO 5.54763 104 10 4067.63 224.61 7000 Subset Select CIO + warm start 1.42693 105 10 3615.01 26.20 7000 L0Bn B 1.50613 105 10 1064.72 0.00 7000 MSK persp. - - 3599.90 - 7000 MSK persp. + warm start 1.50613 105 10 3600.13 8.28 7000 Convex MSK opt. persp. 1.50613 105 10 1405.96 0.45 7000 ours 1.50613 105 10 1.67 0.00 Table 141: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.1).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 1.13982 105 10 3755.86 67.28 3000 Subset Select CIO + warm start 1.59508 105 10 3857.37 19.54 3000 L0Bn B 1.59508 105 10 1491.90 0.00 3000 MSK persp. 1.59508 105 10 3600.12 17.99 3000 MSK persp. + warm start 1.59508 105 10 3600.37 17.99 3000 Convex MSK opt. persp. 1.59508 105 10 1454.84 17.99 3000 ours 1.59508 105 10 3600.27 19.40

4000 Subset Select CIO - - 3600.00 - 4000 Subset Select CIO + warm start 2.17573 105 10 3711.79 20.70 4000 L0Bn B 2.17573 105 10 3399.35 0.00 4000 MSK persp. 2.17573 105 10 3601.17 14.95 4000 MSK persp. + warm start 2.17573 105 10 3602.25 14.95 4000 Convex MSK opt. persp. 2.17573 105 10 1510.77 2.40 4000 ours 2.17573 105 10 99.53 0.00

5000 Subset Select CIO 1.92108 105 10 3602.76 69.69 5000 Subset Select CIO + warm start 2.71985 105 10 3733.49 19.85 5000 L0Bn B 2.71985 105 10 532.76 0.00 5000 MSK persp. 2.71985 105 10 3601.74 11.88 5000 MSK persp. + warm start 2.71985 105 10 3602.31 11.88 5000 Convex MSK opt. persp. 2.71985 105 10 1576.92 1.04 5000 ours 2.71985 105 10 21.01 0.00

6000 Subset Select CIO 2.03206 105 10 4031.62 87.30 6000 Subset Select CIO + warm start 3.10827 105 10 3780.98 22.45 6000 L0Bn B 3.16760 105 10 3082.87 0.00 6000 MSK persp. - - 3600.17 - 6000 MSK persp. + warm start 3.16760 105 10 3599.98 9.90 6000 Convex MSK opt. persp. 3.16760 105 10 1487.94 0.62 6000 ours 3.16760 105 10 20.91 0.00

7000 Subset Select CIO 2.44730 105 10 3992.94 82.05 7000 Subset Select CIO + warm start 3.37947 105 10 4164.11 31.83 7000 L0Bn B 3.72609 105 10 3126.74 0.00 7000 MSK persp. - - 3600.16 - 7000 MSK persp. + warm start 3.72609 105 10 3604.63 8.28 7000 Convex MSK opt. persp. 3.72609 105 10 1555.38 0.45 7000 ours 3.72609 105 10 1.73 0.00 Table 142: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 0.1).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 2.79591 105 9 3686.43 42.67 3000 Subset Select CIO + warm start 3.11853 105 10 3738.09 27.91 3000 L0Bn B 3.27441 105 8 7347.92 4.34 3000 MSK persp. 3.33603 105 10 3600.83 18.03 3000 MSK persp. + warm start 3.33603 105 10 3600.62 18.03 3000 Convex MSK opt. persp. 3.18968 105 10 1341.73 23.47 3000 ours 3.33603 105 10 3600.68 19.44

4000 Subset Select CIO - - 3600.00 - 4000 Subset Select CIO + warm start 4.44307 105 10 3651.78 23.07 4000 L0Bn B 4.53260 105 10 4150.43 32.94 4000 MSK persp. 4.53260 105 10 3602.80 14.94 4000 MSK persp. + warm start 4.53260 105 10 3601.24 14.94 4000 Convex MSK opt. persp. 4.53260 105 10 1444.17 2.53 4000 ours 4.53260 105 10 3600.41 0.62

5000 Subset Select CIO 4.69039 105 8 3951.25 43.95 5000 Subset Select CIO + warm start 5.44684 105 9 4020.24 23.96 5000 L0Bn B 5.63006 105 10 3516.88 0.00 5000 MSK persp. 5.63006 105 10 3603.31 11.92 5000 MSK persp. + warm start 5.63006 105 10 3601.57 11.92 5000 Convex MSK opt. persp. 5.63006 105 10 1572.75 1.04 5000 ours 5.63006 105 10 37.26 0.00

6000 Subset Select CIO 2.80001 105 1 2.28 0.00 6000 Subset Select CIO + warm start 6.44963 105 10 3724.51 22.59 6000 L0Bn B 6.57798 105 10 3317.18 0.00 6000 MSK persp. - - 3604.37 - 6000 MSK persp. + warm start 6.57798 105 10 3606.26 9.92 6000 Convex MSK opt. persp. 6.57798 105 10 1657.28 0.62 6000 ours 6.57798 105 10 20.64 0.00

7000 Subset Select CIO 3.48812 105 1 2.12 0.00 7000 Subset Select CIO + warm start 3.45235 105 1 2.06 0.00 7000 L0Bn B 7.72266 105 10 3589.79 0.00 7000 MSK persp. - - 3599.94 - 7000 MSK persp. + warm start 7.72266 105 10 3600.05 8.29 7000 Convex MSK opt. persp. 7.72266 105 10 1603.99 0.45 7000 ours 7.72266 105 10 20.91 0.00 Table 143: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 0.1).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 6.90645 105 10 4096.61 28.35 3000 Subset Select CIO + warm start 7.31489 105 10 3621.77 21.19 3000 L0Bn B 5.52981 105 1 7310.65 34.21 3000 MSK persp. 7.39018 105 10 3600.41 18.44 3000 MSK persp. + warm start 7.41186 105 10 3600.40 18.10 3000 Convex MSK opt. persp. 7.07443 105 10 1698.29 23.74 3000 ours 7.41186 105 10 3600.59 19.46

4000 Subset Select CIO 9.43345 105 10 3677.02 28.22 4000 Subset Select CIO + warm start 9.93033 105 10 3684.08 21.81 4000 L0Bn B 7.53887 105 1 7333.72 31.99 4000 MSK persp. 9.98256 105 10 3600.92 15.46 4000 MSK persp. + warm start 1.00287 106 10 3602.18 14.92 4000 Convex MSK opt. persp. 1.00287 106 10 1574.91 2.82 4000 ours 1.00287 106 10 3600.08 1.88

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start 1.22972 106 10 3997.81 21.00 5000 L0Bn B 1.23555 106 9 7414.89 5.14 5000 MSK persp. 1.24027 106 10 3601.11 11.96 5000 MSK persp. + warm start 1.24027 106 10 3601.56 11.96 5000 Convex MSK opt. persp. 1.24027 106 10 1638.36 1.06 5000 ours 1.24027 106 10 469.43 0.00

6000 Subset Select CIO 9.41611 105 1 59.04 0.00 6000 Subset Select CIO + warm start 9.78724 105 1 2.14 0.00 6000 L0Bn B 1.44860 106 9 7352.27 4.07 6000 MSK persp. - - 3602.12 - 6000 MSK persp. + warm start 1.45416 106 10 3600.04 9.94 6000 Convex MSK opt. persp. 1.45416 106 10 1653.65 0.63 6000 ours 1.45416 106 10 59.16 0.00

7000 Subset Select CIO 1.14528 106 1 2.13 0.00 7000 Subset Select CIO + warm start 1.14528 106 1 2.11 0.00 7000 L0Bn B 1.69804 106 9 7503.37 3.09 7000 MSK persp. - - 3600.28 - 7000 MSK persp. + warm start 1.70483 106 10 3600.18 8.30 7000 Convex MSK opt. persp. 1.70483 106 10 1715.12 0.45 7000 ours 1.70483 106 10 21.12 0.00 Table 144: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 0.1).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 2.70507 106 5 3696.31 23.15 3000 Subset Select CIO + warm start 2.76623 106 7 4030.96 20.43 3000 L0Bn B 2.55637 106 1 7339.86 18.82 3000 MSK persp. 2.75384 106 10 3600.44 19.50 3000 MSK persp. + warm start 2.75384 106 10 3600.07 19.50 3000 Convex MSK opt. persp. 2.72647 106 10 1999.54 20.67 3000 ours 2.78521 106 10 3600.23 19.46

4000 Subset Select CIO 3.65950 106 9 3841.51 23.56 4000 Subset Select CIO + warm start 3.74433 106 10 3649.94 20.76 4000 L0Bn B 3.45200 106 1 7481.26 15.51 4000 MSK persp. 3.72627 106 10 3601.44 15.63 4000 MSK persp. + warm start 3.75203 106 10 3601.81 14.84 4000 Convex MSK opt. persp. 3.75202 106 10 1543.65 3.23 4000 ours 3.75203 106 10 3601.10 3.50

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start 4.61877 106 10 3681.64 20.02 5000 L0Bn B 4.23094 106 1 7531.59 13.78 5000 MSK persp. 4.60533 106 10 3605.23 12.32 5000 MSK persp. + warm start 4.61877 106 10 3601.43 11.99 5000 Convex MSK opt. persp. 4.61877 106 10 1633.96 1.22 5000 ours 4.61877 106 10 3600.40 0.93

6000 Subset Select CIO 5.38482 106 10 3762.10 21.45 6000 Subset Select CIO + warm start 5.42848 106 10 3891.96 20.47 6000 L0Bn B 4.97741 106 1 7211.19 12.81 6000 MSK persp. 4.81401 106 1 3601.53 24.21 6000 MSK persp. + warm start 5.43826 106 10 3601.18 9.95 6000 Convex MSK opt. persp. 5.43826 106 10 1535.09 0.65 6000 ours 5.43826 106 10 3600.26 0.33

7000 Subset Select CIO 5.66568 106 1 2.34 0.00 7000 Subset Select CIO + warm start 5.66568 106 1 2.16 0.00 7000 L0Bn B 5.82321 106 1 7336.27 11.85 7000 MSK persp. - - 3600.04 - 7000 MSK persp. + warm start 6.36767 106 10 3599.94 8.31 7000 Convex MSK opt. persp. 6.36767 106 10 1779.25 0.45 7000 ours 6.36767 106 10 3600.58 0.11 Table 145: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 0.1).

G.4.6 Synthetic 2 Benchmark with λ2 = 10.0

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 4.92601 104 10 3915.82 53.90 3000 Subset Select CIO + warm start 3.58868 104 10 3913.17 111.25 3000 L0Bn B 6.33576 104 10 1198.54 0.00 3000 MSK persp. 6.33576 104 10 3601.44 8.58 3000 MSK persp. + warm start 6.33576 104 10 3600.20 8.58 3000 Convex MSK opt. persp. 6.33576 104 10 1334.48 9.04 3000 ours 6.33576 104 10 3601.41 4.54

4000 Subset Select CIO 7.36488 104 10 4165.15 41.79 4000 Subset Select CIO + warm start 5.42345 104 10 3789.71 92.55 4000 L0Bn B 8.63287 104 10 1579.27 0.00 4000 MSK persp. 8.63287 104 10 3600.31 7.64 4000 MSK persp. + warm start 8.63287 104 10 3601.55 7.64 4000 Convex MSK opt. persp. 8.63287 104 10 1478.13 2.24 4000 ours 8.63287 104 10 20.33 0.00

5000 Subset Select CIO - - 3600.00 - 5000 Subset Select CIO + warm start 6.16010 104 10 3623.84 112.94 5000 L0Bn B 1.09474 105 10 1296.29 0.00 5000 MSK persp. 1.09474 105 10 3601.04 6.75 5000 MSK persp. + warm start 1.09474 105 10 3602.25 6.75 5000 Convex MSK opt. persp. 1.09474 105 10 1532.63 1.00 5000 ours 1.09474 105 10 20.71 0.00

6000 Subset Select CIO 9.91460 104 10 3945.11 54.60 6000 Subset Select CIO + warm start 8.28165 104 10 3702.54 85.08 6000 L0Bn B 1.27542 105 10 1268.61 0.00 6000 MSK persp. 1.27542 105 10 3602.43 6.47 6000 MSK persp. + warm start 1.27542 105 10 3607.80 6.47 6000 Convex MSK opt. persp. 1.27542 105 10 1314.11 0.61 6000 ours 1.27542 105 10 1.69 0.00

7000 Subset Select CIO 1.17343 105 10 3661.02 53.46 7000 Subset Select CIO + warm start 9.54749 104 10 3758.42 88.62 7000 L0Bn B 1.50512 105 10 1063.19 0.00 7000 MSK persp. 1.50512 105 10 3600.60 5.68 7000 MSK persp. + warm start 1.50512 105 10 3600.79 5.68 7000 Convex MSK opt. persp. 1.50512 105 10 1480.64 0.45 7000 ours 1.50512 105 10 1.70 0.00 Table 146: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 10.0).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 1.39974 105 10 3764.95 36.22 3000 Subset Select CIO + warm start 1.44268 105 10 3937.05 32.16 3000 L0Bn B 1.59409 105 10 1563.82 0.00 3000 MSK persp. 1.59409 105 10 3601.01 9.02 3000 MSK persp. + warm start 1.59409 105 10 3600.73 9.02 3000 Convex MSK opt. persp. 1.59409 105 10 1511.63 9.58 3000 ours 1.59409 105 10 3600.62 4.68

4000 Subset Select CIO 1.82403 105 10 3763.38 43.97 4000 Subset Select CIO + warm start 2.08718 105 10 4023.55 25.82 4000 L0Bn B 2.17472 105 10 3126.91 0.00 4000 MSK persp. 2.17472 105 10 3605.75 7.91 4000 MSK persp. + warm start 2.17472 105 10 3601.48 7.91 4000 Convex MSK opt. persp. 2.17472 105 10 1478.15 2.24 4000 ours 2.17472 105 10 86.90 0.00

5000 Subset Select CIO 2.60617 105 10 3976.98 25.08 5000 Subset Select CIO + warm start 2.19315 105 10 3615.40 48.64 5000 L0Bn B 2.71885 105 10 1571.26 0.00 5000 MSK persp. 2.71885 105 10 3602.81 7.26 5000 MSK persp. + warm start 2.71885 105 10 3603.13 7.26 5000 Convex MSK opt. persp. 2.71885 105 10 1509.28 1.01 5000 ours 2.71885 105 10 20.75 0.00

6000 Subset Select CIO 2.62606 105 10 3828.76 44.93 6000 Subset Select CIO + warm start 2.84305 105 10 3662.07 33.87 6000 L0Bn B 3.16660 105 10 3085.01 0.00 6000 MSK persp. 3.16660 105 10 3601.61 6.73 6000 MSK persp. + warm start 3.16660 105 10 3605.65 6.73 6000 Convex MSK opt. persp. 3.16660 105 10 1471.81 0.61 6000 ours 3.16660 105 10 21.05 0.00

7000 Subset Select CIO 2.96322 105 10 3700.24 50.35 7000 Subset Select CIO + warm start 2.97673 105 10 3747.76 49.67 7000 L0Bn B 3.72509 105 10 3125.79 0.00 7000 MSK persp. 3.72509 105 10 3605.45 5.87 7000 MSK persp. + warm start 3.72509 105 10 3601.55 5.87 7000 Convex MSK opt. persp. 3.72509 105 10 1549.05 0.45 7000 ours 3.72509 105 10 1.71 0.00 Table 147: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.3, λ2 = 10.0).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 3.12174 105 10 3636.16 27.78 3000 Subset Select CIO + warm start 3.27648 105 10 3759.74 21.75 3000 L0Bn B 3.27321 105 8 7307.65 4.14 3000 MSK persp. 3.33503 105 10 3600.22 9.25 3000 MSK persp. + warm start 3.33503 105 10 3600.89 9.25 3000 Convex MSK opt. persp. 3.33503 105 10 1475.73 9.93 3000 ours 3.33503 105 10 3600.25 6.21

4000 Subset Select CIO 3.96358 105 10 3655.16 37.96 4000 Subset Select CIO + warm start 4.44111 105 10 3679.12 23.13 4000 L0Bn B 4.53159 105 10 4152.36 32.87 4000 MSK persp. 4.53159 105 10 3602.35 8.06 4000 MSK persp. + warm start 4.53159 105 10 3600.09 8.06 4000 Convex MSK opt. persp. 4.53159 105 10 1513.22 2.34 4000 ours 4.53159 105 10 3600.95 0.35

5000 Subset Select CIO 4.93471 105 10 3709.44 36.82 5000 Subset Select CIO + warm start 5.27817 105 10 3730.92 27.92 5000 L0Bn B 5.62906 105 10 3521.21 0.00 5000 MSK persp. 5.62906 105 10 3602.11 7.22 5000 MSK persp. + warm start 5.62906 105 10 3603.85 7.44 5000 Convex MSK opt. persp. 5.62906 105 10 1530.90 1.02 5000 ours 5.62906 105 10 31.47 0.00

6000 Subset Select CIO 5.71812 105 10 3620.92 38.27 6000 Subset Select CIO + warm start 6.34165 105 10 3675.07 24.67 6000 L0Bn B 6.57698 105 10 3313.82 0.00 6000 MSK persp. 6.57698 105 10 3605.13 6.87 6000 MSK persp. + warm start 6.57698 105 10 3608.94 6.87 6000 Convex MSK opt. persp. 6.57698 105 10 1642.87 0.62 6000 ours 6.57698 105 10 21.24 0.00

7000 Subset Select CIO 7.32796 105 10 3784.82 26.02 7000 Subset Select CIO + warm start 7.45541 105 10 3662.49 23.87 7000 L0Bn B 7.72166 105 10 3081.93 0.68 7000 MSK persp. 7.72166 105 10 3604.29 5.98 7000 MSK persp. + warm start 7.72166 105 10 3606.92 5.98 7000 Convex MSK opt. persp. 7.72166 105 10 1526.71 0.45 7000 ours 7.72166 105 10 20.37 0.00 Table 148: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.5, λ2 = 10.0).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 6.97357 105 10 3727.41 27.12 3000 Subset Select CIO + warm start 7.35099 105 10 3678.19 20.59 3000 L0Bn B 5.52435 105 1 7287.00 33.44 3000 MSK persp. 7.41086 105 10 3602.90 9.56 3000 MSK persp. + warm start 7.41086 105 10 3600.62 9.56 3000 Convex MSK opt. persp. 7.41086 105 10 1525.68 10.22 3000 ours 7.41086 105 10 3600.48 8.32

4000 Subset Select CIO 9.65426 105 10 3756.83 25.29 4000 Subset Select CIO + warm start 9.86629 105 10 3731.48 22.60 4000 L0Bn B 7.53327 105 1 7269.26 31.51 4000 MSK persp. 1.00277 106 10 3600.55 8.17 4000 MSK persp. + warm start 1.00277 106 10 3601.59 8.17 4000 Convex MSK opt. persp. 1.00277 106 10 1555.56 2.60 4000 ours 1.00277 106 10 3600.62 1.66

5000 Subset Select CIO 1.14948 106 10 3626.73 29.45 5000 Subset Select CIO + warm start 1.23402 106 10 3607.12 20.58 5000 L0Bn B 1.23544 106 9 7434.60 4.95 5000 MSK persp. 1.24017 106 10 3602.18 7.35 5000 MSK persp. + warm start 1.24017 106 10 3604.42 7.35 5000 Convex MSK opt. persp. 1.24017 106 10 1606.64 1.03 5000 ours 1.24017 106 10 386.72 0.00

6000 Subset Select CIO 1.36967 106 10 3743.26 27.64 6000 Subset Select CIO + warm start 1.43335 106 10 3802.62 21.97 6000 L0Bn B 1.44849 106 9 7430.07 4.18 6000 MSK persp. 1.45406 106 10 3608.33 6.98 6000 MSK persp. + warm start 1.45406 106 10 3604.65 6.98 6000 Convex MSK opt. persp. 1.45406 106 10 1665.16 0.62 6000 ours 1.45406 106 10 54.13 0.00

7000 Subset Select CIO 1.60283 106 10 3661.63 27.20 7000 Subset Select CIO + warm start 1.68426 106 10 3897.62 21.05 7000 L0Bn B 1.69016 106 8 7280.69 3.21 7000 MSK persp. 1.70473 106 10 3603.26 6.07 7000 MSK persp. + warm start 1.70473 106 10 3603.98 6.07 7000 Convex MSK opt. persp. 1.70473 106 10 1677.45 0.45 7000 ours 1.70473 106 10 21.30 0.00 Table 149: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.7, λ2 = 10.0).

# of samples method upper bound support size time(s) gap(%)

3000 Subset Select CIO 2.76095 106 10 3644.99 20.66 3000 Subset Select CIO + warm start 2.78489 106 10 3685.28 19.62 3000 L0Bn B 2.55553 106 1 7230.61 17.51 3000 MSK persp. 2.78334 106 10 3601.82 10.13 3000 MSK persp. + warm start 2.78334 106 10 3600.26 10.13 3000 Convex MSK opt. persp. 2.77971 106 10 1638.17 10.70 3000 ours 2.78507 106 10 3600.78 9.79

4000 Subset Select CIO 3.71695 106 10 3601.35 21.65 4000 Subset Select CIO + warm start 3.74577 106 10 3638.03 20.71 4000 L0Bn B 3.45116 106 1 7514.37 15.03 4000 MSK persp. 3.74752 106 10 3602.22 8.70 4000 MSK persp. + warm start 3.75192 106 10 3602.31 8.57 4000 Convex MSK opt. persp. 3.75191 106 10 1564.01 3.01 4000 ours 3.75192 106 10 3600.32 3.23

5000 Subset Select CIO 4.57617 106 10 3864.92 21.14 5000 Subset Select CIO + warm start 4.60856 106 10 3764.48 20.29 5000 L0Bn B 4.23012 106 1 7233.86 13.08 5000 MSK persp. 4.61866 106 10 3603.72 7.81 5000 MSK persp. + warm start 4.61867 106 10 3603.77 7.90 5000 Convex MSK opt. persp. 4.61867 106 10 1492.40 1.19 5000 ours 4.61867 106 10 3600.30 0.89

6000 Subset Select CIO 5.31865 106 10 3718.93 22.96 6000 Subset Select CIO + warm start 5.42880 106 10 3662.21 20.47 6000 L0Bn B 4.97660 106 1 7210.21 12.38 6000 MSK persp. 5.43816 106 10 3606.21 7.09 6000 MSK persp. + warm start 5.43816 106 10 3602.03 7.09 6000 Convex MSK opt. persp. 5.43816 106 10 1527.55 0.64 6000 ours 5.43816 106 10 3600.22 0.32

7000 Subset Select CIO 6.30345 106 10 3670.08 20.83 7000 Subset Select CIO + warm start 6.34815 106 10 3714.08 19.98 7000 L0Bn B 5.82238 106 1 7236.55 11.40 7000 MSK persp. - - 3604.74 - 7000 MSK persp. + warm start 6.36757 106 10 3607.94 6.15 7000 Convex MSK opt. persp. 6.36757 106 10 1747.23 0.45 7000 ours 6.36757 106 10 3600.06 0.10 Table 150: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.9, λ2 = 10.0).

G.5 Experiments on the Case p >> n

# of samples method upper bound support size time(s) gap(%)

100 SOS1 + warm start 2.76902 103 10 3600.94 13.02 100 big-M50 + warm start 2.76902 103 10 3600.58 13.02 100 big-M20 + warm start 2.76902 103 10 3600.47 13.02 100 big-M5 + warm start 2.76902 103 10 3600.17 13.02 100 persp. + warm start 2.76902 103 10 3602.22 13.02 100 eig. persp. + warm start 2.76902 103 10 3600.65 13.02 100 Convex opt. persp. + warm start 2.51480 103 10 1706.56 24.44 100 MSK persp. + warm start 2.71927 103 10 3599.92 15.08 100 Subset Select CIO + warm start 1.68157 103 10 4350.20 86.11 100 L0Bn B 7.78758 102 1 7217.66 255.64 100 ours 2.76902 103 10 3600.22 13.02

200 SOS1 + warm start 4.15945 103 10 3600.08 17.20 200 big-M50 + warm start 4.15945 103 10 3600.05 17.20 200 big-M20 + warm start 4.15945 103 10 3600.11 17.20 200 big-M5 + warm start 4.15945 103 10 3600.22 17.20 200 persp. + warm start 4.15945 103 10 3600.20 17.20 200 eig. persp. + warm start 4.15945 103 10 3600.24 17.20 200 Convex opt. persp. + warm start 4.11140 103 10 1800.00 18.57 200 MSK persp. + warm start 4.15945 103 10 3599.97 17.20 200 Subset Select CIO + warm start 2.74363 103 10 3819.10 77.69 200 L0Bn B 1.20299 103 1 7208.65 240.10 200 ours 4.15945 103 10 3601.81 17.20

500 SOS1 + warm start 1.00615 104 10 3600.51 20.46 500 big-M50 + warm start 1.00615 104 10 3600.28 20.46 500 big-M20 + warm start 1.00615 104 10 3600.07 20.46 500 big-M5 + warm start 1.00615 104 10 3600.11 20.46 500 persp. + warm start 1.00615 104 10 3600.37 20.46 500 eig. persp. + warm start 1.00615 104 10 3600.10 20.46 500 Convex opt. persp. + warm start 1.00615 104 10 1744.05 20.46 500 MSK persp. + warm start 1.00615 104 10 3600.21 20.46 500 Subset Select CIO + warm start - - 2.29 - 500 L0Bn B 1.00615 104 10 1834.75 0.00 500 ours 1.00615 104 10 3600.23 20.46 Table 151: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.001). All baselines use our beam search solution as a warm start whenever possible. The results in the case of p >> n show that OKRidge is competitive with other MIP methods. l0bnb can solve the problem when n = 500 but has difficulty in finding the desired sparsity level in the case of n = 100 and n = 200. MOSEK sometimes gives the best optimality gap. However, as we have shown in the previous subsections, MOSEK is not scalable to high dimensions. Additionally, the results in differential equations show that MOSEK and l0bnb are not competitive in discovering differential equations (datasets have n p), which is the main focus of this paper. In the future, we plan to extend OKRidge to develop a more scalable algorithm for the case p >> n.

# of samples method upper bound support size time(s) gap(%)

100 SOS1 + warm start 2.76721 103 10 3601.59 13.09 100 big-M50 + warm start 2.76721 103 10 3600.66 13.09 100 big-M20 + warm start 2.76721 103 10 3600.32 13.09 100 big-M5 + warm start 2.76721 103 10 3602.22 13.09 100 persp. + warm start 2.76721 103 10 3600.09 13.09 100 eig. persp. + warm start 2.76721 103 10 3600.03 13.09 100 Convex opt. persp. + warm start 2.51378 103 10 2128.32 24.44 100 MSK persp. + warm start 2.71018 103 10 3599.92 15.36 100 Subset Select CIO + warm start 7.59815 102 10 3651.10 311.88 100 L0Bn B 7.78035 102 1 7218.43 255.36 100 ours 2.76721 103 10 3600.35 13.08

200 SOS1 + warm start 4.15848 103 10 3600.46 17.23 200 big-M50 + warm start 4.15848 103 10 3600.42 17.23 200 big-M20 + warm start 4.15848 103 10 3600.30 17.23 200 big-M5 + warm start 4.15848 103 10 3600.10 17.23 200 persp. + warm start 4.15848 103 10 3600.12 17.23 200 eig. persp. + warm start 4.15848 103 10 3600.47 17.23 200 Convex opt. persp. + warm start 4.11040 103 10 2159.71 18.55 200 MSK persp. + warm start 4.15848 103 10 3599.93 17.11 200 Subset Select CIO + warm start 3.14066 103 10 3632.89 55.23 200 L0Bn B 1.20245 103 1 7208.71 237.38 200 ours 4.15848 103 10 3600.13 17.22

500 SOS1 + warm start 1.00605 104 10 3600.08 20.47 500 big-M50 + warm start 1.00605 104 10 3600.08 20.47 500 big-M20 + warm start 1.00605 104 10 3600.80 20.47 500 big-M5 + warm start 1.00605 104 10 3600.52 20.47 500 persp. + warm start 1.00605 104 10 3600.04 20.47 500 eig. persp. + warm start 1.00605 104 10 3600.06 20.47 500 Convex opt. persp. + warm start 1.00605 104 10 2059.28 20.41 500 MSK persp. + warm start 1.00605 104 10 3600.08 20.33 500 Subset Select CIO + warm start 7.47048 103 10 3748.65 62.24 500 L0Bn B 1.00605 104 10 1835.06 0.00 500 ours 1.00605 104 10 3601.58 20.46 Table 152: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 0.1). All baselines use our beam search solution as a warm start whenever possible. The results in the case of p >> n show that OKRidge is competitive with other MIP methods. l0bnb can solve the problem when n = 500 but has difficulty in finding the desired sparsity level in the case of n = 100 and n = 200. MOSEK sometimes gives the best optimality gap. However, as we have shown in the previous subsections, MOSEK is not scalable to high dimensions. Additionally, the results in differential equations show that MOSEK and l0bnb are not competitive in discovering differential equations (datasets have n p), which is the main focus of this paper. In the future, we plan to extend OKRidge to develop a more scalable algorithm for the case p >> n.

# of samples method upper bound support size time(s) gap(%)

100 SOS1 + warm start 2.61800 103 10 3600.36 19.31 100 big-M50 + warm start 2.61800 103 10 3600.18 19.31 100 big-M20 + warm start 2.61800 103 10 3600.40 19.31 100 big-M5 + warm start 2.61800 103 10 3602.50 19.31 100 persp. + warm start 2.61800 103 10 3600.47 19.54 100 eig. persp. + warm start 2.61800 103 10 3600.19 19.54 100 Convex opt. persp. + warm start 2.40846 103 10 1968.34 25.24 100 MSK persp. + warm start 2.61800 103 10 3599.93 10.09 100 Subset Select CIO + warm start 1.79729 103 10 3646.28 74.12 100 L0Bn B 7.11936 102 1 7212.43 236.77 100 ours 2.61806 103 10 3600.78 19.09

200 SOS1 + warm start 4.06441 103 10 3600.07 19.69 200 big-M50 + warm start 4.06441 103 10 3600.92 19.69 200 big-M20 + warm start 4.06441 103 10 3600.61 19.69 200 big-M5 + warm start 4.06441 103 10 3600.07 19.69 200 persp. + warm start 4.06441 103 10 3600.03 11.80 200 eig. persp. + warm start 4.06441 103 10 3600.79 11.80 200 Convex opt. persp. + warm start 4.01212 103 10 2175.61 16.91 200 MSK persp. + warm start 4.06441 103 10 3599.92 9.05 200 Subset Select CIO + warm start 2.68289 103 10 3704.01 81.71 200 L0Bn B 1.15079 103 1 7207.25 219.55 200 ours 4.06441 103 10 3600.92 19.45

500 SOS1 + warm start 9.96316 103 10 3601.58 21.39 500 big-M50 + warm start 9.96316 103 10 3602.06 21.39 500 big-M20 + warm start 9.96316 103 10 3602.58 21.39 500 big-M5 + warm start 9.96316 103 10 3600.48 21.28 500 persp. + warm start 9.96316 103 10 3600.14 21.65 500 eig. persp. + warm start 9.96316 103 10 3600.12 13.16 500 Convex opt. persp. + warm start 9.96316 103 10 2145.98 16.65 500 MSK persp. + warm start 9.96316 103 10 3600.03 11.45 500 Subset Select CIO + warm start 7.48575 103 10 3793.20 61.91 500 L0Bn B 9.96316 103 10 1483.07 0.00 500 ours 9.96316 103 10 3600.96 21.14 Table 153: Comparison of time and optimality gap on the synthetic datasets (p = 3000, k = 10, ρ = 0.1, λ2 = 10.0). All baselines use our beam search solution as a warm start whenever possible. The results in the case of p >> n show that OKRidge is competitive with other MIP methods. When the number of samples is 100, OKRidge is able to find a slightly better solution through the branch-and-bound process, but other solvers are stuck with the warm start solution. l0bnb can solve the problem when n = 500 but has difficulty in finding the desired sparsity level in the case of n = 100 and n = 200. MOSEK sometimes gives the best optimality gap. However, as we have shown in the previous subsections, MOSEK is not scalable to high dimensions. Additionally, the results in differential equations show that MOSEK and l0bnb are not competitive in discovering differential equations (datasets have n p), which is the main focus of this paper. In the future, we plan to extend OKRidge to develop a more scalable algorithm for the case p >> n.

G.6 Comparison between Fast Solve and ADMM

In this subsection, we provide the empirical evidence that the ADMM method is more effective than the fast solve method in terms of the computing a tighter lower bound, which helps speed up the certification process.

Figure 7: Comparison of running time (top row) and optimality gap (bottom row) between using fast solve (Equation (16)) alone and using fast solve and ADMM (Equations (19)- (22)) to calculate the lower bounds in the Bn B tree with three correlation levels ρ = 0.1, 0.5, 0.9 (n = 10000, k = 10).

Figure 8: Comparison of running time (top row) and optimality gap (bottom row) between using fast solve (Equation (16)) alone and using fast solve and ADMM (Equations (19)- (22)) to calculate the lower bounds in the Bn B tree with three correlation levels ρ = 0.1, 0.5, 0.9 (p = 3000, k = 10).

H More Experimental Results on Dynamical Systems

H.1 Results Summarization

Comparison with MIOSR We first show a direct comparison between OKRidge and MIOSR (based on SOS1 formulation). The results are in Appendix H.2, where we show OKRidge outperforms MIOSR in terms of both solution quality and running time.

Comparison with Different MIP Formualtions Next, we compare OKRidge with other MIP formulations (SOS1, big-M, perspective, and eigen-perspective) as done on the synthetic benchmarks. These formulations have never been applied to the differential equation experiments. Similar to the synthetic benchmarks, we show that OKRidge obtain better solution quality and smaller running time. The results are shown in Appendix H.3.

Comparison between Beam Search and More Heuristics In addition, we also compare our beam search method with many more heuristic methods from the sparse regression literature. Beam search outperforms all methods in terms of solution qualities in presence of highly correlated features. Block Decomposition algorithm also achieves high-quality solutions, but is much slower than beam search. The results are shown in Appendix H.4

Comparison with Different MIP Formulations Warm Started by Beam Search Due to the superiority of beam search, we give beam search solutions as warm starts to Gurobi with different MIP formulations and repeat the experiments. Our warm start improve the solution quality and running time, but OKRidge still outperforms other MIP formulations in terms of running time. The results are shown in Appendix H.5.

Comparison with Other MIP Solvers and Formulations Lastly, we solve the same problem using the MOSEK solver, l0bnb, and Subset Selection CIO. For this application, l0bnb has a time limit of 10 seconds for each ℓ0 regularization and 60 seconds for the total time limit. All other algorithms have a total time limit of 30 seconds as mentioned in the experimental setup subsection. Although they are not as competitive as Gurobi as we have shown on the synthetic benchmarks, we still conduct the experiments for the purpose of thorough comparison. Results in Appendix H.6 show that OKRidge outperforms these MIP solvers/formulations.

H.2 Direct Comparison of OKRidge with MIOSR

In the appendix, we directly compare MIOSR (based on the SOS1 formulation) and our method on the differential equation experiments. The results are shown in Figure 9.

Although MIOSR is faster than OKRidge on the Lorenz system, OKRidge is able to find better solutions. See Appendix H.3 and Figure 10 where the eigen-perspective formulation doesn t have this problem and can produce high-quality solutions.

On the Hopf system, MIOSR is faster than OKRidge. However, the number of features is small (p = 21), and the running time is negligible compared to Lorenz and MHD systems.

On the MHD system (p = 462), OKRidge outperforms MIOSR in terms of both solution qualities and running times. This shows that OKRidge is much more scalable to high dimensions than MIOSR.

Figure 9: Comparison between our method and MIOSR on the experiments of discovering sparse differential equations. We obtain better performance on the true positivity rate, L2 coefficient error, and test RMSE.

H.3 Comparison of OKRidge with SOS1, Big-M, Perspective, and Eigen-Perspective Formulations

Besides the MIOSR baseline (which is based on the SOS1 formualtion), we also compare with other MIP formulations, including Big-M (M = 50), perspective, and eigen-perspective formulations. The results are shown in Figure 10. The big-M method produces errors on the Lorenz system, so its result is not shown on the left column. We have discussed about the SOS1 formulation in the last subsection. On the Lorenz system, OKRidge is faster than the perspective and eigen-perspective formulations. On the Hopf system (p = 21), OKRidge is slower than Gurobi, but OKRidge takes only 20 seconds. On the high-dimensional MHD system, OKRidge is significantly faster than all MIP formulations.

Figure 10: Comparison between our method and other MIP formulations solved by Gurobi on the experiments of discovering sparse differential equations. We obtain better performance on the true positivity rate, L2 coefficient error, and test RMSE. SOS1 is faster than ours on the Lorenz system, but the solution quality is worse. OKRidge is slower than Gurobi on the Hopf system (p=21), but the time difference is not significant. On the high-dimensional MHD system, OKRidge is significantly faster than Gurobi with all exisiting MIP formulations.

H.4 Comparison of Our Beam Search with More Heuristic Baselines

Next, we compare our proposed beam search method with more heuristic methods from the sparse regression literature. The results are shown in Figure 11. Since many heuristic method implementations do not consider the ridge regression, we set ℓ2 regularization for all algorithms to be 0 for a fair comparison. Since the features are highly correlated, most of the existing heuristic methods do not produce high-quality solutions except the Block Decomposition algorithm. However, our beam search method is significantly faster than the Block Decomposition algorithm.

Moreover, for the purpose of completeness, we also compare with LASSO, the classical sparse learning method. Other researchers have found that LASSO performs poorly in this context. We confirm this claim experimentally. The result is shown in Figure 12.

Figure 11: Comparison between our method and other heuristic methods on the experiments of discovering sparse differential equations. We obtain better performance on the true positivity rate, L2 coefficient error, and test RMSE. The Block Decomposition algorithm also produces high-quality solutions but is significantly slower than our beam search.

Figure 12: Results on discovering sparse differential equations. Lasso does not do well in this application. Beam search lags behind OKRidge on the MHD system. This shows that beam search alone is not sufficient, and it is necessary to use Bn B to obtain the optimal solution.

H.5 Comparison of OKRidge with MIPs Warmstarted by Our Beam Search

A natural question to ask is whether Gurobi would be faster than OKRidge in terms of running speed if we give the high-quality solutions from last subsection as warm starts to Gurobi . Here, we compare with other formulations warmstarted by our beam search solutions. The results are shown in Figure 13. By comparing Figure 10 with Figure 13, we see that our beam search solutions, if used as warm starts, improve Gurobi s speed. However, OKRidge is still significantly faster than all MIP formulations on the high-dimensional MHD system.

Figure 13: Comparison between our method and other MIP formulations solved by Gurobi, which was warmstarted by our beam search solutions on the experiments of discovering sparse differential equations. OKRidge is slower than Gurobi on the Hopf system (p = 21), but the running time is only 20 seconds. On the Lorenz and MHD systems, OKRidge is faster than other MIP formulations, especially on the high-dimensional MHD system (p = 462).

H.6 Comparison of OKRidge with MOSEK solver, Subset Selection CIO, and L0BNB

Lastly, we compare with other MIP solvers/formulations, including the MOSEK solver, Subset Selection CIO, and l0bnb. The results are shown in Figure 14.

MOSEK (perspective formualtion) produces good solutions on the Hopf dataset, but could not produce optimal solutions on the Lorenz system. MOSEK (optimal formulation) produes suboptimal solutions on the Lorenz and Hopf systems. However, we are using MOSEK to solve the relaxed convex SDP problem. This demonstrates why we need to tackle the NP hard problem instead of the relaxed convex problem. MOSEK runs out of memory on the MHD dataset, so we don t show the results on the MHD dataset.

Subset Selection CIO doesn t do well on the Lorenz and MHD systems. On many instances, Subset Selection CIO would have errors.

l0bnb performs poorly on these datasets. One major reason is that l0bnb could not retrieve the exact support size (k = 1, 2, 3, 4, 5) because the sparsity is controlled implicitly through ℓ0 regularization. It takes a long time to run these differential experiments. l0bnb couldn t find a good ℓ0 regularization which produces the right sparsity level within the time limit.

Please see Appendix G.1 for a detailed discussion of the drawbacks on using MOSEK, Subset Selection CIO, and l0bnb.

Figure 14: Comparison between our method and other MIP solvers or formulations on the experiments of discovering sparse differential equations. We obtain better performance on the true positivity rate, L2 coefficient error, and test RMSE.