# simplex_constrained_sparse_optimization_via_tail_screening__5a67e1f9.pdf

Journal of Machine Learning Research 26 (2025) 1-38 Submitted 1/24; Revised 2/25; Published 6/25

Simplex Constrained Sparse Optimization via Tail Screening

Peng Chen chenpeng1@mail.ustc.edu.cn Department of Statistics and Finance, School of Management University of Science and Technology of China

Jin Zhu j.zhu69@lse.ac.uk Department of Statistics, London School of Economics and Political Science

Junxian Zhu junxian@nus.edu.sg Saw Swee Hock School of Public Health, National University of Singapore

Xueqin Wang wangxq20@ustc.edu.cn Department of Statistics and Finance/International Institute of Finance, School of Management University of Science and Technology of China

Editor: Ali Shojaie

We consider the probabilistic simplex-constrained sparse recovery problem. The commonly used Lasso-type penalty for promoting sparsity is ineﬀective in this context since it is a constant within the simplex. Despite this challenge, fortunately, simplex constraint itself brings a self-regularization property, i.e., the empirical risk minimizer without any sparsitypromoting procedure obtains the usual Lasso-type estimation error. Moreover, we analyze the iterates of a projected gradient descent method and show its convergence to the ground truth sparse solution in the geometric rate until a satisﬁed statistical precision is attained. Although the estimation error is statistically optimal, the resulting solution is usually more dense than the sparse ground truth. To further sparsify the iterates, we propose a method called PERMITS via embedding a tail screening procedure, i.e., identifying negligible components and discarding them during iterations, into the projected gradient descent method. Furthermore, we combine tail screening and the special information criterion to balance the trade-oﬀbetween ﬁtness and complexity. Theoretically, the proposed PERMITS method can exactly recover the ground truth support set under mild conditions and thus obtain the oracle property. We demonstrate the statistical and computational eﬃciency of PERMITS with both synthetic and real data. The implementation of the proposed method can be found in https://github.com/abess-team/PERMITS. Keywords: simplex constrained sparse recovery, projected gradient descent, self-regularization, tail screening, special information criterion

1. Introduction

In widespread applications, we observe n samples (xi, yi) subject to the following generation process

yi = x i w + ξi, i = 1, , n

*. Peng Chen, Jin Zhu, and Junxian Zhu contributed equally. Xueqin Wang is the corresponding author.

c 2025 Peng Chen, Jin Zhu, Junxian Zhu, Xueqin Wang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0010.html.

Chen, Zhu, Zhu, Wang

where xi Rp is the feature vector and yi R is the response value, w := {w Rp : 1 w = 1, w 0} is the unknown coeﬃcient vector lying in the probabilistic simplex and ξi is some random noise. The objective is to recover the unknown coeﬃcient w . Furthermore, this model can be succinctly represented as follows:

where (X, y) Rn p Rn are the observed feature matrix and response vector and ξ Rn

is the noise vector. This modeling framework ﬁnds extensive applications in various domains such as economics, ﬁnance (Benidis et al., 2017; Zheng et al., 2020), and machine learning (Keshava, 2003; Limmer and Sta nczak, 2018). Typically, the coeﬃcient vector w embodies weights. For example, in ﬁnance, w corresponds to a market index or portfolio weights that are assigned to p assets (Du et al., 2022). Nowadays, researchers often collect high-dimensional data with p generally associated with the same or even larger scale than n. In the high-dimensional statistics , a common assumption is that the unknown coeﬃcient vector w is sparse. In other words, although the dimension p of w greatly surpasses the sample size n, the majority of its components w i are precisely zero. Therefore, the true sparsity level deﬁned as

is relatively small compared to n. This assumption is rational in the context of the contemporary era of big data, where an extensive array of features Xj Rn can be collected, yet only a fraction of them signiﬁcantly inﬂuences y. Such a prevalent scenario enables the derivation of eﬀective estimators for w and potentially allows for the recovery of the corresponding support set S := supp(w ).

In this paper, our primary interest is to recover S and estimate w . To estimate the unknown w , a common approach is to solve the following constrained least squares problem

min w Rp f(w) = 1

2n Xw y 2 2 s.t. w . (1)

Let b wn be any solution of (1), and it may not be unique since f is not necessarily strictly convex over . The simplex constraint prevents us from obtaining an explicit formula for b wn, and it can only be approximated through iterative optimization algorithms (Jaggi, 2013; Xiao and Bai, 2022; Li et al., 2023) producing outputs like wt. In real-world scenarios, especially those involving high dimensions (p n), two signiﬁcant challenges emerge. Firstly, the estimation error b wn w 2 can be substantial due to the presence of noise ξ. Secondly, optimization becomes challenging because f(w) loses its strong convexity when p n. Consequently, achieving a desirable linear convergence rate of wt b wn 2 is diﬃcult for most algorithms under these circumstances.

1.1 Regularized methods

In the high-dimensional literature, regularization (Tibshirani, 1996; Meier et al., 2008; Tibshirani et al., 2005) is commonly used to promote sparsity and alleviate the defect of noise

Simplex Constrained Sparse Optimization via Tail Screening

accumulation. However, in the presence of a simplex constraint, regularization is not as simple as that in the unconstrained setting. The most explicit method to promote sparsity is to directly penalize the cardinality of w and solve min w Rp f(w) + λ w 0 s.t. w

where w 0 counts the number of nonzero components and λ is a given penalty coeﬃcient speciﬁed by the user or determined by some data-driven methods. However, this is computationally intractable since 0 is non-convex and thus intractable in practice, especially when both n and p are large. A greedy method, iterative hard thresholding (IHT), is a famous method to solve such sparse recovery problem (Blumensath and Davies, 2009). It performs a hard thresholding procedure after gradient descent to force the sparsity of iterates. Kyrillidis et al. (2013) extended this method to solve (1) by replacing the original hard thresholding procedure with the sparse projection onto the simplex, which can force sparsity and feasibility simultaneously. Motivated by the Lasso-type methods, researchers may consider approximating the nonconvex ℓ0 penalty with a convex ℓ1 penalty, and it becomes

min w Rp f(w) + λ w 1 s.t. w ,

where λ > 0 is also a hyper-parameter controlling the degree of penalty and may be slightly diﬀerent from the above one. Unfortunately, this method fails in the presence of a simplex constraint since the added penalty term λ w 1 = λ is a constant (independent of w) for any w . That is, we cannot promote the sparsity of the solution by tuning λ as those Lasso-type methods do. To deal with this undesirable property that w 1 = 1 for any w , some heuristic methods were considered by ﬁrstly ignoring the equality constraint 1 w = 1 in the simplex and solving the following non-negative Lasso

min w Rp f(w) + λ w 1 s.t. w 0

or a modiﬁed version (Du et al., 2022)

min w Rp f(w) + λ w 1 s.t. w 0, w 1 R

and then scaling the solution such that 1 w = 1 holds (dividing the solution by its ℓ1norm). In the context of portfolio selection, this method was shown to behave well from both theoretical and practical perspectives. However, the theoretical guarantee relied on some additional conditions, such as the single crossing assumption, which is hard to check in practice; see Du et al. (2022) for more details. Besides these heuristic methods, Pilanci et al. (2012) proposed to replace the ℓ0-norm by the inverse ℓ -norm

min w Rp f(w) + λ 1 w s.t. w

Chen, Zhu, Zhu, Wang

which is motivated by the fact that 1/ w is a lower bound of w 0 for any w . They also showed that it can be exactly solved via convex programming, although it is also a non-convex problem. Xiao and Bai (2022) considered the Hadamard reparameterization w = u u and the probabilistic simplex constraint w is reduced to the unit sphere constraint of u Sp 1 := {u Rp : u 2 = 1}. This reparameterization technique enables them to add the ℓ1 penalty λ u 1 to promote the sparsity of u and thus w. Speciﬁcally, they considered the following ℓ1 regularized non-convex problem:

min u Rp 1 2n X (u u) y 2 2 + λ u 1 s.t. u Sp 1.

To solve this non-convex problem, they proposed a geometric projected gradient method with global convergence to a critical point. In fact, this reparameterization technique can also be applied directly to the ℓ0-type sparse optimization problem. Combined with the reparameterization technique w = u u, the existing fast sparse solver (see, e.g., Wang et al., 2024) can solve the following sparse constraint optimization problem

min u Rp 1 2n

1 u 2 2 X(u u) y

2 s.t. u 0 s

where s is the pre-speciﬁed sparsity level. That is, it directly optimizes the normalized term u := (u u)/ u 2 2 which is guaranteed to be feasible such that u . All the above methods are motivated by diﬀerent procedures such as heuristic, norm approximation, parameterization, and sparse projection. Most of such procedures either lack statistical theory or pose an additional computation burden. Although the simplex constraint leads to such a signiﬁcant ﬂaw, we claim that it also brings some interesting blessings, as claimed in the next section.

1.2 Regularization-free methods

In this section, we introduce the notion of self-regularization and some interesting properties specialized to problem (1). By self-regularization, we mean that an estimator or the output of an optimization algorithm can obtain comparable performance as Lasso (w.r.t. ℓ2 error) without an additional regularization term like λ w 1. Speciﬁcally, under the commonly used conditions, the minimizer of the regularization-free problem (1) can actually obtain the Lasso-type ℓ2 error O p

log(p)/n with high probability (see, e.g., Meinshausen, 2013; Slawski and Hein, 2013; Li et al., 2020). Self-regularization was ﬁrst studied in Meinshausen (2013); Slawski and Hein (2013). They considered the non-negativity constrained least squares (NNLS) and the corresponding minimizer b w NNLS n , that is, the constrained set was the non-negative orthant {w Rp : w 0} rather than the simplex in (1). Under certain conditions, they proved that b w NNLS n is consistent in the high-dimensional case p n. Based on this self-regularization property, Slawski and Hein (2013) further considered recovering the support set by thresholding b w NNLS n and taking the positions of the ﬁrst bs largest components [ bwn]NNLS j as the estimated support set b S. The theoretical guarantee was established in their work, but the choice of bs needed some prior knowledge of the noise level σ. These studies deviated from the

Simplex Constrained Sparse Optimization via Tail Screening

paradigm in which sparsity-promoting regularization is regarded as a necessity in the highdimensional estimation problem and provided a theoretical understanding of the previous empirical success of NNLS. Li et al. (2020) recently extended the self-regularization property from the NNLS problem to the simplex constraint. They proved that the optimal solution b wn of (1) has a desirable statistical property similar to Lasso, that is, b wn is consistent as long as s log(p)/n 0. Beyond this property, and observing that b wn was usually much denser than w (i.e. b wn 0 s ), Li et al. (2020) proposed some modiﬁed techniques, including thresholding, weighted ℓ1 and negative ℓ2-norm to sparsify b wn further. The corresponding theoretical guarantee for support recovery was established for a special case where X X is an identity matrix, i.e., the least squares denoising problem. Although b wn possesses the self-regularization property, some questions remain from the practical perspective. Notably, b wn is the theoretically optimal solution to (1), and it lacks a closed-form expression when X X is not an identity matrix. As a result, an optimization algorithm must be recruited to compute b w that approximates b wn. However, this introduces an optimization error b w b wn 2, whose impact remains unclear. To the best of our knowledge, existing theoretical analysis does not address the properties of b w, the output of an optimization algorithm. For instance, it is not evident whether the optimization error is negligible, especially in high-dimensional (p n) settings where f(w) is not strongly convex. This raises several important questions: (i) Does b w, as an approximation of b wn, retain the self-regularization property? (ii) Since optimization algorithms are typically iterative, can we quantitatively assess the quality of the t-th iteration wt of the algorithm, say wt w 2? (iii) Can we establish the linear convergence rate if we care about wt w 2 rather than wt b wn 2? (iv) In terms of the variable selection, under what conditions does the algorithm s solution accurately recover the true support set S ? This paper aims to address these questions by bridging the gap between statistical and optimization properties. The answers to the above questions motivate us to propose a computationally feasible algorithm with rigorous statistical guarantees.

1.3 Proposal and contribution

In this paper, we mainly deal with problem (1) with the aim of estimating w and recovering S . Speciﬁcally, we directly analyze, from both statistical and optimization perspectives, the t-th iterate wt of the projected gradient (PG) descent method applied to solving (1). The main contribution of this paper is two-fold and summarized as follows:

1. We ﬁll the gap between statistics and optimization. Especially, we prove a linear convergence rate of the term wt w 2 to a statistically negligible error and thus claim a similar self-regularization property for t-th iteration wt of the projected gradient descent method as b wn. Note that this result is totally diﬀerent from the counterpart in Li et al. (2020), which is only proved for the theoretical minimizer. The basic idea is that reducing optimization error below the order of statistical error is not only suﬃcient for self-regularization but also easy to compute.

2. We establish the variable selection consistency. Combined with the information criterion, we propose a new algorithm referred to as PERMITS (short for projected gradient

Chen, Zhu, Zhu, Wang

method with tail screening) by embedding a tail screening procedure into PG. This algorithm ﬁxes the smallest component of wt to be zero when t is appropriately large and then keeps optimizing the remaining components, which helps sparsify the iterates wt. Under mild conditions, we prove that PERMITS is statistically guaranteed to recover the unknown support set, i.e., supp(w) = S for the output w of PERMITS. The pivotal factor inﬂuencing the algorithmic performance and theoretical characteristics is the special information criterion (SIC), which ultimately contributes to the variable selection consistency. Contrary to being a straightforward combination of the PG algorithm and variable selection procedure, PERMITS necessitates a comprehensive understanding and analysis of the self-regularization property inherent in the PG algorithm.

1.4 Outline of the paper

The remainder of this paper is organized as follows. In the next section, we introduce the self-regularization property of the PG algorithm and propose the PERMITS algorithm. In Section 3, we study the theoretical properties of the PERMITS algorithm from the perspectives of both computation and statistics. Extensive numerical experiments using both synthetic and real data are shown in Section 4 to validate our theoretical results. Section 5 closes this paper with a concluding remark. In the appendix, detailed proofs of all theoretical results are provided, together with additional numerical results.

1.5 Notations

Capital bold letters such as X represent matrices, and lowercase bold letters such as w, y, and ξ represent column vectors. The set {1, 2, , p} is denoted as [p]. For any matrix X Rn p and a subset S [p], XS Rn |S| is the sub-matrix consisting of columns in S. Particularly, Xj is the j-th column of X. For any vector w, let [w]i or wi be its i-th coordinate or component and w S R|S| is the sub-vector consisting of components in S. We denote w 1, w 2, w as the usual ℓ1, ℓ2 and ℓ norm of w. The vector x+ is denoted as the vector with its i-th coordinate being [x+]i = xi 0 where a b denotes the larger one between a and b. For any binary operation between a vector x and a scalar a, it means the same operation between a and each coordinate of x, e.g., [x + a]i = xi + a. C and c are denoted as the absolute constants. The symbol means with some hidden absolute constant C.

2. Methodology

In this section, we ﬁrst introduce the projected gradient descent method used to solve (1) and elucidate the self-regularization property of its iterates wt in terms of statistical and computational performance. Based on this property, we further develop a variable selection procedure using the special information criteria.

Simplex Constrained Sparse Optimization via Tail Screening

2.1 Projected gradient method and self-regularization property

The constrained minimization (1) can be reformulated as the following unconstrained composite optimization problem min w Rp f(w) + χ (w) (2)

where χ ( ) is the characteristic function of deﬁned as follows:

( 0, w , w / .

The projected gradient (PG) method (Beck and Teboulle, 2009; Nesterov, 2013; Beck, 2017) is an eﬃcient tool to solve (2). The basic idea of PG is illustrated as follows. We ﬁrst replace the diﬀerentiable part f(u) with its quadratic approximation evaluated at the current solution w that

m M(u) := f(w) + f(w), u w + M

where f(w) = (2n) 1 Xw y 2 2, f(w) = n 1X (Xw y) and M is a guess value of the unknown Lipschitz constant Lf of f. Suppose that the current solution is w, then we can make an update and obtain a new solution w+ by solving the following problem

w+ = argmin u Rp {m M(u) + χ (u)}

= P w M 1 f(w) ,

where P ( ) : Rp is a projection operator that maps a vector to the (p 1)-dimensional simplex. The replacement of the quadratic approximation simpliﬁes the minimization problem to the projection onto the simplex. There exist eﬃcient algorithms to compute this exact projection (Duchi et al., 2008; Wang and Carreira-Perpin an, 2013; Condat, 2016; Perez et al., 2020). We iteratively perform quadratic approximation and projection until the diﬀerence between two consecutive iterations is small. This iterative procedure is summarized in Algorithm 1, which provides a meta PG algorithm.

Algorithm 1 Projected Gradient (PG) Method

Input: w0 , tolerance ϵ > 0.

1: Initialize t 0.

2: while wt+1 wt 2 ϵ do

3: Select step-size Mt > 0 see more details in Algorithm 3

4: Set wt+1 P wt 1 Mt f(wt)

5: t t + 1 Output: wt.

Remark 1 The convergence of Algorithm 1 relies on the choice of step size parameter Mt

in Step 3. It is usually required that the selected Mt satisﬁes the following suﬃcient descent property

f(wt+1) f(wt) + f(wt), wt+1 wt + Mt

2 wt+1 wt 2 2 (3)

Chen, Zhu, Zhu, Wang

which can be fulﬁlled via a backtracking procedure (Nesterov, 2013). The details of this procedure are provided in Algorithm 3 in the Appendix. The backtracking method introduces two hyperparameters: (i) L0, the initial step size, and (ii) γ, the decay rate of step size. Given an input pair (L0, γ), the backtracking strategy iteratively adjusts step size until condition (3) holds. The values of L0 > 0 and γ > 1 are arbitrary and do not aﬀect the theoretical guarantees. In our implementation, we set L0 = 1, γ = 2, which performed well across all numerical experiments. Furthermore, for Algorithm 1, the choice of initial parameter w0 does not inﬂuence theoretical results, provided that the tolerance parameter ϵ is selected appropriately. The criteria for setting ϵ will be discussed in Sections 2.3 and 3. Henceforth, for simplicity, we also use the notation PG(w0, ϵ) to represent using Algorithm 1 to solve problem (2).

Under general convexity assumptions, standard analyses of the projected gradient (PG) method (Beck, 2017) guarantee convergence of iterates wt to the solution b wn of problem (1). Speciﬁcally, the geometric convergence rate wt b wn 2 O(e ct) holds for some c > 0, provided that f(w) is strongly convex. However, this strong convexity condition is hard to guarantee in high-dimensional cases where p n. In fact, the strong convexity parameter of f(w) equals the least eigenvalue of n 1X X Rp p, which is exactly 0 when p n. This raises an apparent gap between wt and b wn, causing wt may not inherit the selfregularization property of b wn in high-dimensional cases. Diﬀerent from the common analysis, we directly analyze the term wt w by merging the statistical error b wn w 2 and the optimization error wt b wn 2. Then, an interesting self-regularization property shows that wt converges to w up to an error of order O( p

s σ2 log(p)/n) as illustrated in Figure 1. Speciﬁcally, the orange triangle denotes the feasible region, i.e., the unit simplex . The red region B(w ) is a circle centered at w

with the radius being the statistical error. The blue ellipse G(w ) denotes the sub-level set whose elements have a loss value smaller than f(w ). Deﬁnition 6 provides details of the two intersecting regions B(w ) and G(w ). In Section 3, we prove two critical results: (i) G(w ) B(w ) and (ii) wt enters G(w ) geometrically and then stay there. In other words, wt w 2 O p

s σ2 log(p)/n at a geometric rate. This explicitly reveals the self-regularization property of wt. This property is highly useful for variable selection since it suggests that wt in the PG algorithm converges to w at a fast rate. Speciﬁcally, the convergence rate matches that of the oracle estimator, O( p

s /n), ignoring logarithmic and variance terms. In other words, a large component wt i corresponds to a large w i , and a small wt i corresponds to small a w i . This follows from the bound wt w wt w 2 O( p

s σ2 log(p)/n), which indicates the gap between wt i and w i is small when n is suﬃciently large. More importantly, this property helps distinguish between nonzero and zero components of w , thereby aiding in the identiﬁcation of the true support set S .

2.2 Projected gradient method with tail screening

To adapt the PG method to our sparse learning setting, we propose a screened PG algorithm: PERMITS, a shorthand for projected gradient method with tail screening. The complete algorithmic procedure is summarized in Algorithm 2. The core idea of PERMITS is simple: we recursively remove one element in the current support estimate to get a new support

Simplex Constrained Sparse Optimization via Tail Screening

(a) Illustrative trajectory

0 10 20 30 40 50 Iteration

Estimation Error

0 10 20 30 40 50 Iteration

Optimization Error

(b) Measurement of convergence. Figure 1: Illustration of two-stage convergence property. (a) The convergence trajectory of wt to b wn can be divided into two stages [w0, w T0] and [w T0, b wn] for some T0. The convergence rate of the ﬁrst stage is geometric, and the point in the second stage is apart from w with distance at most O( p

s σ2 log(p)/n), i.e., the statistical error. The whole second stage [w T0, b wn] is included in the statistically satisfactory region G(w ). (b) Two measurements of the convergence. The upper panel shows the estimation error wt w 2. In the lower panel, the optimization error wt b wn 2 decreases linearly and then behaves in a certain oscillating pattern, which may be caused by the non-strong-convexity.

estimate S and ﬁt the reduced data (XS, y) to obtain an updated coeﬃcient w whose quality is measured by the special information criterion SIC(w) = n log f(w)+ w 0 log(p) log log n. The output of PERMITS is the one that minimizes SIC. Removing one element in the estimated support set leverages the self-regularization property of wt. Speciﬁcally, we set wt wt 1 2 as a surrogate of wt w 2. When wt wt 1 2 is tiny, we expect the self-regularization property wt w p

s σ2 log(p)/n to hold. At this point, we discard a tail component of wt and ﬁx its value at 0. If the minimal signal b = min j S w j is large and supp(wt) S , discarding a tail component in wt

is safe, i.e., it removes a component in (S )c. We refer to this process as tail screening. Tail screening produces a sequence of nested support sets:

Sp Sp 1 S2 S1,

where Sp = [p] is the full model, and S1 = {j} for some j [p] is the singleton model. Notably, one of these sets coincides exactly with the true support set S . In other words, identifying S only requires evaluating p models rather than all 2p possible models. To select the correct model, we use the SIC(w), which adaptively determines the unknown sparsity level s . The penalty term w 0 log(p) log log(n) in SIC balances the underﬁtting and over-ﬁtting increasing the model complexity (i.e., larger w 0) reduces the empirical loss f(w), but increases the risk of ﬁtting noise, thereby degrading generalization.

Chen, Zhu, Zhu, Wang

By minimizing the SIC over the candidate models {Sp, Sp 1, , S1}, we obtain a ﬁnal estimate of the true support set S .

Algorithm 2 Proj Ected g Radient Method w Ith Tail Screening (PERMITS)

Input: (X, y), winit , Tmin 1, ϵ > 0.

1: Initialize S = [p], SICbest =

2: while |S| 1 do

3: winit w S/1 w S normalization for warm start

4: w S PG(winit, ϵ) with data (XS, y) and at least Tmin iterations

6: SIC(w) n log f(w) + |S| log(p) log log n

7: if SIC(w) < SICbest then

8: SICbest SIC(w), wbest w

9: j arg min i S wi, S S \ {j}

Output: wbest

Remark 2 In the fourth line of Algorithm 2, we apply the PG algorithm over the current support S. Tmin is an additional hyper-parameter for the minimum number of iterations that are needed for the proof of statistical properties, and a small value (e.g., Tmin = 5 in our numerical experiment) is usually suﬃcient.

Remark 3 In the ninth line of Algorithm 2, we remove the component of w S with the smallest value from the current estimated support S. If multiple components share the smallest value, we can randomly select one of them and discard it. In practice, we can also discard more components at each iteration; for instance, if there are many zero components in w S, we can discard all of them.

Remark 4 Intuitively speaking, PERMITS is better suited to the sparse learning problem because it integrates the standard PG algorithm and incorporates a tail screening procedure. Additionally, it oﬀers two key advantages. First, the dimension of the optimization decreases as the projected gradient iterations proceed, since more components are excluded from the iteration. This may lead to acceleration, as will be conﬁrmed by our numerical experiments. Second, PERMITS induces sparsity in the iterates wt by setting more and more negligibly small components exactly to zero during the iterations. This also demonstrates why PERMITS has the capability to perform variable selection.

Remark 5 The information criterion typically has a form n log f(w) + λn,p w 0 where λn,p is a term scales with n and p. Selecting an appropriate value for λn,p is critical to identify the true model S . Several choices of λn,p have been proposed in the literature, and they correspond to diﬀerent information criteria, including AIC, BIC, and some other ones in the high-dimensional setting (Akaike, 1998; Schwarz, 1978; Wang et al., 2013; Zhu et al., 2020). Algorithm 2 sets λn,p = log p log log n, which corresponds to the SIC proposed in Zhu et al. (2020).

Simplex Constrained Sparse Optimization via Tail Screening

2.3 Practical guidance for the tolerance parameter

In PERMITS, we terminate the sub-problem when the successive diﬀerence rt = wt wt 1 2 is smaller than ϵ. In fact, the parameter ϵ balances the computation time and estimation accuracy. Speciﬁcally, as ϵ decreases, its estimation accuracy improves, while it needs more computation time. Hence, a moderate value ϵ may be a more appropriate choice. From the theoretical perspective, we provide an upper bound for the consistency that

contains some unknown quantities such as κ1, µf, s and σ2. These unknown quantities bring diﬃculties to the choice of ϵ and this problem is inevitable if we simultaneously take into account the statistical and optimization issues. However, compared to these unknown quantities, sample size n and feature dimension p may change more dramatically across diﬀerent practical tasks, and thus, we mainly concern the order of n and p. In Fan et al. (2023), compared to the choice of ϵ here, a variety of hyper-parameters (e.g., number of iterations) need to be chosen manually as well. In their work, they suggested replacing the unknown part (similarly, κ1, µf in our problem) of the theoretical result with a known small constant (e.g., 10 4) and this method achieved good performance in practice. Hence, in our paper, we follow their suggestion and take ϵ to be a small multiple of p

log(p)/n. In our implementation, ϵ = 10 4p

log(p)/n is the default choice. Furthermore, from the practical perspective, we also perform some additional experiments in Appendix A to show the robustness of PERMITS with respect to this tolerance parameter.

3. Theoretical Properties

3.1 Conditions

We list some conditions for establishing the main theoretical results.

(C1) f is Lf smooth over the simplex such that

f(w) f(u) 2 Lf w u 2, w, u .

Condition (C1) only requires that the gradient f(w) is Lf Lipschitz over the simplex which is weaker than the counterpart over the whole Rp (Beck, 2017).

(C2) f is restricted strongly convex with parameter µf such that

f(w) f(u) + f(u), w u + µf

holds for any w, u and w u C(S ) where

C(S) := n δ Rp : 1 δ = 0, δSc 0 or δSc 0 o .

Chen, Zhu, Zhu, Wang

The restricted set C(S) consists of elements having sum 0 and the same sign on Sc and Condition (C2) only requires that f is µf-strongly convex over the direction δ C(S ). In the linear model setting, this condition is equivalent to the following statement

1 n Xδ 2 2 µf δ 2 2, δ C(S ),

which is weaker than the usual restricted eigenvalue condition (Bickel et al., 2009; van de Geer and B uhlmann, 2009; B uhlmann and van de Geer, 2011; Wainwright, 2019) that requires the above inequality to hold over the cone Cα(S) := {δ Rp : δSc 1 α δS 1} for some α 1. In fact, in this case, our restricted set is much smaller than C(S ) C1(S).

(C3) The i.i.d. random errors ξ1, . . . , ξn have zero mean and sub-Gaussian tails: there exists a constant σ > 0 such that P(|ξi| t) 2 exp( t2/σ2), for all t 0.

Condition (C3) is a widespread condition in the high-dimensional literature (Wainwright, 2019) and is weaker than the conventional normality condition.

(C4) b := min j S w j q

Cσ2s log p log log n

nµf for some constant C > 0.

Condition (C4) is necessary for the theoretical guarantee in the high-dimensional setting. It actually allows b to be small enough as the sample size n increases. Speciﬁcally, if we are only concerned with the sample eﬃciency and ignore some constants, this condition says that the minimal signal b can decrease with the order O( p

log p log log n/n). This condition is crucial for the support recovery since, if the minimal signal is smaller than the noise level, we cannot expect to recover that weak signal.

(C5) ρ := max i =j

|X i Xj| Xi 2 Xj 2 < c s for some small enough constant 0 < c < 1.

Condition (C5), as discussed in Wainwright (2019), is stronger than Condition (C2), but it is only needed for proving variable selection consistency. It is a technical assumption that might be relaxed in future work.

3.2 Self-regularization of projected gradient descent method

In our theoretical analysis, we directly analyze the quantity wt w 2, which merges the statistical error b wn w 2 and optimization error wt b wn 2 and only a weaker restricted strongly convex condition (C2) is needed for establishing its geometric rate. First, we formally deﬁne two sets, G(w ) and B(w ), which represent computationally optimal points and statistically optimal points, respectively.

Deﬁnition 6 Deﬁne two useful sets as follows

G(w ) := {w : f(w) f(w )}

B(w ) := w : w w 2 4

where µf is the restricted strongly convex constant deﬁned in Condition (C2).

Simplex Constrained Sparse Optimization via Tail Screening

Remark 7 On one hand, the set G(w ) is a sub-level set containing all w with loss value smaller than the true parameter w . Note that f(w ) is usually much larger than the minimum value f( b wn). Thus, for iterate wt of the PG algorithm, entering G(w ) is considerably easier than approaching b wn, and this is why we refer to it as a computationally satisﬁed set. On the other hand, the set B(w ) is a ball with center w and radius 4

s n 1X ξ /µf which can be viewed as the statistical error. If ξ is σ2 sub-Gaussian, this statistical error, with high probability, is of order O p

s σ2 log p/n which has a dependence on p only with a logarithmic scale. That is, any point w B(w ) is a satisfactory estimate of w even in the high-dimensional case p n. This demonstrates the meaning of the statistically optimal set.

We ﬁrst claim the self-regularization property of the simplex-constrained minimization problem in the following proposition, which is an extension of the result in Li et al. (2020).

Proposition 8 If Condition (C2) holds, then we have G(w ) B(w ).

Remark 9 As a special case, for any minimizer b wn of (1), since f( b wn) f(w ), we have b wn B(w ) and thus b wn w 2 4

s f(w ) /µf

which was proved in Li et al. (2020). This property shows the beneﬁts of simplex constraint in that it helps prevent noise accumulation in the high-dimensional case (p n).

Theorem 10 (Linear Convergence Rate) Suppose Conditions (C1)-(C3) hold, let wt

be the t-th iteration of Algorithm 1 with any w0 , L0 > 0, γ > 1, then, with probability at least 1 O(p 3), the inequality

2 tµf L0 (γLf)

holds for all t > 0, where C is an absolute constant.

Remark 11 The term wt w 2 encompasses the optimization error wt b wn 2 and the estimation error b wn w 2 simultaneously. It is well known that the global linear convergence rate of the optimization error can not be achieved under the above conditions.

However, if we only care about wt w 2 and the tolerance C µf

n is admissible, the corresponding linear convergence rate can be established. To the best of our knowledge, this is the ﬁrst time such a theoretical result has been derived.

As an immediate result of Theorem 10, we obtain the following corollary.

Corollary 12 In the setting of Theorem 10, if we set

nµ2 f s σ2 log p

then we have

max w T b wn 2, w T w 2 C

Chen, Zhu, Zhu, Wang

In fact, this corollary also provides valuable insights from an optimization perspective. For example, even though the full-stage geometric convergence rate of wt b wn 2 can not be obtained, geometric convergence during the ﬁrst stage is feasible under mild conditions. That is, the number of iterations required is still of the logarithmic order, provided that the target precision is of the same order as the statistical error. Similar results were established for general unconstrained and regularized M-estimators in Agarwal et al. (2010). The following theorem claims that, based on this stopping rule rt = wt wt 1 , PG algorithm can also output a good approximation of the true w .

Theorem 13 (ℓ2 Error Bound) Suppose Conditions (C1)-(C3) hold. Let κ1 := µf 2L0 (γLf) > 0, if we set the tolerance parameter ϵ such that

then Algorithm 1 outputs a solution w satisfying

where C > 0 is an absolute constant.

Remark 14 Theorem 13 provides an ℓ2 error bound for the output of the PG algorithm, which aligns with the Lasso-type bounds. Therefore, our theoretical analysis integrates both statistical and optimization perspectives, demonstrating the self-regularization property of the PG algorithm used to solve the problem (1).

3.3 Variable selection consistency of PERMITS

We have demonstrated useful computational properties and statistical properties of the PG algorithm. Motivated by these advantageous properties, PERMITS integrates a straightforward tail screening procedure into the PG algorithm to seek a sparse solution and recover the true support S . The following theorem demonstrates the property of the support set of the solution.

Theorem 15 (Variable Selection Consistency) Suppose Conditions (C1) and (C3)- (C5) hold, let w be the output of the PERMITS algorithm with

n and Tmin log s / log(κ 1 2 ),

where κ1 = µf 2L0 (γLf) and κ2 = 1 µf/Lf. Then, with probability at least 1 O(p 3), the following event holds supp(w) = S .

Remark 16 To facilitate the proof, we replace condition (C2) with the incoherence condition (C5), which is also a common requirement in high-dimensional data analysis (Wainwright, 2019; Wright and Ma, 2022; Tang et al., 2023). To the best of our knowledge,

Simplex Constrained Sparse Optimization via Tail Screening

Theorem 15 is the ﬁrst result providing the variable selection consistency using the selfregularization property of problem (1). It is important to note that our result considers the output of a speciﬁc algorithm rather than a theoretical minimizer. The distinction is crucial the former is attainable in practice, while the latter cannot be.

4. Numerical Experiments

In this section, we perform numerical experiments using both synthetic and real data to verify the self-regularization property and the advantage of PERMITS. The experiments on synthetic data and real-world data are presented in Section 4.1 and Section 4.2, respectively. The results show that the proposed PERMITS algorithm performs better than state-of-theart methods. We compare the performance of PERMITS with the following methods:

1) Oracle: the oracle estimator, which is obtained by solving (1) subject to the constraint that supp(w) = S .

2) IHT*: the iterative hard thresholding estimator proposed by Kyrillidis et al. (2013).

3) H-Lasso: the heuristic Lasso estimator that ﬁrstly solves the non-negative Lasso problem and then divides the solution by its ℓ1 norm to fulﬁll the simplex constraint.

Note that Oracle is actually the best estimator for this sparse-constrained regression problem, and we include it just to exhibit the diﬀerence between diﬀerent estimators and Oracle. The penalty parameter λ of H-Lasso is selected via cross-validation. For IHT*, the true sparsity s is provided since, otherwise, selecting the sparsity is time-consuming.

4.1 Synthetic data

In this subsection, we consider both statistical and computational performance. To validate the consistency of support set recovery, we ﬁx p = 1000 and vary the sample size n from 100 to 600. The synthetic data is generated as follows. The rows of X Rn p are i.i.d. generated from a Gaussian distribution Np(0, Σ), where Σ is the covariance matrix and its elements being exponentially decreasing (Σij = ρ|i j|). We consider three correlation cases such that ρ { 0.5, 0, 0.5} and they are shown in diﬀerent columns in Figures 2 and 3. Then, the true coeﬃcient w Rp takes value 0.1 on s = 10 randomly chosen positions S

and 0 on the remaining positions (S )c. The additive noise ξ is generated from Nn(0, σ2I) where σ2 is the noise level. Here, σ2 is chosen such that the signal-to-noise ratio (SNR), deﬁned by Var(Xw )/σ2, belongs to {0.5, 1, 5}. Results with diﬀerent SNRs are shown in diﬀerent rows in Figures 2 and 3. Finally, the response y is generated through the linear model y = Xw + ξ. Our experiments will present the results of 50 repeated simulations where mean metric (accuracy, error, and time) values are plotted with n ranging from 100 to 600. Statistical performance. We ﬁrst present the statistical performance of methods in the following. Two metrics of statistical performance are considered here: accuracy and error.

Chen, Zhu, Zhu, Wang

For any given estimator w, we deﬁne its accuracy and error as follows

Accuracy := |supp(w) supp(w )|

|supp(w) supp(w )|,

Error := w w 2.

Firstly, we visualize the variation of accuracy as the sample size n increases in Figure 2. As n increases, the accuracy of PERMITS (the pink line in each sub-ﬁgure) approaches 1, which validates the variable selection consistency as theoretically shown in Theorem 15. As SNR increases, the sample size needed for high accuracy decreases. This can be seen by comparing diﬀerent panels in each column in Figure 2 for each model. Similarly, each row shows the eﬀect of diﬀerent correlations for a ﬁxed SNR, and high correlation (ρ = 0.5) cases need a larger sample size to obtain high accuracy. Figure 2 also reveals that the case with ρ = 0.5 seems to be easier than that with ρ = 0.5. This is actually due to an informal fact that constraint = {w : w 0, 1 w = 1} excludes those negatively correlated features, and only positively correlated features are under consideration. Thus, the case with ρ = 0.5 essentially has a lower feature dimension than that with ρ = 0.5. As sample size n increases, the accuracy of each method except H-Lasso tends to 1, and PERMITS universally outperforms other methods in each SNR and correlation. H-Lasso fails to be consistent since it includes too many irrelevant features, and this defect is also validated in the later real data experiment. Then, we show the ℓ2 error in Figure 3, which demonstrates that PERMITS greatly approaches the oracle ℓ2 error when n is suﬃciently large. As sample size n increases, each method tends to obtain the same ℓ2 error as Oracle, and PERMITS outperforms other methods in the sense that we can obtain the same ℓ2 error with a smaller sample size. The eﬀects of SNR and correlation are similar to that in Figure 2. Note that, in contrast to accuracy, the H-Lasso outperforms IHT* in terms of ℓ2 error, although it tends to select more features. Computation performance. With respect to the computational performance, we mainly focus on the running time of diﬀerent methods and their dependence on the dimension p. Here, we compare the mean running time of 50 repeated experiments with p ranging from 100 to 1000, and n is ﬁxed at 500. Figure 4 shows the time demanded by these four methods. We aggregate three diﬀerent SNRs with their means, and thus only one row appears in Figure 4, in contrast to three rows in Figure 2 and 3. Note that H-Lasso is much more time-consuming in contrast to other methods due to the cross-validation procedure.

4.2 Real data: DJIA constituents detection

In this subsection, we consider a real-world application that aims to detect the constituents of the Dow Jones Industrial Average (DJIA) index based on the daily return of the DJIA index and a pool of stocks. The daily return of the DJIA index yt is a price-weighted daily return of s = 30 prominent companies. Here, we choose the stocks pool to be the constituents S&P 500, which contains p = 490 stocks after dropping those with missing data. The resulting stock pool actually contains the constituents of the DJIA as a subset, and we want to recover this subset using daily return data only.

Simplex Constrained Sparse Optimization via Tail Screening

Exp(-0.5) Exp(0.0)

100 200 300 400 500 600 0.00

100 200 300 400 500 600 Sample size

100 200 300 400 500 600

Oracle PERMITS IHT* H-Lasso

Figure 2: The sample size (x-axis) versus variable selection accuracy (y-axis). We ﬁx p = 1000, s = 10. Each sub-ﬁgure corresponds to a diﬀerent choice of SNR (in diﬀerent rows) and correlation structure (in diﬀerent columns). The sample size n ranges from 100 to 600 with step 50. 50 repeated experiments are performed, and the mean values are plotted.

We collect daily returns data of the year 2022, which contains n = 251 samples. Denote the daily return of DJIA by yt R and the returns of the pool by xt Rp, then we have

yt = x t wt, t = 1, 2, . . . , 251

where wt Rp is the weights of 490 stocks in the t-th day. Although 30 prominent constituents of DJIA are ﬁxed for diﬀerent t (i.e., supp(wt) is ﬁxed), the daily weights of these 30 stocks change every day (i.e., wt = wt for t, t {1, . . . , 251}). To put this application into the usual setting of a signal-plus-noise linear model, we can construct a ground truth coeﬃcient

and view ξt = x t (wt w ) as the noise. Then we have the equivalent model that

where X R251 490, y R251 are observed data and (w , ξ) are unobservable. The task is to recover the support set of w based on (X, y), i.e., detect the constituents of DJIA based only on the daily return of the DJIA index and 490 stocks. The right panel of Figure 5 shows the weight w deﬁned as above. We emphasize that w is constructed manually and is indeed in the simplex; furthermore, we know 30 prominent constituents, i.e., the

Chen, Zhu, Zhu, Wang

Exp(-0.5) Exp(0.0)

100 200 300 400 500 600 0.00

100 200 300 400 500 600 Sample size

100 200 300 400 500 600

Oracle PERMITS IHT* H-Lasso

Figure 3: The sample size (x-axis) versus ℓ2 error of parameter estimation (y-axis). The remaining settings are the same as Figure 2.

200 400 600 800 1000

200 400 600 800 1000 Dimension

200 400 600 800 1000

Oracle PERMITS IHT* H-Lasso

Figure 4: The dimensions (x-axis) versus running time (y-axis). Fix n = 500 and p ranges from 100 to 1000 with step size 50. The averaged running time of 50 repeated experiments is plotted.

support set supp(w ). Therefore, this DJIA dataset serves as an appropriate benchmark dataset that enables the evaluation and comparison of methods for simplex-constrained sparse optimization. The optimization is aﬀected by the correlation structure of the X, which is visualized in the left panel of Figure 5. As we can see from Figure 5, the correlation structure is diﬀerent from that in synthetic data, as the correlation may not exponentially decay. Hence, this real-world dataset serves as a complement to synthetic data analysis to help us understand the empirical performance of PERMIT at diﬀerent and more challenging correlation settings.

Simplex Constrained Sparse Optimization via Tail Screening

S&P 500 correlation

4.14% 1.04% 2.63% 2.78% 1.13%

6.07% 5.90% 3.97%

1.17% 4.89% 3.35%

2.84% 3.18%

CRM CSCO CVX DIS DOW

DJIA constituents

Figure 5: Basic information of the DJIA index. The left panel shows the correlation matrix of 490 stocks in our pool, and the right panel shows the constituents of DJIA whose sizes are determined by the weight w .

Table 1 presents the performance of the three methods, evaluated using four metrics: (i) correctly detected stocks, i.e., stocks in the DJIA identiﬁed by the method; (ii) wrongly detected stocks, i.e., stocks not in the DJIA but selected by the method; (iii) accuracy, i.e., the proportion of correctly detected stocks relative to the total number of detected and missed stocks; and (iv) ℓ2-error in estimation.

Table 1: Performance of detection of DJIA constituents.

Method Correctly detected Wrongly detected Accuracy ℓ2 Error

H-Lasso 30 58 34% (30/88) 0.040 IHT* 9 21 18% (9/51) 0.226 PERMITS 28 0 93% (28/30) 0.038

As shown in Table 1, PERMITS detects 28 constituents of DJIA and misses 2 constituents: KO (with weight 1.17%), WBA (with weight 0.97%) whose weights are too small to be distinguished from noise; see Figure 5. Although H-Lasso detects all 30 stocks, it includes 58 in other wrong ones, and this is not an ideal result. Besides, IHT* only rightly detects a few constituents, reﬂecting that it has less power to identify stocks in the DJIA. This may be due to the fact that the signal is too weak. IHT* would also identify some constituents despite being less than H-Lasso. In summary, our method is the only one that ﬁnds almost all constituents without any false discovery. And notably, PERMITS also achieves the least estimated error among these methods.

4.3 Eﬃciency of tail screening procedure

In this subsection, we perform numerical experiments to show the eﬃciency of the tail screening procedure. Speciﬁcally, we compare the outputs of the following two methods.

i) Without tail screening (TS): the empirical risk minimizer of the original problem

min w Rp f(w) = 1

2n Xw y 2 2 s.t. w .

The solution of this problem is obtained as the numerical solution of CVXPY (Diamond and Boyd, 2016), an open-source Python library for convex optimization problems. We

Chen, Zhu, Zhu, Wang

Table 2: Comparison of methods with and without the tail screening procedure across diﬀerent values of n, p and s .

(n, p, s ) Method Accuracy Sparsity ℓ2 Error Time

(500, 500, 5) Without TS 0.13 (0.09) 148.44 (185.92) 0.10 (0.02) 1.52 (0.32) With TS 0.97 (0.06) 5.16 (0.37) 0.05 (0.02) 0.02 (0.00)

(500, 1000, 10) Without TS 0.11 (0.08) 249.58 (230.69) 0.11 (0.02) 3.95 (1.04) With TS 0.98 (0.05) 10.26 (0.53) 0.05 (0.02) 0.30 (0.07)

(1000, 1000, 20) Without TS 0.12 (0.09) 302.98 (183.42) 0.07 (0.01) 8.00 (2.16) With TS 0.99 (0.02) 20.10 (0.30) 0.03 (0.01) 0.32 (0.06)

(1000, 2000, 30) Without TS 0.08 (0.05) 487.34 (168.99) 0.07 (0.01) 31.11 (7.68) With TS 0.97 (0.04) 29.38 (0.85) 0.04 (0.01) 0.57 (0.26)

solve the problem using CVXPY with its default settings. By default, CVXPY calls the solver most specialized to the problem type. Speciﬁcally, for the simplex constrained least squares problem here, CVXPY calls the Operator Splitting Quadratic Program (OSQP) solver (Stellato et al., 2020). We use the OSQP solver s default hyperparameters: a maximum number of 104 iterations, an absolute accuracy tolerance of 10 5, a relative accuracy tolerance of 10 5.

ii) With TS: the output of our proposed PERMITS algorithm.

The comparisons are conducted on the synthetic datasets, following the same data-generating process in Section 4.1. Four criteria are designed to assess the two methods. Two of these criteria (i.e., accuracy and ℓ2 error) are adopted from Section 4.1. As for the remaining two, sparsity denotes the ℓ0 norm of the output, and time is the runtime of methods, measured in seconds. The mean values (with their standard deviations in the parentheses) of 50 replicated simulated data are reported in Table 2. Table 2 demonstrates that the proposed PERMITS algorithm is signiﬁcantly faster than a standard convex optimization solver. This eﬃciency arises from two key factors: (i) leveraging the self-regularization property to eliminate unnecessary iterations and (ii) reducing the eﬀective dimension of the iterates wt through the tail screening procedure. Beyond computational beneﬁts, the tail screening procedure also enhances estimation quality. By sparsifying the solution, it ensures that the estimated sparsity level closely matches the true support size s . As a result, both the accuracy of support recovery S and the ℓ2-error in estimating w improve signiﬁcantly. These ﬁndings highlight that tail screening not only accelerates computation but also strengthens statistical estimation.

5. Conclusion and Discussion

In this work, we study the statistical and computational properties of the PG method applied to solve the simplex-constrained least squares problem. By bridging the gap between statistics and optimization, we extend the self-regularization property from the minimizer to the iterate of the PG algorithm. Without any regularization, the iterate of the PG algorithm

Simplex Constrained Sparse Optimization via Tail Screening

could approach the optimal statistical error at a geometric rate. Furthermore, we propose the PERMITS algorithm, which can accurately recover the true support set, demonstrating its widespread applicability in real-world tasks. Numerical experiments validate the eﬀectiveness of both the statistical and computational performance of the PERMITS method.

Several potential extensions of this work may be considered. First, the variable selection consistency is proven under the mutual incoherence condition, which may be overly stringent. In the numerical experiments, we observe that PERMITS still performs well in the highly correlated setting, and thus, we anticipate relaxing this condition to a weaker one, such as the restricted strong convexity condition as used in the proof of the ℓ2 error. Second, the self-regularization in our work is actually due to the geometric property of the simplex. We expect to extend the self-regularization to some more general constrained sets, such as polyhedra. Lastly, similar statistical properties may be extended to other models, such as the generalized linear model and group linear model (Zhu et al., 2022; Zhang et al., 2023).

Acknowledgments

We would like to thank the two referees for their constructive comments and valuable suggestions, which have substantially improved the paper.

Wang s research is partially supported by National Natural Science Foundation of China grants No.12231017, 72171216, 71921001, and 71991474.

Appendix A. Additional Experiments

In this part, we perform some extended experiments with the same setting as Section 4 to show the eﬀect of the tolerance parameter ϵ. In Section 4, our proposed PERMITS method is implemented with default value ϵ = 10 4p

log(p)/n. Here, we show the results of two other choices such that

and thus these methods are referred to as PERMITS(e-3), PERMITS(e-4), and PERMITS(e5), respectively. Three criteria (support accuracy, ℓ2 error, and running time) compared with some benchmark methods are shown in Figures 6-8. We conclude that all these three choices of ϵ are appropriate for all simulation settings, while there is a slight diﬀerence in running time.

Appendix B. Details of the Backtracking Procedure

Recall that given the current iterate w and a guess value M > 0 of the Lipschitz constant Lf, we can make a projected gradient iteration by substituting f(u) with a quadratic

Chen, Zhu, Zhu, Wang

Exp(-0.5) Exp(0.0)

100 200 300 400 500 600 0.00

100 200 300 400 500 600 Sample size

100 200 300 400 500 600

Oracle PERMITS (e-3) PERMITS (e-4) PERMITS (e-5) IHT* H-Lasso

Figure 6: The sample size (x-axis) versus variable selection accuracy (y-axis). We ﬁx p = 1000, s = 10. Each sub-ﬁgure corresponds to a speciﬁc choice of SNR (in diﬀerent rows) and correlation structure (in diﬀerent columns). The sample size n ranges from 100 to 600 with step 50. We perform 50 repeated experiments, and their mean values are plotted. PERMITS with three diﬀerent tolerance parameters are compared with some benchmark methods and the oracle method.

approximation at w and solve the following minimization problem

w+ = arg min u Rp

f(w) + f(w), u w + M

2 u w 2 2 + χ (u)

= P (w M 1 f(w)).

The quality of this new iteration w+ is aﬀected by the step size M, i.e., the guess value of the Lipschitz constant. To obtain a suﬃcient decrease, we need to check whether M satisﬁes:

f(w+) f(w) + f(w), w+ w + M

2 w+ w 2 2. (4)

If inequality (4) holds, we adopt w+ as the next iterate. Otherwise, we increase the guess M by a factor γ > 1 to a better guess γM and repeat the above projected gradient iteration until inequality (4) holds. This repetition will terminate in ﬁnite times by the Lipschitz smooth condition (C1) whenever M is larger than the Lipschitz constant Lf. We summarize the iterative procedure for ﬁnding M in Algorithm 3.

Simplex Constrained Sparse Optimization via Tail Screening

Exp(-0.5) Exp(0.0)

100 200 300 400 500 600 0.00

100 200 300 400 500 600 Sample size

100 200 300 400 500 600

Oracle PERMITS (e-3) PERMITS (e-4) PERMITS (e-5) IHT* H-Lasso

Figure 7: The sample size (x-axis) versus ℓ2 error of parameter estimation (y-axis). The remaining settings are the same as Figure 6.

200 400 600 800 1000

200 400 600 800 1000 Dimension

200 400 600 800 1000

Oracle PERMITS (e-3) PERMITS (e-4) PERMITS (e-5) IHT* H-Lasso

Figure 8: The dimensions (x-axis) versus running time (y-axis). Fix n = 500 and p ranges from 100 to 1000 with step size 50. The averaged running time of 50 repeated experiments are plotted.

Algorithm 3 Backtracking procedure for ﬁnding M Input: M > 0, γ > 1.

1: while suﬃcient decrease condition (4) is not satisﬁed do

2: w+ P w M 1 f(w)

3: M γM Output: M

Chen, Zhu, Zhu, Wang

Remark 17 The input parameter M of Algorithm 3 is the initial guess of the t-th iteration and may vary across diﬀerent iterations. We set it as max{L0, M/γ} where M is the last value of M in the (t 1)-th iteration in Algorithm 1 and more details could be found in Nesterov (2013).

Appendix C. Technical Proofs

In this part, we prove the corresponding computational and statistical properties. The main body of this paper studies the case where f is the least squares loss function. Here, we give the proofs of the properties on ℓ2 error and linear convergence rate for the general diﬀerentiable convex function f. So, for the least squares loss, i.e., f(w) = 1 2n Xw y 2 2, the properties hold as special cases. Recall that w is the parameter of interest that we want to recover, and f satisﬁes Condition (C1) and Condition (C2), where the latter implies that for any w , it holds that

f(w ) f(w) + f(w), w w + µf

f(w) f(w ) + f(w ), w w + µf

C.1 Proof of Proposition 8

The following proof extends Proposition 2 of Li et al. (2020). Proof Let w be any point in G(w ), that is f(w) f(w ). Then

0 f(w) f(w )

f(w ), w w + µf

2 w w 2 2 f(w ) w w 1

2 w w 2 2 2

s f(w ) w w 2

where the second inequality follows from Condition (C2), the third inequality is due to Holder s inequality, and the last one follows from the property of simplex that for any w , let S be the support set of w , then

i S |wi w i | + X

i/ S |wi w i |

i S |wi w i | + X

i S |wi w i | + 1 X

i S |wi w i | + X

i S (w i wi)

2 w S w S 1

s w S w S 2

Simplex Constrained Sparse Optimization via Tail Screening

Thus, w w 2 4

µf . Using the fact that

f(w ) = n 1X (Xw y) = n 1X ξ ,

we have w B(w ). Since w G(w ) is arbitrary, it holds that G(w ) B(w ).

C.2 Proof of Theorem 10

Proof First, we prove that before entering the region G(w ), the convergence rate of wt is geometric. Assume that wt is outside of G(w ), that is f(wt) > f(w ). The projected gradient iteration performs the backtracking strategy for selecting the step length or, equivalently, Mt to guarantee the suﬃcient decrease that

f(wt) f(wt 1) + f(wt 1), wt wt 1 + Mt

2 wt wt 1 2 2.

Thus, denote zt = wt 1 1 Mt f(wt 1), we have

f(wt) f(wt 1)

f(wt 1), wt wt 1 + Mt

2 wt wt 1 2 2

= f(wt 1), wt w + f(wt 1), w wt 1 + Mt

2 wt wt 1 2 2

=Mt wt 1 zt, wt w + f(wt 1), w wt 1 + Mt

2 wt wt 1 2 2

Mt wt 1 wt, wt w + f(wt 1), w wt 1 + Mt

2 wt wt 1 2 2

2 wt 1 w 2 2 wt 1 wt 2 2 wt w 2 2

+ f(wt 1), w wt 1 + Mt

2 wt wt 1 2 2

= f(wt 1), w wt 1 + Mt

2 wt 1 w 2 2 wt w 2 2

f(w ) f(wt 1) µf

2 wt 1 w 2 2 + Mt

2 wt 1 w 2 2 wt w 2 2 .

The ﬁrst inequality is guaranteed by the suﬃcient decrease property of PG, and the second equality follows from the deﬁnition of zt. The second inequality is due to the convex projection property that wt zt, wt w 0. The third equality uses the fact that 2 a, b = a 2 2 + b 2 2 a b 2 2. The last inequality follows from the restricted strongly convex condition (C2) and the fact that w , wt 1 , 1 (w wt 1) = 1 1 = 0 and

Chen, Zhu, Zhu, Wang

(w wt 1)(S )c = 0 wt 1 (S )c = wt 1 (S )c 0. By the assumption f(wt) > f(w ), we have

0 < f(wt) f(w )

2 wt 1 w 2 2 + Mt

2 wt 1 w 2 2 wt w 2 2

Mt ) wt 1 w 2 2 wt w 2 2 i .

Therefore, wt w 2 2 < 1 µf

Mt wt 1 w 2 2 exp µf

Mt wt 1 w 2 2. Since the suﬃcient decrease condition holds for any L Lf, by the backtracking strategy, we have for any t 0 that Mt L0 (γLf). Hence,

wt w 2 exp 1

2 µf L0 (γLf)

Applying the above result recursively gives

wt w 2 exp 1

2 tµf L0 (γLf)

2 tµf L0 (γLf)

where the last inequality uses the fact that w u 2 2 w u 1 2 for any w, u . This means that when f(wt) > f(w ), that is wt is outside of G(w ), wt w 2 converges at a linear rate. Since PG is a descent algorithm, we know that wt approaches G(w ) in a linear rate and will stay in G(w ) henceforth. Therefore, there exists T0 > 0 such that for any t T0, wt G(w ),

Combine the above arguments, and we have that for any t = 0, 1, 2,

2 tµf L0 (γLf)

When f is the linear least squares loss, Lemma 18 gives that

holds with probability at least 1 O(p 3). The proof is completed.

C.3 Proof of Theorem 13

Proof Before the statistical precision is attained, we can lower bound the diﬀerence rt := wt wt 1 2 via wt 1 w 2 as follows. From the proof of Theorem 10, when wt is outside of G(w ), we have the following one-step progress

wt w 2 e κ1 wt 1 w 2

Simplex Constrained Sparse Optimization via Tail Screening

where κ1 = 1

2 µf L0 (γLf) > 0. By the triangle inequality, we have

rt = wt wt 1 2 wt 1 w 2 wt w 2 (1 e κ1) wt 1 w 2.

For any tolerance parameter ϵ > 0, if PG algorithm terminate (rt ϵ), it must hold that either wt is outside G(w ) that

wt w 2 e κ1 wt 1 w 2 rt

eκ1 1 ϵ eκ1 1

or wt is inside G(w ) that

Hence, if we choose ϵ C(eκ1 1)

n , the PG algorithms will terminate and output a solution w satisfying

This completes the proof of Theorem 13.

C.4 Proof of Theorem 15

Proof Let w := arg min w f(w) s.t. w(S )c = 0

and w := arg min w f(w) s.t. w Sc = 0

be the minimizers of the original optimization constrained on S and S respectively. Correspondingly, denote e w and b w be the output of the PG algorithm supported on S and S respectively. By the deﬁnition of the minimizer, we have that

f( e w) f(w ), f( b w) f( w).

From the ℓ2 error bound in Theorem 13 and the minimal signal condition (C4), we have that St := supp(wt) S for small t and St decreases as t increases. To simplify the notation, we denote S as St. Hence, only three cases can happen in the iteration path of PERMITS: S S , S = S or S S . To prove this theorem, it is suﬃcient to show that SIC(S ) < SIC(S) for any S = S . We split our proof into the following two cases: (I) |S| > |S | and (II) |S| < |S |. We only prove for the ﬁrst case (I) since similar arguments hold for case (II). Now, suppose that |S| > |S | holds. From Theorem 13, we have that all true variables are included in the

Chen, Zhu, Zhu, Wang

current support set, that is, S S . Denote S = S B with |B| > 0 and S B = . We only need to prove that

f( e w) f( b w) |B|σ2 log p

From Lemma 20, we have for the optimal solutions w and w that

f(w ) f( w) |B|σ2 log p

By the deﬁnition of minimizer w, it holds that f( w) f( b w) and thus

f(w ) f( b w) |B|σ2 log p

Hence, it remains to bound the optimization error over S such that

f( e w) f(w ) O σ2 log p

By the warm start in the PERMITS algorithm, for each t > 1, we have that the initial error is no more than O s σ2 log p

n . Note that e w, w are now constrained on the support S with size |S | = s n. That is, the design matrix X has only s rather than p n columns. Therefore, the condition (C5) implies that f is globally strongly convex over this low-dimensional support S . Therefore, the iterate e w within this problem converges exactly in a linear rate to w rather than only to a neighbourhood (in contrast to the result of Theorem 10 where only the convergence to a neighbourhood is guaranteed since we only require a weaker restricted strong convexity holds there). Thus, the initial optimization error O s σ2 log p

n can be decreased to 0 linearly and after Tmin = log s / log(κ 1 2 ) iterations,

the optimization error decreases to O σ2 log p

n . Hence, we have shown that

f( e w) f( b w) |B|σ2 log p

Using the inequality log(x) x 1 and the fact f( b w) > 0, we have

SIC(S ) SIC(S) = n log f( e w) f( b w) |B| log(p) log log n

nf( e w) f( b w) f( b w) |B| log(p) log log n

O |B|σ2 log p |B| log(p) log log n < 0

for some suﬃciently large n. The proof is completed.

Simplex Constrained Sparse Optimization via Tail Screening

C.5 Technical Lemmas

Without loss of generality, we assume that Xj 2/ n C holds for some constant C > 0 in this paper.

Lemma 18 Suppose that y = Xw + ξ where ξi, i [n] are i.i.d. sub-Gaussian random variables and f(w) = (2n) 1 Xw y 2 2 is the least squares loss function. Then, we have with probability at least 1 (8p3) 1 that

f(w ) = n 1X ξ

Proof By deﬁnition, 1 n X j ξ is a sub-Gaussian random variable. Using the union bound, we have

P 1 n X ξ t

j=1 P 1 n X j ξ t 2p exp t2

Setting t = 2 p

2C2σ2 log(2p), we have with probability at least 1 (8p3) 1 that

f(w ) = n 1X ξ 1 n 2 p

2C2σ2 log(2p)

From Proposition 8, we can derive a ℓ bound that b wn w q

still depends on s . The following Lemma says that this bound can be improved to q

n if the mutual incoherence condition (C5) holds.

Lemma 19 (ℓ error bound) Suppose Conditions (C3)-(C5) hold. Let ( w, α, β) be a KKT pair of problem (1). Then, with probability at least 1 O(p 3), we have

Proof of Lemma 19. We claim that the mutual incoherence parameter ρ 1 16s in (C5) is suﬃcient. Our proof is divided into two parts. Part I: w S w S n 1X ξ . Denote J := supp( w) and we consider respectively the following three cases

(a) J (S )c = , J c S = .

(b) J (S )c = , J c S = .

(c) J (S )c = .

We only provide detailed proof for case (a), and similar arguments can be applied to cases (b) and (c).

Chen, Zhu, Zhu, Wang

We prove (a) by noting that it will lead to a contradiction if

w S w S 8 n 1X ξ .

Since J (S )c = and J c S = , i.e., J includes some noise and excludes some signals, then there exist two indexes j, k [p] such that j J (S )c and k J c S . The KKT point ( w, α, β) necessarily satisﬁes the following conditions

n 1X X( w w ) n 1X ξ α1 β = 0, (stationary)

w 0, 1 w = 1, (primal feasibility) β 0, (dual feasibility)

w β = 0. (complementary slackness)

Since j J (S )c J , we have wj > 0, w j = 0 and thus βj = 0 by the complementary slackness condition. Then, the stationary condition reads

( wj w j) + X

i =j n 1X j Xi( wi w i ) n 1X j ξ α = 0

= α = wj + X

i =j n 1X j Xi( wi w i ) n 1X j ξ

i =j n 1X j Xi( wi w i ) n 1X j ξ.

Owing to k J c S , we have wk = 0, w k > 0 and βk 0 by dual feasibility condition. Then, the stationary condition reads

( wk w k) + X

i =k n 1X k Xi( wi w i ) n 1X k ξ α βk = 0

= α = w k + X

i =k n 1X k Xi( wi w i ) n 1X k ξ βk

i =k n 1X k Xi( wi w i ) n 1X k ξ

Therefore, by combining the above two inequalities, we have X

i =j n 1X j Xi( wi w i ) n 1X j ξ < w k + X

i =k n 1X k Xi( wi w i ) n 1X k ξ.

Therefore, we can conclude that

i =k n 1X k Xi( wi w i ) n 1X k ξ X

i =j n 1X j Xi( wi w i ) + n 1X j ξ

2ρ w w 1 + 2 n 1X ξ

4ρ w S w S 1 + 2 n 1X ξ

4 w S w S + 1

2 w S w S .

Simplex Constrained Sparse Optimization via Tail Screening

This says that for any k J c S ,

| wk w k| = w k < 1

2 w S w S < w S w S .

So, the maximum w S w S is necessarily attained on some l J S = such that

| wl w l | = w S w S .

For any m J , by the stationary and complementary slackness conditions we have that

α = ( wm w m) + X

i =m n 1X m Xi( wi w i ) n 1X mξ

= ( wl w l ) + X

i =l n 1X l Xi( wi w i ) n 1X l ξ.

If wl w l < 0, we have w S w S = | wl w l | = ( wl w l ). Then for any m J ,

wm w m = ( wl w l ) + X

i =l n 1X l Xi( wi w i ) n 1X l ξ

i =m n 1X m Xi( wi w i ) + n 1X mξ

( wl w l ) + 2ρ w w 1 + 2 n 1X ξ

( wl w l ) + 1

= ( wl w l ) 1

2( wl w l )

2( wl w l ).

Sum m over J , we have

2 ( wl w l ) X

m J ( wm w m) = 1 X

which contradicts to the fact that wl w l < 0. Hence, wl w l 0, i.e., w S w S = | wl w l | = ( wl w l ). By a similar argument we have for any m J that

2( wl w l ).

Again, sum m over J , we have

2 ( wl w l ) X

m J ( wm w m)

m J c S w m.

Chen, Zhu, Zhu, Wang

While, we have proved for any k J c S that

2 w S w S = 1

2( wl w l ).

Therefore, we have:

2 ( wl w l ) < |J c S |

2 ( wl w l ) = |J | < |J c S | |S | = s .

That is, at least one parameter in S is estimated as 0, then there exists k S such that | wk w k| = w k b . However, Proposition 8 implies that

s n 1X ξ w w 2 | wk w k| = w k b which contradicts to condition(C4). The proof of Part I is completed. Part II: w w n 1X ξ and | α| n 1X ξ . It remains to prove that | wk w k| n 1X ξ for any k / S . Since w S w S n 1X ξ , there exists at least one j S J such that

( wj w j) + X

i =j n 1X j Xi( wi w i ) n 1X j ξ α = 0.

For any k / S , either wk = 0 (then the error | wk w k| = 0) or wk > 0 (then βk = 0) and thus

( wk w k) + X

i =k n 1X k Xi( wi w i ) n 1X k ξ α = 0.

= | wk w k| = wk

=( wj w j) + X

i =j n 1X j Xi( wi w i ) n 1X j ξ X

i =k n 1X k Xi( wi w i ) + n 1X k ξ

w S w S + 2ρ w w 1 + 2 n 1X ξ

w S w S + 1

4 w S w S + 2 n 1X ξ

where the last inequality is a direct implication of Part I. Therefore, it holds that w w n 1X ξ . Similarly, the optimal dual variable α satisﬁes the same bound | α| n 1X ξ . Finally, the probabilistic result holds since the event n n 1X ξ p

σ2 log p/n o

holds with probability at least 1 O(p 3) by Lemma 18.

Lemma 20 Suppose that conditions in Theorem 15 hold. Let w , w be minimizers of f(w) = (2n) 1 Xw y 2 2 constrained on S and S respectively. With probability at least 1 O(p 3), the following two statements hold:

(I) If S = S B for some nonempty set B, then we have for some constant C > 0 that

0 f(w ) f( w) C|B|σ2 log p

Simplex Constrained Sparse Optimization via Tail Screening

(II) If S = S B for some nonempty set B, then we have for some constant c > 0 that

f( w) f(w ) c|B|b2 .

Proof of Lemma 20. Part (I): S = S B. Note that all arguments in the proof of Lemma 19 hold true by substituting X with a lower dimensional XS. Meanwhile, the probability 1 O(p 3) is uniform since the upper bound for the noise level is uniform that

n 1X S ξ n 1X ξ , S.

Let (w , α , β ) and ( w, α, β) be the KKT pairs of the constrained problems corresponding to S and S respectively. The task is to bound the diﬀerence f S f S := f(w ) f( w). Note that, w S > 0 is trivial by Lemma 19 and the minimal signal condition (C4). Moreover, without loss of generality, we assume that w B > 0. In fact, otherwise, we can replace with S = S B with a smaller set S = S B where B B, w B > 0 and f S = f S until S = . Hence, as long as S = S B for some nonempty set B, we have

f S f S = 1

2n Xw y 2 2 1

2n X w y 2 2

2n(Xw X w) (Xw y + X w y)

2n(w S w S) X S h X(w w ) ξ + X( w w ) ξ i

2n(w S w S ) X S h X(w w ) ξ + X( w w ) ξ i

2n(w B w B) X B h X(w w ) ξ + X( w w ) ξ i .

In the following, we will bound these two terms respectively. The complementary slackness implies that βS = 0, β S = 0. Therefore, by the stationary condition, we have

n 1X S X( w w ) n 1X S ξ = α1,

n 1X BX( w w ) n 1X Bξ = α1,

n 1X S X(w w ) n 1X S ξ = α 1.

As claimed in the Lemma 19, we have

Chen, Zhu, Zhu, Wang

Hence, on one hand, we can bound the ﬁrst term as follows

1 2n(w S w S ) X S h X(w w ) ξ + X( w w ) ξ i

2 (w S w S ) 1 + α

2 (w S w S ) 1

|α | w B 1 + | α| w B 1

|B|σ2 log p

On the other hand, the second term is bounded as follows

1 2n(w B w B) X B h X(w w ) ξ + X( w w ) ξ i

1 2n(w B w B) X BXS (w S w S ) + 1 2n(w B w B) X Bξ + α

2 (w B w B) 1

w B w B 1 n 1X BXS (w S w S ) + w B w B 1 n 1X Bξ + | α| w B w B 1

|B|σ2 log p

where we use the facts that

w B w B 1 |B| h w B w B + w B w B i |B|

n 1X BXS (w S w S ) = max j B

n 1X j XS (w S w S )

max j B n 1X j XS (w S w S ) 1

Hence, we prove the ﬁrst part that f S f S |B|σ2 log p

Part (II): S = S B.

Simplex Constrained Sparse Optimization via Tail Screening

We lower bound f( w) f(w ) := f S f S as follows

f S f S = 1

2n X w y 2 2 1

2n Xw y 2 2

2n XS ( w S w S ) 2 2 n 1ξ XS ( w S w S )

2n XS (w S w S ) 2 2 + n 1ξ XS (w S w S )

c w S w S 2 2 n 1X S ξ 2 w S w S 2

C w S w S 2 2 n 1X S ξ 2 w S w S 2

C s σ2 log p

In the ﬁrst inequality, we use the implication of condition (C5) that n 1X S XS has bounded eigenvalues. In the second inequality, we use the fact that w S w S 2 p

n 1X S ξ 2 q

n and w S w S 2 q

Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. Advances in Neural Information Processing Systems, 23, 2010.

Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199 213. Springer, 1998.

Amir Beck. First-order methods in optimization. SIAM, 2017.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Konstantinos Benidis, Yiyong Feng, and Daniel P Palomar. Sparse portfolios for highdimensional ﬁnancial index tracking. IEEE Transactions on Signal Processing, 66(1): 155 170, 2017.

Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, pages 1705 1732, 2009.

Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265 274, 2009.

Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data: Methods, theory and applications. Springer Series in Statistics, 2011.

Chen, Zhu, Zhu, Wang

Laurent Condat. Fast projection onto the simplex and the ℓ1 ball. Mathematical Programming, 158(1):575 585, 2016.

Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

Jin-Hong Du, Yifeng Guo, and Xueqin Wang. High-dimensional portfolio selection with cardinality constraints. Journal of the American Statistical Association, pages 1 13, 2022.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Eﬃcient projections onto the ℓ1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272 279. PMLR, 2008.

Jianqing Fan, Yongyi Guo, and Kaizheng Wang. Communication-eﬃcient accurate statistical estimation. Journal of the American Statistical Association, 118(542):1000 1010, 2023.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427 435. PMLR, 2013.

Nirmal Keshava. A survey of spectral unmixing algorithms. Lincoln Laboratory Journal, 14(1):55 78, 2003.

Anastasios Kyrillidis, Stephen Becker, Volkan Cevher, and Christoph Koch. Sparse projections onto the simplex. In Proceedings of the 30th International Conference on Machine Learning, pages 235 243. PMLR, 2013.

Ping Li, Syama Sundar Rangapuram, and Martin Slawski. Methods for sparse and low-rank recovery under simplex constraints. Statistica Sinica, 30(2):557 577, 2020.

Qiuwei Li, Daniel Mc Kenzie, and Wotao Yin. From the simplex to the sphere: faster constrained optimization using the hadamard parametrization. Information and Inference: A Journal of the IMA, 12(3):1898 1937, 2023.

Steﬀen Limmer and S lawomir Sta nczak. A neural architecture for bayesian compressive sensing over the simplex via laplace techniques. IEEE Transactions on Signal Processing, 66(22):6002 6015, 2018.

Lukas Meier, Sara Van De Geer, and Peter B uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(1):53 71, 2008.

Nicolai Meinshausen. Sign-constrained least squares estimation for high-dimensional regression. Electronic Journal of Statistics, 7:1607 1631, 2013.

Yu Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

Simplex Constrained Sparse Optimization via Tail Screening

Guillaume Perez, Michel Barlaud, Lionel Fillatre, and Jean-Charles R egin. A ﬁltered bucketclustering method for projection onto the simplex and the ℓ1 ball. Mathematical Programming, 182(1-2):445 464, 2020.

Mert Pilanci, Laurent Ghaoui, and Venkat Chandrasekaran. Recovery of sparse probability measures via convex programming. Advances in Neural Information Processing Systems, 25, 2012.

Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, pages 461 464, 1978.

Martin Slawski and Matthias Hein. Non-negative least squares for high-dimensional linear models: Consistency and sparse recovery without regularization. Electronic Journal of Statistics, 7:3004 3056, 2013.

B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: an operator splitting solver for quadratic programs. Mathematical Programming Computation, 12(4):637 672, 2020.

Borui Tang, Jin Zhu, Junxian Zhu, Xueqin Wang, and Heping Zhang. A consistent and scalable algorithm for best subset selection in single index models. ar Xiv preprint ar Xiv:2309.06230, 2023.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267 288, 1996.

Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(1):91 108, 2005.

Sara A van de Geer and Peter B uhlmann. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360 1392, 2009.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Lan Wang, Yongdai Kim, and Runze Li. Calibrating non-convex penalized regression in ultra-high dimension. Annals of Statistics, 41(5):2505, 2013.

Weiran Wang and Miguel A Carreira-Perpin an. Projection onto the probability simplex: An eﬃcient algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013.

Zezhi Wang, Junxian Zhu, Xueqin Wang, Jin Zhu, Huiyang Pen, Peng Chen, Anran Wang, and Xiaoke Zhang. skscope: Fast sparsity-constrained optimization in Python. Journal of Machine Learning Research, 25(290):1 9, 2024.

John Wright and Yi Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022.

Chen, Zhu, Zhu, Wang

Guiyun Xiao and Zheng-Jian Bai. A geometric proximal gradient method for sparse least squares regression with probabilistic simplex constraint. Journal of Scientiﬁc Computing, 92(1):22, 2022.

Yanhang Zhang, Junxian Zhu, Jin Zhu, and Xueqin Wang. A splicing approach to best subset of groups selection. INFORMS Journal on Computing, 35(1):104 119, 2023.

Yu Zheng, Timothy M Hospedales, and Yongxin Yang. Diversity and sparsity: A new perspective on index tracking. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1768 1772. IEEE, 2020.

Jin Zhu, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, and Junxian Zhu. abess: a fast best-subset selection library in Python and R. Journal of Machine Learning Research, 23(202):1 7, 2022.

Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, and Xueqin Wang. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117 33123, 2020.