# generalized_linear_model_regression_under_distancetoset_penalties__25395bf8.pdf

Generalized Linear Model Regression under Distance-to-set Penalties

Jason Xu University of California, Los Angeles jqxu@ucla.edu

Eric C. Chi North Carolina State University eric_chi@ncsu.edu

Kenneth Lange University of California, Los Angeles klange@ucla.edu

Estimation in generalized linear models (GLM) is complicated by the presence of constraints. One can handle constraints by maximizing a penalized log-likelihood. Penalties such as the lasso are effective in high dimensions, but often lead to unwanted shrinkage. This paper explores instead penalizing the squared distance to constraint sets. Distance penalties are more ﬂexible than algebraic and regularization penalties, and avoid the drawback of shrinkage. To optimize distance penalized objectives, we make use of the majorization-minimization principle. Resulting algorithms constructed within this framework are amenable to acceleration and come with global convergence guarantees. Applications to shape constraints, sparse regression, and rank-restricted matrix regression on synthetic and real data showcase strong empirical performance, even under non-convex constraints.

1 Introduction and Background

In classical linear regression, the response variable y follows a Gaussian distribution whose mean xtβ depends linearly on a parameter vector β through a vector of predictors x. Generalized linear models (GLMs) extend classical linear regression by allowing y to follow any exponential family distribution, and the conditional mean of y to be a nonlinear function h(xtβ) of xtβ [24]. This encompasses a broad class of important models in statistics and machine learning. For instance, count data and binary classiﬁcation come within the purview of generalized linear regression.

In many settings, it is desirable to impose constraints on the regression coefﬁcients. Sparse regression is a prominent example. In high-dimensional problems where the number of predictors n exceeds the number of cases m, inference is possible provided the regression function lies in a low-dimensional manifold [11]. In this case, the coefﬁcient vector β is sparse, and just a few predictors explain the response y. The goals of sparse regression are to correctly identify the relevant predictors and to estimate their effect sizes. One approach, best subset regression, is known to be NP hard. Penalizing the likelihood by including an ℓ0 penalty β 0 (the number of nonzero coefﬁcients) is a possibility, but the resulting objective function is nonconvex and discontinuous. The convex relaxation of ℓ0 regression replaces β 0 by the ℓ1 norm β 1. This LASSO proxy for β 0 restores convexity and continuity [31]. While LASSO regression has been a great success, it has the downside of simultaneously inducing both sparsity and parameter shrinkage. Unfortunately, shrinkage often has the undesirable side effect of including spurious predictors (false positives) with the true predictors.

Motivated by sparse regression, we now consider the alternative of penalizing the log-likelihood by the squared distance from the parameter vector β to the constraint set. If there are several constraints, then we add a distance penalty for each constraint set. Our approach is closely related to the proximal

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

distance algorithm [19, 20] and proximity function approaches to convex feasibility problems [5]. Neither of these prior algorithm classes explicitly considers generalized linear models. Beyond sparse regression, distance penalization applies to a wide class of statistically relevant constraint sets, including isotonic constraints and matrix rank constraints. To maximize distance penalized loglikelihoods, we advocate the majorization-minimization (MM) principle [2, 18, 19]. MM algorithms are increasingly popular in solving the large-scale optimization problems arising in statistics and machine learning [22]. Although distance penalization preserves convexity when it already exists, neither the objective function nor the constraints sets need be convex to carry out estimation. The capacity to project onto each constraint set is necessary. Fortunately, many projection operators are known. Even in the absence of convexity, we are able to prove that our algorithm converges to a stationary point. In the presence of convexity, the stationary points are global minima.

In subsequent sections, we begin by brieﬂy reviewing GLM regression and shrinkage penalties. We then present our distance penalty method and a sample of statistically relevant problems that it can address. Next we lay out in detail our distance penalized GLM algorithm, discuss how it can be accelerated, summarize our convergence results, and compare its performance to that of competing methods on real and simulated data. We close with a summary and a discussion of future directions.

GLMs and Exponential Families: In linear regression, the vector of responses y is normally distributed with mean vector E(y) = Xβ and covariance matrix V(y) = σ2I. A GLM preserves the independence of the responses yi but assumes that they are generated from a shared exponential family distribution. The response yi is postulated to have mean µi(β) = E[yi|β] = h(xt iβ), where xi is the ith row of a design matrix X, and the inverse link function h(s) is smooth and strictly increasing [24]. The functional inverse h 1(s) of h(s) is called the link function. The likelihood of any exponential family can be written in the canonical form

p(yi|θi, τ) = c1(yi, τ) exp yθi ψ(θi)

Here τ is a ﬁxed scale parameter, and the positive functions c1 and c2 are constant with respect to the natural parameter θi. The function ψ is smooth and convex; a brief calculation shows that µi = ψ (θi). The canonical link function h 1(s) is deﬁned by the condition h 1(µi) = xt iβ = θi. In this case, h(θi) = ψ (θi), and the log-likelihood ln p(y|β, xj, τ) is concave in β. Because c1 and c2 are not functions of θ, we may drop these terms and work with the log-likelihood up to proportionality. We denote this by L(β | y, X) ln p(y|β, xj, τ). The gradient and second differential of L(β | y, X) amount to

i=1 [yi ψ (xt iβ)]xi and d2L =

i=1 ψ (xt iβ)xixt i. (2)

As an example, when ψ(θ) = θ2/2 and c2(τ) = τ 2, the density (1) is the Gaussian likelihood, and GLM regression under the identity link coincides with standard linear regression. Choosing ψ(θ) = ln[1 + exp(θ)] and c2(τ) = 1 corresponds to logistic regression under the canonical link h 1(s) = ln s 1 s with inverse link h(s) = es 1+es . GLMs unify a range of regression settings, including Poisson, logistic, gamma, and multinomial regression.

Shrinkage penalties: The least absolute shrinkage and selection operator (LASSO) [12, 31] solves

ˆβ = argminβ h λ β 1 1

j=1 L(β | yj, xj) i , (3)

where λ > 0 is a tuning constant that controls the strength of the ℓ1 penalty. The ℓ1 relaxation is a popular approach to promote a sparse solution, but there is no obvious map between λ and the sparsity level k. In practice, a suitable value of λ is found by cross-validation. Relying on global shrinkage towards zero, LASSO notoriously leads to biased estimates. This bias can be ameliorated by re-estimating under the model containing only the selected variables, known as the relaxed LASSO [25], but success of this two-stage procedure relies on correct support recovery in the ﬁrst step. In many cases, LASSO shrinkage is known to introduce false positives [30], resulting in spurious covariates that cannot be corrected. To combat these shortcomings, one may replace the LASSO penalty by a non-convex penalty with milder effects on large coefﬁcients. The smoothly clipped

absolute deviation (SCAD) penalty [10] and minimax concave penalty (MCP) [34] are even functions deﬁned through their derivatives

q γ(βi, λ) = λ 1{|βi| λ} + (γλ |βi|)+

(γ 1)λ 1{|βi|>λ}

and q γ(βi, λ) = λ 1 |βi|

for βi > 0. Both penalties reduce bias, interpolate between hard thresholding and LASSO shrinkage, and signiﬁcantly outperform the LASSO in some settings, especially in problems with extreme sparsity. SCAD, MCP, as well as the relaxed lasso come with the disadvantage of requiring an extra tuning parameter γ > 0 to be selected.

2 Regression with distance-to-constraint set penalties

As an alternative to shrinkage, we consider penalizing the distance between the parameter vector β and constraints deﬁned by sets Ci. Penalized estimation seeks the solution

ˆβ = argminβ

i vidist(β, Ci)2 1

j=1 L(β | yj, xj)

:= argminβ f(β), (4)

where the vi are weights on the distance penalty to constraint set Ci . The Euclidean distance can also be written as dist(β, Ci) = β PCi(β) 2, where PCi(β) denotes the projection of β onto Ci. The projection operator is uniquely deﬁned when Ci is closed and convex. If Ci is merely closed, then PCi(β) may be multi-valued for a few unusual external points β. Notice the distance penalty dist(β, Ci)2 is 0 precisely when β Ci. The solution (4) represents a tradeoff between maximizing the log-likelihood and satisfying the constraints. When each Ci is convex, the objective function is convex as a whole. Sending all of the penalty constants vi to produces in the limit the constrained maximum likelihood estimate. This is the philosophy behind the proximal distance algorithm [19, 20]. In practice, it often sufﬁces to ﬁnd the solution (4) under ﬁxed vi large. The reader may wonder why we employ squared distances rather than distances. The advantage is that squaring renders the penalties differentiable. Indeed, 1

2dist(x, Ci)2 = x PCi(x) whenever PCi(x) is single valued. This is almost always the case. In contrast, dist(x, Ci) is typically nondifferentiable at boundary points of Ci even when Ci is convex. The following examples motivate distance penalization by considering constraint sets and their projections for several important models.

Sparse regression: Sparsity can be imposed directly through the constraint set Ck = {z Rn : z 0 k} . Projecting a point β onto C is trivially accomplished by setting all but the k largest entries in magnitude of β equal to 0, the same operation behind iterative hard thresholding algorithms. Instead of solving the ℓ1-relaxation (3), our algorithm approximately solves the original ℓ0-constrained problem by repeatedly projecting onto the sparsity set Ck. Unlike LASSO regression, this strategy enables one to directly incorporate prior knowledge of the sparsity level k in an interpretable manner. When no such information is available, k can be selected by cross validation just as the LASSO tuning constant λ is selected. Distance penalization escapes the NP hard dilemma of best subset regression at the cost of possible convergence to a local minimum.

Shape and order constraints: As an example of shape and order restrictions, consider isotonic regression [1]. For data y Rn, isotonic regression seeks to minimize 1

2 y β 2 2 subject to the condition that the βi are non-decreasing. In this case, the relevant constraint set is the isotone convex cone C = {β : β1 β2 . . . βn}. Projection onto C is straightforward and efﬁciently accomplished using the pooled adjacent violators algorithm [1, 8]. More complicated order constraints can be imposed analogously: for instance, βi βj might be required of all edges i j in a directed graph model. Notably, isotonic linear regression applies to changepoint problems [32]; our approach allows isotonic constraints in GLM estimation. One noteworthy application is Poisson regression where the intensity parameter is assumed to be nondecreasing with time.

Rank restriction: Consider GLM regression where the predictors Xi and regression coefﬁcients B are matrix-valued. To impose structure in high-dimensional settings, rank restriction serves as an

appropriate matrix counterpart to sparsity for vector parameters. Prior work suggests that imposing matrix sparsity is much less effective than restricting the rank of B in achieving model parsimony [37]. The matrix analog of the LASSO penalty is the nuclear norm penalty. The nuclear norm of a matrix B is deﬁned as the sum of its singular values B = P

j σj(B) = trace(

B B). Notice B is a convex relaxation of rank(B). Including a nuclear norm penalty entails shrinkage and induces low-rankness by proxy.

Distance penalization of rank involves projecting onto the set Cr = {Z Rn n : rank(Z) r} for a given rank r. Despite sacriﬁcing convexity, distance penalization of rank is, in our view, both more natural and more effective than nuclear norm penalization. Avoiding shrinkage works to the advantage of distance penalization, which we will see empirically in Section 4. According to the Eckart-Young theorem, the projection of a matrix B onto Cr is achieved by extracting the singular value decomposition of B and truncating all but the top r singular values. Truncating the singular value decomposition is a standard numerical task best computed by Krylov subspace methods [14].

Simple box constraints, hyperplanes, and balls: Many relevant set constraints reduce to closed convex sets with trivial projections. For instance, enforcing non-negative parameter values is accomplished by projecting onto the non-negative orthant. This is an example of a box constraint. Specifying linear equality and inequality constraints entails projecting onto a hyperplane or half-space, respectively. A Tikhonov or ridge penalty constraint β 2 r requires spherical projection.

Finally, we stress that it is straightforward to consider combinations of the aforementioned constraints. Multiple norm penalties are already in common use. To encourage selection of correlated variables [38], the elastic net includes both ℓ1 and ℓ2 regularization terms. Further examples include matrix ﬁtting subject to both sparse and low-rank matrix constraints [29] and LASSO regression subject to linear equality and inequality constraints [13]. In our setting the relative importance of different constraints can be controlled via the weights vi.

3 Majorization-minimization

Figure 1: Illustrative example of two MM iterates with surrogates g(x|xk) majorizing f(x) = cos(x).

To solve the minimization problem (4), we exploit the principle of majorization-minimization. An MM algorithm successively minimizes a sequence of surrogate functions g(β | βk) majorizing the objective function f(β) around the current iterate βk. See Figure 1. Forcing g(β | βk) downhill automatically drives f(β) downhill as well [19, 22]. Every expectation-maximization (EM) algorithm [9] for maximum likelihood estimation is an MM algorithm. Majorization requires two conditions: tangency at the current iterate g(βk | βk) = f(βk), and domination g(β | βk) f(β) for all β Rm. The iterates of the MM algorithm are deﬁned by βk+1 := arg min β g(β | βk)

although all that is absolutely necessary is that g(βk+1 | βk) < g(βk | βk). Whenever this holds, the descent property f(βk+1) g(βk+1 | βk) g(βk | βk) = f(βk)

follows. This simple principle is widely applicable and converts many hard optimization problems (non-convex or non-smooth) into a sequence of simpler problems.

To majorize the objective (4), it sufﬁces to majorize each distance penalty dist (β, Ci)2. The majorization dist (β, Ci)2 β PCi(βk) 2 2 is an immediate consequence of the deﬁnitions of the set distance dist (β, Ci)2 and the projection operator PCi(β) [8]. The surrogate function

g(β | βk) = 1 2

i vi β PCi(βk) 2 2 1

j=1 L(β | yj, xj).

has gradient

g(β | βk) = X

i vi[β PCi(βk)] 1

j=1 L(β | yj, xj)

and second differential

d2g(β | βk) = X

j=1 d2L(β | yj, xj) := Hk. (5)

The score L(β | yj, xj) and information d2L(β | yj, xj) appear in equation (2). Note that for GLMs under canonical link, the observed and expected information matrices coincide, and their common value is thus positive semideﬁnite. Adding a multiple of the identity In to the information matrix is analogous to the Levenberg-Marquardt maneuver against ill-conditioning in ordinary regression [26]. Our algorithm therefore naturally beneﬁts from this safeguard.

Since solving the stationarity equation g(β | βk) = 0 is not analytically feasible in general, we employ one step of Newton s method in the form

βk+1 = βk ηkd2g(βk | βk) 1 f(βk), where ηk (0, 1] is a stepsize multiplier chosen via backtracking. Note here our application of the gradient identity f(βk) = g(βk | βk), valid for all smooth surrogate functions. Because the Newton increment is a descent direction, some value of ηk is bound to produce a decrease in the surrogate and therefore in the objective. The following theorem, proved in the Supplement, establishes global convergence of our algorithm under simple Armijo backtracking for choosing ηk:

Theorem 3.1 Consider the algorithm map

M(β) = β ηβH(β) 1 f(β),

where the step size ηβ has been selected by Armijo backtracking. Assume that f(β) is coercive in the sense lim β f(β) = + . Then the limit points of the sequence βk+1 = M(βk) are stationary points of f(β). Moreover, the set of limit points is compact and connected.

We remark that stationary points are necessarily global minimizers when f(β) is convex. Furthermore, coercivity of f(β) is a very mild assumption, and is satisﬁed whenever either the distance penalty or the negative log-likelihood is coercive. For instance, the negative log-likelihoods of the Poisson and Gaussian distributions are coercive functions. While this is not the case for the Bernoulli distribution, adding a small ℓ2 penalty ω β 2 2 restores coerciveness. Including such a penalty in logistic regression is a common remedy to the well-known problem of numerical instability in parameter estimates caused by a poorly conditioned design matrix X [27]. Since L(β) is concave in β, the compactness of one or more of the constraint sets Ci is another sufﬁcient condition for coerciveness.

Generalization to Bregman divergences: Although we have focused on penalizing GLM likelihoods with Euclidean distance penalties, this approach holds more generally for objectives containing non-Euclidean measures of distance. As reviewed in the Supplement, the Bregman divergence Dφ(v, u) = φ(v) φ(u) dφ(u)(v u) generated by a convex function φ(v) provides a general notion of directed distance [4]. The Bregman divergence associated with the choice φ(v) = 1

2 v 2 2, for instance, is the squared Euclidean distance. One can rewrite the GLM penalized likelihood as a sum of multiple Bregman divergences

i vi Dφ h Pφ Ci(β), β i +

j=1 wj Dζ h yj,ehj(β) i . (6)

Algorithm 1 MM algorithm to solve distance-penalized objective (4)

1: Initialize k = 0, starting point β0, initial step size α (0, 1), and halving parameter σ (0, 1): 2: repeat 3: fk P

i vi[β PCi(βk)] 1

m Pm j=1 L(β | yj, βj)

m Pm j=1 d2L(β | yj, βj)

5: v H 1 k fk 6: η 1 7: while f(βk + ηv) > f(βk) + αη f t kβk do 8: η ση 9: end while 10: βk+1 βk + ηv 11: k k + 1 12: until convergence

The ﬁrst sum in equation (6) represents the distance penalty to the constraint sets Ci. The projection Pφ Ci(β) denotes the closest point to β in Ci measured under Dφ. The second sum generalizes the GLM log-likelihood term where ehj(β) = h 1(xt jβ). Every exponential family likelihood uniquely corresponds to a Bregman divergence Dζ generated by the conjugate of its cumulant function ζ = ψ

[28]. Hence, L(β | y, X) is proportional to 1

m Pm j=1 Dζ yj, h 1(xt jβ) . The functional form (6) immediately broadens the class of objectives to include quasi-likelihoods and distances to constraint sets measured under a broad range of divergences. Objective functions of this form are closely related to proximity function minimization in the convex feasibility literature [5, 6, 7, 33]. The MM principle makes possible the extension of the projection algorithms of [7] to minimize this general objective.

Our MM algorithm for distance penalized GLM regression is summarized in Algorithm 1. Although for the sake of clarity the algorithm is written for vector-valued arguments, it holds more generally for matrix-variate regression. In this setting the regression coefﬁcients B and predictors Xi are matrix valued, and response yj has mean h[trace(Xt i B)] = h[vec(Xi)t vec(B)]. Here the vec operator stacks the columns of its matrix argument. Thus, the algorithm immediately applies if we replace B by vec(B) and X1, . . . , Xm by X = [vec(X1), . . . , vec(Xm)]t. Projections requiring the matrix structure are performed by reshaping vec(B) into matrix form. In contrast to shrinkage approaches, these maneuvers obviate the need for new algorithms in matrix regression [37].

Acceleration: Here we mention two modiﬁcations to the MM algorithm that translate to large practical differences in computational cost. Inverting the n-by-n matrix d2g(βk | βk) naively requires O(n3) ﬂops. When the number of cases m n, invoking the Woodbury formula requires solving a substantially smaller m m linear system at each iteration. This computational savings is crucial in the analysis of the EEG data of Section 4. The Woodbury formula says

(v In + UV ) 1 = v 1In v 2U Im + v 1V U 1V

when U and V are n m and m n matrices, respectively. Inspection of equations (2) and (5) shows that d2g(βk | βk) takes the required form. Under Woodbury s formula the dominant computation is the matrix-matrix product V U, which requires only O(nm2) ﬂops. The second modiﬁcation to the MM algorithm is quasi-Newton acceleration. This technique exploits secant approximations derived from iterates of the algorithm map to approximate the differential of the map. As few as two secant approximations can lead to orders of magnitude reduction in the number of iterations until convergence. We refer the reader to [36] for a detailed description of quasi-Newton acceleration and a summary of its performance on various high-dimensional problems.

4 Results and performance

We ﬁrst compare the performance of our distance penalization method to leading shrinkage methods in sparse regression. Our simulations involve a sparse length n = 2000 coefﬁcient vector β with 10 nonzero entries. Nonzero coefﬁcients have uniformly random effect sizes. The entries of the design matrix X are N(0, 0.1) Gaussian random deviates. We then recover β from undersampled responses

Relative Error, Logistic

0.6 0.2 0.2

MM MCP SCAD LASSO

Relative Error, Poisson

0.4 0.2 0.0

Support indices of true coefficients

600 800 1000 1200 1400 1600 1800

0.03 0.05 0.07 0.09

Number of samples

Mean squared error

MM MCP SCAD logistic poisson

Figure 2: The left ﬁgure displays relative errors among nonzero predictors in underdetermined Poisson and logistic regression with m = 1000 cases. It is clear that LASSO suffers the most shrinkage and bias, while MM appears to outperform MCP and SCAD. The right ﬁgure displays MSE as a function of m, favoring MM most notably for logistic regression.

yj following Poisson and Bernoulli distributions with canonical links. Figure 2 compares solutions obtained using our distance penalties (MM) to those obtained under MCP, SCAD, and LASSO penalties. Relative errors (left) with m = 1000 cases clearly show that LASSO suffers the most shrinkage and bias; MM seems to outperform MCP and SCAD. For a more detailed comparison, the right side of the ﬁgure plots mean squared error (MSE) as a function of the number of cases averaged over 50 trials. All methods signiﬁcantly outperform LASSO, which is omitted for scale, with MM achieving lower MSE than competitors, most noticeably in logistic regression. As suggested by an anonymous reviewer, similar results from additional experiments for Gaussian (linear) regression with comparison to relaxed lasso are included in the Supplement.

(a) Sparsity constraint

(b) Regularize B (c) Restrict rk(B) = 2 (d) Vary rk(B) = 1, . . . , 8

Figure 3: True B0 in the top left of each set of 9 images has rank 2. The other 8 images in (a) (c) display solutions as ϵ varies over the set {0, 0.1, . . . , 0.7}. Figure (a) applies our MM algorithm with sparsity rather than rank constraints to illustrate how failing to account for matrix structure misses the true signal; Zhou and Li [37] report similar ﬁndings comparing spectral regularization to ℓ1 regularization. Figure (b) performs spectral shrinkage [37] and displays solutions under optimal λ values via BIC, while (c) uses our MM algorithm restricting rank(B) = 2. Figure (d) ﬁxes ϵ = 0.1 and uses MM with rank(B) {1, . . . , 8} to illustrate robustness to rank over-speciﬁcation.

For underdetermined matrix regression, we compare to the spectral regularization method developed by Zhou and Li [37]. We generate their cross-shaped 32 32 true signal B0 and in all trials sample m = 300 responses yi N[tr(Xt i, B), ϵ]. Here the design tensor X is generated with standard normal entries. Figure 3 demonstrates that imposing sparsity alone fails to recover Y 0 and that rank-set projections visibly outperform spectral norm shrinkage as ϵ varies. The rightmost panel also shows that our method is robust to over-speciﬁcation of the rank of the true signal to an extent.

We consider two real datasets. We apply our method to count data of global temperature anomalies relative to the 1961-1990 average, collected by the Climate Research Unit [17]. We assume a non-

1850 1900 1950 2000

0.6 0.2 0.2 0.6

Global Temperature Anomalies

10 20 30 40 50 60

Figure 4: The leftmost plot shows our isotonic ﬁt to temperature anomaly data [17]. The right ﬁgures display the estimated coefﬁcient matrix B on EEG alcoholism data using distance penalization, nuclear norm shrinkage [37], and LASSO shrinkage, respectively.

decreasing solution, illustrating an instance of isotonic regression. The ﬁtted solution displayed in Figure 4 has mean squared error 0.009, clearly obeys the isotonic constraint, and is consistent with that obtained on a previous version of the data [32]. We next focus on rank constrained matrix regression for electroencephalography (EEG) data, collected by [35] to study the association between alcoholism and voltage patterns over times and channels. The study consists of 77 individuals with alcoholism and 45 controls, providing 122 binary responses yi indicating whether subject i has alcoholism. The EEG measurements are contained in 256 64 predictor matrices Xi; the dimension m is thus greater than 16, 000. Further details about the data appear in the Supplement.

Previous studies apply dimension reduction [21] and propose algorithms to seek the optimal rank 1 solution [16]. These methods could not handle the size of the original data directly, and the spectral shrinkage approach proposed in [37] is the ﬁrst to consider the full EEG data. Figure 4 shows that our regression solution is qualitatively similar to that obtained under nuclear norm penalization [37], revealing similar time-varying patterns among channels 20-30 and 50-60. In contrast, ignoring matrix structure and penalizing the ℓ1 norm of B yields no useful information, consistent with ﬁndings in [37]. However, our distance penalization approach achieves a lower misclassiﬁcation error of 0.1475. The lowest misclassiﬁcation rate reported in previous analyses is 0.139 by [16]. As their approach is strictly more restrictive than ours in seeking a rank 1 solution, we agree with [37] in concluding that the lower misclassiﬁcation error can be largely attributed to beneﬁts from data preprocessing and dimension reduction. While not visually distinguishable, we also note that shrinking the eigenvalues via nuclear norm penalization [37] fails to produce a low-rank solution on this dataset.

We omit detailed timing comparisons throughout since the various methods were run across platforms and depend heavily on implementation. We note that MCP regression relies on the MM principle, and the LQA and LLA algorithms used to ﬁt models with SCAD penalties are also instances of MM algorithms [11]. Almost all MM algorithms share an overall linear rate of convergence. While these require several seconds of compute time on a standard laptop machine, coordinate-descent implementations of LASSO outstrip our algorithm in terms of computational speed. Our MM algorithm required 31 seconds to converge on the EEG data, the largest example we considered.

5 Discussion

GLM regression is one of the most widely employed tools in statistics and machine learning. Imposing constraints upon the solution is integral to parameter estimation in many settings. This paper considers GLM regression under distance-to-set penalties when seeking a constrained solution. Such penalties allow a ﬂexible range of constraints, and are competitive with standard shrinkage methods for sparse and low-rank regression in high dimensions. The MM principle yields a reliable solution method with theoretical guarantees and strong empirical results over a number of practical examples. These examples emphasize promising performance under non-convex constraints, and demonstrate how distance penalization avoids the disadvantages of shrinkage approaches.

Several avenues for future work may be pursued. The primary computational bottleneck we face is matrix inversion, which limits the algorithm when faced with extremely large and high-dimensional

datasets. Further improvements may be possible using modiﬁcations of the algorithm tailored to speciﬁc problems, such as coordinate or block descent variants. Since the linear systems encountered in our parameter updates are well conditioned, a conjugate gradient algorithm may be preferable to direct methods of solution in such cases. The updates within our algorithm can be recast as weighted least squares minimization, and a re-examination of this classical problem may suggest even better iterative solvers. As the methods apply to a generalized objective comprised of multiple Bregman divergences, it will be fruitful to study penalties under alternate measures of distance, and settings beyond GLM regression such as quasi-likelihood estimation.

While our experiments primarily compare against shrinkage approaches, an anonymous referee points us to recent work revisiting best subset selection using modern advances in mixed integer optimization [3]. These exciting developments make best subset regression possible for much larger problems than previously thought possible. As [3] focus on the linear case, it is of interest to consider how ideas in this paper may offer extensions to GLMs, and to compare the performance of such generalizations. Best subsets constitutes a gold standard for sparse estimation in the noiseless setting; whether it outperforms shrinkage methods seems to depend on the noise level and is a topic of much recent discussion [15, 23]. Finally, these studies as well as our present paper focus on estimation, and it will be fruitful to examine variable selection properties in future work. Recent work evidences an inevitable trade-off between false and true positives under LASSO shrinkage in the linear sparsity regime [30]. The authors demonstrate that this need not be the case with ℓ0 methods, remarking that computationally efﬁcient methods which also enjoy good model performance would be highly desirable as ℓ0 and ℓ1 approaches possess one property but not the other [30]. Our results suggest that distance penalties, together with the MM principle, seem to enjoy beneﬁts from both worlds on a number of statistical tasks.

Acknowledgements: We would like to thank Hua Zhou for helpful discussions about matrix regression and the EEG data. JX was supported by NSF MSPRF #1606177.

[1] Barlow, R. E., Bartholomew, D. J., Bremner, J., and Brunk, H. D. Statistical inference under

order restrictions: The theory and application of isotonic regression. Wiley New York, 1972.

[2] Becker, M. P., Yang, I., and Lange, K. EM algorithms without missing data. Statistical Methods

in Medical Research, 6:38 54, 1997.

[3] Bertsimas, D., King, A., and Mazumder, R. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813 852, 2016.

[4] Bregman, L. M. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 217, 1967.

[5] Byrne, C. and Censor, Y. Proximity function minimization using multiple Bregman projections, with applications to split feasibility and Kullback Leibler distance minimization. Annals of Operations Research, 105(1-4):77 98, 2001.

[6] Censor, Y. and Elfving, T. A multiprojection algorithm using Bregman projections in a product space. Numerical Algorithms, 8(2):221 239, 1994.

[7] Censor, Y., Elfving, T., Kopf, N., and Bortfeld, T. The multiple-sets split feasibility problem and its applications for inverse problems. Inverse Problems, 21(6):2071 2084, 2005.

[8] Chi, E. C., Zhou, H., and Lange, K. Distance majorization and its applications. Mathematical

Programming Series A, 146(1-2):409 436, 2014.

[9] Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), pages 1 38, 1977.

[10] Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001.

[11] Fan, J. and Lv, J. A selective overview of variable selection in high dimensional feature space.

Statistica Sinica, 20(1):101, 2010.

[12] Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2010.

[13] Gaines, B. R. and Zhou, H. Algorithms for ﬁtting the constrained lasso. ar Xiv preprint

ar Xiv:1611.01511, 2016.

[14] Golub, G. H. and Van Loan, C. F. Matrix computations, volume 3. JHU Press, 2012.

[15] Hastie, T., Tibshirani, R., and Tibshirani, R. J. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. ar Xiv preprint ar Xiv:1707.08692, 2017.

[16] Hung, H. and Wang, C.-C. Matrix variate logistic regression model with application to EEG data. Biostatistics, 14(1):189 202, 2013.

[17] Jones, P., Parker, D., Osborn, T., and Briffa, K. Global and hemispheric temperature anomalies land and marine instrumental records. Trends: a compendium of data on global change, 2016.

[18] Lange, K., Hunter, D. R., and Yang, I. Optimization transfer using surrogate objective functions (with discussion). Journal of Computational and Graphical Statistics, 9:1 20, 2000.

[19] Lange, K. MM Optimization Algorithms. SIAM, 2016.

[20] Lange, K. and Keys, K. L. The proximal distance algorithm. ar Xiv preprint ar Xiv:1507.07598, 2015.

[21] Li, B., Kim, M. K., and Altman, N. On dimension folding of matrix-or array-valued statistical objects. The Annals of Statistics, pages 1094 1121, 2010.

[22] Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015.

[23] Mazumder, R., Radchenko, P., and Dedieu, A. Subset selection with shrinkage: Sparse linear modeling when the SNR is low. ar Xiv preprint ar Xiv:1708.03288, 2017.

[24] Mc Cullagh, P. and Nelder, J. A. Generalized linear models, volume 37. CRC press, 1989.

[25] Meinshausen, N. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374 393, 2007.

[26] Moré, J. J. The Levenberg-Marquardt algorithm: Implementation and theory. In Numerical

analysis, pages 105 116. Springer, 1978.

[27] Park, M. Y. and Hastie, T. L1-regularization path algorithm for generalized linear models.

Journal of the Royal Statistical Society: Series B (Methodological), 69(4):659 677, 2007.

[28] Polson, N. G., Scott, J. G., and Willard, B. T. Proximal algorithms in statistics and machine learning. Statistical Science, 30(4):559 581, 2015.

[29] Richard, E., Savalle, P.-a., and Vayatis, N. Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1351 1358, 2012.

[30] Su, W., Bogdan, M., and Candès, E. False discoveries occur early on the lasso path. The Annals

of Statistics, 45(5), 2017.

[31] Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical

Society: Series B (Methodological), pages 267 288, 1996.

[32] Wu, W. B., Woodroofe, M., and Mentz, G. Isotonic regression: Another look at the changepoint problem. Biometrika, pages 793 804, 2001.

[33] Xu, J., Chi, E. C., Yang, M., and Lange, K. A majorization-minimization algorithm for split feasibility problems. ar Xiv preprint ar Xiv:1612.05614, 2017.

[34] Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals

of Statistics, 38(2):894 942, 2010.

[35] Zhang, X. L., Begleiter, H., Porjesz, B., Wang, W., and Litke, A. Event related potentials during object recognition tasks. Brain Research Bulletin, 38(6):531 538, 1995.

[36] Zhou, H., Alexander, D., and Lange, K. A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing, 21:261 273, 2011.

[37] Zhou, H. and Li, L. Regularized matrix regression. Journal of the Royal Statistical Society:

Series B (Methodological), 76(2):463 483, 2014.

[38] Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the

Royal Statistical Society: Series B (Methodological), 67(2):301 320, 2005.