# semianalytic_resampling_in_lasso__48a672c7.pdf

Journal of Machine Learning Research 20 (2019) 1-33 Submitted 2/18; Revised 4/19; Published 4/19

Semi-Analytic Resampling in Lasso

Tomoyuki Obuchi obuchi@c.titech.ac.jp

Yoshiyuki Kabashima kaba@c.titech.ac.jp Department of Mathematical and Computing Science Tokyo Institute of Technology 2-12-1, Ookayama, Meguro-ku, Tokyo, Japan

Editor: Jon Mc Auliﬀe

An approximate method for conducting resampling in Lasso, the ℓ1 penalized linear regression, in a semi-analytic manner is developed, whereby the average over the resampled datasets is directly computed without repeated numerical sampling, thus enabling an inference free of the statistical ﬂuctuations due to sampling ﬁniteness, as well as a signiﬁcant reduction of computational time. The proposed method is based on a message passing type algorithm, and its fast convergence is guaranteed by the state evolution analysis, when covariates are provided as zero-mean independently and identically distributed Gaussian random variables. It is employed to implement bootstrapped Lasso (Bolasso) and stability selection, both of which are variable selection methods using resampling in conjunction with Lasso, and resolves their disadvantage regarding computational cost. To examine approximation accuracy and eﬃciency, numerical experiments were carried out using simulated datasets. Moreover, an application to a real-world dataset, the wine quality dataset, is presented. To process such real-world datasets, an objective criterion for determining the relevance of selected variables is also introduced by the addition of noise variables and resampling. MATLAB codes implementing the proposed method are distributed in (Obuchi, 2018).

Keywords: bootstrap method, Lasso, variable selection, message passing algorithm, replica method

1. Introduction

Variable selection is an important problem in statistics, signal processing, and machine learning. A desire for useful techniques of variable selection has recently been growing, as cumulated advances in measurement and information technologies have started to steadily produce a large amount of high-dimensional data in science and engineering. A naive method for selecting relevant variables requires solving discrete optimization problems. This involves a serious computational diﬃculty as the dimensionality of the variables increases, even in the simplest case of linear models (Natarajan, 1995). Hence, certain relaxations or approximations are required for handling such large high-dimensional datasets. A great deal of progress has been made with regard to relaxation techniques by introducing an ℓ1 penalty (Tibshirani, 1996; Meinshausen and B uhlmann, 2004; Banerjee et al., 2006; Friedman et al., 2008). Its model consistency, that is, whether the estimated model

c 2019 Tomoyuki Obuchi and Yoshiyuki Kabashima.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-109.html.

Obuchi and Kabashima

converges to the true model in the large size limit of data, has been extensively studied, particularly in the case of linear models (Knight and Fu, 2000; Zhao and Yu, 2006; Yuan and Lin, 2007; Wainwright, 2009; Meinshausen and Yu, 2009). These studies show that a naive usage of the ℓ1 penalty aﬀects model consistency in realistic settings: The resultant estimator cannot completely reject variables not present in the true model if there exist non-trivial correlations between covariates, although the normal consistency in the ℓ2 sense is retained. This fact motivated the development of further techniques for recovering model consistency in the variable selection context, particularly in the last decade, including traditional approaches using conﬁdence interval and hypothesis testing (Zou, 2006; Bach, 2008; Meinshausen and B uhlmann, 2010; Javanmard and Montanari, 2014a,b, 2015; Lockhart et al., 2014; G Sell et al., 2013; Takahashi and Kabashima, 2018). If the distribution of the estimator is accurately ﬁgured out and the corresponding P-value is eﬃciently computed, those traditional approaches are quite eﬃcient and powerful enough to correctly perform variable selection. However in many practical situations, the functional form of the distribution is not clear. In such cases, certain numerical approaches are employed to approximately estimate the distribution. Resampling is one of the representative methods and is focused on in the present paper.

Resampling is a versatile idea applicable to a wide range of problems in statistical modeling and is broadly used in machine learning algorithms; some examples can be found in boosting and bagging (Trevor et al., 2009). In statistics, Efron s bootstrap method is a pioneering example in which resampling is eﬃciently and systematically used (Efron and Tibshirani, 1994). In the case of the ℓ1-penalized linear regression or Lasso (Tibshirani, 1996), through a ﬁne evaluation of each variable s probability to be selected (positive probability) in a fairly general setting, Bach showed that it is possible to perform statistically consistent variable selection by utilizing the bootstrap method (Bach, 2008); the associated algorithm is called bootstrapped Lasso (Bolasso). Another algorithm based on a similar idea, called stability selection (SS), was also implemented by randomizing the penalty coefﬁcient in Lasso (Meinshausen and B uhlmann, 2010). All these examples demonstrate that resampling is powerful and versatile. A common disadvantage of such resampling approaches is their computational cost. For example, in the Bolasso case, the ℓ1-penalized linear regression should be recursively solved according to the number of resampled datasets, which should be suﬃciently large so that the positive probability of variables may be estimated. This multiple computational cost precludes the application of such resampling techniques to large datasets. The aim of this study is to avoid this problem by developing an approximation whereby the resampling is conducted in a semi-analytic manner, and to implement it in Lasso. Such an approximation was already obtained by Malzahn and Opper (2003), where a general framework of semi-analytic resampling was developed using the replica method from statistical mechanics, and was demonstrated in Gaussian process regression. We follow their idea and pursue how it works in Lasso with certain resampling manners. The remaining of the paper is organized as follows. In the next section, the general framework for semi-analytic resampling using the replica method is reviewed. Concrete formulas in the Lasso case are also provided. After taking the average with respect to the resampling, there remains an intractable statistical model. Handling this model is another key issue, and certain approximations (expected to be exact under certain conditions) should

Semi-Analytic Resampling in Lasso

be adopted. In this study, Gaussian approximation is used in conjunction with the so-called cavity method, providing a message passing type algorithm. Its dynamical behavior is analyzed by the so-called state evolution, which guarantees that the algorithm convergence is free from the model dimensionality and the dataset size when covariates are provided as zero-mean independently and identically distributed (i.i.d.) Gaussian random variables. They are explained in Section 3. In Section 4, numerical experiments are carried out on both simulated and real-world datasets to examine the accuracy and eﬃciency of the proposed semi-analytic method. For processing real-world datasets, an objective criterion for determining the relevance of selected variables is also proposed, based on the addition of noise variables and resampling. The last section concludes the paper.

2. Formulation for semi-analytic resampling

Let us start from preparing notations and deﬁnitions. We denote by D a given dataset, which usually consists of a set of inputs xµ and outputs yµ as D = {(xµ, yµ)}M µ=1, and denote our basic statistics by ˆβ(D) = (ˆβ1(D), , ˆβN(D)) . They are interpreted as an estimator of true parameters in the true model that is inferred. Based on the Bayesian inference framework, it is assumed that the basic estimator can be expressed as an average over a certain posterior distribution P(β|λ, D), that is,

ˆβ(λ, D) = Z dβ βP(β|λ, D) β , (1)

where represents the average over the posterior distribution which is supposed to be constructed as follows:

P(β|λ, D) = 1 Z(λ, D)P0(β|λ)P(D|β). (2)

Here, P0(β|λ) is the prior distribution characterized by the parameters λ, and P(D|β) represents the likelihood. The normalization constant or the partition function can be explicitly expressed by

Z(λ, D) = Z dβ P0(β|λ)P(D|β). (3)

Considering the construction of a subsample by resampling from the full data D of size M with replacement, let an indicator vector c = (c1, , c M) specify any such subsample; each cµ is a non-negative integer counting the number of occurrences of the µ-th datapoint of D in the subsample. A subsample identiﬁed by c is denoted by Dc. Specifying a resampling deﬁnes the distribution P(c) of c, and the question is the behavior of the basic estimators and their functions with respect to the average over P(c). In some resampling techniques, additional randomization is introduced in the prior distribution (Meinshausen and B uhlmann, 2010). Hence, an average over the parameters λ is further considered, and the distribution P(λ) of λ is introduced. For notational simplicity, the average over these distributions is denoted by square brackets with appropriate subscripts as follows:

Z dλ( )P(c)P(λ).

The average of this type is called conﬁgurational average throughout this paper.

Obuchi and Kabashima

2.1. General theory of semi-analytic resampling

The purpose of resampling is to obtain the distribution of the basic estimator

P(βi) = h δ βi ˆβi(λ, Dc) i

which is reduced to computing the moments h ˆβr i (λ, Dc) i

c,λ of arbitrary degree r N. We

thus show that these moments can be evaluated in the following manner. By deﬁnition, the moment h ˆβr i (λ, Dc) i

c,λ with r N can be written as

h ˆβr i (λ, Dc) i

" 1 Zr(λ, Dc)

a=1 dβa βa i P0(βa|λ)P(Dc|βa)

Zn r(λ, Dc) Z ( r Y

a=1 dβa βa i P0(βa|λ)P(Dc|βa)

b=1 dβb ) ( r Y

b=1 P0(βb|λ)P(Dc|βb)

The evaluation of (5) is technically diﬃcult owing to the existence of Zr(λ, Dc) in the denominator: The negative power of the partition function does not allow the analytical evaluation of the average over c and λ, even if P(c) and P(λ) take reasonable forms. To overcome this diﬃculty, an auxiliary parameter n is introduced in (6). The exponent n is assumed to be a positive integer larger than r in (7); thus, the power of the partition function can be expanded in the integral form (3). Integral variables {β1, β2, . . . , βn} are termed replicas since they are regarded as constituting n copies of the original system. In (7), the conﬁgurational average can be analytically computed under appropriate approximations. This yields an expression as a function of n that is analytically continuable from N to R. The n 0 limit is taken by employing the analytically continued expression after all computations. The evaluation technique based on these procedures is often called the replica method (M ezard et al., 1987; Nishimori, 2001; Dotsenko, 2005). Evidently, this technique is not justiﬁed in a strict sense, but it is known that the replica method gives correct results for many problems. Justiﬁcation of the replica method is known to be diﬃcult (Talagrand, 2003), and here we leave it as a future work and just employ the method for our purpose. In the present problem, it is far from trivial to obtain an analytically continuable expression, and below we concentrate on its derivation. Accordingly, the conﬁgurational average of any power of the basic estimator can be evaluated from the following quantities:

Ξ ({βa}n a=1 | D) Z ( n Y

a=1 dβa ) " n Y

a=1 P0(βa|λ)P(Dc|βa)

P ({βa}n a=1 | D) 1 Ξ ({βa}n a=1 | D)

a=1 P0(βa|λ)P(Dc|βa)

Semi-Analytic Resampling in Lasso

where the former is called replicated partition function and the latter is called replicated Boltzmann distribution. The average over the replicated Boltzmann distribution is denoted by

( ) Z ( n Y

( ) P ({βa}n a=1 | D) . (8)

It follows from (5) (7) that this average converges to the desired total average over the posterior and the conﬁgurational variables c and λ as n 0. The next step is to compute this average. The nontrivial technical diﬃculties are in the integrations with respect to replicas {βa}n a=1 and in deriving a functional form that is analytically continuable with respect to n. These will be handled after considering the speciﬁc case of Lasso.

2.2. Speciﬁcs to Lasso with independent sampling

The above general theory is now rephrased in the context of Lasso. Given a set of covariates xµ RN and responses yµ R, D = {(xµ, yµ)}M µ=1, the usual estimator by Lasso is

ˆβ(λ, D) = arg min β

!2 + λ||β||1

In contrast to this, considering Bolasso and SS, given the resampled data Dc and the randomized penalty coeﬃcients λ = (λ1, , λN) , the following estimator is introduced:

ˆβ(λ, Dc) = arg min β

For representing this in terms of the posterior average, the following quantities are introduced:

H(β|λ, Dc) = 1

i=1 λi|βi|,

Zγ(λ, Dc) = Z dβ e γH(β|λ,Dc),

Pγ(β|λ, Dc) = 1

Zγ e γH(β|λ,Dc),

where these quantities are called (in the order they appear) Hamiltonian, partition function, and Boltzmann distribution, in accordance with physics terminology. As γ , the Boltzmann distribution converges to a pointwise measure at ˆβ(λ, Dc) and thereby becomes the desired posterior distribution, allowing the identiﬁcation of the Boltzmann distribution with the posterior distribution; thus, the average over the Boltzmann distribution is thereafter denoted by introduced in (1)1. The prior distribution P0(β|λ) and the likelihood

1. It is assumed that the order of the integration and the γ limit may be changed.

Obuchi and Kabashima

P(D|β) in (2) correspond to the factors e γ P i λi|βi| and e γ

2 PM µ=1 cµ(yµ P i xµiβi) 2 in the Boltzmann distribution, respectively. Considering Bolasso and SS, we draw the subsample c of a ﬁxed size m in an unbiased manner. Each sample comes out with a probability of M mm!/(c1! c M!), which has a weak dependency among {cµ}µ. However, this dependency is not essential for large M and m; thus, it is ignored. Using Stirling s formula m! (m

e )m in conjunction with the relation m = PM µ=1 cµ, cµ is approximately regarded as an i.i.d. variable from a Poisson distribution of mean τ = m/M, namely,

It is also natural to require that P(λ) is factorized into a batch of identical distributions as follows:

At this point, the explicit form of P(λi) is not speciﬁed. These factorized natures allow us to express the replicated Boltzmann distribution as

Pγ({βa}n a=1|D) h e γ Pn a=1 H(βa|λ,Dc)i

µ=1 Φµ {βi}N i=1 N Y

i=1 Ψ (βi) , (10)

Φµ {βi}N i=1 = h e 1

2 γc P a(yµ P i xµiβa i )2i

Ψ (βi) = h e γλ Pn a=1 |βa i |i

and the vector notation βi = (βa i )a was introduced for later convenience. Φµ is hereafter called µ-th potential function. To proceed further, an approximation should be introduced to make the right-hand side of (10) tractable. Although there are several ways for that, we here make an approximation based on the cavity method from statistical physics. The details are in the next section.

3. Handling the replicated system

We below work with the cavity method, or the belief propagation (BP) in computer science, and use a Gaussian approximation. This treatment is essentially identical to that used in deriving the known approximate message passing (AMP) (Kabashima, 2003; Donoho et al., 2009), and can be justiﬁed if each covariate is i.i.d. from Gaussian distributions (Bayati and Montanari, 2011; Barbier et al., 2017). For treating nontrivial correlations between covariates, more sophisticated approximations (Opper and Winther, 2001a,b, 2005; Kabashima and Vehkapera, 2014; C akmak et al., 2014; Cespedes et al., 2014; Rangan et al., 2016; Ma and Ping, 2017; Takeuchi, 2017) are required. Such extensions are left as future work.

Semi-Analytic Resampling in Lasso

(10) implies that the replicated Boltzmann distribution naturally has a factor graph structure. Hence, by the cavity method, (10) can be approximately decomposed into two messages as follows:

eφµ i(βi) = 1 Zµ i

j( =i) dβj Φµ {βi}N i=1 Y

j( =i) φj µ(βj), (11)

φi µ(βi) = 1 Zi µ Ψ (βi) Y

eφν i(βi), (12)

where Zµ i, Zi µ are normalization factors that are not relevant and will be discarded below. The average of (8) can be computed by employing this set of equations.

3.1. Gaussian approximation on cavity method

A crucial observation for assessing (11),(12) is that the residual Ra µ = yµ P j xµjβa j appearing in Φµ has a sum of a large number of random variables; thus, the central limit theorem justiﬁes treating it as a Gaussian variable with the appropriate mean and variance. This consideration leads to the following decomposition in (11)

Ra µ + xµiβa i yµ X

j( =i) xµjβa j yµ X

j( =i) xµjβ \µ j + Za µi rµi + Za µi,

where Za µi is a zero-mean Gaussian variable whose covariance is equivalent to that of P j( =i) xµjβa j , and ( ) \µ is the average over Q j φj µ(βj) or the replicated Boltzmann distribution without µ-th potential function. This shows that it is diﬃcult to consider the correlation between βi and βj with diﬀerent i, j in the present framework. This could be overcome even in the cavity method framework (Opper and Winther, 2001a,b); however, it is beyond the scope of this study.

The so-called replica symmetry is now assumed in the messages. This symmetry implies that any permutation of the replicas {β1 i , , βn i } yields an identical message, which in turn implies, by De Finetti s theorem (Hewitt and Savage, 1955), that the messages can be expressed as

φi µ(βi) = Z d Fρi µ(F)

a=1 F(βa i ),

eφµ i(βi) = Z d Gρµ i(G)

a=1 G(βa i ),

Obuchi and Kabashima

where d Fρi µ(F) and d Gρi µ(G) are probability measures over the probability distribution functions F and G, respectively. From this form, two diﬀerent variances naturally emerge:

V \µ i (βa i )2 βa i βb i

\µ = Z dβi (βa i )2 βa i βb i Z d Fρi µ(F)

c=1 F(βc i )

= Z d Fρi µ(F)

(Z F(β)β2dβ Z F(β)βdβ 2)

W \µ i βa i βb i \µ βa i \µβb i \µ

= Z d Fρi µ(F) Z F(β)βdβ 2 Z d Fρi µ(F) Z F(β)βdβ 2 ,

where it was assumed that a = b. Below, the µ-dependence of V \µ i and W \µ i is ignored, as it is small, and let

V \µ i Vi = (βa i )2 βa i βb i ,

W \µ i Wi = βa i βb i βa i βb i .

It can be implied that Vi describes the average of the variance inside a ﬁxed resampled dataset, whereas Wi represents the inter-sample variance. Using Vi and Wi, the covariance of βi is generally written as

Cov\µ βa i , βb i βa i βb i \µ βa i \µβb i \µ = W \µ i + V \µ i δab Wi + Viδab.

Using this relation, the covariance of {Za µi}a can be expressed as

Cov\µ Za µi, Zb µi = X

j,k( =i) xµixµj Cov\µ βa j , βb k X

j x2 µj Cov\µ βa j , βb j

j x2 µj(Wj + Vjδab) Wµ + Vµδab.

The correlation between βj and βk for j = k is neglected in the second sum because it is not taken into account in the present framework, as explained above. Moreover, the addition of the i-th term in the same sum does not aﬀect the following discussion because it is suﬃciently small in the summation. Based on these observations, Za µi is decomposed as follows:

Za µi = Dµi + a µi,

where Dµi and a µi are zero-mean Gaussian variables whose covariances are Cov\µ (Dµi, Dµi) =

Wµ, Cov\µ Dµi, a µi = 0, Cov\µ a µi, b µi = Vµδab. eφµ i can now be computed. The integration with respect to {βj}j( =i) in (11) is replaced by that over Di and a i . The result is

eφµ i(βi) Z d Dµi P(Dµi) Z Y

a d a µi P( a µi) h e 1

2 γc P a(rµi xµiβa i +Dµi+ a µi)2i

2 (1 + cγVµ) n

2 S(c) P a(rµi xµiβa i ) 2+ 1

2 T(c)( P a(rµi xµiβa i )) 2i

Semi-Analytic Resampling in Lasso

S(c) = cγ 1 + cγVµ ,

T(c) = c2γ2Wµ (1 + cγVµ)(1 + cγ(Vµ + n Wµ)),

U(c) = 1 + cγ(Vµ + n Wµ)

ln eφµ i(βi) can be expanded with respect to xµiβa i , and the expansion up to the second order leads to an eﬀective Gaussian approximation of the message, yielding

eφµ i(βi) e 1

2 γ Aµ i P a(βa i ) 2+ 1

2 γ2 Cµ i( P a βa i ) 2+γ Bµ i P a βa i ,

Aµ i = c 1 + cγVµ

c x2 µi, (16a)

Bµ i = c 1 + cγVµ

c rµixµi, (16b)

(" c 1 + cγVµ

" c 1 + cγVµ

c c 1 + cγVµ

x2 µi.(16c)

It should be noted that the n 0 limit was already taken in these coeﬃcients. Owing to the Gaussian approximation of eφµ i(βi), the marginal distribution of the replicated Boltzmann distribution can be simply written as

Pi(βi|D) Z Y

j( =i) dβj P({βi}i|D) Ψ(βi) Y

µ eφµ i(βi)

2 Ai P a(βa i ) 2+(Bi+ Ciz) P a βa i λ P a |βa i |

where Dz = dze 1

2π, and the following identity was used:

e 1 2 Cx2 = Z Dz e

Moreover, Ai = P µ Aµ i, Bi = P µ Bµ i, and Ci = P µ Cµ i. All replicas are now factorized, and the average can be taken for each replica independently, allowing passing to the n 0 limit by analytic continuation from N to R with respect to n. For example, the mean of βa i is computed as

βa i = Z Dz Z dβ eγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|) n 1

"Z Dz Z dβ βeγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|) Z dβ eγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|) n 1#

R dβ βeγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|) R dβ eγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|)

γ Z Dz Sλ(Bi + p

Obuchi and Kabashima

where Sλ is the so-called soft thresholding function

Sλ(x; A) = 1

A(x λsgn (x))θ(|x| λ),

and θ(x) is the step function that is equal to 1 if x > 0 and 0 otherwise. Thus, the mean of the basic estimator takes a reasonable form. In the above average, it is assumed that the coeﬃcients Ai, Bi, Ci remain ﬁnite as γ . This is the case if the following holds:

i x2 µiγVi O(1), (γ ).

This scaling is consistent, which can be shown by

χi γVi = γ n

(βa i )2 βa i βb i o

R dβ β2eγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|) R dβ eγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|)

R dβ βeγ( 1

2 Aiβ+(Bi+ Ciz)βi λ|β|) R dβ eγ( 1

2 Aiβ2+(Bi+ Ciz)βi λ|β|)

Z Dzθ |Bi + p

In contrast to Vi, the inter-sample ﬂuctuation Wi takes a ﬁnite value even as γ . Its explicit form is

Wi = βa i βb i βa i βb i

n 0, γ Z Dz S2 λ Bi + p

λ Z Dz Sλ Bi + p

It follows that any moment of the basic estimator can be computed from (17), once the coeﬃcients {Ai, Bi, Ci}N i=1 are correctly estimated. Hence, the next task is to derive a set of self-consistent equations for the coeﬃcients and to construct an algorithm for solving it.

3.2. Self-consistent equations and a message passing algorithm

Using (12), we obtain

φi µ(βi) Ψ (βi) Y

2 Ai µ P a(βa i ) 2+(Bi µ+

Ci µz) P a βa i λ P a |βa i |

where Ai µ = P ν( =µ) Aν i, Bi µ = P ν( =µ) Bν i, and Ci µ = P ν( =µ) Cν i. Inserting this into (11), we can derive a set of self-consistent equations determining all the cavity coeﬃcients {Ai µ, Bi µ, Ci µ, Aµ i, Bµ i, Cµ i}i,µ. This procedure suggests an iterative

Semi-Analytic Resampling in Lasso

algorithm, conventionally called BP algorithm, which is schematically described as follows:

{ Aµ i, Bµ i, Cµ i}(t) {βi \µ, Vi, Wi}(t), (18a)

{Ai µ, Bi µ, Ci µ}(t+1) { Aµ i, Bµ i, Cµ i}(t), (18b)

{βi \µ, Vi, Wi}(t+1) {Ai µ, Bi µ, Ci µ}(t+1), (18c)

where t = 0, 1, , denotes the algorithm time step. If the BP algorithm converges, the full coeﬃcients {Ai, Bi, Ci}N i=1 are given from the converged values of { Aµ i, Bµ i, Cµ i}i,µ. However, this algorithm is not particularly eﬃcient because its computational cost is O(NM2). A more eﬃcient algorithm is derived by approximately rewriting the cavity coeﬃcients and the cavity mean β \µ using the full coeﬃcients {Ai, Bi, Ci}i. To this end, φi µ(βi) is connected with Pi(βi|D) in a perturbative manner. Comparing the coeﬃcients, it follows that the diﬀerence between Ai and Ai µ is negligibly small, as it is proportional to x2 µi = O(1/N). The same is true between Ci and Ci µ. Hence, the relevant diﬀerence is only B(t) i = B(t) i B(t) i µ and is expressed as

1 + cχ(t 1) µ

c r(t 1) µi xµi = a(t 1) µ xµi +

1 + cχ(t 1) µ

βi \µ (t 1) x2 µi

a(t 1) µ xµi,

1 + cχ(t) µ

1 + cχ(t) µ

r(t) µi xµi

βi \µ (t) . (19)

Accordingly, the diﬀerence between βi \µ and βi is computed as

βi \µ (t) βi (t) βi (t)

B(t) i B(t) i = βi (t) χ(t) i a(t 1) µ xµi, (20)

Inserting (20) into (19) yields

1 + cχ(t) µ

j xµjβj (t) + χ(t) µ a(t 1) µ

where the last term is interpreted as the Onsager reaction term in physics. Rewriting rµi using aµ in the right-hand sides of (16) and collecting several factors, a simpliﬁed message

Obuchi and Kabashima

passing algorithm corresponding to (18) is obtained as follows:

i x2 µiχ(t) i , (21a)

W (t) µ = X

i x2 µi W (t) i , (21b)

f(t) 1µ , f(t) 2µ =

1 + cχ(t) µ

1 + cχ(t) µ

a(t) µ = f(t) 1µ

j xµjβj (t) + χ(t) µ a(t 1) µ

A(t+1) i = X

µ x2 µif(t) 1µ , (21e)

B(t+1) i = X

µ xµia(t) µ +

µ x2 µif(t) 1µ

βi (t), (21f)

C(t+1) i = X

f(t) 2µ W (t) µ + f(t) 2µ f(t) 1µ 2 a(t) µ

βi (t+1) = Z Dz Sλ

B(t+1) i + q

C(t+1) i z; A(t+1) i

χ(t+1) i = 1

Z Dz θ B(t+1) i + q

C(t+1) i z λ

W (t+1) i = Z Dz S2 λ

B(t+1) i + q

C(t+1) i z; A(t+1) i

βi (t+1) 2 . (21j)

We call the algorithm (21) AMPR (Approximate Message Passing with Resampling) because it can be regarded as an extension of the AMP in the usual Lasso to the resampling case. The computational cost is O(NM) per iteration and is signiﬁcantly reduced compared with the BP algorithm. This yields the main result of this study. There is an ambiguity in the initial condition for AMPR. Here we assume that we are given an initial estimate {βi (0), χi(0), Wi(0)}i and conduct the iteration based on (21) until convergence. Still, there is an ambiguity in computing a(0) µ , due to the presence of a( 1) µ in (21d). To resolve this, we assume a( 1) µ = 0, yielding

1 + cχ(0) µ

j xµjβj (0)

These completely determine the initial condition. As explained at the end of Section 3.1, the convergent solution of AMPR, {A i , B i , C i }i, enables the computation of any moment of the basic estimator as follows:

[ βi r]c,λ = lim n 0

a=1 βa i = Z Dz Sr λ B i + p

Semi-Analytic Resampling in Lasso

which indicates that the marginal distribution of the basic estimator is obtained as

P(βi) = Z Dz δ βi Sλ B i + p

This yields the positive probability, which is important for variable selection techniques, as follows:

Πi Prob |ˆβi| = 0 = Z Dz θ B i + p

Applications using these relations are provided in Section 4.

3.3. State evolution for AMPR

A beneﬁt of the AMP type algorithms is that it is possible to track the macroscopic dynamical behavior of the algorithm. This can be done by using the so-called state evolution (SE) equations. Here we derive the SE equations associated with AMPR. The derivation relies on the i.i.d. assumption of the covariates, and hence we assume each xµi is i.i.d. from the zero-mean Gaussian as

xµi N(0, σ2 x). (24)

The variance value is arbitrary in general, but for notational simplicity it is chosen as σ2 x = 1/N in this section. Furthermore, we assume the data is generated from the following linear process:

y = Xβ0 + ξ,

where ξ is the noise vector whose component is i.i.d. from N(0, σ2 ξ) and β0 is the true parameters whose component is also i.i.d. from a certain distribution Pβ0( ). Under the assumption (24) with σ2 x = 1/N, the intraand inter-sample variances can be simpliﬁed as

i x2 µiχi X

i E x2 µiχi 1

i χi χ, (25)

i x2 µi Wi X

i E x2 µi Wi 1

i Wi W, (26)

where we have neglected the correlations between xµi and the variances2. Accordingly, many quantities appearing in (21) become independent of the subscripts µ and i. Terms retaining the dependence are only the linear terms with respect to xµi such as βi, Bi and aµ. To derive the SE equations, we need to handle those terms.

2. Remembering the discussion in Section 3.2, these variances are actually the ones computed in the absence of the µ-th potential function, χ\µ i and W \µ i , and hence this neglect can be justiﬁed. The same discussion can be applied to the following.

Obuchi and Kabashima

We start from the following form of B(t+1) i :

B(t+1) i = X

c r(t) νi xνi f(t) 1 X

j( =i) xνj β\ν j (t)

The righthand side is the sum of a large number of random variables, and hence we can treat it as a Gaussian variable with appropriate mean and variance. The mean is

j( =i) xνj β\ν j (t)

E x2 νi β0i + X

j( =i) E [xνixνj] β0j β\µ j (t) + E [xνiξν]

1 N β0i = αft 1β0i, (28)

where α = M/N is the ratio of the dataset size to the dimensionality. In the same way, the variance becomes

j( =i) xνi β\ν j (t)

α ft 1 2 MSE(t) + σ2 ξ v(t+1) 0 , (29)

where MSE(t) denotes the mean-squared error (MSE) between the true and averaged parameters:

β0i β (t) i 2 1

β \µ i (t) 2 . (30)

Hence, we may write

B(t+1) i = αft 1β0i + q

v(t+1) 0 ui, (31)

where ui N(0, 1). Besides, in the computation of C(t+1) i , we have X

ν x2 νif(t) 2ν W (t) ν αf(t) 2 W (t), (32)

ν x2 νi r(t) νi 2 α MSE(t) + σ2 ξ . (33)

This yields

C(t+1) i C(t+1) = αf(t) 2 W (t) + α f(t) 2 f(t) 1 2 MSE(t) + σ2 ξ . (34)

In the same level of approximation, we get A(t+1) i A(t+1) = αf(t) 1 .

Semi-Analytic Resampling in Lasso

To derive a closed set of equations, we have to compute χ(t+1), W (t+1) and MSE(t+1)

from {A(t+1), {B(t+1) i }i, C(t+1)}. As an example, we show the derivation of χ(t+1) from (21i) in the following:

i=1 χ(t+1) i 1

Z Dz θ B(t+1) i + p

Z dβPβ0(β) Z Du Z Dz θ αf(t) 1 β + q

v(t+1) 0 u + p

where we have applied the law of large numbers. The other quantities W, MSE are computed in the same manner. Overall, we reach the following set of equations:

f(t) 1 , f(t) 2 =

c 1 + c χ(t)

" c 1 + c χ(t)

A(t+1) = αf(t) 1 , (36b)

C(t+1) = αf(t) 2 W (t) + α f(t) 2 f(t) 1 2 MSE(t) + σ2 ξ , (36c)

v(t+1) 0 = α f(t) 1 2 MSE(t) + σ2 ξ , (36d)

χ(t+1) = 1 A(t+1)

Z dβPβ0(β) Z Du

Z Dz θ A(t+1)β + q

v(t+1) 0 u + p

W (t+1) = Z dβPβ0(β) Z Du

( Z Dz S2 λ

A(t+1)β + q

v(t+1) 0 u + p

C(t+1)z; A(t+1)

A(t+1)β + q

v(t+1) 0 u + p

C(t+1)z; A(t+1) 2

MSE(t+1) = Z dβPβ0(β) Z Du

A(t+1)β + q

v(t+1) 0 u + p

C(t+1)z; A(t+1)

Given an initial condition { χ(0), W (0), MSE(0)}, we can track the dynamical evolution of those quantities according to (36). This is the SE equations for AMPR. A direct consequence of the SE equations is the convergence property of AMPR: Its convergence depends on neither the dataset size M nor the model dimensionality N. Hence, we can assume the iteration steps required for convergence is O(1) and the total computational cost of AMPR is thus guaranteed to be O(NM). This reinforces the superiority of the present approach. We, however, warn that the discussion based on the SE equations strongly relies on the i.i.d. assumption among covariates, which is not necessarily satisﬁed in real-world datasets. For covariates with non-trivial correlations and heterogeneity, it is known that

Obuchi and Kabashima

AMP type algorithms tend to show slow convergence, or even not to converge in particular cases (Caltagirone et al., 2014). A common prescription to overcome this diﬃculty is to introduce a damping factor in the update of the messages (Rangan et al., 2014), which is also employed in our implementation (Obuchi, 2018). In Sections 4.1.5 and 4.2, we see how this prescription works for datasets with nontrivial covariates in numerical simulations.

4. Numerical experiments

In this section, the accuracy and the computational time of the proposed semi-analytic method based on AMPR is examined by a comparison with direct numerical resampling. We also check how nontrivial correlations among covariates aﬀect the performance of AMPR. Both simulated and real-world datasets (from UCI machine learning repository, Lichman, 2013) are used. For all experiments involving numerical resampling, Glmnet (Friedman et al., 2010), implemented as an MEX subroutine in MATLAB R , was employed for solving (9), given a sample {λ, Dc}. Moreover, the proposed AMPR algorithm was implemented as raw code in MATLAB. This is not the most optimized approach because AMPR uses a number of for and while loops which are slow in MATLAB; hence, the comparison of computational time is not necessarily fair. However, even in this comparison, there is a signiﬁcant diﬀerence in the computational time between the proposed semi-analytic method and the numerical resampling approach. For reference, it should be noted that all experiments below were conducted in a single thread on a single CPU of Intel(R) Xeon(R) E5-2630 v3 2.4GHz. For actual computations, the distribution P(λ) should be speciﬁed. In SS, the following distribution is used (Meinshausen and B uhlmann, 2010):

i=1 {pwδ (λi λ/w) + (1 pw)δ (λi λ)} ,

with 0 < w 1 and 0 < pw < 1. The case of the non-random penalty coeﬃcient, in which Bolasso is included (Bach, 2008), is recovered at w = 1, irrespective of the value of pw. This distribution is adopted below.

4.1. Simulated dataset

Here, simulated datasets are treated. The data is supposed to be generated from the following linear model:

y = Xβ0 + ξ,

where each component of the design matrix X = (x1, x2, , x N) is i.i.d. from N 0, N 1 , and ξ is the noise vector, whose component is i.i.d. from N(0, σ2 ξ). The ratio of the dataset size M to the model dimensionality N is denoted as α M/N hereafter. These settings are identical to the ones assumed in Section 3.3. The true signal β0 RN is assumed to be K0(= Nρ0)-sparse vector, and the non-zero components are i.i.d. from N(0, 1/ρ0), setting the power of the signal unity. The index set of non-zero components is denoted by S0 = {i||β0i| = 0} and is called true support. Any estimator of the true support is simply called support and is hereafter denoted by S.

Semi-Analytic Resampling in Lasso

For simplicity, the number of resampling times is ﬁxed at Nres = 1000 in the experiments with numerical resampling.

4.1.1. Accuracy of the semi-analytic method

Let us ﬁrst check the consistency between the results of our semi-analytic method and of the direct numerical resampling. Figure 1 shows the plots of the experimental values of the quantities {βi, Wi, Πi}i against their semi-analytic counterparts for the non-random penalty case w = 1 with the bootstrap resampling τ = 1, which is the situation considered in Bolasso. The same plots in the SS situation, w = 0.5(< 1), pw = 0.5, and τ = 0.5, are shown in Figure 2. Other detailed parameters are provided in the captions. These results show that the proposed semi-analytic

-5 0 5 Semi-analytic

0 0.2 0.4 0.6 0.8 1 Semi-analytic

0 0.2 0.4 0.6 0.8 1 Semi-analytic

-4 -3 -2 -1 0 1 2 3 4 Semi-analytic

0 0.2 0.4 0.6 0.8 1 1.2 Semi-analytic

0 0.2 0.4 0.6 0.8 1 Semi-analytic

Figure 1: Experimental values of βi (left), Wi (middle), and Πi (right) are plotted against those computed by the semi-analytic method in the non-random penalty w = 1 and τ = 1 case. The upper panels are for λ = 0.01 and the lower are for λ = 1. The other parameters are set to be (N, α, ρ0, σ2 ξ) = (1000, 0.5, 0.2, 0.01).

method reproduces the numerical results fairly accurately. As far as it was examined, results of similar accuracy were obtained for a very wide range of parameters. These validate the proposed semi-analytic method. In the above experiments using Glmnet, we set a threshold ϵ to judge the algorithm convergence as ϵ = 10 10, which is rather tighter than the default value. This is necessary for examining consistency with the proposed semi-analytic method. For example, for λ = 0.01 (the upper panels in Figures 1 and 2), a systematic deviation from the proposed semianalytic method (βi tends to be underestimated) emerges at the default value ϵ = 10 7.

Obuchi and Kabashima

-3 -2 -1 0 1 2 3 Semi-analytic

0 0.5 1 1.5 2 2.5 3 3.5 4 Semi-analytic

0 0.2 0.4 0.6 0.8 Semi-analytic

-1 -0.5 0 0.5 1 Semi-analytic

0 0.5 1 1.5 2 2.5 Semi-analytic

0 0.1 0.2 0.3 0.4 Semi-analytic

Figure 2: Same plots as in Figure 1 in the random penalty case with w = 0.5, pw = 0.5, and τ = 0.5. The parameters of each panel are identical to the corresponding parameters in Figure 1. In comparison to Figure 1, |βi| and Πi tend to be smaller, whereas Wi tends to be larger. This is probably due to the additional stochastic variation coming from the randomization on the penalty coeﬃcient and the diﬀerence between τ = 1 and 1/2.

Semi-Analytic Resampling in Lasso

This implies that a rather tight threshold is required for microscopic quantities such as βi and Wi. Unless explicitly mentioned, this value ϵ = 10 10 is used below.

4.1.2. Comparison with state evolution

To examine the convergence properties of AMPR, we next see the dynamical behavior of the macroscopic quantities ( χ(t), W (t), MSE(t)) as the algorithm step t proceeds, in comparison with the SE equations (36). Figure 3 shows the plots of them against t for diﬀerent parameters. In all the cases, these macroscopic quantities rapidly take stable values and

0 5 10 15 20 25 30 Iteration step t

0 5 10 15 20 25 30 Iteration step t

0 5 10 15 20 25 30 Iteration step t

0 5 10 15 20 25 30 Iteration step t

Figure 3: Dynamical behavior of macroscopic parameters ( χ(t), W (t), MSE(t)) initialized at ( χ(0), W (0)) = (0, 0) and MSE(0) 1. The result computed by SE is denoted by lines while the one by AMPR is represented by markers, and the agreement is fairly good. (Left) The non-randomized penalty case (w = 1, pw = 0, τ = 1). (Right) The randomized penalty case (w = pw = τ = 1/2). The upper panels are for λ = 1, while the lower ones are of λ = 0.01 for which insets are given to show the MSE values in visible scales. The AMPR result is obtained at N = 20000. The other parameters are ﬁxed at (α, ρ0, σ2 ξ) = (0.5, 0.2, 0.01).

the agreement between the AMPR and SE results is excellent. This demonstrates the fast convergence of AMPR, which does not depend on both N and M. To see a good agreement

Obuchi and Kabashima

with the SE result by just one sample of {β0, X, ξ}, the model dimensionality N is chosen as a rather large value of N = 20000 in the AMPR experiment. Although in Figure 3 the initial condition is ﬁxed at χ(0) = W (0) = β (0) = 0 corresponding to ( χ(0), W (0)) = (0, 0) and MSE(0) 1, we also examined several other initial conditions and conﬁrmed the good agreement between the AMPR and SE results in all the cases.

4.1.3. Application to Bolasso

Bolasso is a variable selection method utilizing the positive probability (23) evaluated by the bootstrap resampling τ = 1 with no penalty randomization w = 1. Its soft version, abbreviated as Bolasso-S (Bach, 2008), selects variables with Πi 0.9 as active variables and other variables are rejected from the support. We here adopt this manner and see the performance of AMPR used for implementing this. The left panel of Figure 4 shows the plot of the true positive ratio (TP) against the false positive ratio (FP) as the values of α change. Bolasso requires scaling the regularization parameter as λ α, and here, it is set to λ = (1/2) α. The other parameters are ﬁxed at (N, ρ0, σ2 ξ) = (1000, 0.2, 0.01). As was theoretically shown (Bach, 2008), TP converges to unity, whereas FP tends to zero in this setup. This demonstrates that the

0 0.05 0.1 0.15 0.2 FP

0 50 100 150 200 250 300 350 400 0

Figure 4: Bolasso experiment at (N, ρ0, σ2 ξ) = (1000, 0.2, 0.01). The semi-analytic result (circle) is plotted with the numerical resampling result (asterisk). (Left) TP is plotted against FP as α increases: TP approaches unity, whereas FP tends to zero. The semi-analytic result completely overlaps that of the numerical resampling. Error bars are omitted for clarity. (Right) Computational time plotted against α. The observed large diﬀerence is attributed to the numerical resampling cost.

model consistency in the variable selection context is recovered. The contribution of this study is a signiﬁcant reduction of the computational time: In the right panel, the actual computational time is compared between the semi-analytic method using AMPR and the direct numerical resampling, by plotting it against α. The computational costs of AMPR and Glmnet are both scaled as O(NM) and thus should be scaled linearly with respect to α for ﬁxed N. Figure 4 clearly shows this linearity. The signiﬁcant diﬀerence in compu-

Semi-Analytic Resampling in Lasso

tational time is fully attributed to the numerical resampling cost. This demonstrates the eﬃciency of AMPR. It should be noted that the average over 10 diﬀerent samples of D is taken in Figure 4 to obtain smooth curves and error bars.

4.1.4. Application to stability selection

SS is another variable selection method utilizing positive probability. The diﬀerence from Bolasso is the presence of the penalty coeﬃcient randomization (w < 1) and that τ is set to 0.5. These introduce a further stochastic variation in the method, and consequently they tend to show a clearer discrimination in the positive probabilities between the variables in and outside the true support. AMPR can easily implement this, and here, it is compared with numerical resampling. Figure 5 shows the plots of the positive probability values against λ, the so-called stability path (Meinshausen and B uhlmann, 2010), and the computational time for obtaining the stability path against the covariate dimensionality N. The other parameters are ﬁxed at (α, ρ0, σ2 ξ, w, pw) = (2, 0.2, 0.01, 0.5, 0.5). Drawing all stability paths {Πi}i leads to an

10-2 10-1 100 101 0

Positive probabilities

102 103 104 100

Figure 5: SS experiment at (α, ρ0, σ2 ξ, w, pw) = (2, 0.2, 0.01, 0.5, 0.5). The semi-analytic result (circle) is plotted with the numerical resampling result (asterisk). (Left) Plots of the medians (points) and q-percentiles (bars) of {Πi}i S0 (TP) and {Πi}i/ S0 (FP) with q = 16 and 84 against λ for N = 8000. The semi-analytic result well overlaps that of the numerical resampling. There is a clear gap between TP and FP, suggesting that an accurate variable selection is possible. (Right) Computational time plotted against N on double-logarithmic scale. The dominant reason for the diﬀerence is again the numerical resampling cost.

unclear plot; thus, only the median and q-percentiles of {Πi}i S0 and those of {Πi}i/ S0 are plotted, which are denoted by TP and FP in the left panel of Figure 5, respectively. The medians are represented by points, and the percentiles of q = 16 and 84, approximately corresponding to one-sigma points in normal distributions, are represented by bars. There is a clear gap between TP and FP, suggesting that an accurate variable selection is possible by setting an appropriate threshold of probability value, as discussed by Meinshausen and

Obuchi and Kabashima

B uhlmann (2010). The right panel shows a plot of the computational time by AMPR and by the numerical resampling. Both are supposed to be scaled as O(NM = N2). Although such a scaling is not observed in the AMPR result, it is expected to appear for larger N. Again, a signiﬁcant diﬀerence in computational time is present between the semi-analytic and the numerical resampling methods, demonstrating the eﬀectiveness of the proposed method.

4.1.5. Correlated covariates

In the derivation of AMPR and the associated SE equations, the weakness of correlations between covariates is assumed, but this assumption does not necessarily hold in realistic situations. Hence, it is important to check the accuracy of results obtained by AMPR for datasets with correlated covariates. Here we numerically examine this point. To introduce correlations into the simulated dataset described above in a systematic way, we generate our covariates {xi}N i=1 in the following manner: As a common component we ﬁrst generate a vector xcom RM each component of which is i.i.d. from N(0, 1/N); choose a number 0 rcom < 1 controlling the ratio of the common component and generate a binary vector mi each component of which independently takes 1 with probability rcom; take another vector xi RM each component of which is i.i.d. from N(0, 1/N), and generate a covariate vector xi as a linear combination between xcom and xi with using mi as a mask. These operations can be summarized in the following equation:

xi = xcom mi(rcom) + xi (1 mi(rcom)), (37)

where denotes Hadamard (component-wise) product. The covariates correlations monotonically increase as rcom grows. To quantify this, we compute an overlap between the co-

variates, overlapij = x i xj/ q

, and plot its mean value over i and j ( = i) in

Figure 6. Keeping in mind this quantitative information, we check the accuracy of AMPR below. To directly see the accuracy of AMPR for correlated covariates, we plot the experimental values of the quantities {βi, Wi, Πi}i against those of AMPR in Figure 7 for two diﬀerent values of the common component ratio, rcom = 0.4 and rcom = 0.8. For rcom = 0.8, the AMPR result shows a clear deviation from the experimental one, while for rcom = 0.4 it still exhibits a good agreement with the experiment. Although Figure 7 is the speciﬁc result to the non-random penalty case (w = 1, τ = 1) with parameters (N, α, ρ0, σ2 ξ, λ) = (1000, 0.5, 0.2, 0.01, 1), we have tested several diﬀerent cases and observed similar tendency. To capture more global and quantitative information, we introduce a normalized MSE of β between the experimental β exp and AMPR β AMPR values as

β exp i β AMPR i 2

β AMPR i 2 . (38)

The counterparts for W and Π are deﬁned in the same way. Plots of the normalized MSEs for these quantities against rcom are given in Figure 8. Here the results for both the randomized (w = pw = τ = 0.5) and non-randomized (w = τ = 1, pw = 0) penalty

Semi-Analytic Resampling in Lasso

0 0.2 0.4 0.6 0.8 1 0

Mean overlap

Figure 6: Plot of mean overlap of covariates against the common component ratio rcom. This is a result of a single numerical simulation at (N, α) = (1000, 0.5), but the invariability of the result against changing the parameters has been also checked.

-4 -3 -2 -1 0 1 2 Semi-analitic

0 0.5 1 1.5 2 Semi-analitic

0 0.2 0.4 0.6 0.8 1 Semi-analitic

-2 -1.5 -1 -0.5 0 0.5 Semi-analitic

0 0.5 1 1.5 2 2.5 Semi-analitic

0 0.2 0.4 0.6 0.8 1 Semi-analitic

Figure 7: Experimental values of βi (left), Wi (middle), and Πi (right) are plotted against those computed by AMPR for correlated covariates with rcom = 0.4 (upper) and rcom = 0.8 (lower). For the lower panels, there exists a systematic deviation in the AMPR result. Here the non-random penalty case (w = 1, τ = 1) is treated and the other parameters are (N, α, ρ0, σ2 ξ, λ) = (1000, 0.5, 0.2, 0.01, 1).

Obuchi and Kabashima

0 0.2 0.4 0.6 0.8 1 0

Normalized MSE

0 0.2 0.4 0.6 0.8 1 0

Normalized MSE

0 0.2 0.4 0.6 0.8 1 0

Normalized MSE

0 0.2 0.4 0.6 0.8 1 0

Normalized MSE

Figure 8: Plots of normalized MSEs of β, W, Π against rcom of the non-randomized (left, w = 1, pw = 0, τ = 1) and randomized (right, w = pw = τ = 1/2) penalty cases. The upper panels are for λ = 1 while the lower ones are of λ = 0.01. The other parameters are (N, α, ρ0, σ2 ξ) = (1000, 0.5, 0.2, 0.01).

Semi-Analytic Resampling in Lasso

cases are presented, with two diﬀerent values of λ. This suggests that AMPR is practically reliable up to a certain level of correlations: For example if rcom 0.6, the normalized MSE of β is well suppressed and is commonly less than 0.2. According to Figure 6, rcom = 0.6 corresponds to the mean overlap value of about 0.36 which is not small. This speaks for the eﬀectiveness of AMPR even for real-world datasets, as long as the correlations are not too large. The above discussion clariﬁes the accuracy of AMPR for correlated covariates, but a more crucial issue is in its convergent property. For the non-correlated case of rcom = 0, AMPR converges very rapidly as shown in Figure 3, but for correlated cases it tends to badly converge and can even diverge. A common way to overcome this is to introduce a damping factor γ in the update (Caltagirone et al., 2014; Rangan et al., 2014), and we adopted this in the above experiments. The update with the damping factor can be symbolized as

θ(t+1) = (1 γ)θ(t) + γG(θ(t)), (39)

where θ is a variable summarizing (β, χ, W ) and G represents the AMPR operations. The original message passing update corresponds to γ = 1. Smaller values of γ are better for the stability of the algorithm but takes a longer time until the convergence. Our experiments on the simulated dataset seemingly show that a smaller value of γ is needed for larger rcom

and smaller λ. For example for obtaining Figure 7, we needed to set γ < 0.1 for avoiding divergence. This value is found experimentally, and it is desired to ﬁnd a more principled way to choose an appropriate value of γ or a more eﬀective way to better control the convergence of AMP type algorithms. Some earlier studies tackled this problem (Caltagirone et al., 2014; Rangan et al., 2014), but the complete understanding of the convergent property is still missing. Investigation along this direction is an important topic but is beyond the purpose of the present paper.

4.2. Real-world dataset

In this subsection, AMPR is applied to a real-world dataset, and the stability path is computed for examining the relevant covariates or variables. The dataset treated here is the wine quality dataset (Lichman, 2013): The data size is M = 4898 (only white wine is treated), and the number of covariates, which represent physicochemical aspects of wine, is N0 = 11. The basic aim of this dataset is to model wine quality in terms of the physicochemical aspects only, and the response is an integer-valued quality score from zero to ten obtained by human expert evaluation. This dataset can be used in both classiﬁcation and regression, and the linear model was also tested in earlier studies (Cortez et al., 2009; Melkumova and Shatskikh, 2017). Lasso and SS are applied to this dataset. As preprocessing, Nnoise noise variables are added into the dataset. Each component of the noise variables is i.i.d. from N(0, 1/N). The usual standardization, zeroing the mean of all variables and responses and normalizing the variables to be of unit norm, is also conducted. The reason of the introduction of the noise variables is to make an objective criterion for judging relevant stability paths. The noise variables are by deﬁnition irrelevant for describing the responses and hence if some stability paths of original variables behave similarly to the ones of the noise variables then those variables can be regarded as irrelevant. A distribution of stability paths of the noise variables thus deﬁnes a kind of rejection region ,

Obuchi and Kabashima

resolving the arbitrariness of judging criterion of the positive probability both in Bolasso and SS. As far as we have searched, this kind of active construction of rejection regions has never been observed in literatures, and hence this proposition itself can be regarded as a part of new results of this paper. This sort of manipulation to dataset is usually unfavored because it increases the computational cost and tends to contaminate the dataset. However thanks to AMPR, the ﬁrst computational issue becomes less serious since the computational time of AMPR increases just linearly with respect to the number of added noise variables Nnoise. The second issue does not seem to be serious either for the wine quality dataset, because the dataset size M = 4898 is very large compared to the number of original variables N0 = 11. Below, we experimentally check if this strategy works well or not. Let us start from checking how the noise variables inﬂuence on the dataset. To this end, we plot the solution paths and the generalization errors estimated by 10-fold cross validation (CV) in Figure 9 in the cases with and without the noise variables. This ﬁgure

10-1 100 101

Coefficients

10-1 100 101

Coefficients

10-1 100 101

10-1 100 101

Figure 9: Solution paths (top) and CV errors plotted against λ (bottom), for Nnoise = 0 (left) and Nnoise = 689 (right). The vertical dotted line denotes the location of the CV error minimum, λmin, whereas the vertical straight line represents the location selected by the one-standard-error rule that deﬁnes the optimal λ value, λopt. The eﬀect of the noise variables addition is weak around λopt.

Semi-Analytic Resampling in Lasso

Table 1: Overlap of the 3rd covariate (citric acid) and the others index i 1 2 3 4 5 6 7 8 9 10 11 overlap x i x3 0.29 -0.15 1.00 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.07

indicates that the solution paths of the original variables and the CV error value are stable against the introduction of noise variables, particularly in the relevant range of λ. At the optimal λ (λopt), chosen by the one-standard-error rule (Trevor et al., 2009), the variables in the support are common in both cases: They are indexed as 1, 2, 4, 5, 6, 10, and 11, which represent ﬁxed acidity, volatile acidity, residual sugar, chlorides, free sulfur dioxide, sulphates, and alcohol, respectively. Hence, we can conclude that the introduction of the noise variables hardly aﬀects the estimates of the original variables, and we can eﬀectively use the noise variables to judge the signiﬁcance of each original variable. Subsequently, the result of applying SS with (τ, w, pw) = (0.5, 0.5, 0.5) to this dataset is examined. Figure 10 shows the plots of the stability paths according to the three categories of variables: The important variables are deﬁned as being in the support, at λopt in the Lasso analysis, and thus their indices are 1, 2, 4, 5, 6, 10, 11; the neutral variables are the other variables in the original dataset, and the corresponding indices are 3, 7, 8, 9; the remaining are the noise variables. Both the results of AMPR (circle) and the direct numerical resampling (asterisk) are shown (the same variable is depicted in the same color). Consequences of Figure 10 are three-fold. The ﬁrst one is consistent agreement between the results of AMPR and the direct numerical resampling. For all categories, the two curves of each stability path Πi(λ) by AMPR and the direct numerical resampling are very close to each other. This is an additional evidence supporting the accuracy of the proposed semianalytic method in real-world datasets with correlated covariates. The computational time for obtaining these results in an experiment is 1077 s by AMPR and 2859 s by the numerical resampling, demonstrating the eﬃciency of AMPR. It should be stressed that this eﬃciency can be enhanced by optimizing the implementation of AMPR. The second consequence is the diﬀerent behaviors of stability paths in diﬀerent categories. The positive probabilities of the important variables (upper left) are growing largely even for λ > λmin, while those of the noise variables (lower left) do not grow well unless λ drops below λmin. The behavior of the neutral variables (upper right) is somewhat elusive, and a better interpretation is provided by utilizing the noise variables, which is the third consequence: As discussed above, we can deﬁne a rejection region from the distribution of stability paths of the noise variables; an actual deﬁnition here is the q-percentiles with q = 16 and 84 of the distribution, which is in the lower right panel depicted by red bars with the median (red markers) of the distribution, the legend of which is given as FP, in the same manner as Figure 5. This analysis shows that citric acid and total sulfur dioxide of the neutral variables tend to be in the rejection region, implying that they are irrelevant for modeling wine quality. Moreover, density and p H are well beyond the interval of the rejection region, and they can be regarded as relevant, even though the behavior of density s path is rather tricky. The above conclusion of citric acid s irrelevance contradicts that in Cortez et al. (2009), where citric acid was concluded to be the fourth most important variable. This may be explained by the collinearity of the covariates. In Table 1, the covariates overlap between citric acid and the others is summarized. This shows a strong collinearity of the 3rd variable

Obuchi and Kabashima

10-1 100 101 0

1 Important variables

1 fixed acidity 2 volatile acidity 4 residual sugar 5 chlorides 6 free sulfur dioxide 10 sulphates 11 alcohol

10-1 100 101 0

1 Neutral variables

3 citric acid 7 total sulfur dioxide 8 density 9 p H

10-1 100 101 0

1 Noise variables

10-1 100 101 0

1 Neutral variables and 16% interval

3 citric acid 7 total sulfur dioxide 8 density 9 p H FP (16%)

Figure 10: Stability paths for the wine quality data with Nnoise = 689. The paths of the important variables are in the upper left, those of the neutral variables are in the upper right, and those of the noise variables are in the lower left panels. For the lower left panel, the paths of only a part of the noise variables are shown. The vertical dotted and straight lines denote λmin and λopt, respectively, as in Figure 9. Circles denote the result by AMPR and asterisks represent that by the direct numerical resampling, and the same variable is depicted in the same color; the semi-analytic and numerical results show a consistent agreement for all cases. The lower right panel is the simultaneous plot of the stability paths of the neutral variables and the FP interval computed from paths of the noise variables.

Semi-Analytic Resampling in Lasso

with several others, particularly with the 1st variable. This implies that citric acid can be replaced by the 1st variable or ﬁxed acidity. This is a plausible explanation because the proposed model puts considerably more weight on ﬁxed acidity than on citric acid, whereas the opposite is the case in Cortez et al. (2009). Further thorough comparison will be required to determine which model is better. Table 1 also implies why AMPR works well for the wine quality dataset: The maximum value of the covariates s overlap is about 0.29 which corresponds rcom 0.5 in Figure 6; Figure 8 shows that the accuracy of AMPR is commonly good around that value of rcom, explaining the accuracy of AMPR. This consideration indicates that given a new dataset we can judge whether AMPR will give an accurate result or not for the dataset from the covariates overlap and Figures 6 and 8. Overall, the analysis using stability paths provides richer information that cannot be obtained by solely using Lasso. A new objective criterion could be proposed for determining the relevance of variables by utilizing the distribution of stability paths of the added noise variables. These facts highlight the eﬀectiveness of the resampling strategy in variable selection, and the proposed semi-analytic method can implement this strategy in a computationally eﬃcient manner.

5. Conclusion

An approximate method was developed for performing resampling in Lasso in a semianalytic manner. The replica method enables us to analytically take the resampling average over the given data and the average over the penalty coeﬃcient randomness, and the resultant replicated model is approximately handled by a Gaussian approximation using the cavity method. A message passing algorithm named AMPR is thus derived, the computational cost of which is O(NM) per iteration and is suﬃciently reasonable. Its convergence in O(1) iterations is guaranteed by a state evolution analysis when covariates are given as zero-mean i.i.d. Gaussian random variables. We demonstrated how it actually works through numerical experiments using simulated and real-world datasets. Comparison with direct numerical resampling has evidenced its approximation accuracy and eﬃciency in terms of computational cost, even for covariates with correlations of a moderate level. AMPR was also employed to approximately perform Bolasso and SS, and it was applied to the wine quality dataset (Lichman, 2013; Cortez et al., 2009). To provide a ﬁner quantitative analysis of the dataset, an objective criterion was proposed for determining the relevance of the stability paths by processing the added noise variables, yielding reasonable results in satisfactory computational time. An advantage of the present framework is its generality. For example, its extension to a generalized linear model is straightforward. This is an immediate future research direction. Extensions to other resampling techniques, such as the multiscale bootstrap method (Shimodaira et al., 2004), would also be interesting. An unsatisfactory aspect of the present AMPR is that the correlations between covariates are neglected by the approximation. This is a clear drawback, and certain issues arise when the present AMPR is applied to problems involving signiﬁcantly correlated covariates, although numerical experiments showed that the accuracy of AMPR is still good in the presence of correlations of a moderate level. This drawback may be overcome using more

Obuchi and Kabashima

sophisticated approximations, such as the expectation propagation or the adaptive TAP method (Opper and Winther, 2001a,b, 2005; Kabashima and Vehkapera, 2014; C akmak et al., 2014; Cespedes et al., 2014; Rangan et al., 2016; Ma and Ping, 2017; Takeuchi, 2017). Applying those approximations with retaining the beneﬁt of message passing algorithms, the low computational cost, is still a nontrivial challenge and promising future work. Resampling is a very versatile framework applicable to various contexts and models in statistics and machine learning. Reducing its computational cost by extending the present method will thus be beneﬁcial in various ﬁelds, and can even be imperative, as available data in society will continue to increase rapidly.

Acknowledgement

This work was supported by JSPS KAKENHI Nos. 25120013 and 17H00764. TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation. The authors thank Hideitsu Hino for useful comments.

Francis R Bach. Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, pages 33 40. ACM, 2008.

Onureena Banerjee, Laurent El Ghaoui, Alexandre d Aspremont, and Georges Natsoulis. Convex optimization techniques for ﬁtting sparse gaussian graphical models. In Proceedings of the 23rd international conference on Machine learning, pages 89 96. ACM, 2006.

Jean Barbier, Nicolas Macris, Mohamad Dia, and Florent Krzakala. Mutual information and optimality of approximate message-passing in random linear estimation. ar Xiv preprint ar Xiv:1701.05823, 2017.

Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57 (2):764 785, 2011.

Francesco Caltagirone, Lenka Zdeborov a, and Florent Krzakala. On convergence of approximate message passing. In 2014 IEEE International Symposium on Information Theory, pages 1812 1816, June 2014. doi: 10.1109/ISIT.2014.6875146.

Burak C akmak, Ole Winther, and Bernard H Fleury. S-amp: Approximate message passing for general matrix ensembles. In Information Theory Workshop (ITW), 2014 IEEE, pages 192 196. IEEE, 2014.

Javier Cespedes, Pablo M Olmos, Matilde S anchez-Fern andez, and Fernando Perez-Cruz. Expectation propagation detection for high-order high-dimensional mimo systems. IEEE Transactions on Communications, 62(8):2840 2849, 2014.

Semi-Analytic Resampling in Lasso

Paulo Cortez, Ant onio Cerdeira, Fernando Almeida, Telmo Matos, and Jos e Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547 553, 2009.

David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914 18919, 2009.

Viktor Dotsenko. Introduction to the replica theory of disordered statistical systems, volume 4. Cambridge University Press, 2005.

Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

Max Grazier G Sell, Jonathan Taylor, and Robert Tibshirani. Adaptive testing for the graphical lasso. ar Xiv preprint ar Xiv:1307.4765, 2013.

Edwin Hewitt and Leonard J Savage. Symmetric measures on cartesian products. Transactions of the American Mathematical Society, 80(2):470 501, 1955.

Adel Javanmard and Andrea Montanari. Conﬁdence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869 2909, 2014a.

Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522 6554, 2014b.

Adel Javanmard and Andrea Montanari. De-biasing the lasso: Optimal sample size for gaussian designs. ar Xiv preprint ar Xiv:1508.02757, 2015.

Yoshiyuki Kabashima. A cdma multiuser detection algorithm on the basis of belief propagation. Journal of Physics A: Mathematical and General, 36(43):11111, 2003.

Yoshiyuki Kabashima and Mikko Vehkapera. Signal recovery using expectation consistent approximation for linear observations. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 226 230. IEEE, 2014.

Keith Knight and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics, pages 1356 1378, 2000.

M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ ml.

Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A signiﬁcance test for the lasso. Annals of statistics, 42(2):413, 2014.

Obuchi and Kabashima

Junjie Ma and Li Ping. Orthogonal amp. IEEE Access, 5:2020 2033, 2017.

D orthe Malzahn and Manfred Opper. An approximate analytical approach to resampling averages. Journal of Machine Learning Research, 4(Dec):1151 1173, 2003.

Nicolai Meinshausen and Peter B uhlmann. Consistent neighbourhood selection for sparse high-dimensional graphs with the lasso. Seminar f ur Statistik, Eidgen ossische Technische Hochschule (ETH), Z urich, 2004.

Nicolai Meinshausen and Peter B uhlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417 473, 2010.

Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for highdimensional data. The Annals of Statistics, pages 246 270, 2009.

LE Melkumova and S Ya Shatskikh. Comparing ridge and lasso estimators for data analysis. Procedia Engineering, 201:746 755, 2017.

Marc M ezard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientiﬁc Publishing Company, 1987.

Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227 234, 1995.

Hidetoshi Nishimori. Statistical physics of spin glasses and information processing: an introduction, volume 111. Clarendon Press, 2001.

Tomoyuki Obuchi. Matlab package of AMPR. https://github.com/T-Obuchi/AMPR_ lasso_matlab, 2018.

Manfred Opper and Ole Winther. Adaptive and self-averaging thouless-anderson-palmer mean-ﬁeld theory for probabilistic modeling. Physical Review E, 64(5):056131, 2001a.

Manfred Opper and Ole Winther. Tractable approximations for probabilistic models: The adaptive thouless-anderson-palmer mean ﬁeld approach. Physical Review Letters, 86(17): 3695, 2001b.

Manfred Opper and Ole Winther. Expectation consistent approximate inference. Journal of Machine Learning Research, 6(Dec):2177 2204, 2005.

Sundeep Rangan, Philip Schniter, and Alyson Fletcher. On the convergence of approximate message passing with arbitrary matrices. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 236 240. IEEE, 2014.

Sundeep Rangan, Philip Schniter, and Alyson Fletcher. Vector approximate message passing. ar Xiv preprint ar Xiv:1610.03082, 2016.

Hidetoshi Shimodaira et al. Approximately unbiased tests of regions using multistepmultiscale bootstrap resampling. The Annals of Statistics, 32(6):2616 2641, 2004.

Semi-Analytic Resampling in Lasso

Takashi Takahashi and Yoshiyuki Kabashima. A statistical mechanics approach to debiasing and uncertainty estimation in lasso for random measurements. Journal of Statistical Mechanics: Theory and Experiment, 2018(7):073405, 2018.

Keigo Takeuchi. Rigorous dynamics of expectation-propagation-based signal recovery from unitarily invariant measurements. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 501 505. IEEE, 2017.

Michel Talagrand. Spin glasses: a challenge for mathematicians: cavity and mean ﬁeld models, volume 46. Springer Science & Business Media, 2003.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996.

Hastie Trevor, Tibshirani Robert, and Friedman Jerome. The Elements of Statistical Learning; Data Mining, Inference, and Prediction. Springer-Verlag New York, 2009. doi: 10.1007/978-0-387-84858-7.

Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183 2202, 2009.

Ming Yuan and Yi Lin. On the non-negative garrotte estimator. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):143 161, 2007.

Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541 2563, 2006.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418 1429, 2006.