# scalable_spikeandslab__5986cdcf.pdf Scalable Spike-and-Slab Niloy Biswas 1 Lester Mackey 2 Xiao-Li Meng 1 Spike-and-slab priors are commonly used for Bayesian variable selection, due to their interpretability and favorable statistical properties. However, existing samplers for spike-and-slab posteriors incur prohibitive computational costs when the number of variables is large. In this article, we propose Scalable Spike-and-Slab (S3), a scalable Gibbs sampling implementation for high-dimensional Bayesian regression with the continuous spike-and-slab prior of George & Mc Culloch (1993). For a dataset with n observations and p covariates, S3 has order max{n2pt, np} computational cost at iteration t where pt never exceeds the number of covariates switching spikeand-slab states between iterations t and t 1 of the Markov chain. This improves upon the order n2p per-iteration cost of state-of-the-art implementations as, typically, pt is substantially smaller than p. We apply S3 on synthetic and real-world datasets, demonstrating orders of magnitude speed-ups over existing exact samplers and significant gains in inferential quality over approximate samplers with comparable cost. 1. Introduction 1.1. Bayesian computation in high dimensions We consider linear, logistic, and probit regression in high dimensions, where the number of observations n is smaller than the number of covariates p. This setting is common in modern applications such as genome-wide association studies (Guan & Stephens, 2011; Zhou et al., 2013) and astronomy (Kelly, 2007; Sereno, 2015). In the non-Bayesian paradigm, sparse point estimates such as the LASSO (Tibshirani, 1996), Elastic Net (Zou & Hastie, 2005) and SLOPE (Bogdan et al., 2015) offer a route to variable selection. 1Department of Statistics, Harvard University 2Microsoft Research New England. Correspondence to: Niloy Biswas . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). These estimates are based on optimization based approaches, which are computationally efficient and scale to datasets with hundreds of thousands of covariates. In the Bayesian paradigm, which will be our focus, one places a prior on the unknown parameters of interest and considers the corresponding posterior distribution. Sampling algorithms such as Markov chain Monte Carlo (MCMC) are then used to simulate from the posterior distribution. In modern high dimensional settings, general-purpose MCMC algorithms can have high computational cost per iteration. This has kindled a line of work on tailored algorithms for Bayesian regression (e.g., Polson et al., 2013; Yang et al., 2016; Narisetty et al., 2019; Johndrow et al., 2020; Biswas et al., 2022). Our manuscript participates in this wider effort to scale Bayesian inference to large data applications. Specifically, we propose computationally efficient MCMC algorithms for high-dimensional Bayesian linear, logistic and probit regression with spike-and-slab priors. 1.2. Variable selection with spike-and-slab priors Consider Gaussian linear regression, logistic regression, and probit regression with n observations and p covariates. The respective likelihoods are given by Llin(β, σ2; X, y) = 1 (2πσ2)n/2 exp Pn i=1(yi x i β)2 Llog(β; X, y) = Qn i=1 exp(yix i β) 1+exp(x i β), and Lprob(β; X, y) = Qn i=1 Φ(x i β)yi(1 Φ(x i β))1 yi. Here X Rn p is the design matrix with rows x i , y Rn (for linear regression) or y {0, 1}n (for logistic and probit regression) is the response vector, β Rp is the unknown signal, σ2 (0, ) is the unknown Gaussian noise variance, and Φ is the cumulative density function of N(0, 1). We focus on the high-dimensional setting with n p, where β Rp is assumed to be sparse. We use a continuous spike-and-slab prior on β to capture sparsity: σ2 Inv Gamma a0 , (for linear regression) σ2 = 1, (for logistic and probit regression) zj i.i.d. j=1,...,p Bernoulli(q), βj|zj,σ2 ind j=1,...,p(1 zj)N(0,σ2τ 2 0 )+zj N(0,σ2τ 2 1 ), (1) where q (0, 1), τ 2 1 τ 2 0 > 0, and a0, b0 > 0 are hyperparameters. Here, N(0, σ2τ 2 0 ) and N(0, σ2τ 2 1 ) correspond to the spike and slab parts of the prior respectively. In the high-dimensional setting, a small constant q 1 is often chosen, but the algorithms in this manuscript also readily extend to hierarchical variants of (1) with a hyperprior placed on q (Scott & Berger, 2010; Castillo & van der Vaart, 2012). Catalyzed by the works of George & Mc Culloch (1993; 1997), continuous spike-and-slab priors are now a mainstay of Bayesian variable selection (see the recent reviews of Tadesse & Vannucci (2021, Section I) and Banerjee et al. (2021)). The posterior probabilities P(zj = 1|y) provide a natural interpretable approach to variable selection. The median probability model selects all covariates j such that P(zj = 1|y) > 1/2, and it is easily fitted using Monte Carlo samples and provides the optimal predictive model in the case with orthogonal design matrix, as well as extensions with certain correlated matrices (Barbieri & Berger, 2004; Barbieri et al., 2021). Narisetty & He (2014) have further fine-tuned the optimal scaling of q, τ 2 0 , and τ 2 1 with respect to the number of covariates and sample size to establish model selection consistency for linear regression with general design matrices in high dimensions. One could alternatively consider point-mass spike-and-slab priors (e.g., Mitchell & Beauchamp, 1988; Johnson & Rossell, 2012), where βj|zj (1 zj)δ0( )+zj N(0, σ2τ 2 1 ) such that a degenerate Dirac distribution about zero is chosen for the spike part. Point-mass priors have favorable statistical properties (e.g., Johnstone & Silverman, 2004; Castillo & van der Vaart, 2012) but can be computationally prohibitive (Bai et al., 2021). We hope to extend our algorithms to point-mass priors in follow-up work. 1.3. Our contributions Our contributions are summarized below. Throughout, we use O and Ωto respectively denote asymptotic upper and lower bounds on computational complexity growth rates. Section 2 introduces Scalable Spike-and-Slab (S3), a computationally efficient implementation of Gibbs samplers for linear and logistic regression with the prior given in (1). Section 2.1 investigates the computational bottlenecks of state-of-the-art (SOTA) implementations, which require Ω(n2p) computational cost per iteration for datasets with n observations and p covariates. Section 2.2 develops S3, which overcomes existing computational bottlenecks by employing a pre-computation based strategy and requires O(max{n2pt, np}) computational cost at iteration t, where pt is no greater than the number of covariates switching spike-and-slab states between iterations t and t 1 of the Markov chain. Section 2.3 analyzes the favourable computational complexity of S3, showing that pt is typically much smaller than p and that it can remain constant and even approach zero under various limiting regimes as p increases. Section 3 compares S3 with the SOTA exact MCMC sampler and a recently proposed approximate MCMC sampler, which does not converge to the posterior distribution of interest. We demonstrate that S3 offers substantially faster numerical runtimes compared to the SOTA exact MCMC sampler, reporting 50 speedups on synthetic datasets. In the same experiment, S3 and the approximate sampler have comparable runtimes, but the asymptotically exact S3 procedure provides more accurate variable selection. Section 4 demonstrates the benefits of S3 on a diverse suite of datasets, including two synthetic datasets and eight realworld experiments. For example, on a genome-wide association study (GWAS) dataset with with n 2000 and p 100000, we again observe 50 computational speedups over the SOTA exact MCMC sampler. Finally, Section 5 discusses directions for future work. The open-source packages Scale Spike Slab in R and Python (www.github.com/ niloyb/Scale Spike Slab) implement our methods and recreate the experiments in this paper. 2. Scalable Spike-and-Slab 2.1. Status quo and computational bottlenecks Gibbs samplers have long been employed to sample from the posterior distributions corresponding to the prior in (1) (e.g., George & Mc Culloch, 1993; O Brien & Dunson, 2004; Held & Holmes, 2006; Polson et al., 2013). The computational bottleneck of existing Gibbs samplers is linked to sampling from the full conditional of β Rp. This is given by βt+1|zt, σ2 t N Σ 1 t X y, σ2 t Σ 1 t (2) for Σt =X X + Dt, where t indexes the iteration of the Markov chain, and Dt is the diagonal matrix with the vector ztτ 2 1 + (1p zt)τ 2 0 populating its diagonal elements. Sampling from (2) using standard matrix multiplication and a generic Cholesky decomposition that ignores the specific structure of Σt requires Ω(p3) computational cost, which quickly becomes prohibitive for large p. Hereafter we will refer to this generic method as the Na ıve Sampler. Ishwaran & Rao (2005) recommend separating the components of β into B blocks of size p/B each, and then updating each block using Gibbs sampling, which gives a reduced Ω(p3B 2) computational cost. However, this cost remains prohibitive for large p, and using a larger number of blocks B induces higher auto-correlation between successive iterations of the Gibbs sampler. Recently, Bhattacharya et al. (2016) developed an algorithm based on the Woodbury matrix identity (Hager, 1989) to sample from multivariate Gaussian distributions of the form in (2), which requires a more favourable Ω(n2p) computational cost and is given in Algorithm 1. Algorithm 1 An Ω(n2p) sampler of (2) (Bhattacharya et al., 2016) Sample r N(0, Ip), ξ N(0, In). Set u = D 1 2 t r and calculate v = Xu + ξ. Set v = M 1 t ( 1 σt y v) for Mt = In + XD 1 t X . Return β = σt(u + D 1 t X v ). For large-scale datasets with n in the thousands and p in the hundreds of thousands, as found in modern scientific applications, the Ω(n2p) cost per iteration is still too high. This has spurred recent work on approximate MCMC (Narisetty et al., 2019) for the continuous spike-and-slab prior on logistic regression and on variational inference methods for the point-mass spike-and-slab prior (Titsias & L azaro-Gredilla, 2011; Ray et al., 2020; Ray & Szab o, 2021). Such approximate samplers can provide improved computational speeds but do not converge to the posterior distribution of interest. 2.2. A scalable Gibbs sampler We now develop S3. Our key insight is that successive precomputation can be used to reduce the computational cost of Algorithm 1. In Algorithm 1, the Ω(n2p) computational cost per iteration arises from the calculation of the matrix Mt. Current algorithms calculate Mt = In + XD 1 t X from scratch every iteration at Ω(n2p) cost under standard matrix multiplication and then solve an n by n linear system to obtain v at Ω(n3) cost. We propose instead to use the previous state zt 1 and the pre-computed matrices Mt 1 and M 1 t 1 at each step to aid the calculation of M 1 t . Our strategy is given below. Specifically, denote X = [X1, ..., Xp] Rn p and its submatrices XAt [Xj : j At] Rn zt and XAc t [Xj : j Ac t] Rn (p zt ), where At {j : zj,t = 1} and Ac t = {j : zj,t = 0}, are the ordered index sets of covariates corresponding to slab states and to spike states respectively at iteration t, and 1 is the L1 norm on {0, 1}p. Also denote t {j : zj,t = zj,t 1}, the ordered index set of covariates which switch spike-and-slab states between iterations t and t 1; δt zt zt 1 1 = | t|, the number of switches; D t Diag((Dt)j,j : j t) Diag(Rδt δt), the diagonal sub-matrix of Dt composed of the diagonal entries with ordered indices in t; and C t D 1 t D 1 t 1. Finally, let Mτ0 In +τ 2 0 XXT Rn n and Mτ1 In + τ 2 1 XXT Rn n, which are fixed for all iterations. Under this notation, there are three expressions for Mt: Mt = Mτ0 + (τ 2 1 τ 2 0 )XAt X At (3a) = Mτ1 + (τ 2 0 τ 2 1 )XAc t X Ac t (3b) = Mt 1 + X t C t X t. (3c) In (3a) (3c), calculating the matrix products XAt X At, XAc t X Ac t, and X t C t X t requires O(n2 zt 1), O(n2(p zt 1)), and O(n2δt) cost respectively. Given Mτ0, Mτ1, Mt 1, and zt 1, we evaluate whichever matrix product in (3a) (3c) has minimal computational cost and thereby calculate Mt at the reduced cost of O(n2pt) where pt min{ zt 1, p zt 1, δt}. To calculate M 1 t , we consider the cases n pt and pt < n separately. When n pt, we calculate M 1 t by directly inverting the calculated matrix Mt from (3), which requires O(n3) cost. When pt < n, we apply the Woodbury matrix identity on (3). This gives M 1 t = M 1 τ0 (4a) M 1 τ0XAt 1 τ 2 1 τ 2 0 I zt 1 +X At M 1 τ0XAt 1X At M 1 τ0 = M 1 τ1 (4b) M 1 τ1XAc t 1 τ 2 0 τ 2 1 Ip zt 1 +X Ac t M 1 τ1XAc t 1X Ac t M 1 τ1 = M 1 t 1 (4c) M 1 t 1X t C 1 t + X t M 1 t 1X t 1X t M 1 t 1. Given M 1 τ0 , M 1 τ1 , M 1 t 1 and zt 1, we evaluate whichever expression in (4a) (4c) has minimal computational cost to calculate M 1 t . Similar to (3), this requires O(n2pt) computational cost, which arises from matrix inversion and multiplication. Overall, this strategy of using the previous state zt 1 and the pre-computed matrices Mτ0, M 1 τ0 , Mτ1, M 1 τ1 , Mt 1 and M 1 t 1, reduces the computational cost of calculating the matrices Mt and M 1 t from Ω(n2p) (as in all current implementations of Algorithm 1) to O(n2pt). As we show in Sections 2.3 and 3, in many large-scale applications pt is orders of magnitude smaller than both n and p, yielding substantial improvements in computational efficiency. Furthermore, we emphasize that the matrices Mτ0, M 1 τ0 , Mτ1, M 1 τ1 are fixed for all iterations, and the state zt 1 and matrices Mt 1 and M 1 t 1 only need to be stored temporarily to generate samples for iteration t and can be deleted after. Therefore S3 requires minimal additional memory compared to current implementations. The full Gibbs samplers for Bayesian linear, logistic, and probit regression which make use of this pre-computation are given in Algorithms 2 and 3. The Gibbs samplers for logistic and probit regression are based on data augmentation strategies (see, e.g., O Brien & Dunson (2004); Narisetty et al. (2019)), and the Gibbs sampler for logistic regression requires an adjusted pre-computation strategy with O max{n2pt, n3, np} cost. Appendix B contains derivations and details of Algorithms 2 and 3 (the implementation of logistic regression is based on a scaled t-distribution approximation to the logistic distribution, as commonly done in the literature; see (Narisetty et al., 2019)). Algorithm 2 Bayesian linear regression with S3 Input: State Ct (βt, zt, σ2 t ) Rp {0, 1}p (0, ), state zt 1, and matrices Mt 1, M 1 t 1. 1: Calculate pt and use (3) to calculate Mt. if pt n then invert Mt to calculate M 1 t else use (4) to calculate M 1 t . 2: Sample βt+1|zt, σ2 t using Algorithm 1 from N Σ 1 t X y, σ2 t Σ 1 t for Σt = X X + Dt. 3: Sample each zj,t+1|βt+1, σ2 t independently from Bernoulli q N(βj,t+1;0,σ2 t τ 2 1 ) q N(βj,t+1;0,σ2 t τ 2 1 )+(1 q)N(βj,t+1;0,σ2 t τ 2 0 ) . for j = 1, ..., p. 4: Sample σ2 t+1|βt+1, zt+1 from Inv Gamma a0+n+p 2 , b0+ y Xβt+1 2 2+β t+1Dt+1βt+1 2 . Output: Ct+1 = (βt+1, zt+1, σ2 t+1), zt, Mt, M 1 t . Algorithm 3 Bayesian logistic & probit regression with S3 Input: State Ct (βt, zt, yt, σ2 t ) Rp {0, 1}p Rn (0, )n, states zt 1, σ2 t 1, and matrices Mt 1, M 1 t 1. 1: Logistic regression: Use pre-computation (see Appendix B.3) to calculate Mt In + W 1/2 t XD 1 t X W 1/2 t for Wt = Diag( σ2 t ). Invert Mt to calculate M 1 t . Probit regression: Calculate pt and use (3) to calculate Mt In + XD 1 t X . if pt n then invert Mt to calculate M 1 t else use (4) to calculate M 1 t . 2: Sample βt+1|zt, yt, σ2 t using Algorithm 1 from N Σ 1 t X W 1 t yt, Σ 1 t for Σt =X W 1 t X+Dt. 3: Sample each zj,t+1|βt+1, yt, σ2 t independently from Bernoulli q N(βj,t+1;0,τ 2 1 ) q N(βj,t+1;0,τ 2 1 )+(1 q)N(βj,t+1;0,τ 2 0 ) for j = 1, ..., p. 4: Sample each yi,t+1|βt+1, zt+1, σ2 t independently from N(x i βt+1, σ2 i,t)I[0,+ ) if yi = 1, N(x i βt+1, σ2 i,t)I( ,0) if yi = 0 for i = 1, ..., n. 5: Logistic regression: Sample each σ2 i,t+1|βt+1, zt+1, yt+1 independently from Inv Gamma v+1 2 , w2ν+( yi,t+1 x i βt+1)2 for i = 1, ..., n, where ν 7.3 and w2 π2(ν 2)/(3ν) are constants. Probit regression: Set each σ2 i,t+1 = 1 for i = 1, ..., n. Output: Ct+1 =(βt+1, zt+1, yt+1, σ2 t+1), zt, Mt, M 1 t 1 2.3. Analysis of computational complexity We now investigate the favorable computational complexity of Algorithms 2 and 3. Proposition 2.1, proved in Appendix A, gives the computational cost of these Gibbs samplers for linear and logistic regression, showing an improvement over existing implementations which have Ω(n2p) cost. Proposition 2.1 (Computational cost). Algorithm 2 and Algorithm 3 for probit regression both have a computational cost of O max{n2pt, np} at iteration t, and Algorithm 3 for logistic regression has a computational cost of O max{n2pt, n3, np} at iteration t, where pt = min{ zt 1, p zt 1, δt} for δt = zt zt 1 1. In Proposition 2.1, zt 1 and p zt 1 are the number of slab covariates and the number of spike covariates respectively at iteration t of the Markov chain, and δt is the number of covariates switching spike-and-slab states between iterations t and t 1 of the Markov chain. Note that pt min{ zt 1, p zt 1} p/2 directly. In practice, there are a variety of scenarios under which pt is significantly smaller than p/2. Sparse zt. Whenever zt is sparse relative to the full dimensionality p, we have pt zt 1 p. Sparsity in zt is a common occurrence in high-dimensional regression with a sparse signal vector β Rp, as zt 1 often closely approximates the true sparsity s β 0 with high probability. This occurs, for instance, in the settings of Narisetty & He (2014) and Narisetty et al. (2019), where strong model selection consistency of the continuous spike-and-slab posterior is established for linear and logistic regression respectively. Posterior concentration. Even when min{ zt 1, p zt 1} is comparable to p/2, concentration of the spikeand-slab posterior targeted by the Gibbs sampler can lead to δt and hence pt remaining much smaller than p/2. Proposition 2.2, proved in Appendix A, calculates the expectation of δt explicitly in terms of the posterior distribution that is targeted by our algorithms and the auto-correlation of the states (zt)t 0. Proposition 2.2 (Expected spike-and-slab swap count). For δt as given in Proposition 2.1, j=1 P(zj,t = 1)P(zj,t 1 = 0)+ P(zj,t = 0)P(zj,t 1 = 1) 2cov(zj,t, zj,t 1). (5) Suppose the Markov chain generated by Algorithm 2 or 3 is at its stationary distribution π at iteration t 1. Then, j=1 varπ(zj,t)(1 corrπ(zj,t, zj,t 1)). (6) In (6), note varπ(zj,t) = Pπ(zj,t = 0)Pπ(zj,t = 1) 1/4 for each component j, with equality only when zj,t Bernoulli(1/2). Therefore all components j with Pπ(zj,t = 1) close to 0 or 1 do not contribute significantly towards δt in expectation. Such posterior concentration is guaranteed whenever zt is convergent, be it to the true model selection vector as in Narisetty & He (2014) and Narisetty et al. (2019) or to any other value. In such circumstances we can have δt = zt zt 1 1 = o(p) and even δt = o( zt 1), regardless of the magnitude of zt 1. High auto-correlation. High auto-correlation of the Gibbs sampler can also lead to smaller values of δt and hence pt. We already see from Proposition 2.2 that, even for components j with bimodal marginal posterior distributions, high auto-correlation between successive states zj,t 1 and zj,t can yield lower δt in expectation. Proposition 2.3, proved in Appendix A, provides an additional exact expression for δt in terms of the empirical correlation between zt and zt 1. Proposition 2.3 (Swap count decomposition). Let τt = p zt 1(p zt 1) and ρt be the empirical correlation between zt and zt 1 (that is, the correlation between z J,t and z J,t 1 when J is uniform on {1, . . . , p}). Then, for δt as given in Proposition 2.1, δt = zt 1 + zt 1 1 2 zt 1 zt 1 1 + 2ρtτtτt 1 Since |ρt| 1, Proposition 2.3 implies zt 1 zt 1 1 2 p + (τt τt 1)2 zt 1 zt 1 1 2 p + (τt + τt 1)2 The lower bound in (8) is a good approximation to δt when ρt is close to one, which is the case either when the Gibbs sampler is converging (such that zt becomes stable) or when it gets stuck (such that zt changes slowly with t). In either case, τt exhibits similar stable/stuck behavior, implying that the lower bound itself will be close to zero. This suggests δt and hence pt is close to zero when ρt is close to 1, even if min{ zt 1, p zt 1} is not negligible. Motivated by such discussions and theoretical analysis, we now empirically examine how pt grows as the number of observations n, the number of covariates p, and the sparsity s of the true signal varies. Figure 1 is based on synthetic linear regression datasets. For each dataset, one Markov chain is generated using Algorithm 2 to target the corresponding spike-and-slab posterior, from which the mean and one standard error bars of (pt)10000 t=5000 are plotted. For Figure 1 (Left), we consider datasets with n = 100, varying p with 0 1000 2000 3000 Dimension p Average cost parameter pt 100 400 700 1000 No. observations n 0 10 20 Sparsity s pt Sparsity s Figure 1. The S3 cost parameter pt for averaged over iterations 5000 < t 10000 with one standard error bars, for synthetic linear regression datasets with varying number of covariates p (Left), varying number of observations n (Center), and varying sparsity s (Right). The ground truth sparsity s is also plotted for comparison. See Section 2.3 for details. p n, and a sparse true signal β Rp with components β j = 2I{j s} for sparsity s = 10, and noise standard deviation σ = 2. For Figure 1 (Center), we consider datasets with p = 1000, varying n with n p, s = 10, and σ = 2. For Figure 1 (Right), we consider datasets with n = 10s, p = 1000, and σ = 2 for varying s 1. Details of the synthetically generated datasets are in Appendix D. Figure 1 (Left) shows that both pt is substantially smaller than both p and n and that it does not increase with the number of covariates p. Figure 1 (Center) shows that pt tends to zero as n increases. Figure 1 (Right) shows that pt decreases as the sparsity s increases. All figures suggest that pt is controlled by δt in these settings, because zt 1 takes values close to s, and p zt 1 tends to be much larger than zt 1. Overall, Figure 1 highlights that not only does pt tend to be substantially smaller than p, but it also tends to be smaller than both n and s. By Proposition 2.1, this showcases the substantially lower computational cost of S3 compared to current implementations which cost Ω(n2p) per iteration. 2.4. Extensions to Scalable Spike-and-Slab With additional memory capacity and pre-computation, we can further improve the per-iteration costs of S3. For the matrices M 1 τ0 and M 1 τ1 in Section 2.2, suppose the matrices X X, X M 1 τ0 X, and X M 1 τ1 X are pre-computed. This initial step requires O(np2) computational cost and O(p2) memory. Then the matrices X At XAt and X Ac t XAc t in (3a) (3b) correspond to pre-computed sub-matrices of X X, and calculating Mt using (3a) (3b) at iteration t involves matrix addition which only requires O(n2) cost. Similarly, matrices X At M 1 τ0 XAt and Table 1. Comparison of S3 with alternatives (pt = min{ zt 1, p zt 1, δt} for δt = zt zt 1 1). MCMC SAMPLER COST CONVERGES TO POSTERIOR NA IVE Ω(p3) STATE-OF-THE-ART Ω(n2p) SKINNY GIBBS Ω(max{n zt 2 1, np}) S3 (LINEAR AND PROBIT) O(max{n2pt, np}) S3 (LOGISTIC) O(max{n2pt, n3, np}) X Ac t M 1 τ1 XAc t in (4a) (4b) correspond to pre-computed sub-matrices of X M 1 τ0 X and X M 1 τ1 X respectively and do not need to be recalculated at iteration t. Therefore calculating (τ 2 1 τ 2 0 ) 1I zt 1 + X At M 1 τ0 XAt 1 or (τ 2 0 τ 2 1 ) 1Ip zt 1+X Ac t M 1 τ1 XAc t 1 in (4a) (4b) at each iteration t only requires O( zt 3 1) or O((p zt 1)3) cost respectively. To sample from (2), consider the cases n min{ zt 1, p zt 1} and min{ zt 1, p zt 1} < n separately. When n min{ zt 1, p zt 1}, we calculate M 1 t by directly inverting the calculated matrix Mt from (3), which requires O(n3) cost. When min{ zt 1, p zt 1} < n, we avoid calculating M 1 t explicitly and instead calculate the matrix vector product M 1 t ( 1 σt y v) in Algorithm 1 right-to-left, using whichever expression in (4a) (4b) has minimal computational cost. Overall, now the Gibbs samplers for linear and probit regression require only O(max{min{ zt 1, p zt 1, n}3, np}) computational cost at iteration t. This provides lower computational cost for S3 linear and probit regression whenever min{ zt 1, p zt 1, n}3 < n2pt. A similar extension for logistic regression requires only O(max{n3, np}) computational cost at iteration t and is given in Appendix B.3. 3. Comparison with Alternatives In this section we compare S3 with the na ıve sampler, the SOTA exact MCMC sampler based on the sampling algorithm of Bhattacharya et al. (2016), and the Skinny Gibbs approximate MCMC sampler of Narisetty et al. (2019) for logistic regression. Table 1 highlights the favorable computational cost of S3 compared to the na ıve and SOTA samplers. The Skinny Gibbs sampler typically has lower computational cost compared to S3 for logistic regression, and can have lower or higher computational cost than S3 for probit regression depending on whether zt 2 1 npt or not. However, unlike S3, the Skinny Gibbs sampler does not converge to the correct posterior distribution. To assess the practical impact of computational cost and asymptotic bias, Figures 2 and 3 compares the numerical 10000 20000 30000 40000 Dimension p Time per iteration (ms) S 3 Logistic SOTA Logistic Skinny Gibbs Logistic S 3 Probit SOTA Probit Figure 2. Comparison of time per iteration between S3, state-ofthe-art (SOTA) exact MCMC sampler and the Skinny Gibbs approximate sampler of Narisetty et al. (2019) on synthetic binary classification datasets. See Section 3 for details. runtimes and statistical performance of S3 with the SOTA sampler and the Skinny Gibbs sampler. For the Skinny Gibbs sampler, we use the skinnybasad R package of Narisetty et al. (2019), which implements only logistic regression. We consider synthetically generated datasets with a true signal β Rp where β j = 2I{j s} for sparsity s. We consider datasets with n = 10s observations and p = 100s covariates for varying sparsity s 1. Details of the synthetically generated dataset are in Appendix D. For each synthetic dataset, we run S3 for logistic and probit regression and the Skinny Gibbs sampler for 1000 iterations, run the SOTA sampler for 100 iterations, and record the average time taken per iteration. All timings were obtained using a single core of an Apple M1 chip on a Macbook Air 2020 laptop with 16 GB RAM. Figure 2 highlights that the numerical runtimes of S3 are orders of magnitude faster than the SOTA sampler and comparable to the Skinny Gibbs sampler. For example, for n = 4000 observations, p = 40000 covariates, and sparsity s = 400, S3 for logistic regression requires 21500ms per iteration on average, which is approximately 15 times faster than the SOTA sampler for probit regression (which requires 335000ms per iteration on average) and 2.5 times faster than the Skinny Gibbs sampler (which requires 55600ms per iteration on average), and S3 for probit regression requires 1100ms per iteration on average, which is approximately 50 times faster than the SOTA sampler for probit regression (which requires 55800ms per iteration on average). For larger real-life datasets with hundreds of thousands of covariates, the numerical runtimes of S3 are similarly favorable compared to the SOTA sampler. This is showcased in Section 4, where for a genetics dataset, S3 is 50 times faster than the SOTA sampler. Figure 3 plots the true positive rate (TPR) and the false discovery rate (FDR) of variable selection based on samples 200 400 600 800 1000 Dimension p 200 400 600 800 1000 Dimension p Sampler S3 Logistic S3 Probit Skinny Gibbs Logistic Figure 3. Average true positive rate (TPR) and false discovery rate (FDR) of S3 and the Skinny Gibbs approximate sampler (Narisetty et al., 2019) across 20 independently generated datasets with one standard error bars. See Section 3 for details. from S3 for logistic and probit regression and Skinny Gibbs on synthetic binary classification datasets. To assess variable selection for signals of varying magnitude, we consider an exponentially decaying sparse true signal β Rp such that β j = 2 9 j 4 for j s and β j = 0 for j > s for sparsity s. The corresponding synthetically generated datasets have n = 100 observations, varying number of covariates p with p n, and sparsity s = 5. For each synthetic dataset, we implement S3 for logistic and probit regression and the Skinny Gibbs sampler for 5000 iterations with a burn-in of 1000 iterations and calculate the TPR and FDR from the samples. We use the same prior hyperparameters for all the algorithms, which are chosen according to Narisetty et al. (2019). Additional experimental details are included in Appendices C. The SOTA sampler is not shown in Figure 3, as SOTA and S3 are alternative implementations of the same Gibbs sampler and by definition have the same statistical performance. Figure 3 shows that in higher dimensions, samples from S3 yield significantly higher TPR and lower FDR than the Skinny Gibbs sampler. We observe similar results for other choices of prior hyperparameters, which give S3 to either have comparable or more favorable statistical performance to the Skinny Gibbs sampler. Overall, Figures 2 and 3 highlight that S3 can have comparable or even favorable computational cost to the Skinny Gibbs sampler, whilst having the correct stationary distribution and more favorable statistical properties in higher dimensions. Appendix E contains additional simulation results showcasing S3 performance for individual datasets as the chain length and the total time elapsed varies. 4. Applications We now examine the benefits of S3 on a diverse suite of regression and binary classification datasets. Table 2 sum- Table 2. Synthetic and real-life datasets considered in Section 4. DATASET n p RESPONSE TYPE BOROVECKI 31 22283 BINARY CHIN 118 22215 BINARY CHOWDARY 104 22283 BINARY GORDON 181 12533 BINARY LYMPH 148 4514 BINARY MAIZE 2266 98385 CONTINUOUS MALWARE 373 503 BINARY PCR 60 22575 CONTINUOUS SYNTHETIC BINARY 1000 50000 CONTINUOUS SYNTHETIC CONTINUOUS 1000 50000 BINARY marizes the two synthetic and eight real-world datasets considered, with further details in Appendix D. We first consider the Gordon microarray dataset (Gordon et al., 2002) with n = 181 observations (corresponding to a binary response vector indicating presence of lung cancer) and p = 12533 covariates (corresponding to genes expression levels). Figure 4 shows the marginal posterior probabilities estimated using samples from S3 and the SOTA sampler for logistic and probit regression and the Skinny Gibbs sampler for logistic regression, as well as the corresponding average runtimes per iteration. The marginal posterior probabilities πj Pπ(zj = 1) are estimated by ˆπj = 1 I(T S) PI i=1 PT t=S+1 z(i) j,t, where (z(i) t )t 0 are samples from i = 1, ..., I independent Markov chains generated using S3. We sample I = 5 independent chains of length T = 5000 iterations with a burn-in of 1000 iterations for both S3 and the SOTA sampler. The average runtime per iteration with one standard error bars are calculated based on these independent chains. Figure 4 (Left) plots ˆπj against j in the decreasing order of ˆπjs. It shows ˆπjs based on samples from both S3 and the SOTA sampler. We simulate both S3 and the SOTA sampler with the same random numbers at each iteration, so that any differences will be due to numerical imprecision. Figure 4 (Left) shows that the estimates using S3 and the SOTA sampler are indistinguishable. Furthermore, in this example all components of zt are identical between S3 and the SOTA sampler chains for all iterations t. Despite producing Markov chains with indistinguishable marginal distributions and hence statistical properties, Figure 4 (Right) shows that S3 has approximately 20 and 6 times faster runtime per iteration than SOTA for logistic and probit regression respectively. Furthermore, S3 for logistic regression has approximately 100 times faster runtime per iteration than the Skinny Gibbs sampler. Overall, Figure 4 highlights the practical value of S3 over the SOTA sampler and the Skinny Gibbs sampler. Figure 5, plotted with the y-axis on the log-scale, compares the runtimes of S3 and the SOTA sampler for linear and 1 10 100 100010000 Gordon Covariates Marginal posterior probabilities 1 10 100 100010000 Gordon Covariates Marginal posterior probabilities S3 Logistic SOTA Logistic Skinny Gibbs Logistic S3 Probit SOTA Probit Time per iteration (ms) Sampler S3 Logistic SOTA Logistic Skinny Gibbs Logistic S3 Probit SOTA Probit Figure 4. Comparing Bayesian logistic and probit regression with S3, SOTA, and Skinny Gibbs on the Gordon microarray dataset with n = 181 observations and p = 12533 covariates. (Left and Middle) Marginal posterior probabilities Pπ(zj = 1) estimated using samples from each chain: the recovered S3 and SOTA probabilities are indistinguishable but differ significantly from the Skinny Gibbs probabilities. (Right) Average runtime per sampler iteration with one standard error bars. See Section 4 for details. probit regression on ten regression and binary classification datasets respectively. It plots the average runtimes with one standard error bars based on 10 independent chains each of length 1000 and 100 for S3 and the SOTA sampler respectively. Figure 5 shows that S3 has lower runtimes per iteration compared to the SOTA sampler for all the datasets considered, with the most substantial speedups for larger datasets. For example, for the Maize GWAS dataset (Romay et al., 2013; Liu et al., 2016; Zeng & Zhou, 2017) with n = 2266 observations (corresponding to average number of days taken for silk emergence in different maize lines) and p = 98385 covariates (corresponding to single nucleotide polymorphisms (SNPs) in the genome), S3 requires 650ms per iteration on average, which is 48 times faster than the SOTA sampler requiring 31300ms per iteration. For researchers, such speedups can reduce algorithm runtime from days to hours, giving substantial time and computational cost savings at no compromise to inferential quality. Appendix E contains additional results of S3 applied to these datasets including effective sample size (ESS) calculations, marginal posterior probabilities, and performance under 10-fold cross-validation. 5. Further Work The following questions arise from our work. Extensions of S3 to point-mass spike-and-slab priors, as well as to non-Gaussian tails. Whilst priors given in (1) are one of the most common formulations employed in practice, a number of alternatives are available. This includes point-mass spike-and-slab priors (e.g., Mitchell Synthetic Binary Synthetic Continuous Time per iteration (ms) Sampler S3 SOTA Figure 5. Average runtime per iteration with one standard error bars for S3 and the SOTA sampler for linear and probit regression applied to the ten continuous and binary response datasets. See Section 4 for details. & Beauchamp, 1988; Johnson & Rossell, 2012), where a degenerate Dirac distribution about zero is chosen for the spike part, and extensions which consider the heavier-tailed Laplace distribution for the slab part instead of a Gaussian distribution (Castillo et al., 2015; Roˇckov a, 2018; Ray et al., 2020; Ray & Szab o, 2021). An extension of S3 would to be employ similar pre-computation based strategy of Section 2 to MCMC samplers for these alternative formulations. Convergence complexity analysis of S3. An important question that is not addressed in this article is the number of iterations required for S3 or other similar samplers to converge to the their target posterior distributions. For Gibbs samplers targeting posteriors corresponding to continuous shrinkage priors (e.g., Carvalho et al., 2010; Bhattacharya et al., 2015; Bhadra et al., 2019), much theoretical progress has been made (Pal & Khare, 2014; Qin & Hobert, 2019; Bhattacharya et al., 2022; Biswas et al., 2022). Convergence of Gibbs samplers targeting spike-and-slab posteriors has been less extensively studied and requires more attention. Diagnostics to assess the convergence of and asymptotic variance of S3. Given some time and computational budget constraints, an immediate benefit of S3 is the ability to run longer Markov chains targeting spike-and-slab posteriors. This can alleviate some concerns linked to burn-in and asymptotic variance, but convergence and effective sample size diagnostics (Johnson, 1998; Biswas et al., 2019; Vats & Knudson, 2021; Vehtari et al., 2021) remain an important consideration particularly in high-dimensional settings. We hope to investigate convergence diagnostics in future work. Acknowledgments. We thank Juan Shen for sharing the PCR and the Lymph Node datasets, as well as Xiaolei Liu and Xiang Zhou for sharing the Maize GWAS dataset. NB was supported by the NSF grant DMS-1844695, a GSAS Merit Fellowship, and a Two Sigma Fellowship Award. XLM was partially supported by the NSF grant DMS-1811308. Bai, R., Roˇckov a, V., and George, E. I. Spike-and-slab meets LASSO: A review of the spike-and-slab LASSO. Handbook of Bayesian Variable Selection, 2021. Banerjee, S., Castillo, I., and Ghosal, S. Bayesian inference in high-dimensional models. Springer volume on Data Science, 2021. Barbieri, M. M. and Berger, J. O. Optimal predictive model selection. Annals of Statistics, 32(3):870 897, 2004. doi: 10.1214/009053604000000238. URL https:// doi.org/10.1214/009053604000000238. Barbieri, M. M., Berger, J. O., George, E. I., and Roˇckov a, V. The Median Probability Model and Correlated Variables. Bayesian Analysis, 16(4):1085 1112, 2021. doi: 10.1214/20-BA1249. URL https://doi.org/10. 1214/20-BA1249. Bhadra, A., Datta, J., Polson, N. G., and Willard, B. Lasso Meets Horseshoe: A Survey. Statistical Science, 34 (3):405 427, 2019. doi: 10.1214/19-STS700. URL https://doi.org/10.1214/19-STS700. Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. Dirichlet Laplace Priors for Optimal Shrinkage. Journal of the American Statistical Association, 110(512):1479 1490, 2015. doi: 10.1080/01621459. 2014.960967. URL https://doi.org/10.1080/ 01621459.2014.960967. Bhattacharya, A., Chakraborty, A., and Mallick, B. K. Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, 103 (4):985 991, 2016. ISSN 0006-3444. doi: 10. 1093/biomet/asw042. URL https://doi.org/10. 1093/biomet/asw042. Bhattacharya, S., Khare, K., and Pal, S. Geometric ergodicity of Gibbs samplers for the Horseshoe and its regularized variants. Electronic Journal of Statistics, 16(1):1 57, 2022. doi: 10.1214/21-EJS1932. URL https://doi.org/10.1214/21-EJS1932. Biswas, N., Jacob, P. E., and Vanetti, P. Estimating convergence of markov chains with l-lag couplings. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ aec851e565646f6835e915293381e20a-Paper. pdf. Biswas, N., Bhattacharya, A., Jacob, P. E., and Johndrow, J. E. Coupling-based convergence assessment of some gibbs samplers for high-dimensional bayesian regression with shrinkage priors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2022. doi: 10.1111/rssb.12495. URL https://doi.org/10. 1111/rssb.12495. Bogdan, M., van den Berg, E., Sabatti, C., Su, W., and Cand es, E. J. SLOPE Adaptive variable selection via convex optimization. The Annals of Applied Statistics, 9 (3):1103 1140, 2015. doi: 10.1214/15-AOAS842. URL https://doi.org/10.1214/15-AOAS842. Carvalho, C. M., Polson, N. G., and Scott, J. G. The horseshoe estimator for sparse signals. Biometrika, 97 (2):465 480, 04 2010. ISSN 0006-3444. doi: 10. 1093/biomet/asq017. URL https://doi.org/10. 1093/biomet/asq017. Castillo, I. and van der Vaart, A. Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences. Annals of Statistics, 40(4):2069 2101, 2012. doi: 10.1214/12-AOS1029. URL https:// doi.org/10.1214/12-AOS1029. Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. Bayesian linear regression with sparse priors. Annals of Statistics, 43(5):1986 2018, 2015. doi: 10.1214/ 15-AOS1334. URL https://doi.org/10.1214/ 15-AOS1334. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Flegal, J. M., Hughes, J., Vats, D., Dai, N., Gupta, K., and Maji, U. mcmcse: Monte Carlo Standard Errors for MCMC. Riverside, CA, and Kanpur, India, 2021. R package version 1.5-0. George, E. I. and Mc Culloch, R. E. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88(423):881 889, 1993. doi: 10.1080/01621459.1993.10476353. URL https://www.tandfonline.com/doi/abs/ 10.1080/01621459.1993.10476353. George, E. I. and Mc Culloch, R. E. Approaches for Bayesian Variable Selection. Statistica Sinica, 7(2): 339 373, 1997. ISSN 10170405, 19968507. URL http://www.jstor.org/stable/24306083. Gordon, G. J. G., Jensen, R. V. R., Hsiao, L.-L. L., Gullans, S. R. S., Blumenstock, J. E. J., Ramaswamy, S. S., Richards, W. G. W., Sugarbaker, D. J. D., and Bueno, R. R. Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17):4963 4967, September 2002. Guan, Y. and Stephens, M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5 (3):1780 1815, 2011. doi: 10.1214/11-AOAS455. URL https://doi.org/10.1214/11-AOAS455. Hager, W. W. Updating the inverse of a matrix. SIAM Review, 31(2):221 239, 1989. ISSN 00361445. URL http://www.jstor.org/stable/2030425. Hans, C., Dobra, A., and West, M. Shotgun Stochastic Search for Large p Regression. Journal of the American Statistical Association, 102(478):507 516, 2007. doi: 10.1198/016214507000000121. URL https://doi. org/10.1198/016214507000000121. Held, L. and Holmes, C. C. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145 168, 2006. doi: 10.1214/06-BA105. URL https://doi.org/10.1214/06-BA105. Ishwaran, H. and Rao, J. S. Spike and slab variable selection: Frequentist and Bayesian strategies. Annals of Statistics, 33(2):730 773, 2005. doi: 10.1214/ 009053604000001147. URL https://doi.org/10. 1214/009053604000001147. Johndrow, J., Orenstein, P., and Bhattacharya, A. Scalable Approximate MCMC Algorithms for the Horseshoe Prior. Journal of Machine Learning Research, 21(73):1 61, 2020. URL http://jmlr.org/papers/v21/ 19-536.html. Johnson, V. E. A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 93(441): 238 248, 1998. Johnson, V. E. and Rossell, D. Bayesian Model Selection in High-Dimensional Settings. Journal of the American Statistical Association, 107(498):649 660, 2012. doi: 10.1080/01621459.2012.682536. URL https://doi. org/10.1080/01621459.2012.682536. Johnstone, I. M. and Silverman, B. W. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. Annals of Statistics, 32(4):1594 1649, 2004. doi: 10.1214/009053604000000030. URL https:// doi.org/10.1214/009053604000000030. Kelly, B. C. Some Aspects of Measurement Error in Linear Regression of Astronomical Data. The Astrophysical Journal, 665(2):1489 1506, aug 2007. doi: 10. 1086/519947. URL https://doi.org/10.1086/ 519947. Liang, F., Song, Q., and Yu, K. Bayesian Subset Modeling for High-Dimensional Generalized Linear Models. Journal of the American Statistical Association, 108(502):589 606, 2013. doi: 10.1080/01621459. 2012.761942. URL https://doi.org/10.1080/ 01621459.2012.761942. Liu, X., Huang, M., Fan, B., Buckler, E. S., and Zhang, Z. Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLOS Genetics, 12(2):1 24, 2016. doi: 10.1371/ journal.pgen.1005767. URL https://doi.org/10. 1371/journal.pgen.1005767. Mitchell, T. J. and Beauchamp, J. J. Bayesian Variable Selection in Linear Regression. Journal of the American Statistical Association, 83(404):1023 1032, 1988. ISSN 01621459. URL http://www.jstor.org/ stable/2290129. Narisetty, N. N. and He, X. Bayesian variable selection with shrinking and diffusing priors. Annals of Statistics, 42 (2):789 817, 2014. doi: 10.1214/14-AOS1207. URL https://doi.org/10.1214/14-AOS1207. Narisetty, N. N., Shen, J., and He, X. Skinny Gibbs: A Consistent and Scalable Gibbs Sampler for Model Selection. Journal of the American Statistical Association, 114(527):1205 1217, 2019. doi: 10.1080/ 01621459.2018.1482754. URL https://doi.org/ 10.1080/01621459.2018.1482754. O Brien, S. M. and Dunson, D. B. Bayesian multivariate logistic regression. Biometrics, 60(3):739 746, 2004. doi: 10.1111/j.0006-341X.2004.00224.x. URL https:// doi:10.1111/j.0006-341X.2004.00224.x. Pal, S. and Khare, K. Geometric ergodicity for Bayesian shrinkage models. Electronic Journal of Statistics, 8(1): 604 645, 2014. doi: 10.1214/14-EJS896. URL https: //doi.org/10.1214/14-EJS896. Polson, N. G., Scott, J. G., and Windle, J. Bayesian Inference for Logistic Models Using P olya Gamma Latent Variables. Journal of the American Statistical Association, 108(504):1339 1349, 2013. doi: 10.1080/01621459. 2013.829001. URL https://doi.org/10.1080/ 01621459.2013.829001. Qin, Q. and Hobert, J. P. Convergence complexity analysis of Albert and Chib s algorithm for Bayesian probit regression. Annals of Statistics, 47(4):2320 2347, 2019. doi: 10.1214/18-AOS1749. URL https://doi.org/10. 1214/18-AOS1749. Ray, K. and Szab o, B. Variational bayes for highdimensional linear regression with sparse priors. Journal of the American Statistical Association, 0(0):1 12, 2021. doi: 10.1080/01621459.2020.1847121. URL https:// doi.org/10.1080/01621459.2020.1847121. Ray, K., Szabo, B., and Clara, G. Spike and slab variational bayes for high dimensional logistic regression. In Advances in Neural Information Processing Systems, volume 33, pp. 14423 14434. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ a5bad363fc47f424ddf5091c8471480a-Paper. pdf. Romay, M. C., Millard, M. J., Glaubitz, J. C., Peiffer, J. A., Swarts, K. L., Casstevens, T. M., Elshire, R. J., Acharya, C. B., Mitchell, S. E., Flint-Garcia, S. A., Mc Mullen, M. D., Holland, J. B., Buckler, E. S., and Gardner, C. A. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biology, 14(6):R55, 2013. doi: 10.1186/gb-2013-14-6-r55. URL https://doi. org/10.1186/gb-2013-14-6-r55. Roˇckov a, V. Bayesian estimation of sparse signals with a continuous spike-and-slab prior. Annals of Statistics, 46 (1):401 437, 2018. doi: 10.1214/17-AOS1554. URL https://doi.org/10.1214/17-AOS1554. Scott, J. G. and Berger, J. O. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics, 38(5):2587 2619, 2010. doi: 10.1214/10-AOS792. URL https://doi.org/10. 1214/10-AOS792. Sereno, M. A Bayesian approach to linear regression in astronomy. Monthly Notices of the Royal Astronomical Society, 455(2):2149 2162, 11 2015. ISSN 00358711. doi: 10.1093/mnras/stv2374. URL https: //doi.org/10.1093/mnras/stv2374. Tadesse, M. G. and Vannucci, M. Handbook of Bayesian Variable Selection. Chapman and Hall/CRC, 2021. doi: 10.1201/9781003089018. URL https://doi.org/ 10.1201/9781003089018. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1): 267 288, 1996. doi: 10.1111/j.2517-6161.1996. tb02080.x. URL https://doi.org/10.1111/j. 2517-6161.1996.tb02080.x. Titsias, M. and L azaro-Gredilla, M. Spike and slab variational inference for multi-task and multiple kernel learning. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings. neurips.cc/paper/2011/file/ b495ce63ede0f4efc9eec62cb947c162-Paper. pdf. Vats, D. and Knudson, C. Revisiting the Gelman Rubin Diagnostic. Statistical Science, 36(4):518 529, 2021. doi: 10.1214/20-STS812. URL https://doi.org/ 10.1214/20-STS812. Vats, D., Flegal, J. M., and Jones, G. L. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321 337, 2019. ISSN 0006-3444. doi: 10. 1093/biomet/asz002. URL https://doi.org/10. 1093/biomet/asz002. Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and B urkner, P.-C. Rank-Normalization, Folding, and Localization: An Improved b R for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis, 16 (2):667 718, 2021. doi: 10.1214/20-BA1221. URL https://doi.org/10.1214/20-BA1221. Yang, Y., Wainwright, M. J., and Jordan, M. I. On the computational complexity of high-dimensional Bayesian variable selection. Annals of Statistics, 44(6):2497 2532, 2016. doi: 10.1214/15-AOS1417. URL https://doi. org/10.1214/15-AOS1417. Zeng, P. and Zhou, X. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nature Communications, 8(1):456, 2017. doi: 10.1038/s41467-017-00470-2. URL https://doi. org/10.1038/s41467-017-00470-2. Zhou, X., Carbonetto, P., and Stephens, M. Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLOS Genetics, 9(2):1 14, 02 2013. doi: 10.1371/journal. pgen.1003264. URL https://doi.org/10.1371/ journal.pgen.1003264. Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2):301 320, 2005. doi: 10.1111/j.1467-9868.2005. 00503.x. URL https://doi.org/10.1111/j. 1467-9868.2005.00503.x. Proof of Proposition 2.1. Consider Step 1 of Algorithm 2 and Algorithm 3 for probit regression. Given pre-computed matrices Mτ0, Mτ0, Mt 1, M 1 t 1 and state zt 1, calculating Mt, M 1 t requires O(n2pt) cost by (3) and (4), where pt = min{ zt 1, p zt 1, δt} for δt = zt zt 1 1. Consider Step 1 of Algorithm 3 for logistic regression. This requires O(max{n2pt, n3}) cost, where the O(n2pt) cost arises from the calculation of Mt using (21) and the O(n3) cost arises from inverting Mt to calculate M 1 t . Given M 1 t , Step 2 of Algorithms 2 and 3 then requires O(np) cost, which arises from the matrix vector product XD 1 2 t r for r N(0, Ip) in Algorithm 1. By component-wise independence, Step 3 of Algorithms 2 and 3 costs O(p) and Step 4 of Algorithm 3 cost O(n). Step 4 of Algorithm 3 and Step 5 of Algorithm 3 for probit regression both cost O(1). Step 5 of Algorithm 3 for logistic regression both costs O(n). This gives an overall computational cost of O(max{n2pt, np}) for Algorithm 2 and Algorithm 3 for probit regression at iteration t, and a cost of O(max{n2pt, n3, np}) for 3 for logistic regression. Proof of Proposition 2.2. By linearity, E[δt] = Pp j=1 P(zj,t = zj,t 1) where for each component j the random variables zj,t and zj,t 1 are on {0, 1}. For each j, we obtain P(zj,t = zj,t 1) = P(zj,t = 1, zj,t 1 = 0) + P(zj,t = 0, zj,t 1 = 1) = P(zj,t = 1) P(zj,t = 1, zj,t 1 = 1) + P(zj,t 1 = 1) P(zj,t = 1, zj,t 1 = 1) = P(zj,t = 1) cov(zj,t, zj,t 1) P(zj,t = 1)P(zj,t 1 = 1) + P(zj,t 1 = 1) cov(zj,t, zj,t 1) P(zj,t = 1)P(zj,t 1 = 1) = P(zj,t = 1)P(zj,t 1 = 0) + P(zj,t = 0)P(zj,t 1 = 1) 2cov(zj,t, zj,t 1). When zj,t 1 follows the stationary π, zj,t zj,t 1 and var(zj,t) = P(zj,t = 1)P(zj,t 1 = 0). Consequently, P(zj,t = zj,t 1) = 2varπ(zj,t) 2cov(zj,t, zj,t 1) = 2varπ(zj,t)(1 corrπ(zj,t, zj,t 1)). Proof of Proposition 2.3. Note that a2 = a if a only takes the value 0 and 1. This gives I{zj,t = zj,t 1} = (zj,t zj,t 1)2 = zj,t + zj,t 1 2zj,tzj,t 1. (10) Let J be the uniform random variable on the integers {1, . . . , p}. Then, δt = p EJ(z J,t) + EJ(z J,t 1) 2EJ(z J,tz J,t 1) , (11) where the expectation is taken with respect to the random index J. Note that EJ(z J,t) = zt 1/p and Var J(z J,t) = ( zt 1/p)(1 zt 1/p) = τ 2 t /p2. We obtain EJ(z J,tz J,t 1) = Cov J(z J,t, z J,t 1) + EJ(z J,t)EJ(z J,t 1) = ρtτtτt 1 + zt 1 zt 1 1 where ρt = corr J(z J,t, z J,t 1). Combining (11)-(12) yields (7). B. Algorithm Derivations B.1. Linear regression with spike-and-slab priors For linear regression with the continuous spike-and-slab priors in (1), the posterior density of (β, z, σ2) Rp {0, 1}p (0, ) is given by π(β, z, σ2|y) N(y; Xβ, σ2)Inv Gamma σ2; a0 q N(βj; 0, σ2τ 2 1 ) zj (1 q)N(βj; 0, σ2τ 2 0 ) 1 zj. (14) From (13), we can calculate the conditional distributions. We have π(β|z, σ2, y) N(y; Xβ, σ2In) j=1 N(βj; 0, σ2τ 2 1 )zj N(βj; 0, σ2τ 2 0 )1 zj N(y; Xβ, σ2In)N(β; 0, σ2D 1) for D Diag(zτ 2 1 + (1p z)τ 2 0 ) exp 1 2σ2In β X Xβ 2β X y + β Dβ N(β; Σ 1X y, σ2Σ 1) for Σ = X X + D, π(z|β, σ2, y) q N(βj; 0, σ2τ 2 1 ) zj (1 q)N(βj; 0, σ2τ 2 0 ) 1 zj j=1 Bernoulli zi; q N(βj; 0, σ2τ 2 1 ) q N(βj; 0, σ2τ 2 1 ) + (1 q)N(βj; 0, σ2τ 2 0 ) π(σ2|β, z, y) N(y; Xβ, σ2)Inv Gamma σ2; a0 N(β; 0, σ2D 1) 2σ2 y Xβ 2 2 Inv Gamma σ2; a0 + n + p 2 , b0 + y Xβ 2 2 + β Dβ 2 as given in Algorithm 2. B.2. Probit regression with spike-and-slab priors Consider the probit regression likelihood, where for each observation i = 1, ..., n, P(yi = 1|xi, β) = 1 P(yi = 0|xi, β) = Φ(x i β) for x i the i-th row of the design matrix X. and Φ that cumulative density function of a univariate Normal distribution. We obtain yi = I{ yi > 0} for yi|β N(x i β, 1). The Bayesian probit regression model is then given by zj i.i.d. Bernoulli(q) for all j = 1, ..., p βj|zj ind (1 zj)N(0, τ 2 0 ) + zj N(0, τ 2 1 ) for all j = 1, ..., p (15) yi|β ind N(x i β, 1) for all i = 1, ..., n yi = I{ yi > 0} for all i = 1, ..., n. For the prior and likelihood in (15), the posterior density of (β, z, y) Rp {0, 1}p Rp is given by π(β, z, y|y) n Y i=1 I I{ yi > 0} = yi N( yi; x i β, 1) q N(βj; 0, τ 2 1 ) zj (1 q)N(βj; 0, τ 2 0 ) 1 zj. (16) From (16), we can calculate the conditional distributions. We obtain π(β|z, y, y) N( y; Xβ, In)N(β; 0, D 1) for D Diag(zτ 2 1 + (1p z)τ 2 0 ) N(β; Σ 1X y, Σ 1) for Σ = X X + D, π(z|β, y, y) j=1 Bernoulli zi; q N(βj; 0, τ 2 1 ) q N(βj; 0, τ 2 1 ) + (1 q)N(βj; 0, τ 2 0 ) π( y|β, z, y) i=1 N( yi; x i β, 1)I I{ yi > 0} = yi . as required for probit regression in Algorithm 3. B.3. Logistic regression with spike-and-slab priors We first describe the Bayesian logistic regression model considered. Consider the logistic regression likelihood, where for each observation i = 1, ..., n, P(yi = 1|xi, β) = 1 P(yi = 0|xi, β) = exp(x i β) 1+exp(x i β) for x i the i-th row of the design matrix X. We obtain yi = I{ yi > 0} where yi ind Logistic(x i β, 1), corresponding to the logistic distribution centered about x i β and scale parameter 1. B.3.1. STUDENT S t-DISTRIBUTION BASED APPROXIMATION OF THE LOGISTIC REGRESSION LIKELIHOOD. Following O Brien & Dunson (2004) and Narisetty et al. (2019), we can approximate Logistic(x i β, 1) with x i β + wtν, where tν denotes a t-distribution with ν degrees of freedom and w is a multiplicative factor. The constants ν 7.3 and w2 π2(ν 2) 3ν are chosen following O Brien & Dunson (2004), in order to match the variance of the logistic distribution and to minimize the integrated squared distance between the respective densities. The Gaussian scale representation of this t-distribution is yi|xi, β, σi N(x i β, σ2 i ), σ2 i Inv Gamma v where each σ2 i is an augmented variable. The Bayesian logistic regression model is then given by zj i.i.d. Bernoulli(q) for all j = 1, ..., p βj|zj ind (1 zj)N(0, τ 2 0 ) + zj N(0, τ 2 1 ) for all j = 1, ..., p (18) σ2 i i.i.d. Inv Gamma(ν yi|β, σ2 i ind N(x i β, σ2 i ) for all i = 1, ..., n yi = I{ yi > 0} for all i = 1, ..., n. Let σ2 denote the vector with entries σ2 i for i = 1, ..., n. For the prior and likelihood in (18), the posterior density of (β, z, y, σ2) on Rp {0, 1}p Rn (0, )n is given by π(β, z, y, σ2|y) q N(βj; 0, τ 2 1 ) zj (1 q)N(βj; 0, τ 2 0 ) 1 zj i=1 I I{ yi > 0} = yi N( yi; x i β, σ2 i )Inv Gamma σ2 i ; ν From (19), we can calculate the conditional distributions. Let W = Diag( σ2). We obtain π(β|z, y, σ2, y) N( y; Xβ, W )N(β; 0, D 1) for D Diag(zτ 2 1 + (1p z)τ 2 0 ) N(β; Σ 1X W 1 y, Σ 1) for Σ = X W 1X + D, π(z|β, y, σ2, y) j=1 Bernoulli zi; q N(βj; 0, τ 2 1 ) q N(βj; 0, τ 2 1 ) + (1 q)N(βj; 0, τ 2 0 ) π( y|β, z, σ2, y) i=1 N( yi; x i β, σ2 i )I I{ yi > 0} = yi , and π( σ2|β, z, y, y) i=1 N( yi; x i β, σ2 i )Inv Gamma σ2 i ; ν i=1 Inv Gamma σ2 i ; ν + 1 2 , w2ν + ( yi x i β)2 as required for logistic regression in Algorithm 3. Algorithm 4 An Ω(n2p) sampler of (20) (Bhattacharya et al., 2016) Sample r N(0, Ip), ξ N(0, In). Set u = D 1 2 t r and calculate v = W 1/2 t Xu + ξ. Set v = M 1 t (W 1/2 t y v) for Mt = In + W 1/2 t XD 1 t X W 1/2 t . Return β = u + D 1 t X W 1/2 t v . A scalable Gibbs sampler for logistic regression. The computational bottleneck of existing Gibbs samplers for logistic regression is linked to sampling from the full conditional of β Rp. This is given by βt+1|zt, σ2 t N Σ 1 t X W 1 t y, Σ 1 t for Σt = X W 1 t X + Dt, (20) where t indexes the iteration of the Markov chain, Wt is the diagonal matrix with the vector σ2 t populating its diagonal elements, and Dt is the diagonal matrix with the vector ztτ 2 1 + (1p zt)τ 2 0 populating its diagonal elements. To sample from (20), we can use the Ω(n2p) sampler of Bhattacharya et al. (2016), which is given in Algorithm 4. Following the strategy in Section 2.2, S3 for logistic regression uses pre-computation to reduce the computational cost of Algorithm 4. Using the notation from Section 2.2 with Mt In + W 1/2 t XD 1 t X W 1/2 t , we note Mt = In + W 1/2 t Mτ0 In + (τ 2 1 τ 2 0 )XAc t XT Ac t W 1/2 t (21a) = In + W 1/2 t Mτ1 In + (τ 2 0 τ 2 1 )XAc t XT Ac t W 1/2 t (21b) = In + W 1/2 t W 1/2 t 1(Mt 1 In)W 1/2 t 1 + X t C t XT t W 1/2 t . (21c) In (21a) (21c), calculating the matrix products XAt X At, XAc t X Ac t, and X t C t X t requires O(n2 zt 1), O(n2(p zt 1)), and O(n2δt) cost respectively. Given Mτ0, Mτ1, Mt 1, and zt 1, we evaluate whichever matrix product in (21a) (21c) has minimal computational cost and thereby calculate Mt at the reduced cost of O(n2pt) where pt min{ zt 1, p zt 1, δt}. To calculate M 1 t , we calculate M 1 t by directly inverting the calculated matrix Mt, which requires O(n3) cost. Overall, this strategy reduces the computational cost of calculating the matrices Mt and M 1 t from Ω(n2p) to O(max{n2pt, n3}). Extensions to Scalable Spike-and-Slab for logistic regression. Suppose the matrices X X is pre-computed. This initial step requires O(np2) computational cost and O(p2) memory. Then the matrices X At XAt and X Ac t XAc t in (21a) (21b) correspond to pre-computed sub-matrices of X X, and calculating Mt using (21a) (21b) each iteration t involves matrix addition and diagonal matrix multiplication which only requires O(n2) cost. To sample from (20), we calculate M 1 t by directly inverting the calculated matrix Mt from (3), which requires O(n3) cost. Overall, now the Gibbs samplers for logistic regression requires O(max{n3, np}) computational cost at iteration t, which is an improvement compared to S3. C. Experiment Details Figure 3 of Section 3. In Figure 3, we use the same prior hyperparameters for all the algorithms. Following Narisetty et al. (2019), we choose τ 2 0 = 1 n, τ 2 1 = max{ p2.1 100n, 1} and q = P(zj = 1) such that P(Pp j=1 I{zj = 1} > K) = 0.1 for K = max{10, log n}. The true positive rate (TPR) and the false discovery rate (FDR) correspond to the proportion of nonzero and zero components of β j that are correctly selected respectively. They are calculated as 1 s Ps j=1 I{Pπ(zj = 1) > 0.5} and 1 p s Pp j=s+1 I{Pπ(zj = 1) > 0.5} respectively, where the marginal posterior probabilities πj Pπ(zj = 1) are estimated by ˆπj 1 4000 P5000 t=1001 zj,t for sample points (zt)t 0 generated using S3 or Skinny Gibbs. The lines in Figure 3 correspond to the average TPR and FDR across 20 independently generated datasets, and the grey bands correspond to one standard error of the averages. D. Dataset Details Synthetic continuous response dataset in Section 2.3. In Figure 1, synthetic linear regression datasets are considered. For number of observations n and number of covariates p, we generate a design matrix X Rn p such that each [X]i,j i.i.d. N(0, 1) for all 1 i n and 1 j p, which is then scaled to ensure each column has a mean of 0 and a standard error of 1. We choose the true signal β Rp such that β j = 2I{j s}, where s is the sparsity parameter corresponding to the number of non-zero components. Given X and β , we generate y = Xβ + σ ϵ for ϵ N(0, In), where σ = 2 is the Gaussian noise standard deviation. Synthetic binary response dataset in Section 3. In Figures 2 and 3, synthetic binary classification datasets are considered. For number of observations n and number of covariates p, we generate a design matrix X Rn p such that each [X]i,j i.i.d. N(0, 1) for all 1 i n and 1 j p, which is then scaled to ensure each column has a mean of 0 and a standard error of 1. We choose the true signal β Rp such that β j = 2 9 j 4 I{j s}, where s is the sparsity parameter corresponding to the number of non-zero components. Given X and β , we generate yi = I{ yi > 0} for yi Logistic(x i β , 1) for i = 1..., n, where x i is the i-th row of X and Logistic(x i β , 1) is the Logistic distribution with mean x i β and scale parameter 1. Datasets in Section 4. The Malware detection dataset from the UCI machine learning repository (Dua & Graff, 2017) has n = 373 observations with binary responses and p = 503 covariates, and is publicly available on www.kaggle.com/ piyushrumao/malware-executable-detection. The Borovecki, Chowdary, Chin and Gordon datasets are all high-dimensional microarray datasets. They are publicly available on the datamicroarray package in R. The Borovecki dataset has n = 31 observations with binary responses and p = 22283 covariates. The Chowdary dataset has n = 104 observations with binary responses and p = 22283 covariates. The Chin dataset has n = 118 observations with binary responses and p = 22215 covariates. The Gordon dataset has n = 181 observations with binary responses and p = 12533 covariates. The PCR GWAS dataset has n = 60 observations with continuous responses and p = 22575 covariates, and is publicly available on www.ncbi.nlm.nih.gov/geo (accession number GSE3330). The Lymph Node GWAS dataset has n = 148 observations with binary responses and p = 4514 covariates, and has been previously considered (Hans et al., 2007; Liang et al., 2013; Narisetty et al., 2019). The Maize GWAS dataset has n = 2266 observations with continuous responses and p = 98385 covariates, and has been previously considered (Romay et al., 2013; Liu et al., 2016; Zeng & Zhou, 2017). The Lymph Node GWAS and the Maize GWAS datasets are not publicly available. The synthetic continuous dataset has n = 1000 observations and p = 50000 covariates. The design matrix X is generates such that each [X]i,j i.i.d. N(0, 1) for all 1 i n and 1 j p, which is then scaled to ensure each column has a mean of 0 and a standard error of 1. The true signal β Rp is chosen such that β j = 2 9 j 4 I{j s}, where s is the sparsity parameter corresponding to the number of non-zero components. Given X and β , we generate y = Xβ + σ ϵ for ϵ N(0, In), where σ = 2 is the Gaussian noise standard deviation. The synthetic binary classification dataset is generated as in Section 3, with n = 1000 observations and p = 50000 covariates. E. Additional Experiments Variable selection performance as a function of time or number of iterations. Figure 6 plots the average true positive rate (TPR) and the false discovery rate (FDR) of variable selection based on samples from S3 and Skinny Gibbs as the length of the chains are varied. The TPR and FDR are averaged over 10 independent chains, and one standard error bars are shown. We consider a synthetic binary classification dataset generated using a logistic regression model as in Section 3, with n = 200 observations, p = 1000 covariates, sparsity s = 10, and an exponentially decaying sparse true signal β Rp such that β j = 2 9 j 4 for j s and β j = 0 for j > s. The TPR and FDR are calculated as in Section 3 with the marginal posterior probabilities πj Pπ(zj = 1) now estimated by ˆπj 1 T 999 PT t=1000 zj,t for a burn-in of 1000, a varying chain length T 1000, and sample points (zt)t 0 generated using S3 or Skinny Gibbs. We use the same prior hyperparameters for all the algorithms, which are chosen according to Narisetty et al. (2019). Figure 6 Left and Center-Left plot the TPR and FDR against the chain length T. It shows that S3 for both logistic and probit regression have higher TPR and lower FDR than Skinny Gibbs for all chain lengths. Furthermore, S3 for logistic regression has higher TPR and lower FDR than S3 for probit regression, which is expected as the synthetic dataset for this example is generated using a logistic regression model. The SOTA sampler is omitted from the Left and Center-Left plots as its output has the same marginal distribution and statistical performance as S3. Figure 6 Center-Right and Right plot the TPR and FDR against total time elapsed in seconds to generate samples using S3, SOTA, or Skinny Gibbs chains with a burn-in of 1000 0 5000 10000 15000 Chain length 0 5000 10000 15000 Chain length S 3 Logistic S 3 Probit Skinny Gibbs Logistic 0 25 50 75 Time elasped (s) 0 25 50 75 Time elasped (s) S 3 Logistic S 3 Probit Skinny Gibbs Logistic SOTA Logistic SOTA Probit Figure 6. Avrage rue positive rate (TPR) and false discovery rate (FDR) plotted against the number of iterations and the total time elapsed in seconds. We consider S3, SOTA, and the Skinny Gibbs approximate sampler (Narisetty et al., 2019) applied to a synthetic binary classification dataset with n = 200 observations and p = 1000 covariates. The TPR and FDR are averaged over 10 independent chains, and one standard error bars are shown on the left and center-left plots and omitted on the right and center-right plots for visibility. The SOTA sampler is omitted from the Left and Center-Left plots as its output has the same marginal distribution and statistical performance as S3. See Section E for details. iterations. The standard error bars are now omitted for better visibility. For each time budget, we observe better variable selection performance from S3 when compared with the slower SOTA implementation or with Skinny Gibbs. Effective Sample Size of S3 for the datasets in Section 4. Figure 7 shows the Effective Sample Size per iteration and per unit of time (in seconds) of S3 and the SOTA sampler for the datasets in Section 4. The ESS is calculated using the mcmcse package (Flegal et al., 2021; Vats et al., 2019) for one S3 chain of length 10000 iterations with a burn-in of 1000 iterations for each dataset. The average ESS of the β components are then plotted. Figure 7 Right shows that S3 has significantly higher ESS per second compared to the corresponding SOTA sampler for all the datasets considered. Performance metrics for the datasets in Section 4. Figures 8 14 show various performance metrics of S3 for some of the datasets considered in Section 4. Figures 8 14 (Left) plot the marginal posterior probability estimates ˆπj against j in the decreasing order of ˆπjs, following the setup in Figure 4. For datasets with continuous valued responses, ˆπjs are based on samples from S3 for linear regression. For datasets with binary valued responses, ˆπjs are based on samples from S3 for logistic and probit regression, and the Skinny Gibbs sampler from logistic regression. We use samples from 5 independent chains of length 10000 iterations with a burn-in of 1000 iterations. Estimates based on samples from the SOTA sampler are not shown, as they implement the same Gibbs sampler as S3 (other than possible numerical discrepancies, as discussed in Section 4). Figures 8 14 (Center) show the average time taken per iteration with one standard error bars for S3, the SOTA sampler, and the Skinny Gibbs sampler based on 5 independent chains of length 10000 iterations. Figures 8 14 (Right) show the 10-fold cross-validation average root-mean-square error (RMSE) against the total time elapsed to run one S3 and one SOTA chain. To compute this evaluation, we partition the observed dataset into 10 folds uniformly at random and, for each fold k, run a chain conditioned on all data outside of fold k and evaluate its performance on the held-out data in the k-th fold. The average RMSE is calculated as 1 10 P10 k=1 rk, where rk is the RMSE for the kth fold. For datasets with continuous valued responses, the quantities rk for linear regression are calculated as ( 1 |Dk| P i Dk(yi ˆyi)2)1/2 where Dk is the kth fold, ˆyi 1 T 1000 PT t=1001 x T i βt are the predicted responses, and (βt)t 0 are samples from S3 and SOTA targeting the posterior distribution of the kth training set. For datasets with binary valued responses, the quantities Synthetic Binary Synthetic Continuous ESS per iteration of β S3 Linear S3 Logistic S3 Probit Synthetic Binary Synthetic Continuous ESS per second of β S3 Logistic SOTA Linear SOTA Logistic SOTA Probit Figure 7. Effective sample size (ESS) per iteration and per second of S3 and the SOTA sampler for some of the datasets in Section 4. The ESS is calculated using one S3 chain of length 10000 iterations with a burn-in of 1000 iterations. The ESS per iteration of the SOTA sampler is omitted from the Left plot as it implements the same Gibbs sampler as S3. 1 10 100 1000 10000 Borovecki Covariates Marginal posterior probabilities S3 Logistic S3 Probit Skinny Gibbs Logistic Time per iteration (ms) S3 Logistic SOTA Logistic S3 Probit SOTA Probit Skinny Gibbs Logistic 0 20 40 60 80 Time elasped (s) Cross validation RMSE S3 Logistic SOTA Logistic S3 Probit SOTA Probit Figure 8. Borovecki dataset with n = 31 observations, p = 22283 covariates and binary valued responses. rk are calculated as ( 1 |Dk| P i Dk(yi ˆpi)2)1/2, where Dk is the kth fold, ˆpi 1 T 1000 PT t=1001 Logistic(x T i βt) and ˆpi 1 T 1000 PT t=1001 Φ(x T i βt) are the predicted probabilities for logistic and probit regression respectively, and (βt)t 0 are samples from S3 and SOTA targeting the posterior distribution of the kth training set. Figures 8 14 (Right) plot the average RMSE against total time elapsed in seconds to generate samples using S3 or SOTA chains with a burn-in of 1000 iterations. The RMSE of the Skinny Gibbs sampler is not available, as the skinnybasad package does not output the full chain trajectories required for RMSE calculations. 1 10 100 1000 10000 Chin Covariates Marginal posterior probabilities S3 Logistic S3 Probit Skinny Gibbs Logistic Time per iteration (ms) S3 Logistic SOTA Logistic S3 Probit SOTA Probit Skinny Gibbs Logistic 0 100 200 300 400 500 Time elasped (s) Cross validation RMSE S3 Logistic SOTA Logistic S3 Probit SOTA Probit Figure 9. Chin dataset with n = 118 observations, p = 22215 covariates and binary valued responses. 1 10 100 1000 10000 Chowdary Covariates Marginal posterior probabilities S3 Logistic S3 Probit Skinny Gibbs Logistic Time per iteration (ms) S3 Logistic SOTA Logistic S3 Probit SOTA Probit Skinny Gibbs Logistic 0 100 200 300 400 Time elasped (s) Cross validation RMSE S3 Logistic SOTA Logistic S3 Probit SOTA Probit Figure 10. Chowdary dataset with n = 104 observations, p = 22283 covariates and binary valued responses. 1 10 100 1000 10000 Gordon Covariates Marginal posterior probabilities S3 Logistic S3 Probit Skinny Gibbs Logistic Time per iteration (ms) S3 Logistic SOTA Logistic S3 Probit SOTA Probit Skinny Gibbs Logistic 0 100 200 300 400 500 Time elasped (s) Cross validation RMSE S3 Logistic SOTA Logistic S3 Probit SOTA Probit Figure 11. Gordon dataset with n = 181 observations, p = 12533 covariates and binary valued responses. 1 10 100 1000 Lymph Covariates Marginal posterior probabilities S3 Logistic S3 Probit Skinny Gibbs Logistic Time per iteration (ms) S3 Logistic SOTA Logistic S3 Probit SOTA Probit Skinny Gibbs Logistic 0 20 40 60 80 Time elasped (s) Cross validation RMSE S3 Logistic SOTA Logistic S3 Probit SOTA Probit Figure 12. Lymph dataset with n = 148 observations, p = 4514 covariates and binary valued responses. 1 10 100 Malware Covariates Marginal posterior probabilities S3 Logistic S3 Probit Skinny Gibbs Logistic Time per iteration (ms) S3 Logistic SOTA Logistic S3 Probit SOTA Probit Skinny Gibbs Logistic 0 20 40 60 80 Time elasped (s) Cross validation RMSE S3 Logistic SOTA Logistic S3 Probit SOTA Probit Figure 13. Malware dataset with n = 373 observations, p = 503 covariates and binary valued responses. 1 10 100 1000 10000 PCR Covariates Marginal posterior probabilities Sampler S3 Linear S3 Linear SOTA Linear Time per iteration (ms) S3 Linear SOTA Linear 0 10 20 30 40 50 Time elasped (s) Cross validation RMSE S3 Linear SOTA Linear Figure 14. PCR dataset with n = 60 observations, p = 22575 covariates and continuous valued responses.