# bayesian_model_selection_with_graph_structured_sparsity__68ffd4ec.pdf Journal of Machine Learning Research 21 (2020) 1-61 Submitted 2/19; Revised 2/20; Published 6/20 Bayesian Model Selection with Graph Structured Sparsity Youngseok Kim youngseok@uchicago.edu Chao Gao chaogao@galton.uchicago.edu Department of Statistics University of Chicago Chicago, IL 60637, USA Editor: Francois Caron We propose a general algorithmic framework for Bayesian model selection. A spike-and-slab Laplacian prior is introduced to model the underlying structural assumption. Using the notion of effective resistance, we derive an EM-type algorithm with closed-form iterations to efficiently explore possible candidates for Bayesian model selection. The deterministic nature of the proposed algorithm makes it more scalable to large-scale and high-dimensional data sets compared with existing stochastic search algorithms. When applied to sparse linear regression, our framework recovers the EMVS algorithm (Roˇckov a and George, 2014) as a special case. We also discuss extensions of our framework using tools from graph algebra to incorporate complex Bayesian models such as biclustering and submatrix localization. Extensive simulation studies and real data applications are conducted to demonstrate the superior performance of our methods over its frequentist competitors such as ℓ0 or ℓ1 penalization. Keywords: spike-and-slab prior, graph laplacian, variational inference, expectation maximization, sparse linear regression, biclustering 1. Introduction Bayesian model selection has been an important area of research for several decades. While the general goal is to estimate the most plausible sub-model from the posterior distribution (Barry and Hartigan, 1993; Diebolt and Robert, 1994; Richardson and Green, 1997; Bottolo and Richardson, 2010) for a wide class of learning tasks, most of the developments of Bayesian model selection have been focused on variable selection in the setting of sparse linear regression (Hans et al., 2007; Li and Zhang, 2010; Ghosh and Clyde, 2011; Roˇckov a and George, 2014; Wang et al., 2018). One of the main challenges of Bayesian model selection is its computational efficiency. Recently, Roˇckov a and George (2014) discovered that Bayesian variable selection in sparse linear regression can be solved by an EM algorithm Dempster et al. (1977); Neal and Hinton (1998) with a closed-form update at each iteration. Compared with previous stochastic search type of algorithms such as Gibbs sampling (George and Mc Culloch, 1993, 1997), this deterministic alternative greatly speeds up computation for large-scale and high-dimensional data sets. The main thrust of this paper is to develop of a general framework of Bayesian models that includes sparse linear regression, change-point detection, clustering and many other models as special cases. We will derive a general EM-type algorithm that efficiently explores c 2020 Youngseok Kim and Chao Gao. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-123.html. Kim and Gao possible candidates for Bayesian model selection. When applied to sparse linear regression, our model and algorithmic frameworks naturally recover the proposal of Roˇckov a and George (2014). The general framework proposed in this paper can be viewed as an algorithmic counterpart of the theoretical framework for Bayesian high-dimensional structured linear models in Gao et al. (2015). While the work Gao et al. (2015) is focused on optimal posterior contraction rate and oracle inequalities, the current paper pursues a general efficient and scalable computational strategy. In order to study various Bayesian models from a unified perspective, we introduce a spike-and-slab Laplacian prior distribution on the model parameters. The new prior distribution is an extension of the classical spike-and-slab prior (Mitchell and Beauchamp, 1988; George and Mc Culloch, 1993, 1997) for Bayesian variable selection. Our new definition incorporates the graph Laplacian of the underlying graph representing the model structure, and thus gives the name of the prior. Under this general framework, the problem of Bayesian model selection can be recast as selecting a subgraph of some base graph determined by the statistical task. Here, the base graph and its subgraphs represent the structures of the full model and the corresponding sub-models, respectively. Various choices of base graphs lead to specific statistical estimation problems such as sparse linear regression, clustering and change-point detection. In addition, the connection to graph algebra further allows us to build prior distributions for even more complicated models. For example, using graph products such as Cartesian product or Kronecker product (Imrich and Klavzar, 2000; Leskovec et al., 2010), we can construct prior distributions for biclustering models from the Laplacian of the graph products of row and column clustering structures. This leads to great flexibility in analyzing real data sets of complex structures. Our Bayesian model selection follows the procedure of Roˇckov a and George (2014) that evaluates the posterior probabilities of sub-models computed from the solution path of the EM algorithm. However, the derivation of the EM algorithm under our general framework is indeed nontrivial task. When the underlying base graph of the model structure is a tree, the derivation of the EM algorithm is straightforward by following the arguments in Roˇckov a and George (2014). On the other hand, for a general base graph that is not a tree, the arguments in Roˇckov a and George (2014) do not apply. To overcome this difficulty, we introduce a relaxation through the concept of effective resistance (Lov asz, 1993; Ghosh et al., 2008; Spielman, 2007) that adapts to the underlying graphical structure of the model. The lower bound given by this relaxation is then used to derive a variational EM algorithm that works under the general framework. Model selection with graph structured sparsity has also been studied in the frequentist literature. For example, generalized Lasso (Tibshirani and Taylor, 2011; Arnold and Tibshirani, 2014) and its multivariate version network Lasso (Hallac et al., 2015) encode the graph structured sparsity with ℓ1 regularization. Algorithms based on ℓ0 regularization have also been investigated recently (Fan and Guan, 2018; Xu and Fan, 2019). Compared with these frequentist methods, our proposed Bayesian model selection procedure tends to achieve better model selection performance in terms of false discovery proportion and power in a wide range of model scenarios, which will be shown through an extensive numerical study under various settings. The rest of the paper is organized as follows. In Section 2, we introduce the general framework of Bayesian models and discuss the spike-and-slab Laplacian prior. The EM Bayesian Model Selection with Graph Structured Sparsity algorithm will be derived in Section 3 for both the case of trees and general base graphs. In Section 4, we discuss how to incorporate latent variables and propose a new Bayesian clustering models under our framework. Section 5 introduces the techniques of graph products and several important extensions of our framework. We will also discuss a non Gaussian spike-and-slab Laplacian prior in Section 6 with a natural application to reduced isotonic regression (Schell and Singh, 1997). Finally, extensive simulated and real data analysis will be presented in Section 7. 2. A General Framework of Bayesian Models In this section, we describe a general framework for building Bayesian structured models on graphs. To be specific, the prior structural assumption on the parameter θ Rp will be encoded by a graph. Throughout the paper, G = (V, E) is an undirected graph with V = [p] and some E {(i, j) : 1 i < j p}. It is referred to as the base graph of the model, and our goal is to learn a sparse subgraph of G from the data. We use p = |V | and m = |E| for the node size and edge size of the base graph. 2.1. Model Description We start with the Gaussian linear model y | β, σ2 N(Xβ, σ2In) that models an ndimensional observation. The design matrix X Rn p is determined by the context of the problem. Given some nonzero vector w Rp, the Euclidean space Rp can be decomposed as a direct sum of the one-dimensional subspace spanned by w and its orthogonal complement. In other words, we can write β = 1 w 2 ww T β + Ip 1 w 2 ww T β. The structural assumption will be imposed by a prior on the second term above. To simplify the notation, we introduce the space Θw = θ Rp : w T θ = 0 . Then, any β Rp can be decomposed as β = αw + θ for some α R and θ Θw. The likelihood is thus given by y | α, θ, σ2 N(X(αw + θ), σ2In). (1) The prior distribution on the vector αw + θ will be specified by independent priors on α and θ. They are given by α | σ2 N(0, σ2/ν), (2) θ | γ, σ2 p(θ | γ, σ2) Y (i,j) E exp (θi θj)2 2σ2[v0γij + v1(1 γij)] I{θ Θw}. (3) Under the prior distribution, α is centered at 0 and has precision ν/σ2. The parameter θ is modeled by a prior distribution on Θw that encodes a pairwise relation between θi and θj. Here, v0 is a very small scalar and v1 is a very large scalar. For a pair (i, j) E in the base graph, the prior enforces the closedness between θi and θj when γij = 1. Our goal is then to learn the most probable subgraph structure encoded by {γij}, which will be estimated from the posterior distribution. Kim and Gao We finish the Bayesian modeling by putting priors on γ and σ2. They are given by γ | η p(γ | η) Y (i,j) E ηγij(1 η)1 γij I{γ Γ}, (4) η Beta(A, B), (5) σ2 Inv Gamma(a/2, b/2). (6) Besides the standard conjugate priors on η and σ2, the independent Bernoulli prior on γ is restricted on a set Γ {0, 1}m. This restriction is sometimes useful for particular models, but for now we assume that Γ = {0, 1}m until it is needed in Section 4. The Bayesian model is now fully specified. The joint distribution is p(y, α, θ, γ, η, σ2) = p(y | α, θ, σ2)p(α | σ2)p(θ | γ, σ2)p(γ | η)p(η)p(σ2). (7) Among these distributions, the most important one is p(θ|γ, σ2). To understand its properties, we introduce the incidence matrix D Rm p for the base graph G = (V, E). The matrix D has entries Dei = 1 and Dej = 1 if e = (i, j), and Dek = 0 if k = i, j. We note that the definition of D depends on the order of edges {(i, j)} even if G is an undirected graph. However, this does not affect any application that we will need in the paper. We then define the Laplacian matrix Lγ = DT diag v 1 0 γ + v 1 1 (1 γ) D. It is easy to see that Lγ is the graph Laplacian of the weighted graph with adjacency matrix {v 1 0 γij + v 1 1 (1 γij)}. Thus, we can write (3) as p(θ | γ, σ2) exp 1 2σ2 θT Lγθ I{θ Θw}. (8) Given its form, we name (8) the spike-and-slab Laplacian prior. Proposition 1 Suppose G = (V, E) is a connected base graph. For any γ {0, 1}m and v0, v1 (0, ), the graph Laplacian Lγ is positive semi-definite and has rank p 1. The only eigenvector corresponding to its zero eigenvalue is proportional to 1p, the vector with all entries 1. As a consequence, as long as 1T p w = 0, the spike-and-slab Laplacian prior is a non-degenerate distribution on Θw. Its density function with respect to the Lebesgue measure restricted to Θw is p(θ | γ, σ2) = 1 (2πσ2)(p 1)/2 detw(Lγ) exp 1 2σ2 θT Lγθ I{θ Θw}, where detw(Lγ) is the product of all nonzero eigenvalues of the positive semi-definite matrix Ip 1 w 2 ww T Lγ Ip 1 w 2 ww T . The proposition reveals two important conditions that lead to the well-definedness of the spike-and-slab Laplacian prior: the connectedness of the base graph G = (V, E) and 1T p w = 0. Without either condition, the distribution would be degenerate on Θw. Extensions to a base graph that is not necessarily connected is possible. We leave this task to Section 4 and Section 5, where tools from graph algebra are introduced. Bayesian Model Selection with Graph Structured Sparsity 2.2. Examples The Bayesian model (7) provides a very general framework. By choosing a different base graph G = (V, E), a design matrix X, a grounding vector w Rp and a precision parameter ν, we then obtain a different model. Several important examples are given below. Example 1 (Sparse linear regression) The sparse linear regression model y | θ, σ2 N(Xθ, σ2In) is a special case of (1). To put it into the general framework, we can expand the design matrix X Rn p and the regression vector θ Rp by [0n, X] Rn (p+1) and [θ0; θ] Rp+1. With the grounding vector w = [1; 0p], the sparse linear regression model can be recovered from (1). For the prior distribution, the base graph G consists of nodes V = {0, 1, ..., p} and edges {(0, i) : i [p]}. We set ν = , so that θ0 = 0 with prior probability one. Then, (3) is reduced to θ | γ, σ2 p(θ | γ, σ2) i=1 exp θ2 i 2σ2[v0γ0i + v1(1 γ0i)] That is, θi|γ, σ2 N(0, σ2[v0γ0i+v1(1 γ0i)]) independently for all i [n]. This is recognized as the spike-and-slab Gaussian prior for Bayesian sparse linear regression considered by George and Mc Culloch (1993, 1997); Roˇckov a and George (2014). Example 2 (Change-point detection) Set n = p, X = In, and w = 1n. We then have yi | θi, σ2 N(α + θi, σ2) independently for all i [n] from (1). For the prior distribution on α and θ, we consider ν = 0 and a one-dimensional chain graph G = (V, E) with E = {(i, i + 1) : i [n 1]}. This leads to a flat prior on α, and the prior on θ is given by θ | γ, σ2 p(θ | γ, σ2) i=1 exp (θi θi+1)2 2σ2[v0γi,i+1 + v1(1 γi,i+1)] I{1T p θ = 0}. A more general change-point model on a tree can also be obtained by constructing a tree base graph G. Example 3 (Two-dimensional image denoising) Consider a rectangular set of observations y Rn1 n2. With the same construction in Example 2 applied to sec(y), we obtain yij | θij, σ2 N(α + θij, σ2) independently for all (i, j) [n1] [n2] from (1). To model images, we consider a prior distribution that imposes closedness to nearby pixels. Consider ν = 0 and a base graph G = (V, E) shown in the picture below. Kim and Gao We then obtain a flat prior on α, and θ | γ, σ2 p(θ | γ, σ2) Y (ik,jl) E exp (θik θjl)2 2σ2[v0γik,jl + v1(1 γik,jl)] I{1T n1θ1n2 = 0}. Note that G is not a tree in this case. 3. EM Algorithm In this section, we will develop efficient EM algorithms for the general model. It turns out that the bottleneck is the computation of detw(Lγ) given some γ {0, 1}m. Lemma 2 Let spt(G) be the set of all spanning trees of G. Then detw(Lγ) = (1T p w)2 v 1 0 γij + v 1 1 (1 γij) . In particular, if G is a tree, then detw(Lγ) = (1T p w)2 w 2 Q (i,j) E v 1 0 γij + v 1 1 (1 γij) . The lemma suggests that the hardness of computing detw(Lγ) depends on the number of spanning trees of the base graph G. When the base graph is a tree, detw(Lγ) is factorized over the edges of the tree, which greatly simplifies the derivation of the algorithm. We will derive a closed-form EM algorithm in Section 3.1 when G is a tree, and the algorithm for a general G will be given in Section 3.2. 3.1. The Case of Trees We treat γ as latent. Our goal is to maximize the marginal distribution after integrating out the latent variables. That is, max α,θ Θw,η,σ2 log X γ p(y, α, θ, γ, η, σ2), (9) where p(y, α, θ, γ, η, σ2) is given by (7). Since the summation over γ is intractable, we consider an equivalent form of (9), which is max q max α,θ Θw,η,σ2 X γ q(γ) log p(y, α, θ, γ, η, σ2) q(γ) . (10) Then, the EM algorithm is equivalent to iteratively updating q, α, θ Θw, η, σ2 (Neal and Hinton, 1998). Now we illustrate the EM algorithm that solves (10). The E-step is to update q(γ) given the previous values of θ, η, σ. In view of (7), we have qnew(γ) p(y, α, θ, γ, η, σ2) p(θ | γ, σ2)p(γ | η). (11) Bayesian Model Selection with Graph Structured Sparsity According to (2), p(θ | γ, σ2) can be factorized when the base graph G = (V, E) is a tree. Therefore, with a simpler notation qij = q(γij = 1), we can write the update for q as qnew(γ) = Q (i,j) E(qnew ij )γij(1 qnew ij )1 γij, where qnew ij = ηφ(θi θj; 0, σ2v0) ηφ(θi θj; 0, σ2v0) + (1 η)φ(θi θj; 0, σ2v1). (12) Here, φ( ; µ, σ2) stands for the density function of N(µ, σ2). To derive the M-step, we introduce the following function F(α, θ; q) = y X(αw + θ) 2 + να2 + θT Lqθ, (13) where Lq is obtained by replacing γ with q in the definition of the graph Laplacian Lγ. The M-step consists of the following three updates, (αnew, θnew) = argmin α,θ Θw F(α, θ; qnew), (14) (σ2)new = argmin σ2 F(αnew, θnew; qnew) + b 2σ2 + p + n + a + 2 2 log(σ2) , (15) ηnew = argmax η [(A 1 + qnew sum) log η + (B 1 + p 1 qnew sum) log(1 η)] ,(16) where the notation qnew sum stands for P (i,j) E qnew ij . While (14) is a simple quadratic programming, (15) and (16) have closed forms, which are given by (σ2)new = F(αnew, θnew; qnew) + b p + n + a + 2 and ηnew = A 1 + qnew sum A + B + p 3. (17) We remark that the EMVS algorithm (Roˇckov a and George, 2014) is a special case for the sparse linear regression problem discussed in Example 1. When G is a tree, the spike-and-slab graph Laplacian prior (8) is proportional to the product of individual spike-and-slab priors p(θ | γ, σ2) Y (i,j) E exp (θi θj)2 2σ2[v0γij + v1(1 γij)] supported on Θw, as we have seen in Example 1 and 2. In this case, the above EM algorithm we have developed can also be extended to models with alternative prior distributions, such as the spike-and-slab Lasso prior (Roˇckov a and George, 2018) and the finite normal mixture prior (Stephens, 2016). 3.2. General Graphs When the base graph G is not a tree, the E-step becomes computationally infeasible due to the lack of separability of p(θ|γ, σ2) in γ. In fact, given the form of the density function in Proposition 1, the main problem lies in the term p detw(Lγ), which cannot be factorized over (i, j) E when the base graph G = (V, E) is not a tree (Lemma 2). To overcome the difficulty, we consider optimizing a lower bound of the objective function (10). This means we need to find a good lower bound for log detw(Lγ). Similar techniques are also advocated Kim and Gao in the context of learning exponential family graphical models (Wainwright and Jordan, 2008). By Lemma 2, we can write log detw(Lγ) = log X v 1 0 γij + v 1 1 (1 γij) + log (1T p w)2 We only need to lower bound the first term on the right hand side of the equation above, because the second term is independent of γ. By Jensen s inequality, for any non-negative sequence {λ(T)}T spt(G) such that P T spt(G) λ(T) = 1, we have v 1 0 γij + v 1 1 (1 γij) T spt(G) λ(T) log Y v 1 0 γij + v 1 1 (1 γij) X T spt(G) λ(T) log λ(T) T spt(G) λ(T)I{(i, j) T} log v 1 0 γij + v 1 1 (1 γij) X T spt(G) λ(T) log λ(T). One of the most natural choices of the weights {λ(T)}T spt(G) is the uniform distribution λ(T) = 1 |spt(G)|. This leads to the following lower bound v 1 0 γij + v 1 1 (1 γij) (i,j) E rij log v 1 0 γij + v 1 1 (1 γij) + log |spt(G)|, (19) where rij = 1 |spt(G)| T spt(G) I{(i, j) T}. (20) The quantity rij defined in (20) is recognized as the effective resistance between the ith and the jth nodes (Lov asz, 1993; Ghosh et al., 2008). Given a graph, we can treat each edge as a resistor with resistance 1. Then, the effective resistance between the ith and the jth nodes is the resistance between i and j given by the whole graph. That is, if we treat the entire graph as a resistor. Let L be the (unweighted) Laplacian matrix of the base graph G = (V, E), and L+ its pseudo-inverse. Then, an equivalent definition of (20) is given by the formula rij = (ei ej)T L+(ei ej), where ej is the basis vector with the ith entry 1 and the remaining entries 0. Therefore, computation of the effective resistance can leverage fast Laplacian solvers in the literature (Spielman and Teng, 2004; Livne and Brandt, 2012). Some important examples of effective resistance are listed below: Bayesian Model Selection with Graph Structured Sparsity When G is the complete graph of size p, then rij = 2/p for all (i, j) E. When G is the complete bipartite graph of sizes p and k, then rij = p+k 1 pk for all (i, j) E. When G is a tree, then rij = 1 for all (i, j) E. When G is a two-dimensional grid graph of size n1 n2, then rij [0.5, 0.75] depending on how close the edge (i, j) is from its closest corner. When G is a lollipop graph, the conjunction of a linear chain with size p and a complete graph with size k, then rij = 1 or 2/k depending on whether the edge (i, j) belongs to the chain or the complete graph. By (19), we obtain the following lower bound for the objective function (10), max q max α,θ Θw,η,σ2 X γ q(γ) log p(y | α, θ, σ2)p(α | σ2)ep(θ | γ, σ2)p(γ | η)p(η)p(σ2) q(γ) , (21) where the formula of ep(θ | γ, σ2) is obtained by applying the lower bound (19) in the formula of p(θ | γ, σ2) in Proposition 1. Since ep(θ | γ, σ2) can be factorized over (i, j) E, the E-step is given by qnew(γ) = Q (i,j) E(qnew ij )γij(1 qnew ij )1 γij, where qnew ij = ηv rij/2 0 e (θi θj)2/2σ2v0 ηv rij/2 0 e (θi θj)2/2σ2v0 + (1 η)v rij/2 1 e (θi θj)2/2σ2v1 . (22) Observe that the lower bound (19) is independent of α, θ, η, σ2, and thus the M-step remains the same as in the case of a tree base graph. The formulas are given by (14)-(16), except that (16) needs to be replaced by ηnew = A 1 + qnew sum A + B + m 2. The EM algorithm for a general base graph can be viewed as a natural extension of that of a tree base graph. When G = (V, E) is a tree, it is easy to see from the formula (20) that rij = 1 for all (i, j) E. In this case, the E-step (22) is reduced to (12), and the inequality (19) becomes an equality. 3.3. Bayesian Model Selection The output of the EM algorithm bq(γ) can be understood as an estimator of the posterior distribution p(γ|bα, bθ, bσ2, bη), where bα, bθ, bσ2, bη are obtained from the M-step. Then, we get a subgraph according to the thresholding rule bγij = I{bqij 1/2}. It can be understood as a model learned from the data. The sparsity of the model critically depends on the values of v0 and v1 in the spike-and-slab Laplacian prior. With a fixed large value of v1, we can obtain the solution path of bγ = bγ(v0) by varying v0 from 0 to v1. The question then is how to select the best model along the solution path of the EM algorithm. The strategy suggested by Roˇckov a and George (2014) is to calculate the posterior score p(γ|y) with respect to the Bayesian model of v0 = 0. While the meaning of p(γ|y) Kim and Gao corresponding to v0 = 0 is easily understood for the sparse linear regression setting in Roˇckov a and George (2014), it is less clear for a general base graph G = (V, E). In order to define a version of (7) for v0 = 0, we need to introduce the concept of edge contraction. Given a γ {0, 1}m, the graph corresponding to the adjacency matrix γ induces a partition of disconnected components {C1, ..., Cs} of [p]. In other words, {i, j} Cl for some l [s] if and only if there is some path between i and j in the graph γ. For notational convenience, we define a vector z [s]n so that zi = l if and only if i Cl. A membership matrix Zγ {0, 1}p s is defined with its (i, l)th entry being the indicator I{zi = l}. We let e G = (e V , e E) be a graph obtained from the base graph G = (V, E) after the operation of edge contraction. In other words, every node in e G is obtained by combining nodes in G according to the partition of {C1, ..., Cs}. To be specific, e V = [s], and (k, l) e E if and only if there exists some i Ck and some j Cl such that (i, j) E. Now we are ready to define a limiting version of (3) as v0 0. Let e Lγ = DT diag(v 1 1 (1 γ))D, which is the graph Laplacian of the weighted graph with adjacency matrix {v 1 1 (1 γij)}. Then, define p(eθ | γ, σ2) = 1 (2πσ2)(s 1)/2 det ZT γ w(ZTγ e LγZγ) exp eθT ZT γ e LγZγeθ I{eθ ΘZT γ w}. (23) With e G = (e V , e E) standing for the contracted base graph, the prior distribution (23) can also be written as p(eθ | γ, σ2) exp ωkl(eθk eθl)2 I{eθ ΘZT γ w}, (24) where ωkl = P (i,j) E I{z(i) = k, z(j) = l}, which means that the edges {(i, j)}z(i)=k,z(j)=l in the base graph G = (V, E) are contracted as a new edge (k, l) in e G = (e V , e E) with ωkl as the weight. Proposition 3 Suppose G = (V, E) is connected and 1T p w = 0. Let Zγ be the membership matrix defined as above. Then for any γ {0, 1}m, (23) is a well-defined density function on the (s 1)-dimensional subspace {eθ Rs : w T Zγeθ = 0}. Moreover, for an arbitrary design matrix X Rn p, the distribution of θ that follows (3) weakly converges to that of Zγeθ as v0 0. Motivated by Proposition 3, a limiting version of (7) for v0 = 0 is defined as follows, y | α, eθ, γ, σ2 N(X(αw + Zγeθ), σ2In). (25) Then, p(eθ | γ, σ2) is given by (23), and p(α|σ2), p(γ|η), p(η), p(σ2) are specified in (2) and (4)-(6). The posterior distribution of γ has the formula p(γ | y) Z Z Z Z p(y, α, eθ, γ, η, σ2) dα deθ dη dσ2 = Z p(σ2) Z p(α | σ2) Z p(y | α, eθ, γ, σ2)p(eθ | γ, σ2) deθ dα dσ2 Z p(γ | η)p(η)dη. Bayesian Model Selection with Graph Structured Sparsity A standard calculation using conjugacy gives det ZT γ w(ZT γ e LγZγ) det ZT γ w(ZTγ (XT X + e Lγ)Zγ) !1/2 ν ν + w T XT (In Rγ)Xw y T (In Rγ)y |w T XT (In Rγ)y|2 ν + w T XT (In Rγ)Xw + b n+a Beta P (i,j) E γij + A 1, P (i,j) E(1 γij) + B 1) Beta(A, B) , where Rγ = XZγ(ZT γ (XT X + e Lγ)Zγ) 1ZT γ XT . This defines the model selection score g(γ) = log p(γ | y) up to a universal additive constant. The Bayesian model selection procedure evaluates g(γ) on the solution path {bγ(v0)}0 0 models approximate sparsity in the sense that γij implies θi θj. Though v0 > 0 does not offer interpretation of exact sparsity, a nonzero v0 leads to efficient computation via the EM algorithm. That is, for a v0 > 0, we can maximize max q max θ γ q(γ) log p(y | θ)pv0(θ | γ)p(γ) 1. We have ignored other parameters such as α, η, σ2 in order to make the discussion below clear and concise. Kim and Gao which is the objective function of EM. Denote the output of the algorithm by qv0(γ) = Q ij qij,v0, we then obtain our model by bγij(v0) = I{qij,v0 > 0.5}. As we vary v0 on a grid from 0 to v1, we obtain a path of models {bγ(v0)}0 0 for all j [k]. In other words, none of the k clusters is empty. Proposition 4 Let the conditional distribution of θ, µ | γ, σ2 be specified by (33) with some non-degenerate γ. Then, the distribution of θ | γ, σ2 weakly converges to p(θ | γ, σ2) Y 1 i bk. This does not matter, because the model selection score (46) does not depend on the clustering labels. Finally, the bγ constructed according to the above procedure will be evaluated by g(bγ) defined by (46). In the toy example with four data points y = (4, 2, 2, 4)T , the model selection score is computed along the solution path. According to Figure 1, the model selection procedure suggests that a clustering structure with two clusters {4, 2} and { 2, 4} is the most plausible one. We also note that the curve of g(γ) has sharp phase transitions whenever the solution paths µ merge. 5. Extensions with Graph Algebra In many applications, it is useful to have a model that imposes both row and column structures on a high-dimensional matrix θ Rp1 p2. We list some important examples below. 1. Biclustering. In applications such as gene expression data analysis, one needs to cluster both samples and features. This task imposes a clustering structure for both rows and columns of the data matrix (Hartigan, 1972; Cheng and Church, 2000). 2. Block sparsity. In problems such as planted clique detection (Feige and Krauthgamer, 2000) and submatrix localization (Hajek et al., 2017), the matrix can be viewed as the sum of a noise background plus a submatrix of signals with unknown locations. Equivalently, it can be modeled by simultaneous row and column sparsity (Ma and Wu, 2015). 3. Sparse clustering. Suppose the data matrix exhibits a clustering structure for its rows and a sparsity structure for its columns, then we have a sparse clustering problem (Witten and Tibshirani, 2010). For this task, we need to select nonzero column features in order to accurately cluster the rows. For the problems listed above, the row and column structures can be modeled by graphs γ1 and γ2. Then, the structure of the matrix θ is induced by a notion of graph product of γ1 Kim and Gao and γ2. In this section, we introduce tools from graph algebra including Cartesian product and Kronecker product to build complex structure from simple components. We first introduce the likelihood of the problem. To cope with many useful models, we assume that the observation can be organized as a matrix y Rn1 n2. Then, the specific setting of a certain problem can be encoded by design matrices X1 Rn1 p1 and X2 Rn2 p2. The likelihood is defined by y | α, θ, σ2 N(X1(αw + θ)XT 2 , σ2In1 In2). (47) The matrix w Rp1 p2 is assumed to have rank one, and can be decomposed as w = w1w T 2 for some w1 Rp1 and w2 Rp2. The prior distribution of the scalar is simply given by α | σ2 N(0, σ2/ν). (48) We then need to build prior distributions of θ that is supported on Θw = {θ Rp1 p2 : Tr(wθT ) = 0} using Cartesian and Kronecker products. 5.1. Cartesian Product We start with the definition of the Cartesian product of two graphs. Definition 6 Given two graphs G1 = (V1, E1) and G2 = (V2, E2), their Cartesian product G = G1 G2 is defined with the vertex set V1 V2. Its edge set contains ((x1, x2), (y1, y2)) if and only if x1 = y1 and (x2, y2) E2 or (x1, y1suno) E1 and x2 = y2. According to the definition, it can be checked that for two graphs of sizes p1 and p2, the adjacency matrix, the Laplacian and the incidence matrix of the Cartesian product enjoy the relations A1 2 = A2 Ip1 + Ip2 A1, L1 2 = L2 Ip1 + Ip2 L1, D1 2 = [D2 Ip1; Ip2 D1]. Given graphs γ1 and γ2 that encode row and column structures of θ, we introduce the following prior distribution p(θ | γ1, γ2, σ2) Y (i,j) E1 exp θi θj 2 2σ2[v0γ1,ij + v1(1 γ1,ij)] (k,l) E2 exp θ k θ l 2 2σ2[v0γ2,kl + v1(1 γ2,kl)] Here, E1 and E2 are the base graphs of the row and column structures. According to its form, the prior distribution (49) models both pairwise relations of rows and those of columns based on γ1 and γ2, respectively. To better understand (49), we can write it in the following equivalent form, p(θ | γ1, γ2, σ2) exp 1 2σ2 vec(θ)T (Lγ2 Ip1 + Ip2 Lγ1) vec(θ) I{θ Θw}, (50) Bayesian Model Selection with Graph Structured Sparsity where Lγ1 Rp1 p1 and Lγ2 Rp2 p2 are Laplacian matrices of the weighted graphs {v0γ1,ij + v1(1 γ1,ij)} and {v0γ2,kl + v1(1 γ2,kl)}, respectively. Therefore, by Definition 6, p(θ | γ1, γ2, σ2) is a spike-and-slab Laplacian prior p(θ | γ, σ2) defined in (3) with γ = γ1 γ2, and the well-definedness is guaranteed by Proposition 1. To complete the Bayesian model, the distribution of (γ1, γ2, σ2) are specified by γ1, γ2 | η1, η2 Y (i,j) E1 ηγ1,ij 1 (1 η1)1 γ1,ij Y (i,j) E2 ηγ2,kl 2 (1 η2)1 γ2,kl, (51) η1, η2 Beta(A1, B1) O Beta(A2, B2), (52) σ2 Inv Gamma(a/2, b/2). (53) We remark that it is possible to constrain γ1 and γ2 in some subsets Γ1 and Γ2 like (4). This extra twist is useful for a biclustering model that will be discussed in Section 5.3. Note that in general the base graph G = G1 G2 is not a tree, and the derivation of the EM algorithm follows a similar argument in Section 3.2. Using the same argument in (19), we lower bound log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)] by e E1 E2 re log[v 1 0 γe + v 1 1 (1 γe)] + log |spt(G)|. (54) E1 E2 = {((i, k), (j, k)) : (i, j) E1, k V2} {((i, k), (i, l)) : i V1, (k, l) E2} , and γ = γ1 γ2, we can write (54) as k=1 r(i,k),(j,k) log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] i=1 r(i,k),(i,l) log[v 1 0 γ2,kl + v 1 1 (1 γ2,kl)] (i,j) E1 r1,ij log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] + X (k,l) E2 r2,kl log[v 1 0 γ2,kl + v 1 1 (1 γ2,kl)], k=1 r(i,k),(j,k) = 1 |spt(G)| T spt(G) I{((i, k), (j, k)) T}, and r2,kl is similarly defined. Kim and Gao Using the lower bound derived above, it is direct to derive the an EM algorithm, which consists of the following iterations, qnew 1,ij = η1v r1,ij/2 0 e θi θj 2/2σ2v0 η1v r1,ij/2 0 e θi θj 2/2σ2v0 + (1 η1)v r1,ij/2 1 e θi θj 2/2σ2v1 , (55) qnew 2,kl = η2v r2,kl/2 0 e θ k θ l 2/2σ2v0 η2v r2,kl/2 0 e θ k θ l 2/2σ2v0 + (1 η2)v r2,kl/2 1 e θ k θ l 2/2σ2v1 , (56) (αnew, θnew) = argmin α,θ Θw F(α, θ; qnew 1 , qnew 2 ), (57) (σ2)new = F(αnew, θnew; qnew 1 , qnew 2 ) + b n1n2 + p1p2 + a + 2 , ηnew 1 = A1 1 + P (i,j) E1 qnew 1,ij A1 + B1 2 + m1 , ηnew 2 = A2 1 + P (k,l) E2 qnew 2,ij A2 + B2 2 + m2 . (58) The definition of the function F(α, θ; q1, q2) is given by F(α, θ; q1, q2) = y X1(αw + θ)XT 2 2 F + να2 + vec(θ)T (Lq2 Ip1 + Ip2 Lq1) vec(θ) Though the E-steps (55) and (56) are straightforward, the M-step (57) is a quadratic programming of dimension p1p2, which may become the computational bottleneck of the EM algorithm when the size of the problem is large. We will introduce a Dykstra-like algorithm to solve (57) in Appendix E. The Cartesian product model is useful for simultaneous learning the row structure γ1 and the column structure and γ2 of the coefficient matrix θ. Note that when X1 = X, X2 = Id, and E2 = , the Cartesian product model becomes the multivariate regression model described in Section 4.1. In this case, the model only regularizes the row structure of θ. Another equally interesting example is obtained when X1 = X, X2 = Id, and E1 = . In this case, the model only regularizes the column structure of θ, and can be interpreted as multitask learning with task clustering. 5.2. Kronecker Product The Kronecker product of two graphs is defined below. Definition 7 Given two graphs G1 = (V1, E1) and G2 = (V2, E2), their Kronecker product G = G1 G2 is defined with the vertex set V1 V2. Its edge set contains ((x1, x2), (y1, y2)) if and only if (x1, y1) E1 and (x2, y2) E2. It is not hard to see that the adjacency matrix of two graphs has the formula A1 2 = A1 A2, which gives the name of Definition 7. The prior distribution of θ given row and column graphs γ1 and γ2 that we discuss in this subsection is p(θ | γ1, γ2, σ2) Y (k,l) E2 exp (θik θjl)2 2σ2[v0γ1,ijγ2,kl + v1(1 γ1,ijγ2,kl)] Bayesian Model Selection with Graph Structured Sparsity Again, E1 and E2 are the base graphs of the row and column structures. According to its form, the prior imposes a nearly block structure on θ based on the graphs γ1 and γ2. Moreover, p(θ | γ1, γ2, σ2) can be viewed as a spike-and-slab Laplacian prior p(θ | γ, σ2) defined in (3) with γ = γ1 γ2. The distribution of (γ1, γ2, σ2) follows the same specification in (51)-(53). To derive an EM algorithm, we follow the strategy in Section 3.2 and lower bound log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)] by X (k,l) E2 r(i,k),(j,l) log[v 1 0 γ1,ijγ2,kl + v 1 1 (1 γ1,ijγ2,kl)]. (60) Unlike the Cartesian product, the Kronecker product structure has a lower bound (60) that is not separable with respect to γ1 and γ2. This makes the E-step combinatorial, and does not apply to a large-scale problem. To alleviate this computational barrier, we consider a variational EM algorithm that finds the best posterior distribution of γ1, γ2 that can be factorized. In other words, instead of maximizing over all possible distribution q, we maximize over the mean-filed class q Q, with Q = {q(γ1, γ2) = q1(γ1)q2(γ2) : q1, q2}. Then, the objective becomes max q1,q2 max α,θ Θ2,δ,η,σ2 X γ1,γ2 q1(γ1)q2(γ2) log ep(y, α, θ, δ, γ1, γ2, η, σ2) q1(γ1)q2(γ2) , where ep(y, α, θ, δ, γ1, γ2, η, σ2) is obtained by replacing p(θ | γ1, γ2, σ2) with ep(θ | γ1, γ2, σ2) in the joint distribution p(y, α, θ, δ, γ1, γ2, η, σ2). Here, log ep(θ | γ1, γ2, σ2) is a lower bound for log p(θ | γ1, γ2, σ2) with (60). The E-step of the variational EM is qnew 1 (γ1) exp γ2 q2(γ2) log ep(y, α, θ, δ, γ1, γ2, η, σ2) qnew 2 (γ2) exp γ1 qnew 1 (γ1) log ep(y, α, θ, δ, γ1, γ2, η, σ2) After some simplification, we have qnew 1,ij = 1 + (1 η1) Q (k,l) E2 v r(i,k),(j,l)/2 1 e (θik θjl)2/2σ2v1 q2,kl η1 Q (k,l) E2 v r(i,k),(j,l)/2 0 e (θik θjl)2/2σ2v0 q2,kl qnew 2,kl = 1 + (1 η2) Q (i,j) E1 v r(i,k),(j,l)/2 1 e (θik θjl)2/2σ2v1 qnew 1,kl η2 Q (i,j) E1 v r(i,k),(j,l)/2 0 e (θik θjl)2/2σ2v0 qnew 1,kl The M-step can be derived in a standard way, and it has the same updates as in (57)-(58), with a new definition of F(α, θ; q1, q2) given by F(α, θ; q1, q2) = y X1(αw + θ)X2 2 + να2 v0 + 1 q1,ijq2,kl (θik θjl)2. Kim and Gao 5.3. Applications in Biclustering When both row and column graphs encode clustering structures discussed in Section 4.2, we have the biclustering model. In this section, we discuss both biclustering models induced by Kronecker and Cartesian products. We start with a special form of the likelihood (47), which is given by y | α, θ, σ2 N(α1n11T n2 + θ, σ2In1 In2), and the prior distribution on α is given by (48). The prior distribution on θ will be discussed in two cases. Cartesian product θil θil µ2,ih θi l θi l µ2,i h µ1,jl µ1,jl Kronecker product Figure 2: Structure diagrams for the two biclustering methods. The Cartesian product biclustering model (Left) and the Kronecker product biclustering model (Right) have different latent variables and base graphs. While the Cartesian product models the row and column clustering structures by separate latent variable matrices µ1 Rk1 n2 and µ2 Rn1 k2, the Kronecker product directly models the checkerboard structure by a single latent matrix µ Rk1 k2. 5.3.1. Cartesian product biclustering model Let k1 [n1] and k2 [n2] be upper bounds of the numbers of row and column clusters, respectively. We introduce two latent matrices µ1 Rk1 n2 and µ2 Rn1 k2 that serve as row and column clustering centers. The prior distribution is then specified by p(θ, µ1, µ2 | γ1, γ2, σ2) j=1 exp θi µ1,j 2 2σ2[v0γ1,ij + v1(1 γ1,ij)] h=1 exp θ l µ2, h 2 2σ2[v0γ2,lh + v1(1 γ2,lh)] I{1T n1θ1n2 = 0}, Bayesian Model Selection with Graph Structured Sparsity which can be regarded as an extension of (33) in the form of (49). The prior distributions on γ1 and γ2 are independently specified by (35) with (k, n) replaced by (k1, n1) and (k2, n2). Finally, σ2 follows the inverse Gamma prior (6). We follow the framework of Section 3.2. The derivation of the EM algorithm requires lower bounding log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)]. Using the same argument in Section 5.1, we have the following lower bound j=1 r1,ij log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] + h=1 r2,lh log[v 1 0 γ2,lh + v 1 1 (1 γ2,lh)]. (63) By the symmetry of the complete bipartite graph, r1,ij is a constant that does not depend on (i, j). Then use the same argument in (37)-(38), and we obtain the fact that Pn1 i=1 Pk1 j=1 r1,ij log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] is independent of {γ1,ij}, and the same conclusion also applies to the second term of (63). Since the lower bound (63) does not dependent on γ1, γ2, the determinant factor in the density function p(θ, µ1, µ2 | γ1, γ2, σ2) does not play any role in the derivation of the EM algorithm. With some standard calculations, the E-step is given by qnew 1,ij = exp θi µ1,j 2 Pk1 u=1 exp θi µ1,u 2 2σ2 v , qnew 1,lh = exp θ l µ2, h 2 Pk2 v=1 exp θ l µ2, v 2 where v 1 = v 1 0 v 1 1 . The M-step is given by (αnew, θnew, µnew 1 , µnew 2 ) = argmin α,1Tn1θ1n2=0,µ1,µ2 F(α, θ, µ1, µ2; qnew 1 , qnew 2 ), (σ2)new = F(αnew, θnew, µnew 1 , µnew 2 ; qnew 1 , qnew 2 ) + b 2n1n2 + n1k2 + n2k1 + a + 2 , where F(α, θ, µ1, µ2; q1, q2) = y α1n11T n2 θ 2 F + ν α 2 v0 + 1 q1,ij v0 + 1 q2,lh θ l µ2, h 2. 5.3.2. Kronecker product biclustering model For the Kronecker product structure, we introduce a latent matrix µ Rk1 k2. Since the biclustering model implies a block-wise constant structure for θ. Each entry of µ serves as a center for a block of the matrix θ. The prior distribution is defined by p(θ, µ | γ1, γ2, σ2) h=1 exp (θil µjh)2 2σ2[v0γ1,ijγ2,lh + v1(1 γ1,ijγ2,lh)] I{1T n1θ1n2 = 0}. Kim and Gao The prior distribution is another extension of (33), and it is in a similar form of (59). To finish the Bayesian model specification, we consider the same priors for γ1, γ2, σ2 as in the Cartesian product case. Recall that the lower bound of log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)] is given by (60) for a general Kronecker product structure. In the current setting, a similar argument gives the lower bound h=1 r(i,l),(j,h) log[v 1 0 γ1,ijγ2,lh + v 1 1 (1 γ1,ijγ2,lh)]. Since r(i,l),(j,h) r is independent of (i, l), (j, h) by the symmetry of the complete bipartite graph, the above lower bound can be written as h=1 log[v 1 0 γ1,ijγ2,lh + v 1 1 (1 γ1,ijγ2,lh)] = r log(v 1 0 ) h=1 γ1,ijγ2,lh + r log(v 1 1 ) h=1 (1 γ1,ijγ2,lh) = rn1n2 log(v 1 0 ) + rn1n2(k1k2 1) log(v 1 1 ), which is independent of γ1, γ2. The inequality (5.3.2) is because both γ1 and γ2 satisfy (34). Again, the determinant factor in the density function p(θ, µ | γ1, γ2, σ2) does not play any role in the derivation of the EM algorithm, because the lower bound (5.3.2) does not depend on (γ1, γ2). Since we are working with the Kronecker product, we will derive a variational EM algorithm with the E-step finding the posterior distribution in the mean filed class Q = {q(γ1, γ2) = q1(γ1)q2(γ2) : q1, q2}. By following the same argument in Section 5.2, we obtain the E-step as qnew 1,ij = exp Pn2 l=1 Pk2 h=1 q2,lh(θil µjh)2 Pk1 u=1 exp Pn2 l=1 Pk2 h=1 q2,lh(θil µuh)2 qnew 2,lh = exp Pn1 i=1 Pk1 j=1 qnew 1,ij (θil µjh)2 Pk2 v=1 exp Pn1 i=1 Pk1 v=1 qnew 1,ij (θil µlv)2 where v 1 = v 1 0 v 1 1 . The M-step is given by (αnew, θnew, µnew) = argmin α,1Tn1θ1n2=0,µ F(α, θ, µ; qnew 1 , qnew 2 ), (σ2)new = F(αnew, θnew, µnew; qnew 1 , qnew 2 ) + b 2n1n2 + n1k2 + n2k1 + a + 2 , F(α, θ, µ1, µ2; q1, q2) = y α1n11T n2 θ 2 F + ν α 2 v0 + 1 q1,ijq2,lh (θil µjh)2. Bayesian Model Selection with Graph Structured Sparsity 6. Reduced Isotonic Regression The models that we have discussed so far in our general framework all involve Gaussian likelihood functions and Gaussian priors. It is important to develop a natural extension of the framework to include non-Gaussian models. In this section, we discuss a reduced isotonic regression problem with a non-Gaussian prior distribution, while a full extension to non-Gaussian models will be considered as a future project. Given a vector of observation y Rn, the reduced isotonic regression seeks the best piecewise constant fit that is nondecreasing (Schell and Singh, 1997; Gao et al., 2018). It is an important model that has applications in problems with natural monotone constraint on the signal. With the likelihood y|α, θ, σ2 N(α1n + θ, σ2In), we need to specify a prior distribution on θ that induces both piecewise constant and isotonic structures. We propose the following prior distribution, θ | γ, σ2 p(θ | γ, σ2) i=1 exp (θi+1 θi)2 2σ2[v0γi + v1(1 γi)] I{θi θi+1}I{1T nθ = 0}. (64) We call (64) the spike-and-slab half-Gaussian distribution. Note that the support of the distribution is the intersection of the cone {θ : θ1 θ2 ... θn} and the subspace {θ : 1T nθ = 0}. The parameters v0 and v1 play similar roles as in (3), which model the closedness between θi and θi+1 depending on the value of γi. Proposition 8 For any γ {0, 1}n 1 and v0, v1 (0, ), the spike-and-slab half-Gaussian prior (64) is well defined on {θ : θ1 θ2 ... θn} {θ : 1T nθ = 0}, and its density function with respect to the Lebesgue measure restricted on the support is given by p(θ | γ, σ2) = 2n 1 1 (2πσ2)(n 1)/2 i=1 [v 1 0 γi + v 1 1 (1 γi)] 2σ2[v0γi + v1(1 γi)] I{θ1 θ2 ... θn}I{1T nθ = 0}. Note that the only place that Proposition 8 deviates from Proposition 1 is the extra factor 2n 1 due to the isotonic constraint {θ : θ1 θ2 ... θn} and the symmetry of the density. We complete the model specification by put priors on α, γ, η, σ2 that are given by (2), (4), (5) and (6). Now we are ready to derive the EM algorithm. Since the base graph is a tree, the EM algorithm for reduced isotonic regression is exact. The E-step is given by qnew i = ηφ(θi θi 1; 0, σ2v0) ηφ(θi θi 1; 0, σ2v0) + (1 η)φ(θi θi 1; 0, σ2v1). The M-step is given by (αnew, θnew) = argmin α,θ1 θ2 ... θn,1Tn θ=0 F(α, θ; qnew), (65) Kim and Gao F(α, θ; q) = y α1n θ 2 + να2 + (θi θi 1)2, and the updates of σ2 and η are given by (17) with p = n. The M-step (65) can be solved by a very efficient optimization technique. Since y α1n θ 2 = ( y α)1n 2 + y y1n θ 2 by 1T nθ = 0, α and θ can be updated independently. It is easy to see that αnew = n n+ν y. The update of θ can be solved by SPAVA (Burdakov and Sysoev, 2017). Similar to the Gaussian case, the parameter v0 determines the complexity of the model. For each v0 between 0 and v1, we apply the EM algorithm above to calculate bq, and then let bγi = bγi(v0) = I{bqi 1/2} form a solution path. The best model will be selected from the EM-solution path by the limiting version of the posterior distribution as v0 0. Given a γ {0, 1}n 1, we write s = 1 + Pn 1 i=1 (1 γi) to be the number of pieces, and Zγ {0, 1}n s is the membership matrix defined in Section 3.3. As v0 0, a slight variation of Proposition 3 implies that θ that follows (64) weakly converges to Zγeθ, where eθ is distributed by p(eθ | γ, σ2) exp (eθl eθl+1)2 I{eθ1 eθ2 ... eθs}I{1T n Zγeθ = 0}. (66) The following proposition determines the normalizing constant of the above distribution. Proposition 9 The density function of (66) is given by p(eθ | γ, σ2) = 2s 1(2πσ2) (s 1)/2q det ZT γ 1n(ZTγ e LγZγ) (eθl eθl+1)2 I{eθ1 eθ2 ... eθs}I{1T n Zγeθ = 0}, (67) where Zγ and e Lγ are defined in Section 3.3. Interestingly, compared with the formula (23), (67) has an extra 2s 1 due to the isotonic constraint {eθ1 ... eθs}. Following Section 3.3, we consider a reduced version of the likelihood y | α, eθ, γ, σ2 N(α1n + Zγeθ, σ2In). Then, with the prior distributions on α, eθ, γ, σ2 specified by (2), (67), (4), (5) and (6), we obtain the joint posterior distribution p(α, eθ, γ, σ2 | y). Ideally, we would like to integrate out α, eθ, σ2 and use p(γ | y) for model selection. However, the integration with respect to eθ is intractable due to the isotonic constraint. Therefore, we propose to maximize out α, eθ, σ2, and then the model selection score for reduced isotonic regression is given by g(γ) = max α,eθ1 ... eθs,1Tn Zγ eθ=0,σ2 log p(α, eθ, γ, σ2 | y). For each γ, the optimization involved in the evaluation of g(γ) can be done efficiently, which is very similar to the M-step updates. Bayesian Model Selection with Graph Structured Sparsity 7. Numerical Results In this section, we test the performance of the methods proposed in the paper and compare the accuracy in terms of sparse signal recovery and graphical structure estimation with existing methods. We name our method Bayes MSG (Bayesian Model Selection on Graphs) throughout the section. All simulation studies and real data applications were conduced on a standard laptop (2.6 GHz Intel Core i7 processor and 16GB memory) using R and Julia programming languages. Our Bayesian method outputs a subgraph defined by bγ = argmax {g(γ) : γ {bγ(v0)}0