# bayesian_model_selection_with_graph_structured_sparsity__68ffd4ec.pdf

Journal of Machine Learning Research 21 (2020) 1-61 Submitted 2/19; Revised 2/20; Published 6/20

Bayesian Model Selection with Graph Structured Sparsity

Youngseok Kim youngseok@uchicago.edu Chao Gao chaogao@galton.uchicago.edu Department of Statistics University of Chicago Chicago, IL 60637, USA

Editor: Francois Caron

We propose a general algorithmic framework for Bayesian model selection. A spike-and-slab Laplacian prior is introduced to model the underlying structural assumption. Using the notion of eﬀective resistance, we derive an EM-type algorithm with closed-form iterations to eﬃciently explore possible candidates for Bayesian model selection. The deterministic nature of the proposed algorithm makes it more scalable to large-scale and high-dimensional data sets compared with existing stochastic search algorithms. When applied to sparse linear regression, our framework recovers the EMVS algorithm (Roˇckov a and George, 2014) as a special case. We also discuss extensions of our framework using tools from graph algebra to incorporate complex Bayesian models such as biclustering and submatrix localization. Extensive simulation studies and real data applications are conducted to demonstrate the superior performance of our methods over its frequentist competitors such as ℓ0 or ℓ1 penalization.

Keywords: spike-and-slab prior, graph laplacian, variational inference, expectation maximization, sparse linear regression, biclustering

1. Introduction

Bayesian model selection has been an important area of research for several decades. While the general goal is to estimate the most plausible sub-model from the posterior distribution (Barry and Hartigan, 1993; Diebolt and Robert, 1994; Richardson and Green, 1997; Bottolo and Richardson, 2010) for a wide class of learning tasks, most of the developments of Bayesian model selection have been focused on variable selection in the setting of sparse linear regression (Hans et al., 2007; Li and Zhang, 2010; Ghosh and Clyde, 2011; Roˇckov a and George, 2014; Wang et al., 2018). One of the main challenges of Bayesian model selection is its computational eﬃciency. Recently, Roˇckov a and George (2014) discovered that Bayesian variable selection in sparse linear regression can be solved by an EM algorithm Dempster et al. (1977); Neal and Hinton (1998) with a closed-form update at each iteration. Compared with previous stochastic search type of algorithms such as Gibbs sampling (George and Mc Culloch, 1993, 1997), this deterministic alternative greatly speeds up computation for large-scale and high-dimensional data sets. The main thrust of this paper is to develop of a general framework of Bayesian models that includes sparse linear regression, change-point detection, clustering and many other models as special cases. We will derive a general EM-type algorithm that eﬃciently explores

c 2020 Youngseok Kim and Chao Gao.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-123.html.

Kim and Gao

possible candidates for Bayesian model selection. When applied to sparse linear regression, our model and algorithmic frameworks naturally recover the proposal of Roˇckov a and George (2014). The general framework proposed in this paper can be viewed as an algorithmic counterpart of the theoretical framework for Bayesian high-dimensional structured linear models in Gao et al. (2015). While the work Gao et al. (2015) is focused on optimal posterior contraction rate and oracle inequalities, the current paper pursues a general eﬃcient and scalable computational strategy. In order to study various Bayesian models from a uniﬁed perspective, we introduce a spike-and-slab Laplacian prior distribution on the model parameters. The new prior distribution is an extension of the classical spike-and-slab prior (Mitchell and Beauchamp, 1988; George and Mc Culloch, 1993, 1997) for Bayesian variable selection. Our new deﬁnition incorporates the graph Laplacian of the underlying graph representing the model structure, and thus gives the name of the prior. Under this general framework, the problem of Bayesian model selection can be recast as selecting a subgraph of some base graph determined by the statistical task. Here, the base graph and its subgraphs represent the structures of the full model and the corresponding sub-models, respectively. Various choices of base graphs lead to speciﬁc statistical estimation problems such as sparse linear regression, clustering and change-point detection. In addition, the connection to graph algebra further allows us to build prior distributions for even more complicated models. For example, using graph products such as Cartesian product or Kronecker product (Imrich and Klavzar, 2000; Leskovec et al., 2010), we can construct prior distributions for biclustering models from the Laplacian of the graph products of row and column clustering structures. This leads to great ﬂexibility in analyzing real data sets of complex structures. Our Bayesian model selection follows the procedure of Roˇckov a and George (2014) that evaluates the posterior probabilities of sub-models computed from the solution path of the EM algorithm. However, the derivation of the EM algorithm under our general framework is indeed nontrivial task. When the underlying base graph of the model structure is a tree, the derivation of the EM algorithm is straightforward by following the arguments in Roˇckov a and George (2014). On the other hand, for a general base graph that is not a tree, the arguments in Roˇckov a and George (2014) do not apply. To overcome this diﬃculty, we introduce a relaxation through the concept of eﬀective resistance (Lov asz, 1993; Ghosh et al., 2008; Spielman, 2007) that adapts to the underlying graphical structure of the model. The lower bound given by this relaxation is then used to derive a variational EM algorithm that works under the general framework. Model selection with graph structured sparsity has also been studied in the frequentist literature. For example, generalized Lasso (Tibshirani and Taylor, 2011; Arnold and Tibshirani, 2014) and its multivariate version network Lasso (Hallac et al., 2015) encode the graph structured sparsity with ℓ1 regularization. Algorithms based on ℓ0 regularization have also been investigated recently (Fan and Guan, 2018; Xu and Fan, 2019). Compared with these frequentist methods, our proposed Bayesian model selection procedure tends to achieve better model selection performance in terms of false discovery proportion and power in a wide range of model scenarios, which will be shown through an extensive numerical study under various settings. The rest of the paper is organized as follows. In Section 2, we introduce the general framework of Bayesian models and discuss the spike-and-slab Laplacian prior. The EM

Bayesian Model Selection with Graph Structured Sparsity

algorithm will be derived in Section 3 for both the case of trees and general base graphs. In Section 4, we discuss how to incorporate latent variables and propose a new Bayesian clustering models under our framework. Section 5 introduces the techniques of graph products and several important extensions of our framework. We will also discuss a non Gaussian spike-and-slab Laplacian prior in Section 6 with a natural application to reduced isotonic regression (Schell and Singh, 1997). Finally, extensive simulated and real data analysis will be presented in Section 7.

2. A General Framework of Bayesian Models

In this section, we describe a general framework for building Bayesian structured models on graphs. To be speciﬁc, the prior structural assumption on the parameter θ Rp will be encoded by a graph. Throughout the paper, G = (V, E) is an undirected graph with V = [p] and some E {(i, j) : 1 i < j p}. It is referred to as the base graph of the model, and our goal is to learn a sparse subgraph of G from the data. We use p = |V | and m = |E| for the node size and edge size of the base graph.

2.1. Model Description

We start with the Gaussian linear model y | β, σ2 N(Xβ, σ2In) that models an ndimensional observation. The design matrix X Rn p is determined by the context of the problem. Given some nonzero vector w Rp, the Euclidean space Rp can be decomposed as a direct sum of the one-dimensional subspace spanned by w and its orthogonal complement. In other words, we can write

β = 1 w 2 ww T β + Ip 1 w 2 ww T β.

The structural assumption will be imposed by a prior on the second term above. To simplify the notation, we introduce the space Θw = θ Rp : w T θ = 0 . Then, any β Rp can be decomposed as β = αw + θ for some α R and θ Θw. The likelihood is thus given by

y | α, θ, σ2 N(X(αw + θ), σ2In). (1)

The prior distribution on the vector αw + θ will be speciﬁed by independent priors on α and θ. They are given by

α | σ2 N(0, σ2/ν), (2)

θ | γ, σ2 p(θ | γ, σ2) Y

(i,j) E exp (θi θj)2

2σ2[v0γij + v1(1 γij)]

I{θ Θw}. (3)

Under the prior distribution, α is centered at 0 and has precision ν/σ2. The parameter θ is modeled by a prior distribution on Θw that encodes a pairwise relation between θi and θj. Here, v0 is a very small scalar and v1 is a very large scalar. For a pair (i, j) E in the base graph, the prior enforces the closedness between θi and θj when γij = 1. Our goal is then to learn the most probable subgraph structure encoded by {γij}, which will be estimated from the posterior distribution.

Kim and Gao

We ﬁnish the Bayesian modeling by putting priors on γ and σ2. They are given by

γ | η p(γ | η) Y

(i,j) E ηγij(1 η)1 γij I{γ Γ}, (4)

η Beta(A, B), (5)

σ2 Inv Gamma(a/2, b/2). (6)

Besides the standard conjugate priors on η and σ2, the independent Bernoulli prior on γ is restricted on a set Γ {0, 1}m. This restriction is sometimes useful for particular models, but for now we assume that Γ = {0, 1}m until it is needed in Section 4. The Bayesian model is now fully speciﬁed. The joint distribution is

p(y, α, θ, γ, η, σ2) = p(y | α, θ, σ2)p(α | σ2)p(θ | γ, σ2)p(γ | η)p(η)p(σ2). (7)

Among these distributions, the most important one is p(θ|γ, σ2). To understand its properties, we introduce the incidence matrix D Rm p for the base graph G = (V, E). The matrix D has entries Dei = 1 and Dej = 1 if e = (i, j), and Dek = 0 if k = i, j. We note that the deﬁnition of D depends on the order of edges {(i, j)} even if G is an undirected graph. However, this does not aﬀect any application that we will need in the paper. We then deﬁne the Laplacian matrix Lγ = DT diag v 1 0 γ + v 1 1 (1 γ) D.

It is easy to see that Lγ is the graph Laplacian of the weighted graph with adjacency matrix {v 1 0 γij + v 1 1 (1 γij)}. Thus, we can write (3) as

p(θ | γ, σ2) exp 1

2σ2 θT Lγθ I{θ Θw}. (8)

Given its form, we name (8) the spike-and-slab Laplacian prior.

Proposition 1 Suppose G = (V, E) is a connected base graph. For any γ {0, 1}m and v0, v1 (0, ), the graph Laplacian Lγ is positive semi-deﬁnite and has rank p 1. The only eigenvector corresponding to its zero eigenvalue is proportional to 1p, the vector with all entries 1. As a consequence, as long as 1T p w = 0, the spike-and-slab Laplacian prior is a non-degenerate distribution on Θw. Its density function with respect to the Lebesgue measure restricted to Θw is

p(θ | γ, σ2) = 1 (2πσ2)(p 1)/2

detw(Lγ) exp 1

2σ2 θT Lγθ I{θ Θw},

where detw(Lγ) is the product of all nonzero eigenvalues of the positive semi-deﬁnite matrix Ip 1 w 2 ww T Lγ Ip 1 w 2 ww T .

The proposition reveals two important conditions that lead to the well-deﬁnedness of the spike-and-slab Laplacian prior: the connectedness of the base graph G = (V, E) and 1T p w = 0. Without either condition, the distribution would be degenerate on Θw. Extensions to a base graph that is not necessarily connected is possible. We leave this task to Section 4 and Section 5, where tools from graph algebra are introduced.

Bayesian Model Selection with Graph Structured Sparsity

2.2. Examples

The Bayesian model (7) provides a very general framework. By choosing a diﬀerent base graph G = (V, E), a design matrix X, a grounding vector w Rp and a precision parameter ν, we then obtain a diﬀerent model. Several important examples are given below.

Example 1 (Sparse linear regression) The sparse linear regression model y | θ, σ2 N(Xθ, σ2In) is a special case of (1). To put it into the general framework, we can expand the design matrix X Rn p and the regression vector θ Rp by [0n, X] Rn (p+1) and [θ0; θ] Rp+1. With the grounding vector w = [1; 0p], the sparse linear regression model can be recovered from (1). For the prior distribution, the base graph G consists of nodes V = {0, 1, ..., p} and edges {(0, i) : i [p]}. We set ν = , so that θ0 = 0 with prior probability one. Then, (3) is reduced to

θ | γ, σ2 p(θ | γ, σ2)

i=1 exp θ2 i 2σ2[v0γ0i + v1(1 γ0i)]

That is, θi|γ, σ2 N(0, σ2[v0γ0i+v1(1 γ0i)]) independently for all i [n]. This is recognized as the spike-and-slab Gaussian prior for Bayesian sparse linear regression considered by George and Mc Culloch (1993, 1997); Roˇckov a and George (2014).

Example 2 (Change-point detection) Set n = p, X = In, and w = 1n. We then have yi | θi, σ2 N(α + θi, σ2) independently for all i [n] from (1). For the prior distribution on α and θ, we consider ν = 0 and a one-dimensional chain graph G = (V, E) with E = {(i, i + 1) : i [n 1]}. This leads to a ﬂat prior on α, and the prior on θ is given by

θ | γ, σ2 p(θ | γ, σ2)

i=1 exp (θi θi+1)2

2σ2[v0γi,i+1 + v1(1 γi,i+1)]

I{1T p θ = 0}.

A more general change-point model on a tree can also be obtained by constructing a tree base graph G.

Example 3 (Two-dimensional image denoising) Consider a rectangular set of observations y Rn1 n2. With the same construction in Example 2 applied to sec(y), we obtain yij | θij, σ2 N(α + θij, σ2) independently for all (i, j) [n1] [n2] from (1). To model images, we consider a prior distribution that imposes closedness to nearby pixels. Consider ν = 0 and a base graph G = (V, E) shown in the picture below.

Kim and Gao

We then obtain a ﬂat prior on α, and

θ | γ, σ2 p(θ | γ, σ2) Y

(ik,jl) E exp (θik θjl)2

2σ2[v0γik,jl + v1(1 γik,jl)]

I{1T n1θ1n2 = 0}.

Note that G is not a tree in this case.

3. EM Algorithm

In this section, we will develop eﬃcient EM algorithms for the general model. It turns out that the bottleneck is the computation of detw(Lγ) given some γ {0, 1}m.

Lemma 2 Let spt(G) be the set of all spanning trees of G. Then

detw(Lγ) = (1T p w)2

v 1 0 γij + v 1 1 (1 γij) .

In particular, if G is a tree, then detw(Lγ) = (1T p w)2

w 2 Q (i,j) E v 1 0 γij + v 1 1 (1 γij) .

The lemma suggests that the hardness of computing detw(Lγ) depends on the number of spanning trees of the base graph G. When the base graph is a tree, detw(Lγ) is factorized over the edges of the tree, which greatly simpliﬁes the derivation of the algorithm. We will derive a closed-form EM algorithm in Section 3.1 when G is a tree, and the algorithm for a general G will be given in Section 3.2.

3.1. The Case of Trees

We treat γ as latent. Our goal is to maximize the marginal distribution after integrating out the latent variables. That is,

max α,θ Θw,η,σ2 log X

γ p(y, α, θ, γ, η, σ2), (9)

where p(y, α, θ, γ, η, σ2) is given by (7). Since the summation over γ is intractable, we consider an equivalent form of (9), which is

max q max α,θ Θw,η,σ2 X

γ q(γ) log p(y, α, θ, γ, η, σ2)

q(γ) . (10)

Then, the EM algorithm is equivalent to iteratively updating q, α, θ Θw, η, σ2 (Neal and Hinton, 1998). Now we illustrate the EM algorithm that solves (10). The E-step is to update q(γ) given the previous values of θ, η, σ. In view of (7), we have

qnew(γ) p(y, α, θ, γ, η, σ2) p(θ | γ, σ2)p(γ | η). (11)

Bayesian Model Selection with Graph Structured Sparsity

According to (2), p(θ | γ, σ2) can be factorized when the base graph G = (V, E) is a tree. Therefore, with a simpler notation qij = q(γij = 1), we can write the update for q as qnew(γ) = Q (i,j) E(qnew ij )γij(1 qnew ij )1 γij, where

qnew ij = ηφ(θi θj; 0, σ2v0) ηφ(θi θj; 0, σ2v0) + (1 η)φ(θi θj; 0, σ2v1). (12)

Here, φ( ; µ, σ2) stands for the density function of N(µ, σ2). To derive the M-step, we introduce the following function

F(α, θ; q) = y X(αw + θ) 2 + να2 + θT Lqθ, (13)

where Lq is obtained by replacing γ with q in the deﬁnition of the graph Laplacian Lγ. The M-step consists of the following three updates,

(αnew, θnew) = argmin α,θ Θw F(α, θ; qnew), (14)

(σ2)new = argmin σ2

F(αnew, θnew; qnew) + b

2σ2 + p + n + a + 2

2 log(σ2) , (15)

ηnew = argmax η [(A 1 + qnew sum) log η + (B 1 + p 1 qnew sum) log(1 η)] ,(16)

where the notation qnew sum stands for P (i,j) E qnew ij . While (14) is a simple quadratic programming, (15) and (16) have closed forms, which are given by

(σ2)new = F(αnew, θnew; qnew) + b

p + n + a + 2 and ηnew = A 1 + qnew sum A + B + p 3. (17)

We remark that the EMVS algorithm (Roˇckov a and George, 2014) is a special case for the sparse linear regression problem discussed in Example 1. When G is a tree, the spike-and-slab graph Laplacian prior (8) is proportional to the product of individual spike-and-slab priors

p(θ | γ, σ2) Y

(i,j) E exp (θi θj)2

2σ2[v0γij + v1(1 γij)]

supported on Θw, as we have seen in Example 1 and 2. In this case, the above EM algorithm we have developed can also be extended to models with alternative prior distributions, such as the spike-and-slab Lasso prior (Roˇckov a and George, 2018) and the ﬁnite normal mixture prior (Stephens, 2016).

3.2. General Graphs

When the base graph G is not a tree, the E-step becomes computationally infeasible due to the lack of separability of p(θ|γ, σ2) in γ. In fact, given the form of the density function in Proposition 1, the main problem lies in the term p

detw(Lγ), which cannot be factorized over (i, j) E when the base graph G = (V, E) is not a tree (Lemma 2). To overcome the diﬃculty, we consider optimizing a lower bound of the objective function (10). This means we need to ﬁnd a good lower bound for log detw(Lγ). Similar techniques are also advocated

Kim and Gao

in the context of learning exponential family graphical models (Wainwright and Jordan, 2008). By Lemma 2, we can write

log detw(Lγ) = log X

v 1 0 γij + v 1 1 (1 γij) + log (1T p w)2

We only need to lower bound the ﬁrst term on the right hand side of the equation above, because the second term is independent of γ. By Jensen s inequality, for any non-negative sequence {λ(T)}T spt(G) such that P T spt(G) λ(T) = 1, we have

v 1 0 γij + v 1 1 (1 γij)

T spt(G) λ(T) log Y

v 1 0 γij + v 1 1 (1 γij) X

T spt(G) λ(T) log λ(T)

T spt(G) λ(T)I{(i, j) T}

log v 1 0 γij + v 1 1 (1 γij) X

T spt(G) λ(T) log λ(T).

One of the most natural choices of the weights {λ(T)}T spt(G) is the uniform distribution

λ(T) = 1 |spt(G)|.

This leads to the following lower bound

v 1 0 γij + v 1 1 (1 γij)

(i,j) E rij log v 1 0 γij + v 1 1 (1 γij) + log |spt(G)|, (19)

where rij = 1 |spt(G)|

T spt(G) I{(i, j) T}. (20)

The quantity rij deﬁned in (20) is recognized as the eﬀective resistance between the ith and the jth nodes (Lov asz, 1993; Ghosh et al., 2008). Given a graph, we can treat each edge as a resistor with resistance 1. Then, the eﬀective resistance between the ith and the jth nodes is the resistance between i and j given by the whole graph. That is, if we treat the entire graph as a resistor. Let L be the (unweighted) Laplacian matrix of the base graph G = (V, E), and L+ its pseudo-inverse. Then, an equivalent deﬁnition of (20) is given by the formula rij = (ei ej)T L+(ei ej),

where ej is the basis vector with the ith entry 1 and the remaining entries 0. Therefore, computation of the eﬀective resistance can leverage fast Laplacian solvers in the literature (Spielman and Teng, 2004; Livne and Brandt, 2012). Some important examples of eﬀective resistance are listed below:

Bayesian Model Selection with Graph Structured Sparsity

When G is the complete graph of size p, then rij = 2/p for all (i, j) E.

When G is the complete bipartite graph of sizes p and k, then rij = p+k 1

pk for all (i, j) E.

When G is a tree, then rij = 1 for all (i, j) E.

When G is a two-dimensional grid graph of size n1 n2, then rij [0.5, 0.75] depending on how close the edge (i, j) is from its closest corner.

When G is a lollipop graph, the conjunction of a linear chain with size p and a complete graph with size k, then rij = 1 or 2/k depending on whether the edge (i, j) belongs to the chain or the complete graph.

By (19), we obtain the following lower bound for the objective function (10),

max q max α,θ Θw,η,σ2 X

γ q(γ) log p(y | α, θ, σ2)p(α | σ2)ep(θ | γ, σ2)p(γ | η)p(η)p(σ2)

q(γ) , (21)

where the formula of ep(θ | γ, σ2) is obtained by applying the lower bound (19) in the formula of p(θ | γ, σ2) in Proposition 1. Since ep(θ | γ, σ2) can be factorized over (i, j) E, the E-step is given by qnew(γ) = Q (i,j) E(qnew ij )γij(1 qnew ij )1 γij, where

qnew ij = ηv rij/2 0 e (θi θj)2/2σ2v0

ηv rij/2 0 e (θi θj)2/2σ2v0 + (1 η)v rij/2 1 e (θi θj)2/2σ2v1 . (22)

Observe that the lower bound (19) is independent of α, θ, η, σ2, and thus the M-step remains the same as in the case of a tree base graph. The formulas are given by (14)-(16), except that (16) needs to be replaced by

ηnew = A 1 + qnew sum A + B + m 2.

The EM algorithm for a general base graph can be viewed as a natural extension of that of a tree base graph. When G = (V, E) is a tree, it is easy to see from the formula (20) that rij = 1 for all (i, j) E. In this case, the E-step (22) is reduced to (12), and the inequality (19) becomes an equality.

3.3. Bayesian Model Selection

The output of the EM algorithm bq(γ) can be understood as an estimator of the posterior distribution p(γ|bα, bθ, bσ2, bη), where bα, bθ, bσ2, bη are obtained from the M-step. Then, we get a subgraph according to the thresholding rule bγij = I{bqij 1/2}. It can be understood as a model learned from the data. The sparsity of the model critically depends on the values of v0 and v1 in the spike-and-slab Laplacian prior. With a ﬁxed large value of v1, we can obtain the solution path of bγ = bγ(v0) by varying v0 from 0 to v1. The question then is how to select the best model along the solution path of the EM algorithm. The strategy suggested by Roˇckov a and George (2014) is to calculate the posterior score p(γ|y) with respect to the Bayesian model of v0 = 0. While the meaning of p(γ|y)

Kim and Gao

corresponding to v0 = 0 is easily understood for the sparse linear regression setting in Roˇckov a and George (2014), it is less clear for a general base graph G = (V, E). In order to deﬁne a version of (7) for v0 = 0, we need to introduce the concept of edge contraction. Given a γ {0, 1}m, the graph corresponding to the adjacency matrix γ induces a partition of disconnected components {C1, ..., Cs} of [p]. In other words, {i, j} Cl for some l [s] if and only if there is some path between i and j in the graph γ. For notational convenience, we deﬁne a vector z [s]n so that zi = l if and only if i Cl. A membership matrix Zγ {0, 1}p s is deﬁned with its (i, l)th entry being the indicator I{zi = l}. We let e G = (e V , e E) be a graph obtained from the base graph G = (V, E) after the operation of edge contraction. In other words, every node in e G is obtained by combining nodes in G according to the partition of {C1, ..., Cs}. To be speciﬁc, e V = [s], and (k, l) e E if and only if there exists some i Ck and some j Cl such that (i, j) E. Now we are ready to deﬁne a limiting version of (3) as v0 0. Let e Lγ = DT diag(v 1 1 (1 γ))D, which is the graph Laplacian of the weighted graph with adjacency matrix {v 1 1 (1 γij)}. Then, deﬁne

p(eθ | γ, σ2) = 1 (2πσ2)(s 1)/2

det ZT γ w(ZTγ e LγZγ) exp

eθT ZT γ e LγZγeθ

I{eθ ΘZT γ w}. (23)

With e G = (e V , e E) standing for the contracted base graph, the prior distribution (23) can also be written as

p(eθ | γ, σ2) exp

ωkl(eθk eθl)2

I{eθ ΘZT γ w}, (24)

where ωkl = P (i,j) E I{z(i) = k, z(j) = l}, which means that the edges {(i, j)}z(i)=k,z(j)=l in the base graph G = (V, E) are contracted as a new edge (k, l) in e G = (e V , e E) with ωkl as the weight.

Proposition 3 Suppose G = (V, E) is connected and 1T p w = 0. Let Zγ be the membership matrix deﬁned as above. Then for any γ {0, 1}m, (23) is a well-deﬁned density function on the (s 1)-dimensional subspace {eθ Rs : w T Zγeθ = 0}. Moreover, for an arbitrary design matrix X Rn p, the distribution of θ that follows (3) weakly converges to that of Zγeθ as v0 0.

Motivated by Proposition 3, a limiting version of (7) for v0 = 0 is deﬁned as follows,

y | α, eθ, γ, σ2 N(X(αw + Zγeθ), σ2In). (25)

Then, p(eθ | γ, σ2) is given by (23), and p(α|σ2), p(γ|η), p(η), p(σ2) are speciﬁed in (2) and (4)-(6). The posterior distribution of γ has the formula

p(γ | y) Z Z Z Z p(y, α, eθ, γ, η, σ2) dα deθ dη dσ2

= Z p(σ2) Z p(α | σ2) Z p(y | α, eθ, γ, σ2)p(eθ | γ, σ2) deθ dα dσ2 Z p(γ | η)p(η)dη.

Bayesian Model Selection with Graph Structured Sparsity

A standard calculation using conjugacy gives

det ZT γ w(ZT γ e LγZγ)

det ZT γ w(ZTγ (XT X + e Lγ)Zγ)

!1/2 ν ν + w T XT (In Rγ)Xw

y T (In Rγ)y |w T XT (In Rγ)y|2

ν + w T XT (In Rγ)Xw + b n+a

Beta P (i,j) E γij + A 1, P (i,j) E(1 γij) + B 1)

Beta(A, B) ,

where Rγ = XZγ(ZT γ (XT X + e Lγ)Zγ) 1ZT γ XT .

This deﬁnes the model selection score g(γ) = log p(γ | y) up to a universal additive constant. The Bayesian model selection procedure evaluates g(γ) on the solution path {bγ(v0)}0<v0 v1 and selects the best model with the highest value of g(γ).

3.4. Summary of Our Approach

The parameter v0 plays a critical role in our Bayesian model selection procedure. Recall that the joint distribution of our focus has the expression

p(y | θ)pv0(θ | γ)p(γ), (27)

where p(y | θ) is the linear model parametrized by θ, pv0(θ | γ) is the spike-and-slab prior with tuning parameter v0, and p(γ) is the prior on the graph1. The tuning parameter v0 is the variance of the spike component of the prior. Diﬀerent choices of v0 are used as diﬀerent components in our entire model selection procedure.

1. The ideal choice of v0 = 0 models exact sparsity in the sense that γij = 1 implies θi = θj. In this case, the exact posterior

pv0=0(γ | y) Z p(y | θ)pv0=0(θ | γ)p(γ)dθ

can be calculated according to the formulas that we derive in Section 3.3. Then the ideal Bayes model selection procedure would be the one that maximizes the posterior pv0=0(γ | y) over all γ. However, since this would require evaluating pv0=0(γ | y) for exponentially many γ s, it is sensible to maximize pv0=0(γ | y) only over a carefully chosen subset of γ that has a reasonable size for computational eﬃciency.

2. The choice of v0 > 0 models approximate sparsity in the sense that γij implies θi θj. Though v0 > 0 does not oﬀer interpretation of exact sparsity, a nonzero v0 leads to eﬃcient computation via the EM algorithm. That is, for a v0 > 0, we can maximize

max q max θ

γ q(γ) log p(y | θ)pv0(θ | γ)p(γ)

1. We have ignored other parameters such as α, η, σ2 in order to make the discussion below clear and concise.

Kim and Gao

which is the objective function of EM. Denote the output of the algorithm by qv0(γ) = Q ij qij,v0, we then obtain our model by bγij(v0) = I{qij,v0 > 0.5}. As we vary v0 on a grid from 0 to v1, we obtain a path of models {bγ(v0)}0<v0 v1. It covers models that ranges from very parsimonious ones to the fully saturated one.

The proposed model selection procedure in Section 3.3 is

max γ {bγ(v0)}0<v0 v1 pv0=0(γ | y), (28)

where pv0=0(γ | y) R p(y | θ)pv0=0(θ | γ)p(γ)dθ. That is, we optimize the posterior distribution pv0=0(γ | y) only over the EM solution path. The best one among all the candidate models will be selected according to pv0=0(γ | y), which is the exact/full posterior of γ. There are two ways to interpret our model selection procedure (28):

1. The procedure (28) can be understood as a computationally eﬃcient approximation strategy to the ideal Bayes model selection procedure maxγ pv0=0(γ | y) that is infeasible to compute. The EM algorithm with various choices of v0 simply provides a short list of candidate models. From this perspective, the proposed procedure (28) is fully Bayes.

2. The procedure (28) can be also understood as a method for selecting the tuning parameter v0, because the maximizer of (28) must be in the form of bγ(bv0) for some datadriven bv0. In this way, the solution bγ = bγ(bv0) also has a non-Bayesian interpretation since it is obtained by post-processing the EM solution with the tuning parameter bv0 selected by (28). In this regard, bγ also can be thought of as an empirical Bayes estimator.

In summary, our proposed procedure (28) is motivated by both statistical and computational considerations.

4. Clustering: A New Deal

4.1. A Multivariate Extension

Before introducing our new Bayesian clustering model, we need a multivariate extension of the general framework (7) to model a matrix observation y Rn d. With a design matrix X Rn p, the dimension of θ is now p d. We denote the ith row of θ by θi. With the grounding vector w Rp, the distribution p(y|α, θ, σ2)p(α|σ2)p(θ|γ, σ2) is given by

y | α, θ, σ2 N(X(wαT + θ), σ2In Id), (29)

α | σ2 N 0, σ2

θ | γ, σ2 p(θ|γ, σ2) Y

(i,j) E exp θi θj 2

2σ2[v0γij + v1(1 γij)]

I{θ Θw}, (31)

where Θw = {θ Rp d : w T θ = 0}. The prior distributions on γ, η, σ2 are the same as (4)-(6). Moreover, the multivariate spike-and-slab Laplacian prior (31) is supported on a

Bayesian Model Selection with Graph Structured Sparsity

d(p 1)-dimensional subspace Θw, and is well-deﬁned as long as 1T p w = 0 for the same reason stated in Proposition 1. The multivariate extension can be understood as the task of learning d individual graphs for each column of θ. Instead of modeling the d graph separately by γ(1), ..., γ(d) using (7), we assume the d columns of θ share the same structure by imposing the condition γ(1) = ... = γ(d). An immediate example is a Bayesian multitask learning problem with group sparsity. It can be viewed as a multivariate extension of Example 1. With the same argument in Example 1, (29)-(31) is specialized to

y | θ, σ2 N(Xθ, σ2In Id),

θ | γ, σ2 p(θ | γ, σ2)

i=1 exp θi 2

2σ2[v0γi + v1(1 γi)]

To close this subsection, let us mentions that the model (29)-(31) can be easily modiﬁed to accommodate a heteroscedastic setting. For example, one can replace the σ2In Id in (29) by a more general In diag(σ2 1, ..., σ2 d), and then make corresponding changes to (30) and (31) as well.

4.2. Model Description

Consider the likelihood

y | α, θ, σ2 N(1nαT + θ, σ2In Id), (32)

with the prior distribution of α|σ2 speciﬁed by (30). The clustering model uses the following form of (31),

p(θ, µ | γ, σ2)

j=1 exp θi µj 2

2σ2[v0γij + v1(1 γij)]

I{1T nθ = 0}. (33)

Here, both vectors θi and µj are in Rd. The prior distribution (33) can be derived from (31) by replacing θ in (31) with (θ, µ) and specifying the base graph as a complete bipartite graph between θ and µ. We impose the restriction that

j=1 γij = 1, (34)

for all i [n]. Then, µ1, ..., µk are latent variables that can be interpreted as the clustering centers, and each θi is connected to one of the clustering centers. To fully specify the clustering model, the prior distribution of γ is given by (4) with Γ being the set of all {γij} that satisﬁes (34). Equivalently,

(γi1, ..., γik) Uniform({ej}k j=1), (35)

independently for all i [n], where ej is a vector with 1 on the jth entry and 0 elsewhere. Finally, the prior of σ2 is given by (6).

Kim and Gao

4.3. EM Algorithm

The EM algorithm can be derived by following the idea developed in Section 3.2. In the current setting, the lower bound (19) becomes

T spt(Kn,k)

j=1 [v 1 0 γij + v 1 1 (1 γij)]

j=1 rij log v 1 0 γij + v 1 1 (1 γij) + log |spt(Kn,k)|, (36)

where Kn,k is the complete bipartite graph. By symmetry, the eﬀective resistance rij = r is a constant independent of (i, j). Thus, (36) can be written as

j=1 log v 1 0 γij + v 1 1 (1 γij) + log |spt(Kn,k)| (37)

= r log(v 1 0 )

j=1 γij + r log(v 1 1 )

j=1 (1 γij) + log |spt(Kn,k)|

= rn log(v 1 0 ) + rn(k 1) log(v 1 1 ) + log |spt(Kn,k)|, (38)

where the last equality is derived from (34). Therefore, for the clustering model, the lower bound (19) is a constant independent of {γij}. As a result, the lower bound of the objective function of the EM algorithm becomes X

γ q(γ) log p(y | α, θ, σ2)p(α | σ2)ep(θ, µ | γ, σ2)p(γ)p(σ2)

q(γ) , (39)

ep(θ, µ | γ, σ2) = const 1 (2πσ2)(n+k 1)d/2

j=1 exp θi µj 2

2σ2[v0γij + v1(1 γij)]

and the algorithm is to maximize (39) over q, α, θ {θ : 1T nθ = 0} and σ2. Maximizing (39) over q, we obtain the E-step as

qnew ij = exp θi µj 2

Pk l=1 exp θi µl 2

2σ2 v , (40)

independently for all i [n], where v 1 = v 1 0 v 1 1 , and we have used the notation

qij = q ((γi1, ..., γik) = ej) .

It is interesting to note that the E-step only depends on v0 and v1 through v. Maximizing (39) over α, θ {θ : 1T nθ = 0}, µ, σ2, we obtain the M-step as

(αnew, θnew, µnew) = argmin α,1Tn θ=0,µ F(α, θ, µ; qnew), (41)

(σ2)new = F(αnew, θnew, µnew; qnew) + b

(2n + k)d + a + 2 ,

Bayesian Model Selection with Graph Structured Sparsity

F(α, θ, µ; q) = y 1nαT θ 2 F + ν α 2 +

Note that for all θ such that 1T nθ = 0, we have

y 1nαT θ 2 F = n 11n1T ny 1nαT 2 F + y n 11n1T ny θ 2 F,

and thus the M-step (41) can be solved separately for α and (θ, µ).

4.4. A Connection to Bipartite Graph Projection

The clustering model (33) involves latent variables µ that do not appear in the likelihood (32). This allows us to derive an eﬃcient EM algorithm in Section 4.3. To better understand (33), we connect the bipartite graphical structure between θ and µ to a graphical structure on θ alone. Given a γ = {γij} that satisﬁes (34), we call it non-degenerate if Pn i=1 γij > 0 for all j [k]. In other words, none of the k clusters is empty.

Proposition 4 Let the conditional distribution of θ, µ | γ, σ2 be speciﬁed by (33) with some non-degenerate γ. Then, the distribution of θ | γ, σ2 weakly converges to

p(θ | γ, σ2) Y

1 i<l n exp λil θi θl 2

I{1T nθ = 0}, (42)

as v1 , where λil = Pk j=1 γijγlj/nj with nj = Pn i=1 γij being the size of the jth cluster.

The formula (42) resembles (3), except that λ encodes a clustering structure. By the deﬁnition of λil, if θi and θl are in diﬀerent clusters, λil = 0, and otherwise, λil takes the inverse of the size of the cluster that both θi and θl belong to. The relation between (33) and (42) can be understood from the operations of graph projection and graph lift, in the sense that the weighted graph λ, with nodes θ, is a projection of γ, a bipartite graph between nodes θ and nodes µ. Conversely, γ is said to be a graph lift of λ. Observe that the clustering structure of (42) is combinatorial, and therefore it is much easier to work with the bipartite structure in (33) with latent variables.

4.5. A Connection to Gaussian Mixture Models

We establish a connection to Gaussian mixture models. We ﬁrst give the following result.

Proposition 5 Let the conditional distribution of θ, µ | γ, σ2 be speciﬁed by (33). Then, as v0 0, this conditional distribution weakly converges to the distribution speciﬁed by the following sampling process: θ = γµ and µ | γ, σ2 is sampled from

p(µ | γ, σ2) Y

1 j<l k exp (nj + nl) µj µl 2

I{1T nγµ = 0}, (43)

where nj = Pn i=1 γij being the size of the jth cluster.

Kim and Gao

With this proposition, we can see that as v0 0, the clustering model speciﬁed by (32), (30) and (33) becomes

y | α, µ, γ, σ2 N(1nαT + γµ, σ2In Id), (44)

with α | σ2 distributed by (30) and µ | γ, σ2 distributed by (43). The likelihood function (44) is commonly used in Gaussian mixture models, which encodes an exact clustering structure. Therefore, with a ﬁnite v0, the model speciﬁed by (32), (30) and (33) can be interpreted as a relaxed version of the Gaussian mixture models that leads to an approximate clustering structure.

4.6. Adaptation to the Number of Clusters

The number k in (33) should be understood as an upper bound of the number of clusters. Even though the EM algorithm outputs k cluster centers {bµ1, ..., bµk}, these k cluster centers will be automatically grouped according to their own closedness as we vary the value of v0. Generally speaking, for a very small v0 (the Gaussian mixture model in Section 4.5, for example), {bµ1, ..., bµk} will take k vectors that are not close to each other. As we increase v0, the clustering centers {bµ1, ..., bµk} start to merge, and eventually for a suﬃciently large v0, they will all converge to a single vector. In short, v0 parametrizes the solution path of our clustering algorithm, and on this solution path, the eﬀective number of clusters increases as the value of v0 increases. We illustrate this point by a simple numerical example. Consider the observation y = (4, 2, 2, 4)T R4 1. We ﬁt our clustering model with k {2, 3, 4}. Figure 1 visualizes the output of the EM algorithm (bµ, bθ) as v0 varies. It is clear that the solution path always starts at {bµ1, ..., bµk} of diﬀerent values. Then, as v0 increases, the solution path has various phase transitions where the closest two bµj s merge. In the end, for a suﬃciently large v0, the clustering centers {bµ1, ..., bµk} all merge to a common value. To explain this phenomenon, it is most clear to investigate the case k = 2. Then, the M-step (41) updates µ according to

µnew 1 = argmin µ1

µnew 1 = argmin µ2

Observe that both µnew 1 and µnew 1 are weighted averages of {θ1, ..., θn}, and the only diﬀerence between µnew 1 and µnew 1 lies in the weights. According to the E-step (40),

qi1 = exp θi µ1 2

exp θi µ1 2

2σ2 v + exp θi µ2 2

qi2 = exp θi µ2 2

exp θi µ1 2

2σ2 v + exp θi µ2 2

Bayesian Model Selection with Graph Structured Sparsity

Figure 1: (Top) Solution paths of bµ with diﬀerent choices of k; (Bottom) Model selection scores on the solution paths.

and we recall that v 1 = v 1 0 v 1 1 . Therefore, as v0 , qi1 1/2 and qi2 1/2, which results in the phenomenon that µnew 1 and µnew 2 merge to the same value. The same reasoning also applies to k 2.

4.7. Model Selection

In this section, we discuss how to select a clustering structure from the solution path of the EM algorithm. According to the discussion in Section 4.6, we should understand k as an upper bound of the number of clusters, and the estimator of the number of clusters will be part of the output of the model selection procedure. The general recipe of our method follows the framework discussed in Section 3.3, but some nontrivial twist is required in the clustering problem. To make the presentation clear, the model selection procedure will be introduced in two parts. We will ﬁrst propose our model selection score, and then we will describe a method that extracts a clustering structure from the output of the EM algorithm.

4.7.1. The model selection score

For any γ {0, 1}n k that satisﬁes (34), we can calculate the posterior probability p(γ | y) with v0 = 0. This can be done by the connection to Gaussian mixture models discussed in Section 4.5. To be speciﬁc, the calculation follows the formula

p(γ | y) = Z Z Z p(y | α, µ, γ, σ2)p(α | σ2)p(µ | γ, σ2)p(σ2)p(γ) dα dµ dσ2,

Kim and Gao

where p(y | α, µ, γ, σ2), p(α|σ2), p(µ|γ, σ2), p(σ2), and p(γ) are speciﬁed by (32), (30), (43), (6) and (35). A standard calculation gives the formula

ν ν + n detγT 1n( Lγ) detγT 1n( Lγ + γT γ)

" νn ν + n n 11ny 2

+ Tr (y n 11n1T ny)T (In γ( Lγ + γT γ)γT )(y n 11n1T ny) # nd+a

where Lγ = (1k nγ + γT 1n k 2γT γ)/v1 is the graph Laplacian of the weighted adjacency matrix, which satisﬁes

Tr(µT Lγµ) = X

v1 µj µl 2.

Recall that nj = Pn i=1 γij is the size of the jth cluster. However, the goal of the model selection is to select a clustering structure, and it is possible that diﬀerent γ s may correspond to the same clustering structure due to label permutation. To overcome this issue, we need to sum over all equivalent γ s. Given a γ {0, 1}n k that satisﬁes (34), deﬁne a symmetric membership matrix Γ(γ) {0, 1}n n by

j=1 γijγlj = 1

In other words, Γil(γ) = 1 if and only if i and l are in the same cluster. It is easy to see that every clustering structure can be uniquely represented by a symmetric membership matrix. We deﬁne the posterior probability of a clustering structure Γ by

p(Γ | y) = X

γ {γ:Γ(γ)=Γ} p(γ | y).

The explicit calculation of the above summation is not necessary. A shortcut can be derived from the fact that p(γ | y) = p(γ | y) if Γ(γ) = Γ(γ ). This immediately implies that p(Γ | y) = |Γ| p(γ | y) for any γ that satisﬁes Γ(γ) = Γ. Suppose Γ encodes a clustering structure with ek nonempty clusters, and then we have |Γ| = k ek ek!. This leads to the model selection score

g(γ) = log p(Γ(γ) | y) = log p(γ | y) + log k ek

for any γ {0, 1}n k that satisﬁes (34), and ek above is calculated by ek = Pk j=1 max1 i n γij, the eﬀective number of clusters.

4.7.2. Extraction of clustering structures from the EM algorithm

Let bµ and bq be outputs of the EM algorithm, and we discuss how to obtain bγ that encodes a meaningful clustering structure to be evaluated by the model selection score (46). It is very tempting to directly threshold bq as is done in Section 3.3. However, as has been discussed

Bayesian Model Selection with Graph Structured Sparsity

in Section 4.6, the solution paths of {µ1, ..., µk} merge at some values of v0. Therefore, we should treat the clusters whose clustering centers merge together as a single cluster. Given bµ1, ..., bµk output by the M-step, we ﬁrst merge bµj and bµl whenever bµj bµl ϵ. The number ϵ is taken as 10 8, the square root of the machine precision, in our code. This forms a partition [k] = bk l=1Gl for some bk k. Then, by taking average within each group, we obtain a reduced collection of clustering centers eµ1, ..., eµbk. In other words, eµl = |Gl| 1 P j Gl bµj.

The bq [0, 1]n k output by the E-step should also be reduced to eq [0, 1]n bk as well. Note that bqij is the estimated posterior probability that the ith node belongs to the jth cluster. This means that eqil is the estimated posterior probability that the ith node belongs to the lth reduced cluster. An explicit formua is given by eqil = P j Gl bqij. With the reduced posterior probability eq, we simply apply thresholding to obtain bγ. We have (bγi1, ..., bγik) = ej if j = argmax1 l bk eqil. Recall that ej is a vector with 1 on the jth entry and 0 elsewhere. Note that according to this construction, we always have bγij = 0 whenever j > bk. This does not matter, because the model selection score (46) does not depend on the clustering labels. Finally, the bγ constructed according to the above procedure will be evaluated by g(bγ) deﬁned by (46). In the toy example with four data points y = (4, 2, 2, 4)T , the model selection score is computed along the solution path. According to Figure 1, the model selection procedure suggests that a clustering structure with two clusters {4, 2} and { 2, 4} is the most plausible one. We also note that the curve of g(γ) has sharp phase transitions whenever the solution paths µ merge.

5. Extensions with Graph Algebra

In many applications, it is useful to have a model that imposes both row and column structures on a high-dimensional matrix θ Rp1 p2. We list some important examples below.

1. Biclustering. In applications such as gene expression data analysis, one needs to cluster both samples and features. This task imposes a clustering structure for both rows and columns of the data matrix (Hartigan, 1972; Cheng and Church, 2000).

2. Block sparsity. In problems such as planted clique detection (Feige and Krauthgamer, 2000) and submatrix localization (Hajek et al., 2017), the matrix can be viewed as the sum of a noise background plus a submatrix of signals with unknown locations. Equivalently, it can be modeled by simultaneous row and column sparsity (Ma and Wu, 2015).

3. Sparse clustering. Suppose the data matrix exhibits a clustering structure for its rows and a sparsity structure for its columns, then we have a sparse clustering problem (Witten and Tibshirani, 2010). For this task, we need to select nonzero column features in order to accurately cluster the rows.

For the problems listed above, the row and column structures can be modeled by graphs γ1 and γ2. Then, the structure of the matrix θ is induced by a notion of graph product of γ1

Kim and Gao

and γ2. In this section, we introduce tools from graph algebra including Cartesian product and Kronecker product to build complex structure from simple components. We ﬁrst introduce the likelihood of the problem. To cope with many useful models, we assume that the observation can be organized as a matrix y Rn1 n2. Then, the speciﬁc setting of a certain problem can be encoded by design matrices X1 Rn1 p1 and X2 Rn2 p2. The likelihood is deﬁned by

y | α, θ, σ2 N(X1(αw + θ)XT 2 , σ2In1 In2). (47)

The matrix w Rp1 p2 is assumed to have rank one, and can be decomposed as w = w1w T 2 for some w1 Rp1 and w2 Rp2. The prior distribution of the scalar is simply given by

α | σ2 N(0, σ2/ν). (48)

We then need to build prior distributions of θ that is supported on Θw = {θ Rp1 p2 : Tr(wθT ) = 0} using Cartesian and Kronecker products.

5.1. Cartesian Product

We start with the deﬁnition of the Cartesian product of two graphs.

Deﬁnition 6 Given two graphs G1 = (V1, E1) and G2 = (V2, E2), their Cartesian product G = G1 G2 is deﬁned with the vertex set V1 V2. Its edge set contains ((x1, x2), (y1, y2)) if and only if x1 = y1 and (x2, y2) E2 or (x1, y1suno) E1 and x2 = y2.

According to the deﬁnition, it can be checked that for two graphs of sizes p1 and p2, the adjacency matrix, the Laplacian and the incidence matrix of the Cartesian product enjoy the relations

A1 2 = A2 Ip1 + Ip2 A1,

L1 2 = L2 Ip1 + Ip2 L1,

D1 2 = [D2 Ip1; Ip2 D1].

Given graphs γ1 and γ2 that encode row and column structures of θ, we introduce the following prior distribution

p(θ | γ1, γ2, σ2) Y

(i,j) E1 exp θi θj 2

2σ2[v0γ1,ij + v1(1 γ1,ij)]

(k,l) E2 exp θ k θ l 2

2σ2[v0γ2,kl + v1(1 γ2,kl)]

Here, E1 and E2 are the base graphs of the row and column structures. According to its form, the prior distribution (49) models both pairwise relations of rows and those of columns based on γ1 and γ2, respectively. To better understand (49), we can write it in the following equivalent form,

p(θ | γ1, γ2, σ2) exp 1

2σ2 vec(θ)T (Lγ2 Ip1 + Ip2 Lγ1) vec(θ) I{θ Θw}, (50)

Bayesian Model Selection with Graph Structured Sparsity

where Lγ1 Rp1 p1 and Lγ2 Rp2 p2 are Laplacian matrices of the weighted graphs {v0γ1,ij + v1(1 γ1,ij)} and {v0γ2,kl + v1(1 γ2,kl)}, respectively. Therefore, by Deﬁnition 6, p(θ | γ1, γ2, σ2) is a spike-and-slab Laplacian prior p(θ | γ, σ2) deﬁned in (3) with γ = γ1 γ2, and the well-deﬁnedness is guaranteed by Proposition 1. To complete the Bayesian model, the distribution of (γ1, γ2, σ2) are speciﬁed by

γ1, γ2 | η1, η2 Y

(i,j) E1 ηγ1,ij 1 (1 η1)1 γ1,ij Y

(i,j) E2 ηγ2,kl 2 (1 η2)1 γ2,kl, (51)

η1, η2 Beta(A1, B1) O Beta(A2, B2), (52)

σ2 Inv Gamma(a/2, b/2). (53)

We remark that it is possible to constrain γ1 and γ2 in some subsets Γ1 and Γ2 like (4). This extra twist is useful for a biclustering model that will be discussed in Section 5.3. Note that in general the base graph G = G1 G2 is not a tree, and the derivation of the EM algorithm follows a similar argument in Section 3.2. Using the same argument in (19), we lower bound log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)] by

e E1 E2 re log[v 1 0 γe + v 1 1 (1 γe)] + log |spt(G)|. (54)

E1 E2 = {((i, k), (j, k)) : (i, j) E1, k V2} {((i, k), (i, l)) : i V1, (k, l) E2} ,

and γ = γ1 γ2, we can write (54) as

k=1 r(i,k),(j,k) log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)]

i=1 r(i,k),(i,l) log[v 1 0 γ2,kl + v 1 1 (1 γ2,kl)]

(i,j) E1 r1,ij log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] + X

(k,l) E2 r2,kl log[v 1 0 γ2,kl + v 1 1 (1 γ2,kl)],

k=1 r(i,k),(j,k) = 1 |spt(G)|

T spt(G) I{((i, k), (j, k)) T},

and r2,kl is similarly deﬁned.

Kim and Gao

Using the lower bound derived above, it is direct to derive the an EM algorithm, which consists of the following iterations,

qnew 1,ij = η1v r1,ij/2 0 e θi θj 2/2σ2v0

η1v r1,ij/2 0 e θi θj 2/2σ2v0 + (1 η1)v r1,ij/2 1 e θi θj 2/2σ2v1 , (55)

qnew 2,kl = η2v r2,kl/2 0 e θ k θ l 2/2σ2v0

η2v r2,kl/2 0 e θ k θ l 2/2σ2v0 + (1 η2)v r2,kl/2 1 e θ k θ l 2/2σ2v1 , (56)

(αnew, θnew) = argmin α,θ Θw F(α, θ; qnew 1 , qnew 2 ), (57)

(σ2)new = F(αnew, θnew; qnew 1 , qnew 2 ) + b n1n2 + p1p2 + a + 2 ,

ηnew 1 = A1 1 + P (i,j) E1 qnew 1,ij A1 + B1 2 + m1 ,

ηnew 2 = A2 1 + P (k,l) E2 qnew 2,ij A2 + B2 2 + m2 . (58)

The deﬁnition of the function F(α, θ; q1, q2) is given by

F(α, θ; q1, q2) = y X1(αw + θ)XT 2 2 F + να2 + vec(θ)T (Lq2 Ip1 + Ip2 Lq1) vec(θ)

Though the E-steps (55) and (56) are straightforward, the M-step (57) is a quadratic programming of dimension p1p2, which may become the computational bottleneck of the EM algorithm when the size of the problem is large. We will introduce a Dykstra-like algorithm to solve (57) in Appendix E. The Cartesian product model is useful for simultaneous learning the row structure γ1 and the column structure and γ2 of the coeﬃcient matrix θ. Note that when X1 = X, X2 = Id, and E2 = , the Cartesian product model becomes the multivariate regression model described in Section 4.1. In this case, the model only regularizes the row structure of θ. Another equally interesting example is obtained when X1 = X, X2 = Id, and E1 = . In this case, the model only regularizes the column structure of θ, and can be interpreted as multitask learning with task clustering.

5.2. Kronecker Product

The Kronecker product of two graphs is deﬁned below.

Deﬁnition 7 Given two graphs G1 = (V1, E1) and G2 = (V2, E2), their Kronecker product G = G1 G2 is deﬁned with the vertex set V1 V2. Its edge set contains ((x1, x2), (y1, y2)) if and only if (x1, y1) E1 and (x2, y2) E2.

It is not hard to see that the adjacency matrix of two graphs has the formula A1 2 = A1 A2, which gives the name of Deﬁnition 7. The prior distribution of θ given row and column graphs γ1 and γ2 that we discuss in this subsection is

p(θ | γ1, γ2, σ2) Y

(k,l) E2 exp (θik θjl)2

2σ2[v0γ1,ijγ2,kl + v1(1 γ1,ijγ2,kl)]

Bayesian Model Selection with Graph Structured Sparsity

Again, E1 and E2 are the base graphs of the row and column structures. According to its form, the prior imposes a nearly block structure on θ based on the graphs γ1 and γ2. Moreover, p(θ | γ1, γ2, σ2) can be viewed as a spike-and-slab Laplacian prior p(θ | γ, σ2) deﬁned in (3) with γ = γ1 γ2. The distribution of (γ1, γ2, σ2) follows the same speciﬁcation in (51)-(53). To derive an EM algorithm, we follow the strategy in Section 3.2 and lower bound log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)] by X

(k,l) E2 r(i,k),(j,l) log[v 1 0 γ1,ijγ2,kl + v 1 1 (1 γ1,ijγ2,kl)]. (60)

Unlike the Cartesian product, the Kronecker product structure has a lower bound (60) that is not separable with respect to γ1 and γ2. This makes the E-step combinatorial, and does not apply to a large-scale problem. To alleviate this computational barrier, we consider a variational EM algorithm that ﬁnds the best posterior distribution of γ1, γ2 that can be factorized. In other words, instead of maximizing over all possible distribution q, we maximize over the mean-ﬁled class q Q, with Q = {q(γ1, γ2) = q1(γ1)q2(γ2) : q1, q2}. Then, the objective becomes

max q1,q2 max α,θ Θ2,δ,η,σ2 X

γ1,γ2 q1(γ1)q2(γ2) log ep(y, α, θ, δ, γ1, γ2, η, σ2)

q1(γ1)q2(γ2) ,

where ep(y, α, θ, δ, γ1, γ2, η, σ2) is obtained by replacing p(θ | γ1, γ2, σ2) with ep(θ | γ1, γ2, σ2) in the joint distribution p(y, α, θ, δ, γ1, γ2, η, σ2). Here, log ep(θ | γ1, γ2, σ2) is a lower bound for log p(θ | γ1, γ2, σ2) with (60). The E-step of the variational EM is

qnew 1 (γ1) exp

γ2 q2(γ2) log ep(y, α, θ, δ, γ1, γ2, η, σ2)

qnew 2 (γ2) exp

γ1 qnew 1 (γ1) log ep(y, α, θ, δ, γ1, γ2, η, σ2)

After some simpliﬁcation, we have

qnew 1,ij =

1 + (1 η1) Q (k,l) E2

v r(i,k),(j,l)/2 1 e (θik θjl)2/2σ2v1 q2,kl

η1 Q (k,l) E2

v r(i,k),(j,l)/2 0 e (θik θjl)2/2σ2v0 q2,kl

qnew 2,kl =

1 + (1 η2) Q (i,j) E1

v r(i,k),(j,l)/2 1 e (θik θjl)2/2σ2v1 qnew 1,kl

η2 Q (i,j) E1

v r(i,k),(j,l)/2 0 e (θik θjl)2/2σ2v0 qnew 1,kl

The M-step can be derived in a standard way, and it has the same updates as in (57)-(58), with a new deﬁnition of F(α, θ; q1, q2) given by

F(α, θ; q1, q2) = y X1(αw + θ)X2 2 + να2

v0 + 1 q1,ijq2,kl

(θik θjl)2.

Kim and Gao

5.3. Applications in Biclustering

When both row and column graphs encode clustering structures discussed in Section 4.2, we have the biclustering model. In this section, we discuss both biclustering models induced by Kronecker and Cartesian products. We start with a special form of the likelihood (47), which is given by

y | α, θ, σ2 N(α1n11T n2 + θ, σ2In1 In2),

and the prior distribution on α is given by (48). The prior distribution on θ will be discussed in two cases.

Cartesian product

θil θil µ2,ih

θi l θi l µ2,i h

µ1,jl µ1,jl

Kronecker product

Figure 2: Structure diagrams for the two biclustering methods. The Cartesian product biclustering model (Left) and the Kronecker product biclustering model (Right) have diﬀerent latent variables and base graphs. While the Cartesian product models the row and column clustering structures by separate latent variable matrices µ1 Rk1 n2 and µ2 Rn1 k2, the Kronecker product directly models the checkerboard structure by a single latent matrix µ Rk1 k2.

5.3.1. Cartesian product biclustering model

Let k1 [n1] and k2 [n2] be upper bounds of the numbers of row and column clusters, respectively. We introduce two latent matrices µ1 Rk1 n2 and µ2 Rn1 k2 that serve as row and column clustering centers. The prior distribution is then speciﬁed by

p(θ, µ1, µ2 | γ1, γ2, σ2)

j=1 exp θi µ1,j 2

2σ2[v0γ1,ij + v1(1 γ1,ij)]

h=1 exp θ l µ2, h 2

2σ2[v0γ2,lh + v1(1 γ2,lh)]

I{1T n1θ1n2 = 0},

Bayesian Model Selection with Graph Structured Sparsity

which can be regarded as an extension of (33) in the form of (49). The prior distributions on γ1 and γ2 are independently speciﬁed by (35) with (k, n) replaced by (k1, n1) and (k2, n2). Finally, σ2 follows the inverse Gamma prior (6). We follow the framework of Section 3.2. The derivation of the EM algorithm requires lower bounding log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)]. Using the same argument in Section 5.1, we have the following lower bound

j=1 r1,ij log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] +

h=1 r2,lh log[v 1 0 γ2,lh + v 1 1 (1 γ2,lh)]. (63)

By the symmetry of the complete bipartite graph, r1,ij is a constant that does not depend on (i, j). Then use the same argument in (37)-(38), and we obtain the fact that Pn1 i=1 Pk1 j=1 r1,ij log[v 1 0 γ1,ij + v 1 1 (1 γ1,ij)] is independent of {γ1,ij}, and the same conclusion also applies to the second term of (63). Since the lower bound (63) does not dependent on γ1, γ2, the determinant factor in the density function p(θ, µ1, µ2 | γ1, γ2, σ2) does not play any role in the derivation of the EM algorithm. With some standard calculations, the E-step is given by

qnew 1,ij = exp θi µ1,j 2

Pk1 u=1 exp θi µ1,u 2

2σ2 v , qnew 1,lh = exp θ l µ2, h 2

Pk2 v=1 exp θ l µ2, v 2

where v 1 = v 1 0 v 1 1 . The M-step is given by

(αnew, θnew, µnew 1 , µnew 2 ) = argmin α,1Tn1θ1n2=0,µ1,µ2 F(α, θ, µ1, µ2; qnew 1 , qnew 2 ),

(σ2)new = F(αnew, θnew, µnew 1 , µnew 2 ; qnew 1 , qnew 2 ) + b 2n1n2 + n1k2 + n2k1 + a + 2 ,

where F(α, θ, µ1, µ2; q1, q2) = y α1n11T n2 θ 2 F + ν α 2

v0 + 1 q1,ij

v0 + 1 q2,lh

θ l µ2, h 2.

5.3.2. Kronecker product biclustering model

For the Kronecker product structure, we introduce a latent matrix µ Rk1 k2. Since the biclustering model implies a block-wise constant structure for θ. Each entry of µ serves as a center for a block of the matrix θ. The prior distribution is deﬁned by

p(θ, µ | γ1, γ2, σ2)

h=1 exp (θil µjh)2

2σ2[v0γ1,ijγ2,lh + v1(1 γ1,ijγ2,lh)]

I{1T n1θ1n2 = 0}.

Kim and Gao

The prior distribution is another extension of (33), and it is in a similar form of (59). To ﬁnish the Bayesian model speciﬁcation, we consider the same priors for γ1, γ2, σ2 as in the Cartesian product case. Recall that the lower bound of log P T spt(G) P e T [v 1 0 γe + v 1 1 (1 γe)] is given by (60) for a general Kronecker product structure. In the current setting, a similar argument gives the lower bound

h=1 r(i,l),(j,h) log[v 1 0 γ1,ijγ2,lh + v 1 1 (1 γ1,ijγ2,lh)].

Since r(i,l),(j,h) r is independent of (i, l), (j, h) by the symmetry of the complete bipartite graph, the above lower bound can be written as

h=1 log[v 1 0 γ1,ijγ2,lh + v 1 1 (1 γ1,ijγ2,lh)]

= r log(v 1 0 )

h=1 γ1,ijγ2,lh + r log(v 1 1 )

h=1 (1 γ1,ijγ2,lh)

= rn1n2 log(v 1 0 ) + rn1n2(k1k2 1) log(v 1 1 ),

which is independent of γ1, γ2. The inequality (5.3.2) is because both γ1 and γ2 satisfy (34). Again, the determinant factor in the density function p(θ, µ | γ1, γ2, σ2) does not play any role in the derivation of the EM algorithm, because the lower bound (5.3.2) does not depend on (γ1, γ2). Since we are working with the Kronecker product, we will derive a variational EM algorithm with the E-step ﬁnding the posterior distribution in the mean ﬁled class Q = {q(γ1, γ2) = q1(γ1)q2(γ2) : q1, q2}. By following the same argument in Section 5.2, we obtain the E-step as

qnew 1,ij = exp Pn2 l=1 Pk2 h=1 q2,lh(θil µjh)2

Pk1 u=1 exp Pn2 l=1 Pk2 h=1 q2,lh(θil µuh)2

qnew 2,lh = exp Pn1 i=1 Pk1 j=1 qnew 1,ij (θil µjh)2

Pk2 v=1 exp Pn1 i=1 Pk1 v=1 qnew 1,ij (θil µlv)2

where v 1 = v 1 0 v 1 1 . The M-step is given by

(αnew, θnew, µnew) = argmin α,1Tn1θ1n2=0,µ F(α, θ, µ; qnew 1 , qnew 2 ),

(σ2)new = F(αnew, θnew, µnew; qnew 1 , qnew 2 ) + b 2n1n2 + n1k2 + n2k1 + a + 2 ,

F(α, θ, µ1, µ2; q1, q2) = y α1n11T n2 θ 2 F + ν α 2

v0 + 1 q1,ijq2,lh

(θil µjh)2.

Bayesian Model Selection with Graph Structured Sparsity

6. Reduced Isotonic Regression

The models that we have discussed so far in our general framework all involve Gaussian likelihood functions and Gaussian priors. It is important to develop a natural extension of the framework to include non-Gaussian models. In this section, we discuss a reduced isotonic regression problem with a non-Gaussian prior distribution, while a full extension to non-Gaussian models will be considered as a future project. Given a vector of observation y Rn, the reduced isotonic regression seeks the best piecewise constant ﬁt that is nondecreasing (Schell and Singh, 1997; Gao et al., 2018). It is an important model that has applications in problems with natural monotone constraint on the signal. With the likelihood y|α, θ, σ2 N(α1n + θ, σ2In), we need to specify a prior distribution on θ that induces both piecewise constant and isotonic structures. We propose the following prior distribution,

θ | γ, σ2 p(θ | γ, σ2)

i=1 exp (θi+1 θi)2

2σ2[v0γi + v1(1 γi)]

I{θi θi+1}I{1T nθ = 0}. (64)

We call (64) the spike-and-slab half-Gaussian distribution. Note that the support of the distribution is the intersection of the cone {θ : θ1 θ2 ... θn} and the subspace {θ : 1T nθ = 0}. The parameters v0 and v1 play similar roles as in (3), which model the closedness between θi and θi+1 depending on the value of γi.

Proposition 8 For any γ {0, 1}n 1 and v0, v1 (0, ), the spike-and-slab half-Gaussian prior (64) is well deﬁned on {θ : θ1 θ2 ... θn} {θ : 1T nθ = 0}, and its density function with respect to the Lebesgue measure restricted on the support is given by

p(θ | γ, σ2) = 2n 1 1 (2πσ2)(n 1)/2

i=1 [v 1 0 γi + v 1 1 (1 γi)]

2σ2[v0γi + v1(1 γi)]

I{θ1 θ2 ... θn}I{1T nθ = 0}.

Note that the only place that Proposition 8 deviates from Proposition 1 is the extra factor 2n 1 due to the isotonic constraint {θ : θ1 θ2 ... θn} and the symmetry of the density. We complete the model speciﬁcation by put priors on α, γ, η, σ2 that are given by (2), (4), (5) and (6). Now we are ready to derive the EM algorithm. Since the base graph is a tree, the EM algorithm for reduced isotonic regression is exact. The E-step is given by

qnew i = ηφ(θi θi 1; 0, σ2v0) ηφ(θi θi 1; 0, σ2v0) + (1 η)φ(θi θi 1; 0, σ2v1).

The M-step is given by

(αnew, θnew) = argmin α,θ1 θ2 ... θn,1Tn θ=0 F(α, θ; qnew), (65)

Kim and Gao

F(α, θ; q) = y α1n θ 2 + να2 +

(θi θi 1)2,

and the updates of σ2 and η are given by (17) with p = n. The M-step (65) can be solved by a very eﬃcient optimization technique. Since y α1n θ 2 = ( y α)1n 2 + y y1n θ 2

by 1T nθ = 0, α and θ can be updated independently. It is easy to see that αnew = n n+ν y. The update of θ can be solved by SPAVA (Burdakov and Sysoev, 2017). Similar to the Gaussian case, the parameter v0 determines the complexity of the model. For each v0 between 0 and v1, we apply the EM algorithm above to calculate bq, and then let bγi = bγi(v0) = I{bqi 1/2} form a solution path. The best model will be selected from the EM-solution path by the limiting version of the posterior distribution as v0 0. Given a γ {0, 1}n 1, we write s = 1 + Pn 1 i=1 (1 γi) to be the number of pieces, and Zγ {0, 1}n s is the membership matrix deﬁned in Section 3.3. As v0 0, a slight variation of Proposition 3 implies that θ that follows (64) weakly converges to Zγeθ, where eθ is distributed by

p(eθ | γ, σ2) exp

(eθl eθl+1)2

I{eθ1 eθ2 ... eθs}I{1T n Zγeθ = 0}. (66)

The following proposition determines the normalizing constant of the above distribution.

Proposition 9 The density function of (66) is given by

p(eθ | γ, σ2) = 2s 1(2πσ2) (s 1)/2q

det ZT γ 1n(ZTγ e LγZγ)

(eθl eθl+1)2

I{eθ1 eθ2 ... eθs}I{1T n Zγeθ = 0}, (67)

where Zγ and e Lγ are deﬁned in Section 3.3.

Interestingly, compared with the formula (23), (67) has an extra 2s 1 due to the isotonic constraint {eθ1 ... eθs}. Following Section 3.3, we consider a reduced version of the likelihood y | α, eθ, γ, σ2 N(α1n + Zγeθ, σ2In). Then, with the prior distributions on α, eθ, γ, σ2 speciﬁed by (2), (67), (4), (5) and (6), we obtain the joint posterior distribution p(α, eθ, γ, σ2 | y). Ideally, we would like to integrate out α, eθ, σ2 and use p(γ | y) for model selection. However, the integration with respect to eθ is intractable due to the isotonic constraint. Therefore, we propose to maximize out α, eθ, σ2, and then the model selection score for reduced isotonic regression is given by g(γ) = max α,eθ1 ... eθs,1Tn Zγ eθ=0,σ2 log p(α, eθ, γ, σ2 | y).

For each γ, the optimization involved in the evaluation of g(γ) can be done eﬃciently, which is very similar to the M-step updates.

Bayesian Model Selection with Graph Structured Sparsity

7. Numerical Results

In this section, we test the performance of the methods proposed in the paper and compare the accuracy in terms of sparse signal recovery and graphical structure estimation with existing methods. We name our method Bayes MSG (Bayesian Model Selection on Graphs) throughout the section. All simulation studies and real data applications were conduced on a standard laptop (2.6 GHz Intel Core i7 processor and 16GB memory) using R and Julia programming languages. Our Bayesian method outputs a subgraph deﬁned by

bγ = argmax {g(γ) : γ {bγ(v0)}0<v0 v1} ,

which is a sub-model selected by the model selection score g(γ) on the EM solution path (see Section 3.3 for details). Suppose γ is the underlying true subgraph that generates the data, we measure the performance of bγ by false discovery proportion and power. The deﬁnitions are

P (i,j) E(1 bγij)γ ij P (i,j) E(1 bγij) and POW = 1

P (i,j) E(1 γ ij)bγij P (i,j) E(1 γ ij) ,

where we adopt the convention that 0/0 = 1. Note that the above FDP and POW are not suitable for the clustering/biclustering model, because clustering structures are equivalent up to arbitrary clustering label permutations. The sub-model indexed by bγ also induces a point estimator for the model parameters. This can be done by calculating the posterior mean of the reduced model speciﬁed by the likelihood (25) and priors (2) and (23). With notations p(y|α, eθ, γ, σ2), p(α|σ2) and p(eθ|γ, σ2) for (25), (2) and (23), the point estimator is deﬁned by bβ = αestw + Zbγeθest, where Zγ is the membership matrix deﬁned in Section 3.3, and the deﬁnition of (αest, eθest) is given by

(αest, eθest) = argmax α,eθ {eθ:w T Zbγ eθ=0} log h p(y|α, eθ, bγ, σ2)p(α|σ2)p(eθ|bγ, σ2) i ,

which is a simple quadratic programming whose solution does not depend on σ2. Note that the deﬁnition implies that bβ is the posterior mode of the reduced model. Since the posterior distribution is Gaussian, bβ is also the posterior mean. The performance of bβ will be measured by the mean squared error

n X(bβ β ) 2,

where β is the true parameter that generates the data. The hyper-parameters a, b, A, B in (5) and (6) are all set as the default value 1. The same rule is also applied to the extensions in Sections 4-6.

7.1. Simulation Studies

In this section, we compare the proposed Bayesian model selection procedure with existing methods in the literature. There are two popular generic methods for graph-structured

Kim and Gao

Signal 3: very unevenly spaced

Signal 2: unevenly spaced

Signal 1: evenly spaced

Figure 3: Three diﬀerent signals for the linear chain graph. All signals have 20 pieces. Signal 1 has evenly spaced changes (each piece has length 50), Signal 2 has unevenly spaced changes (a smaller piece has length 10), and Signal 3 has very unevenly spaced changes (a smaller one has length 2).

ℓ₀-pen (replicate 2)

ℓ₀-pen (replicate 1)

Signal 1: evenly spaced

ℓ₀-pen (replicate 2)

ℓ₀-pen (replicate 1)

Signal 2: unevenly spaced

ℓ₀-pen (replicate 2)

ℓ₀-pen (replicate 1)

Signal 3: very unevenly spaced

Figure 4: Visualization of typical solutions of the three methods when σ = 0.5. Since ℓ0-pen is very unstable, we plot contrasting solutions from two independent replicates. (Top) Evenly spaced signal; (Center) Unequally spaced signal; (Bottom) Very unevenly spaced signal; (Far Left) Bayes MSG; (Left) Gen Lasso; (Right and Far Right) Two independent replicates of ℓ0-pen.

Bayesian Model Selection with Graph Structured Sparsity

model selection in the literature. The ﬁrst method is the generalized Lasso (or simlply Gen Lasso henceforth) (Tibshirani et al., 2005; She, 2010; Tibshirani and Taylor, 2011), deﬁned by bβ = 1

2 y Xβ 2 + λ X

(i,j) E |βi βj|. (68)

The second method is the ℓ0-penalized least-squares (Barron et al., 1999; Friedrich et al., 2008; Fan and Guan, 2018), deﬁned by

2 y Xβ 2 + λ X

(i,j) E I{βi = βj}. (69)

For both methods, an estimated subgraph is given by

bγij = I{|bβi bβj| ϵ},

for all (i, j) E. Here, the number ϵ is taken as 10 8. The two methods are referred to by Gen Lasso and ℓ0-pen from now on. In addition to Gen Lasso and ℓ0-pen, various other methods (P erez and de Los Campos, 2014; Bondell and Reich, 2008; Govaert and Nadif, 2003; Tan and Witten, 2014; Gao et al., 2016; Chi et al., 2017; Mair et al., 2009; Gao et al., 2018; Tibshirani et al., 2011) that are speciﬁc to diﬀerent models will also be compared in our simulation studies.

7.1.1. Linear Chain Graph

We ﬁrst consider the simplest linear chain graph, which corresponds to the change-point mdoel explained in Example 2. We generate data according to y N(β , σ2In) with n = 1000 and σ {0.1, 0.2, 0.3, 0.4, 0.5}. The mean vector β Rn is speciﬁed in three diﬀerent cases as shown in Figure 3. We compare the performances of the proposed Bayesian method, Gen Lasso and ℓ0-pen. For the linear chain graph, Gen Lasso is the same as fused Lasso (Tibshirani et al., 2005). Its tuning parameter λ in (68) is selected by cross validation using the default method of the R package genlasso (Arnold and Tibshirani, 2014). For ℓ0-pen, the λ in (69) is selected using the method suggested by Fan and Guan (2018). The results are summarized in Table 1. Some typical solutions of the three methods are plotted in Figure 4. In terms of MSE, our Bayesian method achieves the smallest error among the three methods when σ is small, and Gen Lasso has the best performance when σ is large. For model selection performance measured by FDP and POW, the Bayesian method is the best, and ℓ0-pen is better than Gen Lasso. We also point out that the solutions of ℓ0-pen is highly unstable, as shown in Figure 4. In terms of computational time, Bayes MSG, Gen Lasso and ℓ0-pen require 5.2, 11.8 and 19.0 seconds on average. It is not surprising that Gen Lasso achieves the lowest MSE in the low signal strength regime. This is because Lasso is known to produce estimators with strong bias towards zero, and therefore it is favored when the true parameters are close to zero. The other two methods, Bayes MSG and ℓ0 are designed to achieve nearly unbiased estimators when the signals are strong, and therefore show their advantages over Gen Lasso when the signal strength is large.

Kim and Gao

σ Even Uneven Very uneven MSE FDP POW MSE FDP POW MSE FDP POW

0.1 0.00019 0.00 1.00 0.00949 0.00 1.00 0.00217 0.00 0.80 0.2 0.00585 0.00 0.98 0.01010 0.00 0.97 0.00279 0.00 0.81 0.3 0.01620 0.01 0.96 0.01116 0.01 0.97 0.00349 0.00 0.81 0.4 0.01940 0.05 0.95 0.01693 0.02 0.96 0.00837 0.00 0.79 0.5 0.04667 0.10 0.95 0.03682 0.02 0.96 0.01803 0.05 0.78

0.1 0.00094 0.81 1.00 0.00116 0.90 1.00 0.00570 0.96 1.00 0.2 0.00374 0.81 1.00 0.00458 0.90 1.00 0.01152 0.94 1.00 0.3 0.00842 0.81 0.98 0.01024 0.89 1.00 0.02084 0.93 0.99 0.4 0.01494 0.81 0.98 0.01813 0.88 0.98 0.03376 0.92 0.98 0.5 0.02345 0.82 0.98 0.02818 0.89 0.98 0.04984 0.92 0.97

0.1 0.00505 0.00 0.98 0.00288 0.00 0.97 0.02042 0.00 0.81 0.2 0.00545 0.00 0.98 0.00888 0.00 0.94 0.06049 0.00 0.63 0.3 0.00399 0.01 0.98 0.00918 0.02 0.94 0.06121 0.00 0.63 0.4 0.00826 0.02 0.97 0.01119 0.02 0.93 0.06250 0.00 0.63 0.5 0.06512 0.03 0.92 0.04627 0.02 0.93 0.06452 0.00 0.63

Table 1: Comparisons of the three methods for the linear chain graph.

An interesting question would be if it is possible to design a method that works well in all of the three criteria (MSE, POW, FDP) with both low and strong signal? Unfortunately, a recent paper (Song and Cheng, 2018) proves that this is impossible. The result of Song and Cheng (2018) rigorously establishes the incompatibility phenomenon between selection consistency and rate-optimality in high-dimensional sparse regression. We therefore believe our proposed Bayes MSG, which performs very well in terms of the three criteria (MSE, POW, FDP) except for the MSE in the low signal regime is a very good solution in view of this recent impossibility result.

7.1.2. Regression with Graph-structured Coefficients

For this experiment, we consider a linear regression setting with graph structured sparsity on the regression coeﬃcients. We sample random Gaussian measurements Xij N(0, 1) and measurement errors ϵi N(0, 1). Then we construct a design matrix X Rn p and a response vector y = Xθ + ϵ Rn, where θ is a vector of node attributes of an underlying graph G. We ﬁx n = 500 throughout this simulation study, and consider three diﬀerent graph settings listed in Table 2. When G is a star graph with the center node at 0, our proposed model in Section 2 corresponds to the sparse linear regression problem. We compare our proposed approach Bayes MSG with the following baseline methods implemented in R programming language: Lasso (glmnet R package) (Friedman et al., 2010), and Bayesian Spike-And-Slab linear regression (BSAS) via MCMC (BGLR R package) (P erez and de Los Campos, 2014). All the R packages listed here are implemented using their default setting and their recommended model selection methods. Next, when G is a linear chain graph, the model corresponds to the linear regression problem with a sparse graph diﬀerence vector (θ2 θ1, , θp θp 1). This problem setting is particularly studied for fused Lasso (Tibshirani et al., 2005) and the approximate ℓ0

Bayesian Model Selection with Graph Structured Sparsity

Graph # of nodes # of edges Description Star 1,001 (1 ﬁxed node) 1,001 regression with sparse coeﬃcients Linear chain 1,000 999 regression with gradient-sparse signals Complete 200 19,900 regression with clustered coeﬃcients

Table 2: Simulation Settings for Gaussian Design.

Graph Star Graph Linear Chain Graph Complete Graph Method BMSG Lasso BSAS BMSG GLasso ITALE BMSG GLasso OSCAR MSE 0.579 0.792 0.530 0.099 0.276 0.159 0.399 0.472 0.438 Time 5.425 1.663 11.32 3.161 5.791 6.543 27.52 50.62 18.18 FDP 0.324 0.662 - 0.000 0.953 0.106 0.224 0.614 0.467 POW 0.978 0.995 - 0.954 0.980 0.988 0.996 1.000 1.000

Graph Star Graph Linear Chain Graph Complete Graph Method BMSG Lasso BSAS BMSG GLasso ITALE BMSG GLasso OSCAR MSE 0.624 0.789 0.592 0.099 0.274 0.176 0.403 0.472 0.442 Time 5.396 1.721 12.75 3.436 5.904 6.554 22.85 51.42 15.80 FDP 0.319 0.658 - 0.000 0.930 0.032 0.205 0.290 0.208 POW 0.983 0.982 - 0.994 0.988 0.996 0.998 1.000 0.994

Table 3: Simulation Results for Gaussian Design C = 0.5 (above) C = 1 (below).

regression setting (ITALE) (Xu and Fan, 2019). Therefore, we compare Bayes MSG with the above baseline approaches: fused Lasso (genlasso R package) and ITALE (ITALE R package) on the linear chain graph. Finally, when G is a complete graph, the model corresponds to the linear regression problem with clustered coeﬃcients, i.e. βj s may be clustered together. This particular problem setting is also considered in the studies of Gen Lasso (Tibshirani and Taylor, 2011) and OSCAR (Bondell and Reich, 2008). We compare Bayes MSG with the following baseline methods: Gen Lasso and OSCAR (lqa R package). In brief, OSCAR seeks to solve

minimizeβ 1 2 y Xθ 2 2 + λ

j=1 (c(j 1) + 1)|θ|(j).

A true graph-structured sparse signal is constructed as follows. For the case of star graphs, θ j = 0.5C for j = 1, , 40 and 0 otherwise. For the cases of linear change graphs and complete graphs, θ j = C for j = 1, , 0.4p, θ j = 2C for j = 0.4p+1, , 0.7p, θ j = 3C for j = 0.7p + 1, , 0.9p and 4C otherwise. Tables 3 displays the estimation error (MSEs) θ bθ 2 on the test data sets, and computation times. BMSG, BSAS and Gen Lasso are abbreviations for Bayes MSG, Bayesian Spike-And-Slab regression and Generalized Lasso. Each reported error value is averaged across 10 independent simulations with diﬀerent random seeds. The results show that our proposed Bayes MSG method is the overall winners across all models in both estimation error (MSE) and model selection error (FDP and POW). The advantage is especially obvious for the linear chain graph and the complete graph. The only case where Bayes MSG cannot beat its competitor (BSAS) is the estimation error in the

Kim and Gao

Name # of nodes # of edges mean.ER sd.ER diameter # of CC Chicago roadmap 4126 4308 0.9575 0.0499 324 1 Enron email 4112 14520 0.2831 0.2341 14 1 Facebook egonet 4039 88234 0.0457 0.0608 8 1

Table 4: Graph properties of the three real networks.

Name # of clust # of nodes in each cluster # of cuts total variation Chicago roadmap 4 (576, 678, 835, 2037) 31 31 κ Enron email 4 (384, 538, 1531, 1659) 4570 5047 κ Facebook egonet 4 (750, 753, 778, 1758) 651 1220 κ

Table 5: Important features of the signals on the three networks.

star graph case (sparse linear regression). However, we note that BSAS is an MCMC-based method that does not involve model selection but perform Bayesian model averaging. Thus, the solution of BSAS is not sparse. On the other hand, Bayes MSG is designed for model selection, and therefore performs much better in terms of model selection error (FDP and POW).

7.1.3. Two-Dimensional Grid Graph

We consider the two-dimensional grid graph described in Example 3. The data is generated according to yij N(κµ ij, 1) for i = 1, ..., 21 and j = 1, ..., 21, where

and κ {1, 2, ..., 10} is used to control the signal strength. Note that µ ij has a piecewise constant structure because of the operation by that denotes the integer part. In fact, µ ij only takes 5 possible values as shown in Figure 6.

Since the R package genlasso does not provide a tuning method for the λ in (68) for the two-dimensional grid graph setting, we report MSE based on the λ selected by cross validation, and FDP and POW are reported based on the λ that minimizes the FDP. The λ in ℓ0-pen is tuned by the method in Fan and Guan (2018).

The results are shown in Figure 5. It is clear that our method outperforms the other two in terms of all the evaluation criteria when the signal strength is not very small. When the signal strength is very small, Gen Lasso achieves the lowest MSE but shows poor model selection performance. We also illustrate the solution path of our method in Figure 6. Typical solutions of Gen Lasso and ℓ0-pen are visualized in Figure 7. We observe that ℓ0-pen tends to oversmooth the data, while Gen Lasso tends to undersmooth. In terms of the computational time, Bayes MSG, Gen Lasso and ℓ0-pen require 21.2, 26.7 and 8.4 seconds on average.

Bayesian Model Selection with Graph Structured Sparsity

signal strength

1 2 3 4 5 6 7 8 9 10

Bayes MSG Gen Lasso ℓ₀-pen

signal strength

1 2 3 4 5 6 7 8 9 10

signal strength

1 2 3 4 5 6 7 8 9 10 10-1.5

Two-dimensional grid graph

Figure 5: Comparison of the three methods for the two-dimensional grid graph. (Left) MSE; (Center) FDP; (Right) POW.

True signal Noisy observation

3.0 2.5 2.0 1.5 1.0 log10(v )

model selection score

Model selection Final estimate

v = 1e-3 * 1 v = 1e-3 * 5 v = 1e-3 * 10 v = 1e-3 * 15 v = 1e-3 * 20 v = 1e-3 * 25 v = 1e-3 * 100

Figure 6: (Top panels) True signal, noisy observations, model selection score, and ﬁnal estimate; (Bottom panels) A regularization path from v0 = 10 3 to v0 = 10 1.

-pen (model selected) -pen (argmin(FDP)) Gen Lasso (cross validated) Gen Lasso (argmin(FDP))

Figure 7: (Far Left) ℓ0-pen with λ selected using the method in Fan and Guan (2018); (Left) ℓ0-pen with λ that minimizes FDP; (Right) Gen Lasso with λ selected by cross validation; (Far Right) Gen Lasso with λ that minimizes FDP.

Kim and Gao

Figure 8: The Chicago roadmap network with signals exhibiting four clusters.

7.1.4. Generic Graphs

In this section, we consider some graphical structures that naturally arise in real world applications. The three graphs to be tested are the Chicago metropolitan area road network2, the Enron email network3, and the Facebook egonet network4. For all the three networks, we extract induced subgraphs of sizes about 4000. Graph properties for the three networks are summarized in Table 4. For each network, we calculate its number of nodes, number of edges, mean and standard deviation of eﬀective resistances, diameter, and number of connected components. We observe that the three networks behave very diﬀerently. The Chicago roadmap network is locally and globally tree-like, since its number of edges is very close to its number of nodes, and the distribution of its eﬀective resistances highly concentrates around 1. The other two networks, the Enron email network and the Facebook egonet, are denser graphs but their eﬀective resistances behave in very diﬀerent ways. For each network, we generate data according to yi N(κµ i , 1) on its set of nodes, with the signal strength varies according to κ {1, 2, ..., 5}. The signal µ for each graph is generated as follows:

1. Pick four anchor nodes from the the set of all nodes uniformly at random.

2. For each node, compute the the length of the shortest path to each of the four anchor nodes.

3. Code the ith node by j if the jth anchor node is the closest one to the ith node. This gives four clusters for each graph.

4. Generate a piecewise constant signal µ i = j.

Some properties of the signals are summarized in Table 5, where the number of cuts of µ

with respect to the base graph G = (V, E) is deﬁned by P (i,j) E I{µ i = µ j}, and the total variation of µ means P (i,j) E |µ i µ j|. We also plot the signal on the Chicago roadmap network in Figure 8.

2. The data set can be retrieved from http://www.cs.utah.edu/~lifeifei/Spatial Dataset.htm. 3. The data set can be retrieved from http://snap.stanford.edu/data/email-Enron.html. 4. The data set can be retrieved from http://snap.stanford.edu/data/ego-Facebook.html.

Bayesian Model Selection with Graph Structured Sparsity

signal strength

Bayes MSG Gen Lasso ℓ₀-pen

signal strength

signal strength

Chicago roadmap network

Bayes MSG Gen Lasso ℓ₀-pen

signal strength

signal strength

Enron email network

signal strength

Bayes MSG Gen Lasso ℓ₀-pen

signal strength

signal strength

Facebook ego network

Figure 9: Comparison of the three methods on generic graphs. (Top) Chicago Roadmap network; (Center) Enron Email network; (Bottom) Facebook Ego network.

Since the R package genlasso does not provide a tuning method for the λ in (68) for a generic graph, we report MSE based on the λ selected by cross validation, and FDP and POW are reported based on the λ that minimizes the FDP. The λ in ℓ0-pen is tuned by the method in Fan and Guan (2018).

The results are shown in Figure 9. It is clear that our method outperforms the other two. When the signal strength κ is small, we observe that Gen Lasso sometimes has the smallest MSE, but its MSE grows very quickly as κ increases. For most κ s, our method and ℓ0-pen are similar in terms of MSE. In terms of the model selection performance, Gen Lasso is not competitive, and our method outperforms ℓ0-pen.

Kim and Gao

101.50 101.75 102.00

vector-clust 2d-grid row-clust biclust-cartesian biclust-kronecker

101.50 101.75 102.00 10-2

Figure 10: (Top left) True signal; (Top center) Computational time; (Top right) MSE; (Bottom) Heatmaps of estimators using diﬀerent models (n1 = 72).

7.1.5. Comparison of Different Base Graphs

One key ingredient of our Bayesian model selection framework is the speciﬁcation of the base graph. For the same problem, there can be multiple ways to specify the base graph that lead to completely diﬀerent models and methods. In this section, we consider an example and compare the performances of diﬀerent Bayesian methods with diﬀerent base graphs. We consider observations yij N(θ ij, 1) for i [n1] and j [n2]. We ﬁx n2 = 12 and vary n1 from 24 to 144. The signal matrix θ Rn1 n2 has a checkerboard structure as shown in Figure 10. That is, the n1 n2 matrix is divided into 6 6 equal-sized blocks. On the (u, v)th block, θ ij = 2(u + v 6). The following models are considered to ﬁt the observations:

1. Vector clustering. We regard the matrix y Rn1 n2 as a n1n2-dimensional vector and apply the clustering model described in Section 4.2 with n = k = n1n2.

2. Two-dimensional grid graph. The two-dimensional image denoising model described in Example 3 is ﬁt to the observations.

3. Row clustering. We regard the matrix y Rn1 n2 as n = n1 observations in Rd with d = n2, and then ﬁt the clustering model described in Section 4.2 to the rows of y with n = k = n1.

4. Cartesian product biclustering. The biclustering model induced by the Cartesian product described in Section 5.3 is ﬁt to the observations.

Bayesian Model Selection with Graph Structured Sparsity

Figure 11: Comparisons of the 5 biclustering methods for the checkerboard data. (Left) Time; (Right) MSE.

5. Kronecker product biclustering. The biclustering model induced by the Kronecker product described in Section 5.3 is ﬁt to the observations.

Figure 10 summarizes the results. In terms of MSE, the vector clustering and the twodimensional grid graph do not fully capture the structure of the data and thus perform worse than all other methods. Both the biclustering models are designed for the checkerboard structure, and they therefore have the best performances. Between the two biclustering models, the one induced by the Kronecker product has a smaller MSE at the cost of a higher computational time. To summarize the comparisons, we would like to emphasize that the right choice of the base graph has an enormous impact to the result. This also highlights the ﬂexibility of our Bayesian model selection framework that is able to capture various degrees of structures of the data.

7.1.6. Biclustering

To evaluate performance of our biclustering methods in comparison to existing methods, we provide a simulation study using the same data in Section 7.1.5. The true parameter θ Rn1 n2 has a checkerboard structure and is visualized in Figure 10. More precisely, we consider observations yij N(θ ij, 1) for i [n1] and j [n2]. We ﬁx n2 = 12 and vary n1 from 24 to 144. The signal matrix θ Rn1 n2 has a checkerboard structure as shown in Figure 10. That is, the n1 n2 matrix is divided into 6 6 equal-sized blocks. On the (u, v)th block, θ ij = 2(u + v 6). For comparison, we consider the three competitors, block BC (Govaert and Nadif, 2003), sparse BC (Tan and Witten, 2014), and COBRA (Chi et al., 2017), with the implementation via R packages blockcluster, sparse BC and cvxbiclustr. In summary, block BC and sparse BC are based on the minimization of P i,j(yij µz1(i)z2(j))2 given the number k1 and

Kim and Gao

k2 of row and column clusters, respectively, where µ Rk1 k2 is a matrix of latent hidden bicluster means and z1 and z2 are row and column cluster assignments, respectively. The methods block BC and sparse BC use diﬀerent approaches for the estimation of row and column clusters, i.e. block BC uses a block EM algorithm. sparse BC solves a penalized linear regression with the ℓ1-penalty λ µ 1. COBRA is an alternating direction method of multipliers (ADMM) algorithm based on the minimization of P i,j(yij θij)) + penrow(θ) + pencol(θ) where penrow(θ) = P i<j wij θi θj 2 and pencol is deﬁned similarly.

The result is displayed in Figure 11. The result shows that Bayes MSG Kronecker outperforms other methods in terms of MSE. Bayes MSG Cartesian also produces competitive solutions in terms of MSE within short amount of computation time. Note that Bayes MSG and COBRA are regularization path based methods, and require less computation time than other approaches (block BC and sparse BC) based on cross validation.

7.2. Real Data Applications

In this section, we apply our methods to three diﬀerent data sets.

7.2.1. Global warming data

The global warming data has been studied previously by Wu et al. (2001); Tibshirani et al. (2011). It consists of 166 data points in degree Celsius from 1850 to 2015. Here we ﬁt the Bayesian reduced isotonic regression discussed in Section 6. Our results are shown in Figure 12.

When v0 is nearly zero, the solution is very close to the regular isotonic regression that can be solved eﬃciently by the pool-adjacent-violators algorithm (PAVA) (Mair et al., 2009). When v0 = 0.005, we obtain a ﬁt with 24 pieces. The PAVA outputs a very similar ﬁt also with 24 pieces. In contrast, the Bayesian model selection procedure suggests a model with v0 = 0.06, which has only 6 pieces, a signiﬁcantly more parsimonious and a more interpretable ﬁt. This may suggest global warming is accelerating faster in recent years. The same conclusion cannot be obtained from the suboptimal ﬁt with 24 pieces.

To compare with existing methods, we have implemented reduced isotonic regression with dynamic programming (DP) algorithm introduced by Gao et al. (2018) and the near isotonic (Near Iso) regression (Tibshirani et al., 2011). Figure 13 shows the estimated signals (left panel) and the estimated diﬀerences {θi+1 θi : i = 1, , n 1} between the two adjacent years (right panel). Since the DP algorithm does not have a practical model selection procedure, we use the number of pieces k = 9 selected by Bayes MSG. Near Iso uses Mallow s Cp for model selection (Tibshirani et al., 2011). The left panel of Figure 13 shows that Bayes MSG outputs a more sparse but still reasonable isotonic ﬁt. On the other hand, Near Iso relaxes the isotonic constraint, allowing non-monotone signals but penalizing the decreasing portion of variations. Indeed, one can see from the right panel of Figure 13 that the changepoints of Near Iso contain those of PAVA. The PAVA solution is the least parsimonious isotonic ﬁt by deﬁnition, and thus we can conclude that Near Iso does not seem to ﬁnd a more parsimonious ﬁt.

Bayesian Model Selection with Graph Structured Sparsity

10-2.0 10-1.5 10-1.0 80

model selection score

1850 1900 1950 2000

temperature anomalies ( C)

1850 1900 1950 2000

temperature anomalies ( C)

1850 1900 1950 2000

temperature anomalies ( C)

1850 1900 1950 2000

temperature anomalies ( C)

1850 1900 1950 2000

temperature anomalies ( C)

Global warming data

Figure 12: The solution path (smaller v0 at top left and larger v0 at bottom right) for Bayesian reduced isotonic regression.

1850 1900 1950 2000

PAVA DP Bayes MSG Neariso

Global Warming

1850 1855 1864 1876 1882 1883 1895 1901 1902 1906 1913 1918 1919 1924 1929 1933 1935 1936 1945 1953 1956 1963 1976 1978 1979 1986 1989 1993 1994 1996 2000 2001 2008 2012 2013 2014

Bayes MSG DP PAVA Neariso

Global Warming

Figure 13: Comparison of various isotonic regression methods. (Left) The estimated isotonic signals (PAVA, DP, Bayes MSG, Near Iso); let us mention that DP and Bayes MSG signals exactly coincide. (Right) The estimated diﬀerences between the two adjacent years; the only years (x-axis) with at least one nonzero diﬀerences are reported.

Kim and Gao

biclustkronecker (label reordered)

Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Small Cell Small Cell Small Cell Small Cell Small Cell Small Cell

Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal

Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Small Cell Small Cell Small Cell Small Cell Small Cell Small Cell

Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal

Biclust-Cartesian Biclust-Kronecker Observation Observation

Figure 14: Results of biclustering for the lung cancer data5.

block BC kmeans BC sparse BC COBRA

C i id Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Carcinoid Small Cell Small Cell Small Cell Small Cell Small Cell Small Cell

Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal Normal

Figure 15: Comparison of existing biclustering methods.

Figure 16: A correlation plot of the selected genes.

Bayesian Model Selection with Graph Structured Sparsity

Label Gene description/Gen Bank ID (2, 7) proteoglycan 1, secretory granule , AI932613: Homo sapiens c DNA, 3 end

AI147237: Homo sapiens c DNA, 3 end , S71043: Ig A heavy chain allotype 2 , advanced glycosylation end product-speciﬁc receptor , leukocyte immunoglobulin-like receptor, subfamily B (5, 5) immunoglobulin lambda locus , glypican 3

glutamate receptor, ionotropic, AMPA 2 , small inducible cytokine subfamily A , W60864: Homo sapiens c DNA, 3 end , secreted phosphoprotein 1 , LPS-induced TNF-alpha factor (6, 2) interleukin 6 , carcinoembryonic antigen-related cell adhesion molecule 5

(6, 11) secretory granule, neuroendocrine protein 1 , alcohol dehydrogenase 2 , neuroﬁlament, light polypeptide (8, 3) fmajor histocompatibility complex, class II , glycoprotein (transmembrane) nmb (8, 5) N90866: Homo sapiens c DNA, 3 end , receptor (calcitonin) activity modifying protein 1

Table 6: Description or Gen Bank ID of the selected gene clusters of size at least 2 and at most 5.

7.2.2. Lung cancer data

We illustrate the Bayesian biclustering models by a gene expression data set from a lung cancer study. The same data set has also been used by Bhattacharjee et al. (2001); Lee et al. (2010); Sill et al. (2011); Chi et al. (2017). Following Chi et al. (2017), we study a subset with 56 samples and 100 genes. The 56 samples comprise 20 pulmonary carcinoid samples (Carcinoid), 13 colon cancer metastasis samples (Colon), 17 normal lung samples (Normal) and 6 small cell lung carcinoma samples (Small Cell). We also apply the row and column normalizations as has been done in Chi et al. (2017). Our goal is to identify sets of biologically relevant genes, for example, that are signiﬁcantly expressed for certain cancer types. We ﬁt both Bayesian biclustering models (Section 5.3) induced by the Cartesian and Kronecker products to the data with n1 = 56, n2 = 100, k2 = 10, and k2 = 20. Recall that k1 and k2 are upper bounds of the numbers of row and column clusters, and the actual numbers of row and column clusters will be learned through Bayesian model selection. To pursue a more ﬂexible procedure of model selection, we use two independent pairs of (v0, v1) for the row structure and the column structure. To be speciﬁc, let (v0, v1) be the parameters for the row structure, and the parameters for the column structure are set as (cv0, cv1) with some c {1/10, 1/5, 1/2, 1, 2, 5, 10}. Then, the model selection scores are computed with both v0 and c varying in their ranges. The results are shown in Figure 14. The two methods select diﬀerent models with diﬀerent interpretations. The Cartesian product ﬁt gives 4 row clusters and 8 column clusters, while the Kroneker product ﬁt gives 6 row clusters and 11 column clusters. Even though we have not used the information of the row labels for both biclustering methods, the row clustering structure output by the Cartesian product model almost coincides with these labels except one. On the other hand, the Kronecker product model leads to a ﬁner row clustering structure, with potential discoveries of subtypes of both normal lung samples and pulmonary carcinoid samples.

5. The rows of the four heatmaps are ordered in the same way according to the labels of tumor types.

Kim and Gao

Using the same lung cancer data set, we have also compared the Bayes MSG Cartesian product method with several existing biclustering methods implemented via R packages blockcluster, sparse BC and cvxbiclustr. Figure 11 displays the ﬁnal estimates of the four competitors, block BC (Govaert and Nadif, 2003), sparse BC (Tan and Witten, 2014), kmeans BC (Gao et al., 2016) and COBRA (Chi et al., 2017). One can qualitatively compare these results with the Bayes MSG solutions in Figure 15. The results show that Bayes MSG and all the competitors select models with bk1 = 4 row clusters. However, Bayes MSG selects a smallest number of column clusters. This implies that Bayes MSG favors a more parsimonious model compared to the others. Also, the selected Bayes MSG Cartesian model achieves a lowest misclassiﬁcation error for the row (cancer type). An important goal of biclustering is to simultaneously identify gene and tumor types. To be speciﬁc, we seek to ﬁnd genes that show diﬀerent expression levels for diﬀerent types of samples. To this end, we report those genes that are clustered together by both the Cartesian and Kronecker product structures. Groups of genes with size between 2 and 5 are reported in Table 6. Note that our gene clustering is assisted by the sample clustering in the biclustering framework, which is diﬀerent from gene clustering methods that are only based on the correlation structure (Bhattacharjee et al., 2001). As a sanity check, the correlation matrix of the subset of the selected genes is plotted in Figure 16, and we can observe a clear pattern of block structure.

7.2.3. Chicago crime data

The Chicago crime data is publicly available at Chicago Police Department website6. The report in the website contains time, type, district, community area, latitude and longitude of each crime occurred. After removing missing data, we obtain 6.0 millions of crimes that occurred in 22 police districts from 2003 to 2018 (16 years). Here we restrict ourselves to the analysis of the spatial and temporal structure of Chicago crimes within the past few years, ignoring the types or categories of the crimes. Since the 22 districts have diﬀerent area sizes, we divide the total numbers of crimes in each district by its population density (population per unit area). We will call this quantity the Chicago crime rate. We observe that the Chicago crime rates exhibit decreasing patterns over the years. Since our study is focused on the relative comparisons among diﬀerent police districts, we divide each entry by the sum of the yearly Chicago crime rates over all the 22 district in its current year. We will call this quantity the relative crime rate. Admittedly this preprocessing step does not reﬂect the diﬀerence between the residential and the ﬂoating populations in each district which might be important for the analysis of the crime data. For instance, around O Hare international airport, it is very likely that the ﬂoating and the residential populations diﬀer a lot. After the preprocessing, we obtain a three-way tensor with size 22 16 4, for 22 districts, 16 years, and 4 seasons, which is visualized in Figure 17. Our main interest is to understand the geographical structure of the relative crime rates and how the structure changes over the year. A Bayesian model is constructed for this purpose by using the graphical tools under the proposed framework. We deﬁne a

6. The Chicago crime data set can be retrieved from https://data.cityofchicago.org/Public-Safety/ Crimes-2001-to-present.

Bayesian Model Selection with Graph Structured Sparsity

12th 11th 15th

Figure 17: Visualization of the Chicago crime data after preprocessing.

Geographic Communities

Spring Summer

Fall Winter

Yearly and Seasonal Trends

2003 2016 2017 2018 Final estimate

Figure 18: Visualization of Bayesian model selection for the Chicago crime data. (Left) The overall geographical pattern; (Right) Four diﬀerent patterns from 2003 to 2018.

graph characterizing the geographical eﬀect by G1 = (V1, E1) with V1 = {1, 2, , 22} and E1 = {(i, j) : the ith and jth districts are adjacent}. A graph characterizing the temporal eﬀect is given by G2 = (V2, E2) with V2 = {1, 2, ..., 16} and E2 = {(i, i + 1) : i = 1, ..., 15}. Then, the 22 16 4 tensor is modeled by a spike-and-slab Laplacian prior with the base graph G1 G2, in addition to a multivariate extension (Section 4.1) along the dimension of 4 seasons. The result of the Bayesian model selection for the Chicago crime data is visualized in Figure 18. The geographical structure of the relative crime rates exhibit four diﬀerent patterns according to the partition {2003, 2004, ..., 2015}, {2016}, {2017}, {2018}. While geographical compositions of the crimes are similar from 2003 to 2015, our results reveal that the last three years have witnessed dramatic changes. In particular, in these three

Kim and Gao

years, the relative crime rates of Districts 11 and 15 were continuously decreasing, and the relative crime rates of Districts 1 and 18 show the opposite trend. This implies that the overall crime pattern is moving away from historically dangerous areas to downtown areas in Chicago.

Acknowledgments

We thank Matthew Stephens for helpful discussions. Research of CG is supported in part by NSF grant DMS-1712957 and NSF CAREER award DMS-1847590.

Bayesian Model Selection with Graph Structured Sparsity

Appendix A. Some Basics on Linear Algebra

For a symmetric matrix Γ with rank r, it has an eigenvalue decomposition Γ = UDUT with some orthonormal matrix U O(p, r) = {V Rp r : V T V = Ir} and some diagonal matrix D whose diagonal entires are all positive. Then, the Moore-Penrose pseudo inverse of Γ is deﬁned by Γ+ = UD 1UT .

Lemma 10 Consider a symmetric and invertible matrix R Rr r. Then, we have

R = V T (V R 1V T )+V, (70)

for any V O(p, r).

Proof To prove (70), we write R as its eigenvalue decomposition WΛW T for some W O(r, r) and some invertible diagonal matrix Λ. Then, it is easy to see that V W O(p, r), and we thus have

(V R 1V T )+ = (V WΛ 1W T V T )+ = V WΛW T V T ,

and then V T (V R 1V T )+V = V T V WΛW T V T V = WΛW T = R.

Lemma 11 Let A, B Rn m be matrices of full column rank m (i.e. m n). Let ZA and ZB span the nullspaces of A and B, respectively. That is, ZA, ZB Rn (n m) and

AT ZA = 0, BT ZB = 0.

Then, we have In A(BT A) 1BT = ZB(ZT AZB) 1ZT A.

Proof Let C = In A(BT A) 1BT , and then it is easy to check that

CZB = ZB, ZT AC = ZT A, CA = 0, BT C = 0.

Note that the above four equations determine the singular value decomposition of C and are also satisﬁed by ZB(ZT AZB) 1ZT A, which immediately implies C = ZB(ZT AZB) 1ZT A.

Lemma 12 Suppose for symmetric matrices S, H Rp p, we have M([S; H]) = Rp, where the notation M( ) means the subspace spanned by the columns of a matrix. Then, we have

(t S + H) 1 R(RT HR) 1RT ,

as t , where R is any matrix such that M(R) is the null space of S.

Kim and Gao

Proof We ﬁrst prove the special case of H = Ip. Denote the rank of S by r, and then S has an eigenvalue decomposition S = UDUT for some U O(p, r) and some diagonal matrix D with positive diagonal entries. Since M(R) is the null space of S, we have Ip = UUT + R(RT R) 1RT by Lemma 11. Then,

(t S + Ip) 1 = (U(t D + Ir)UT + R(RT R) 1RT ) 1 = U(t D + Ir) 1UT + R(RT R) 1RT ,

which converges to R(RT R) 1RT as t . The second equality above follows from UT R = 0. Now we assume a general H of full rank, which means H = QT Q for some Q Rp p that is invertible. Then,

(t S + H) 1 = Q 1(t(QT ) 1SQ 1 + Ip) 1(QT ) 1.

Since the null space of (QT ) 1SQ 1 is M(QR), we have

(t(QT ) 1SQ 1 + Ip) 1 QR(RT QT QR) 1RT QT = QR(RT HR) 1RT QT ,

and therefore (t S + H) 1 R(RT HR) 1RT . For a general H that is not necessarily full rank, since M([S; H]) = Rp, S + H is a matrix of full rank. Then,

(t S + H) 1 = ((t 1)S + S + H) 1 R(RT (S + H)R) 1RT = R(RT HR) 1RT ,

and the proof is complete.

Appendix B. Degenerate Gaussian Distributions

A multivariate Gaussian distribution is fully characterized by its mean vector and covariance matrix. For N(µ, Σ) with some µ Rp and a positive semideﬁnite Σ Rp p, we call the distribution degenerate if rank(Σ) < p. Given any Σ such that rank(Σ) = r < p, we have the decomposition Σ = AAT for some A Rp r. Therefore, X N(µ, Σ) if and only if

X = µ + AZ, (71)

where Z N(0, Ir). The latent variable representation (71) immediately implies that X µ M(A) = M(Σ) with probability one. The density function of N(µ, Σ) is given by

(2π) r/2 1 p

det+(Σ) exp 1

2(x µ)T Σ+(x µ) I{x µ M(Σ)}. (72)

The formula (72) can be found in Khatri (1968); Songgui and Shein-Chung (1994). Note that the density function (72) is deﬁned with respect to the Lebesgue measure on the subspace {x : x µ M(Σ)}. Here, the det+( ) is used for the product of all nonzero eigenvalues of a symmetric matrix, and Σ+ is the Moore-Penrose inverse of the covariance matrix Σ. The two characterizations (71) and (72) of N(µ, Σ) are equivalent to each other. The property is useful for us to identify whether a formula leads to a well-deﬁned density function of a degenerate Gaussian distribution.

Bayesian Model Selection with Graph Structured Sparsity

Lemma 13 Suppose f(x) = exp( 1

2(x µ)T Ω(x µ))I{x µ M(V )} for some µ Rp, some positive semideﬁnite Ω Rp p and some V O(p, r). As long as M(ΩV ) = M(V ), we have R f(x)dx < , and f(x)/ R f(x)dx is the density function of N(µ, Σ) with Σ = V (V T ΩV ) 1V T .

Proof Without loss of generality, assume µ = 0. Since M(ΩV ) = M(V ), V T ΩV is an invertible matrix, and thus Σ is well deﬁned. It is easy to see that M(V ) = M(Σ). Therefore, in view of (72), we only need to show

x T Ωx = x T (V (V T ΩV ) 1V T )+x,

for all x M(V ). Since x = V V T x for all x M(V ), it suﬃces to show

V T ΩV = V T (V (V T ΩV ) 1V T )+V,

which is immediately implied by (70) with R = V T ΩV . The proof is complete.

We remark that Lemma 13 also holds for a V Rp r that satisﬁes rank(V ) = r but is not necessarily orthonormal. This is because V (V T ΩV ) 1V T = W(W T ΩW) 1W T whenever M(V ) = M(W).

Appendix C. Proofs of Propositions

Proof of Proposition 1

The property of the Laplacian matrix Lγ is standard in spectral graph theory (Spielman, 2007). We apply Lemma 13 with µ = 0, Ω= σ 2Lγ and V is chosen arbitrarily from O(p, p 1) such that w T V = 0. Then, the condition M(ΩV ) = M(V ) is equivalent to 1T p w = 0, because 1p spans the null space of Lγ. Note that (70) immediately implies V RV T = V V T (V R 1V T )+V V T = (V R 1V T )+. We then have

det+(V (V T ΩV ) 1V T ) = σ2(p 1)det+(V (V T LγV ) 1V T )

= σ2(p 1)det+((V V T LγV V T )+)

= σ2(p 1) 1 det+(V V T LγV V T ).

The proof is complete by realizing that V V T = Ip ww T / w 2 from Lemma 11.

Proof of Proposition 3

Recall the incidence matrix D Rm p deﬁned in Section 2.1. With the new notations Φ = DT diag(γ)D and Ψ = DT diag(1 γ)D, we can write Lγ = v 1 0 Φ + v 1 1 Ψ and e Lγ = v 1 1 Ψ. We ﬁrst prove that (23) is well deﬁned. By Lemma 13, it is suﬃcient to show e V T ZT γ ΨZγ e V is invertible. This is because M([Φ; Ψ; w]) = Rp, and the columns of Zγ e V are all orthogonal to M([Φ; w]). Therefore, (23) is well deﬁned and its covariance matrix is given by (73).

Kim and Gao

Now we will show the distribution (8) converges to that of Zγeθ with eθ distributed by (23) as v0 0. Let e V Rr r 1 be a matrix of rank r 1 that satisﬁes e V T ZT γ w = 0. Then, by Lemma 13, the distribution (23) can be written as

eθ N(0, σ2 e V (e V T ZT γ e LγZγ e V ) 1 e V T ), (73)

which implies Zγeθ N(0, σ2Zγ e V (e V T ZT γ e LγZγ e V ) 1 e V T ZT γ ).

On the other hand, the distribution (8) can be written as

θ N(0, σ2V (V T LγV ) 1V T ),

where the matrix V Rp p 1 can be chosen to be any matrix of rank p 1 that satisﬁes V T w = 0. In particular, we can choose V that takes the form of

V = [Zγ e V ; U],

where U O(p, p r) and U satisﬁes UT w = 0 and UT Zγ e V = 0. Since both θ and Zγeθ are Gaussian random vectors, we need to prove

V (V T LγV ) 1V T Zγ e V (e V T ZT γ e LγZγ e V ) 1 e V T ZT γ , (74)

as v0 0. Note that (74) is equivalent to

V (v 1 0 V T ΦV + v 1 1 V T ΨV ) 1V T v1Zγ e V (e V T ZT γ ΨZγ e V ) 1 e V T ZT γ , (75)

as v0 0. By the deﬁnition of Zγ and the property of graph Laplacian, the null space of Φ is M(Zγ). By the construction of V , the null space of V T ΦV is M(V T Zγ). Moreover, we have M([V T ΦV ; V T ΨV ]) = Rp 1. Therefore, Lemma 12 implies that

(v 1 0 V T ΦV + v 1 1 V T ΨV ) 1 v1V T Zγ(ZT γ V V T ΨV V T Zγ) 1ZT γ V,

V (v 1 0 V T ΦV + v 1 1 V T ΨV ) 1V T v1V V T Zγ(ZT γ V V T ΨV V T Zγ) 1ZT γ V V T .

Since V V T Zγ = Zγ e V e V T ZT γ Zγ, we have M(V V T Zγ) = M(Zγ e V ). This implies

V V T Zγ(ZT γ V V T ΨV V T Zγ) 1ZT γ V V T = Zγ e V (e V T ZT γ ΨZγ e V ) 1 e V T ZT γ ,

and therefore we obtain (75). The proof is complete.

Proof of Proposition 4

We only prove the case with d = 1. The general case with d 2 follows the same argument with more complicated notation of covariance matrices (such as Kronecker products). By Lemma 13, (33) is the density function of N(0, Σ), with

Σ = σ2U(UT LγU) 1UT ,

Bayesian Model Selection with Graph Structured Sparsity

Lγ = (v 1 0 v 1 1 )In + v 1 1 k In (v 1 0 v 1 1 )γ v 1 1 1n k (v 1 0 v 1 1 )γT v 1 1 1k n (v 1 0 v 1 1 )γT γ + v 1 1 n Ik

The matrix U is deﬁned by

U = V 0n k 0k (n 1) Ik

and V Rn (n 1) is a matrix of rank n 1 that satisﬁes 1T n V = 0. Note that θ|γ, σ2 follows N(0, Σ[n] [n]). That is, the covariance matrix is the top n n submatrix of Σ. A direct calculation gives Σ[n] [n] = σ2V [V T (A BC 1BT )V ] 1V T ,

A = (v 1 0 v 1 1 )In + v 1 1 k In,

B = (v 1 0 v 1 1 )γ v 1 1 1n k,

C = (v 1 0 v 1 1 )γT γ + v 1 1 n Ik.

Letting v1 , we have

Σ[n] [n] σ2v0V [V T (In γ(γT γ) 1γT )V ] 1V T .

The existence of (γT γ)T is guaranteed by the condition that γ is non-degenerate. By Lemma 13, p(θ|γ, σ2) Q 1 i<l n exp λil θi θl 2

I{1T nθ = 0} is the density function of

N(0, σ2v0V [V T (In γ(γT γ) 1γT )V ] 1V T ), which completes the proof.

Proof of Proposition 5

We only prove the case with d = 1. The general case with d 2 follows the same argument with more complicated notation of covariance matrices (such as Kronecker products). The proof is basically an application of Proposition 3. That is, as v0 0, the distribution of (θT , µT )T weakly converges to that of Zγeµ. In the current setting, we have

The random vector eµ is distributed by (24). Note that the contracted base graph is a complete graph on {1, ..., k}, and wjl = nj + nl in the current setting. The density (24) thus becomes

p(eµ|γ, σ2) Y

1 j<l k exp (nj + nl)(eµj eµl)2

I{1T nγeµ = 0}.

Finally, the relations θ = γeµ and µ = eµ lead to the desired conclusion.

Proof of Proposition 8

Note that the integration is with respect to the Lebesgue measure on the (n 1)-dimensional

Kim and Gao

subspace {θ : 1T nθ = 0}. Consider a matrix V Rn n 1 of rank n 1 that satisﬁes 1T n V = 0, which means that the columns of [1n : V ] Rn n form a nondegenerate basis. Then, we can write dθ in the integral as 1

det(V T V )d(V T θ). For the Laplacian matrix Lγ that satisﬁes

θT Lγθ = Pn 1 i=1 (θi+1 θi)2

v0γi+v1(1 γi), we have 1T n Lγ = 0. In particular, we choose V such that its ith column is V i = ei ei+1, where ei is a vector whose ith entry is 1 and 0 elsewhere. Then, we have Lγ = V SγV T with Sγ = diag(v 1 0 γ + v 1 1 (1 γ)), and the integral becomes

1Tn θ=0,θ1 ... θn 2(n 1) 1 (2πσ2)(n 1)/2

det1n(Lγ) exp 1

2σ2 θT Lγθ dθ

θ1 ... θn 2(n 1) 1 (2πσ2)(n 1)/2

det1n(Lγ) det(V T V ) exp 1

2σ2 θT V SγV T θ d(V T θ)

δ1 0,...,δn 0 2(n 1) 1 (2πσ2)(n 1)/2

det1n(Lγ) det(V T V ) exp 1

2σ2 δT Sγδ dδ

= Z 1 (2πσ2)(n 1)/2

det1n(Lγ) det(V T V ) exp 1

2σ2 δT Sγδ dδ (76)

det1n(Lγ) det(Sγ) det(V T V ) (77)

where the last equality is by det1n(Lγ) = det+(V SγV T ) = det(Sγ) det(V T V ). The equality (76) is by the symmetry of exp 1

2σ2 δT Sγδ , and (77) is by Lemma 13. Finally, Lemma 2 says that

det1n(Lγ) = n

i=1 [v 1 0 γi + v 1 1 (1 γi)].

This completes the proof.

Proof of Proposition 9

We need to calculate

1Tn Zγ eθ=0, eθ1 eθs

(eθl eθl+1)2

where the integral is taken with respect to the Lebesgue measure on the low-dimensional subspace {eθ : 1T n Zγeθ = 0}. Choose e V l = el el+1, where el {0, 1}s is a vector whose lth entry is 1 and 0 elsewhere. Then the columns of [ZT γ 1n : (ZT γ Zγ) 1 e V ] Rs s form a non-degenerate basis of Rs. This is because

ZT γ 1n = (n1, , ns)T , ZT γ Zγ = diag(n1, , ns),

Bayesian Model Selection with Graph Structured Sparsity

where nl is the size of lth cluster. Furthermore, ZT γ 1n and (ZT γ Zγ) 1 e V are orthogonal to each other. We write f W = (ZT γ Zγ) 1 e V for simplicity. Then,

1Tn Zγ eθ=0, eθ1 eθs

(eθl eθl+1)2

det f W T f W

(eθl eθl+1)2

d(f W T eθ)

= det f W T f W(e V T f W) 1 p

det f W T f W

(eθl eθl+1)2

d(e V T eθ)

= det f W T f W(e V T f W) 1 p

det f W T f W

eδ1, ,eδs 1 0

eδ2 l 2σ2v1

det f W T f W

det e V T f W

eδ1, ,eδs 1 0

eδ2 l 2σ2v1

= (2πσ2v1)(s 1)/2

det f W T f W

det e V T f W .

The ﬁrst equality is from the Lebesgue integration on the reduced space and the last equality is by symmetry. The second equality follows from the change of variables formula, because for any eθ such that 1T n Zγeθ = 0, eθ = f WU for some U, which leads to f W T f W(e V T f W) 1 e V T eθ = f W T f WU = f W T eθ. Finally, we observe that

v(s 1)/2 1 q

det ZT γ 1n(ZTγ e LγZγ) =

det(f W T e V e V T f W)

det f W T f W

(det e V T f W)2

det f W T f W

!1/2 = det e V T f W p

det f W T f W ,

since e V e V T = v1e Lγ is the reduced graph Laplacian and the columns of f W spans the nullspace of ZT γ 1n. The proof is complete.

Appendix D. Proof of Lemma 2

We let U O(p, p 1) be an orthonormal matrix that satisﬁes 1T p U = 0, and V O(p, p 1) be an orthonormal matrix that satisﬁes w T V = 0. Write Ip U(V T U) 1V T as R. Then, RU = 0, which implies that R has rank at most one. The facts Rw = w and 1T p R = 1T p , together with Lemma 11, imply that

Ip U(V T U) 1V T = 1 1Tp ww1T p . (78)

Kim and Gao

detw(Lγ) = det+(V V T LγV V T )

= det(V T LγV )

= det(V T UUT LγUUT V ) (79)

= (det(V T U))2 det(UT LγU).

The inequality (79) is because UUT is a projection matrix to the null space of Lγ. We are going to calculate det(V T U) and det(UT LγU) separately. For det(V T U), we have

1 = det V w 1w T U p 1/21p

= det V T U p 1/2V T 1p w 1w T U w T 1p/(p1/2 w )

= det(V T U) det(w T (Ip U(V T U) 1V T )1p)/(p1/2 w )

= det(V T U)p1/2 w

and we thus get

det(V T U) 2 =

p w 2 . (80)

We use (78) for the equality (80). The calculation of det(UT LγU) requires the Cauchy-Binet formula. For any given matrices A, B Rn m with n m, we have

det(ABT ) = X

{S [m]:|S|=n} det(BT SA S). (81)

The version (81) can be found in Tao (2012) and references therein. Let A = B = UT DT (v 1/2 0 diag(γ) + v 1/2 1 diag(1 γ)), and we have

det(UT LγU) = X

{S E:|S|=p 1}

v 1 0 γij + v 1 1 (1 γij)

det(DS UUT DT S ).

Note that det(DS UUT DT S ) = det(UT DT S DS U), and DT S DS is the graph Laplacian of a subgraph of the base graph with the edge set S. Since |S| = p 1, S is either a spanning tree (det(UT DT S DS U) = p) or is disconnected (det(UT DT S DS U) = 0), we have

det(UT LγU) = p X

v 1 0 γij + v 1 1 (1 γij) . (82)

Therefore, by plugging (80) and (82) into (79), we obtain the desired conclusion.

Bayesian Model Selection with Graph Structured Sparsity

Algorithm 1 A fast DLPA Input: Initialize u1, u2 and z2.

znew 1 = (In1 + Lq1) 1(z2 + u2)

znew 2 = (znew 1 + u1)(In2 + Lq2) 1

unew 1 = z2 + u2 znew 1 , unew 2 = znew 1 + u2 znew 2 z1 = znew 1 , z2 = znew 2 , u1 = unew 1 , u2 = unew 2 until convergence criteria met Output: θ = z2

Appendix E. Some Implementation Details

In this section, we present a fast algorithm that solves the M-step (57). To simplify the notation in the discussion, we consider a special case with X1 = In1, X2 = In2, w = 1n11T n2 and ν = 0, which is the most important setting that we need for the biclustering problem. In other words, we need to optimize F(θ; q1, q2) over θ Θw for any q1 and q2, where

F(θ; q1, q2) = y y1n11T n2 θ 2 F + vec(θ)T (Lq2 Ip1 + Ip2 Lq1) vec(θ) (83)

= y y1n11T n2 θ 2 F + θθT , Lq1 + θT θ, Lq2 . (84)

Algorithm 1 is a Dykstra-like proximal algorithm (DLPA) (Dykstra, 1983) that iteratively solves the optimization problem. It is shown that Algorithm 1 has a provable linear convergence Combettes and Pesquet (2011). If we initialize u1 = u2 = 0n1 n2 and z2 = y y1n11T n2 with ˆα = y, then the ﬁrst two steps of Algorithm 1 can be written as the following update

θnew = (In1 + Lq1) 1 y y1n11T n2 (In2 + Lq2) 1. (85)

In practice, we suggest using (85) as approximate M-step updates in the ﬁrst few iterations of the EM algorithm. Then, the full version of Algorithm 1 can be implemented in later iterations to ensure convergence.

Appendix F. Accuracy of Variational Approximation

In this section, we conduct an empirical study of the accuracy of the variational approximation. The variational lower bound we used is

v 1 0 γij + v 1 1 (1 γij)

(i,j) E rij log v 1 0 γij + v 1 1 (1 γij) + log |spt(G)|,

where rij = |spt(G)| 1 P T spt(G) I{(i, j) T} is an eﬀective resistance of an edge (i, j). An eﬀective resistance re of an edge e measures how important the edge e is in the whole graph, provided that their local resistance is 1. For instance, if (i, j) is the only edge connecting

Kim and Gao

Figure 19: (Left) Accuracy of variational approximation to the normalization constant; (Center and Right) The quality of variational approximation on Grid(10,10).

Tadpole(50,50) Lollipop(80,20) Grid(10,10) Grid(20,5) Bipartite(80,20) Barbell(50,50) P50 C50 P80 K20 P10 P10 P20 P5 K80,20 K50 K50

Table 7: A List of Graphs Used for Evaluating Quality of Variational Approximation

two mutually exclusive subgraphs containing i and j respectively then rij = 1. In this case, determining whether γij = 1 has a direct eﬀect on separation of i and j, one may want to put larger weights on γij in the objective function. In the right hand side of (19), log[v 1 0 γij + v 1 1 (1 γij)] is weighted by the eﬀective resistance rij of the edge (i, j). This implies that the variational lower bound or the right hand side of the above inequality puts larger weights to more important edges. This intuitive explanation can be veriﬁed by the following numerical experiments. We compare the true log partition function

f1(γ) = log X

v 1 0 γij + v 1 1 (1 γij)

with our variational lower bound

(i,j) E rij log v 1 0 γij + v 1 1 (1 γij) + log |spt(G)|

for various choices of graphs. To this end, we randomly sample two weighted graphs G1 and G2 and then put independent random uniform weights on the edges of G1 and G2, respectively. Then we compare the diﬀerences between the true normalization constants and the diﬀerences between the variational lower bounds. We ﬁx n = 100, v0 = 10 1 and v1 = 103. Let us write the linear chain graph as Pn and the complete graph as Kn. Let be a graph operator joining two graphs by adding exactly one connecting edge between two graphs. We compare Pn, Kn and the 6 graphs in Table 7. The graph K80,20 is sampled under the additional constraint that P j γij = 1. The left panel of Figure 19 plots the accuracy of approximation, measured by ef2(γ)/ef1(γ)

against the complexity of graphs. It shows that the approximation is more accurate for a sparser graph, but even for the complete graph we still obtain an accuracy over 0.8.

Bayesian Model Selection with Graph Structured Sparsity

The center and the right panels of Figure 19 displays the true log-partition function and the variational approximation when Grid(10,10) is clustered into two parts. To be speciﬁc, the true signal θ R10 10 is deﬁned on the nodes of G and has exactly 2 clusters, the bottom left square of (a, b) and the rest. That is,

C(a,b) 1 = {θij : i a and j b} C(a,b) 2 = {θij : i > a or j > b}.

For instance, if a = 3 and b = 4, then C1 has 12 nodes. Let γ(a,b) be the corresponding latent structure parameter, and γ(a,b) ij = 1 if i and j are in the same cluster and γ(a,b) ij = 0 otherwise. For a = 1, , 10 and b = 1, , 10, we compare the true log-normalization constant f1(γ(a,b)) and its variational approximation f2(γ(a,b)), and summarize the result in the center and the right panels of Figure 19. The result is displayed by heat-maps, and the values of f1(γa,b) and f2(γa,b) are reported in the (a, b)-th entries of the two heatmaps. The results support our conclusion that the variational approximations are reasonable for various graphical structures.

Taylor B Arnold and Ryan J Tibshirani. Genlasso: path algorithm for generalized lasso problems. R package version, 1, 2014.

Andrew Barron, Lucien Birg e, and Pascal Massart. Risk bounds for model selection via penalization. Probability theory and related ﬁelds, 113(3):301 413, 1999.

Daniel Barry and John A Hartigan. A Bayesian analysis for change point problems. Journal of the American Statistical Association, 88(421):309 319, 1993.

Arindam Bhattacharjee, William G Richards, Jane Staunton, Cheng Li, Stefano Monti, Priya Vasa, Christine Ladd, Javad Beheshti, Raphael Bueno, and Michael Gillette. Classiﬁcation of human lung carcinomas by mrna expression proﬁling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, 98(24):13790 13795, 2001.

Howard D Bondell and Brian J Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64(1):115 123, 2008.

Leonard Bottolo and Sylvia Richardson. Evolutionary stochastic search for Bayesian model exploration. Bayesian Analysis, 5(3):583 618, 2010.

Oleg Burdakov and Oleg Sysoev. A dual active-set algorithm for regularized monotonic regression. Journal of Optimization Theory and Applications, 172(3):929 949, 2017.

Y. Cheng and G.M. Church. Biclustering of expression data. In Proceedings of the eighth international conference on intelligent systems for molecular biology, pages 93 103, 2000.

Eric C Chi, Genevera I Allen, and Richard G Baraniuk. Convex biclustering. Biometrics, 73(1):10 19, 2017.

Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185 212. Springer, 2011.

Kim and Gao

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1 38, 1977.

Jean Diebolt and Christian P Robert. Estimation of ﬁnite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 363 375, 1994.

Richard L Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837 842, 1983.

Zhou Fan and Leying Guan. Approximate ℓ0-penalized estimation of piecewise-constant signals on graphs. The Annals of Statistics, 46(6B):3217 3245, 2018.

Uriel Feige and Robert Krauthgamer. Finding and certifying a large hidden clique in a semirandom graph. Random Structures & Algorithms, 16(2):195 208, 2000.

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

Felix Friedrich, Angela Kempe, Volkmar Liebscher, and Gerhard Winkler. Complexity penalized M-estimation: Fast computation. Journal of Computational and Graphical Statistics, 17(1):201 224, 2008.

Chao Gao, Aad W van der Vaart, and Harrison H Zhou. A general framework for Bayes structured linear models. ar Xiv preprint ar Xiv:1506.02174, 2015.

Chao Gao, Yu Lu, Zongming Ma, and Harrison H Zhou. Optimal estimation and completion of matrices with biclustering structures. The Journal of Machine Learning Research, 17 (1):5602 5630, 2016.

Chao Gao, Fang Han, and Cun-Hui Zhang. On estimation of isotonic piecewise constant signals. The Annals of Statistics, to appear, 2018.

Edward I George and Robert E Mc Culloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881 889, 1993.

Edward I George and Robert E Mc Culloch. Approaches for Bayesian variable selection. Statistica sinica, pages 339 373, 1997.

Arpita Ghosh, Stephen Boyd, and Amin Saberi. Minimizing eﬀective resistance of a graph. SIAM review, 50(1):37 66, 2008.

Joyee Ghosh and Merlise A Clyde. Rao blackwellization for Bayesian variable selection and model averaging in linear and binary regression: a novel data augmentation approach. Journal of the American Statistical Association, 106(495):1041 1052, 2011.

G erard Govaert and Mohamed Nadif. Clustering with block mixture models. Pattern Recognition, 36(2):463 473, 2003.

Bayesian Model Selection with Graph Structured Sparsity

Bruce Hajek, Yihong Wu, and Jiaming Xu. Submatrix localization via message passing. The Journal of Machine Learning Research, 18(1):6817 6868, 2017.

David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 387 396. ACM, 2015.

Chris Hans, Adrian Dobra, and Mike West. Shotgun stochastic search for large p regression. Journal of the American Statistical Association, 102(478):507 516, 2007.

John A Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123 129, 1972.

Wilfried Imrich and Sandi Klavzar. Product Graphs: Structure and Recognition. Wiley, 2000.

Chinubhai G Khatri. Some results for the singular normal multivariate regression models. Sankhy a: The Indian Journal of Statistics, Series A, pages 267 280, 1968.

Mihee Lee, Haipeng Shen, Jianhua Z Huang, and JS Marron. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087 1095, 2010.

Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: an approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985 1042, 2010.

Fan Li and Nancy R Zhang. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistical Association, 105(491):1202 1214, 2010.

Oren E Livne and Achi Brandt. Lean algebraic multigrid (lamg): fast graph laplacian linear solver. SIAM Journal on Scientiﬁc Computing, 34(4):B499 B522, 2012.

L aszl o Lov asz. Random walks on graphs: a survey. Combinatorics, Paul erdos is eighty, 2 (1):1 46, 1993.

Zongming Ma and Yihong Wu. Volume ratio, sparsity, and minimaxity under unitarily invariant norms. IEEE Transactions on Information Theory, 61(12):6939 6956, 2015.

Patrick Mair, Kurt Hornik, and Jan de Leeuw. Isotone optimization in R: pool-adjacentviolators algorithm (pava) and active set methods. Journal of statistical software, 32(5): 1 24, 2009.

Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023 1032, 1988.

Radford M Neal and Geoﬀrey E Hinton. A view of the EM algorithm that justiﬁes incremental, sparse, and other variants. In Learning in graphical models, pages 355 368. Springer, 1998.

Paulino P erez and Gustavo de Los Campos. Genome-wide regression and prediction with the BGLR statistical package. Genetics, 198(2):483 495, 2014.

Kim and Gao

Sylvia Richardson and Peter J Green. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology), 59(4):731 792, 1997.

Veronika Roˇckov a and Edward I George. EMVS: the EM approach to Bayesian variable selection. Journal of the American Statistical Association, 109(506):828 846, 2014.

Veronika Roˇckov a and Edward I George. The spike-and-slab lasso. Journal of the American Statistical Association, 113(521):431 444, 2018.

Michael J Schell and Bahadur Singh. The reduced monotonic regression method. Journal of the American Statistical Association, 92(437):128 135, 1997.

Yiyuan She. Sparse regression with exact clustering. Electronic Journal of Statistics, 4: 1055 1096, 2010.

Martin Sill, Sebastian Kaiser, Axel Benner, and Annette Kopp-Schneider. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15):2089 2097, 2011.

Qifan Song and Guang Cheng. Optimal false discovery control of minimax estimator. ar Xiv preprint ar Xiv:1812.10013, 2018.

Wang Songgui and Chow Shein-Chung. Advanced linear models: Theory and applications, 1994.

Daniel A Spielman. Spectral graph theory and its applications. In Foundations of Computer Science, 2007. FOCS 07. 48th Annual IEEE Symposium on, pages 29 38. IEEE, 2007.

Daniel A Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsiﬁcation, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 81 90. ACM, 2004.

Matthew Stephens. False discovery rates: a new deal. Biostatistics, 18(2):275 294, 2016.

Kean Ming Tan and Daniela M Witten. Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics, 23(4):985 1008, 2014.

Terence Tao. Topics in Random Matrix Theory, volume 132. American Mathematical Soc., 2012.

Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91 108, 2005.

Ryan J Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335 1371, 2011.

Ryan J Tibshirani, Holger Hoeﬂing, and Robert Tibshirani. Nearly-isotonic regression. Technometrics, 53(1):54 61, 2011.

Bayesian Model Selection with Graph Structured Sparsity

Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008.

Gao Wang, Abhishek K Sarkar, Peter Carbonetto, and Matthew Stephens. A simple new approach to variable selection in regression, with application to genetic ﬁne-mapping. bio Rxiv, page 501114, 2018.

Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490):713 726, 2010.

Wei Biao Wu, Michael Woodroofe, and Graciela Mentz. Isotonic regression: another look at the changepoint problem. Biometrika, 88(3):793 804, 2001.

Sheng Xu and Zhou Fan. Iterative alpha expansion for estimating gradient-sparse signals from linear measurements. ar Xiv preprint ar Xiv:1905.06097, 2019.