# quantile_graphical_models_a_bayesian_approach__0701fc95.pdf

Journal of Machine Learning Research 21 (2020) 1-47 Submitted 4/17; Revised 10/19; Published 1/20

Quantile Graphical Models: a Bayesian Approach

Nilabja Guha nilabja guha@uml.edu Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854, USA

Veera Baladandayuthapani veerab@umich.edu Department of Biostatistics University of Michigan Ann Arbor, MI 48103, USA

Bani K. Mallick bmallick@stat.tamu.edu Department of Statistics Texas A & M University College Station, TX 77843, USA

Editor: David Blei

Graphical models are ubiquitous tools to describe the interdependence between variables measured simultaneously such as large-scale gene or protein expression data. Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices and they are generated under a multivariate normal joint distribution. However, they suﬀer from several shortcomings since they are based on Gaussian distribution assumptions. In this article, we propose a Bayesian quantile based approach for sparse estimation of graphs. We demonstrate that the resulting graph estimation is robust to outliers and applicable under general distributional assumptions. Furthermore, we develop eﬃcient variational Bayes approximations to scale the methods for large data sets. Our methods are applied to a novel cancer proteomics data dataset where-in multiple proteomic antibodies are simultaneously assessed on tumor samples using reverse-phase protein arrays (RPPA) technology. Key-words : Graphical model, Quantile regression, Variational Bayes

1. Introduction

Probabilistic graphical models are the basic tools to represent dependence structures among multiple variables. They provide a simple way to visualize the structure of a probabilistic model as well as provide insights into the properties of the model, including conditional independence structures. A graph comprises with vertices (nodes) connected by edges (links or arcs). In a probabilistic graphical model, each vertex represents a random variable (single or vector) and the edges express probabilistic relationship between these variables. The graph deﬁnes the way the joint distribution over all the random variables can be decomposed into a product of factors contacting subset of the variables. There are two types of probabilistic graphical models: (1) Undirected graphical models where the edges do not carry the directional information (Sch afer and Strimmer (2005); Dobra et al. (2004); Yuan and

c 2020 Nilabja Guha, Veera Baladandayuthapani, Bani K. Mallick .

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/17-231.html.

Guha, Baladandayuthapani, Mallick

Lin (2007)); (2) The other major class of graphical models is the directed graphical models (DAG) or Bayesian networks where the edges of the graphs have a particular directionality which expresses causal relationships between random variables (Friedman (2004); Segal et al. (2003); Mallick et al. (2009)). In this paper, we focus on the undirected graphical models. One popular tool of undirected graphical models is Gaussian Graphical Models (GGM) which assume that the stochastic variables follow a multivariate normal distribution with a particular structure of the inverse of the covariance matrix, called the precision or the concentration matrix. This precision matrix of the multivariate normal distribution has the interpretation of the conditional dependence. Compared with the marginal dependence, this conditional dependence can capture the direct link between two variables when all other variables are conditioned on. Furthermore, it is usually assumed that one of the variables can be predicted by those of a small subset of other variables. This assumption leads to sparsity (many zeros) in the precision matrix and reduces the problem to well known covariance selection problems (Dempster (1972); Wong et al. (2003)). Sparse estimation of precision matrix, thus plays a center role in Gaussian graphical model estimation problem (Friedman et al. (2008)). There has been an intense development of Bayesian graphical model literature over the past decades but mainly in a Gaussian graphical model setup. In a Bayesian setup, this joint modeling is done by hierarchically specifying priors on inverse covariance matrix (or precision matrix) using global priors on the space of positive-deﬁnite matrices. This prior speciﬁcation is done through inverse Wishart priors or hyper-inverse Wishart priors (Lauritzen (1996)). Wishart priors show conjugate formulation and exact marginal likelihoods can be computed (Scott and Carvalho (2008)) but overall inﬂexible due to its restrictive forms. In the space of decomposable graph the marginal likelihood are available upto normalizing constants (Giudici (1996); Roverato (2000)). The marginal likelihoods are used to calculate the posterior probability of each graph, resulting an exact solution for smaller dimension but for a moderately large P(number of nodes) or outside such restrictive class the computation may be prohibitively expensive. For non decomposable graph the computation is non trivial and maybe prohibitive using reversible jump MCMC (Giudici and Green (1999); Brooks et al. (2003)). A novel Monte Carlo technique can be found in Atay Kayis and Massam (2005). There have been approaches by shrinking the covariance matrix using matrix factorization. For example, factorization of covariance matrix in terms of standard deviation and correlation (Barnard et al. (2000)), decomposition of correlation matrix (Liechty et al. (2004)) explore such technique. Writing the inverse covariance matrix as the product of inverse partial variance and the matrix of partial correlations, Wong et al. (2003) used reversible-jump-based Markov chain Monte Carlo (MCMC) algorithms to identify the zeros among the oﬀ-diagonal elements. An equivalent formulation of GGM is via neighborhood selection through the conditional mean under normality assumption (Peng et al. (2009)). The method is based on the conditional distribution of each variable, conditioning on all other variables. In a GGM framework, this conditional distribution is a normal distribution with the conditional mean function linearly related to the other variables. Furthermore, the conditional independence relationship among variables can be inferred by the variable selection techniques of the regression coeﬃcients of the conditional mean function (Meinshausen and B uhlmann (2006)).

Quantile Graphical Models: a Bayesian Approach

Figure 1: Left panel shows the true graph and right panel shows GGM ﬁt in a typical case.

More speciﬁcally, if a speciﬁc regression coeﬃcient appeared to be zero, the corresponding variables are conditionally independent. Of course, the joint distribution approach and the conditional approach based on linear regressions are essentially equivalent.

Due to ease of computation and the presence of a nice interpretation, the vast majority of works on graphical model selection have been centered around the multivariate Gaussian distribution. In a multivariate Gaussian setup the conditional mean conveys necessary and suﬃcient information to infer the conditional independence structure. In contrast, for other distributions, this may not be true. For instance, for the multivariate t-distribution, the conditional independence can not be captured only using the conditional mean as it also depends on the conditional variance which is a nonlinear function of other variables (Kotz and Nadarajah (2004)). For more complex distributions, the conditional independence structure may depend nonlinearly on higher order moments of the conditional distribution. Hence, the inference of a graph can be signiﬁcantly aﬀected by deviations from the normality and can lead to a wrong graph. The following example, which we discuss in details in section 5 (Example 1 (A)), demonstrates the eﬀect of deviation from normality in a simple case. We assume the following structure for a graph with 30 variables/nodes X1, . . . , X30 with 400 observations from each variable. Here X11, . . . , X20 is generated from a heavy tailed distribution induced by a common scale parameter, and Xi, Xj for 10 < i, j 20 is connected in the network iﬀ|i j| < 2, given the scale parameter, and X1 . . . , X9 has some nonlinearity and non-normality and they form a subgraph G1 disjoint from G2 formed by X11, . . . , X20. We have X20, . . . , X29 independent of the rest and X30 is the function of the latent scale parameter. The ﬁtted and true graphs for X1, . . . , X29 given the scale parameter, are given in Figure 1 where index i denotes i th vertex corresponding to Xi, and it is clear that with deviation from Gaussianity we have a large number of falsely detected edges. This poses serious restriction in a variety of applications which contain non-Gaussian data as well as data with outliers. Liu et al. (2012) used Gaussian Copula model to allow ﬂexible marginal distributions. Alternatively, non-Gaussian distributions have been directly used for modeling the joint distribution to obtain the graph (Finegold and Drton (2011),Yang et al. (2016)).

Guha, Baladandayuthapani, Mallick

In this paper, we propose a novel Bayesian quantile based graphical model. The main intention is to model the conditional quantile functions (rather than the mean) in a regression setup. This is well known that the conditional quantile regression coeﬃcients can infer the conditional independence between variables. Under linearity of the conditional quantile regression function, conditional distribution of the kth variable is independent of the jth variable if the corresponding regression coeﬃcient of the quantile regression is zero for all quantiles. Hence by performing a neighborhood selection of these quantile regression coefﬁcients, we can explore the graphical structure. Thus, in our framework this neighborhood selection boils down to a variable selection problem in the quantile regression setup. A spike and slab prior formulation has been used for that purpose (George and Mc Culloch (1993)). Using Bayesian approach through spike and slab type prior, we can characterize the uncertainty regarding selected graph through the posterior distribution. A natural development would be to investigate the asymptotic property of the proposed estimated graph. We study the asymptotic behaviors of the graph when the dimension as well as the number of observations increases to inﬁnity. The posterior probability of a small Hellinger neighborhood around the true graph approaches to one, under conditions similar to Jiang (2007). Subsequently, we extend this proof of consistency under the assumption of model misspeciﬁcation, even under distribution with sub-exponential tail bound. The posterior distribution is not in an explicit form, hence we resort to simulation based MCMC method. However, carrying out MCMC in this complex setup could be computationally intensive. Therefore along with MCMC, we also propose a variational algorithm for the mean ﬁeld approximation of the posterior density (Beal et al. (2003); Wand et al. (2011); Neville et al. (2014)). The main contributions of our paper are: (1) development of robust graphical models based on quantiles in a Bayesian hierarchical modeling framework, (2) proving the consistency of those resultant graph estimates under truly speciﬁed as well as misspeciﬁed models and, (3) proposing the MCMC based posterior simulation technique as well as a fast computationally eﬃcient approximation of the posterior distribution. In next section, we formulate the neighborhood selection problem for a particular node and write down the corresponding likelihood and the posterior density. In section 3, we discuss the estimation consistency. Later in section 4, we discuss the posterior approximation in details and write down the network construction algorithm. In section 5, we discuss some of the examples and in section 6, we use the proposed method in establishing a protein network.

2. Methodology

An undirected graph G can be represented by the pair (V, E), where V represents the set of vertices and E = (i, j) represents the set of edges, for some i, j V . Two nodes, i and j, are called neighbors if (i, j) E. A graph is called complete, if all possible pair of nodes are neighbors, (i, j) E for every i, j V . C G, is called complete if it induces a complete subgraph. A Gaussian graphical model (GGM) uses a graphical structure to deﬁne a set of pairwise conditional independence relationships on a P-dimensional constant mean, normally distributed random vector x NP (µ, ΣG). Here ΣG denotes the dependence of the covariance matrix Σ on the graph G and this is the key diﬀerence of this class of

Quantile Graphical Models: a Bayesian Approach

models with the usual Gaussian models. Thus, if G = (V, E) is an undirected graph and if x = (xν)ν V is a random vector in R|V | that follows a multivariate normal distribution with mean vector µ and covariance matrix ΣG then the unknown covariance matrix ΣG in GGM is restricted by its Markov properties; given ΩG = ΣG 1, elements xi and xj of the vector x are conditionally independent, given their neighbors, iﬀωij = 0 where wij is the ijth element of ΩG. If G = (V, E) is an undirected graph describing the joint distribution of x, ωij = 0 for all pairs (i, j) E. Thus, the elements of the adjacency matrix of the graph G have a very speciﬁc interpretation, in the sense that they model conditional independence among the components of the multivariate normal. This way, the covariance matrix Σ (or the precision matrix Ω) depends on the graph G and this dependence is denoted as ΣG (ΩG). The equivalent results can be obtained by using the conditional regression setup where the conditional distribution of one variable Xk given all other variables is [Xk|X k] N(P j =k βkj Xj, σ2 k) where βkj = ωkj/ωkk, σ2 k = 1/ωkk and X k is the vector containing all Xs except the kth one. It is clear that the variable Xk is conditionally independent of Xl given all other variables iﬀthe corresponding conditional regression coeﬃcient βkl is 0. This result transforms the Gaussian graphical model problem to a variable selection (or neighborhood selection) problem in a conditional regression setup (Meinshausen and B uhlmann (2006)). If multivariate normality assumption on x does not hold, then the conditional mean does not characterize the dependence among the variables. Under general distribution, it can be helpful to study the full conditional distribution. The absence of an edge between kth and jth node implies that the conditional distribution of Xk given the rest Xk|X k, does not depend on j and vice versa. Any distribution is characterized by its quantiles. Therefore, we can look at the conditional quantile functions of Xk and check if it depends on Xj. Hence, the main idea is to model the quantiles of Xk and perform a variable selection over all quantiles. We use linear model for modeling the quantile functions and perform variable selection in the set up of quantile regression (Koenker and Bassett Jr (1978); Koenker (2004)). Thus, we generalize the concept of Gaussian graphical model in a quantile domain where we consider the conditional quantile regression of each of the node variable Xk given all others say X k for k = 1 P. See Belloni et al. (2016) for some very recent developments in quantile based graph estimation in frequentist set up. In a conditional linear quantile regression model if Xk(τ) is the τ th quantile of kth variable Xk then the conditional quantile of Xk given X k, that is Xk, k(τ), can be expressed as

Xk, k(τ) = βk,0(τ) + X

j =k βk,j(τ)Xj, j = 1, P. (1)

We summarize the above discussion in the following result.

Proposition 1 Under the assumption of linearity of the conditional quantile function of Xk, as in model (1), Xk is conditionally independent of Xj iﬀβk,j(τ) = 0, τ.

Therefore from Proposition 1, we obtain a similar framework as in the Gaussian graphical model problem. That way, we transform the quantile graphical modeling problem to a quantile regression problem.

Guha, Baladandayuthapani, Mallick

Furthermore, instead of looking at a single quantile such as median, considering a set of quantiles will be useful to address a more general dependence structure. To induce sparsity, it will be helpful to look at the coeﬃcients for a set of quantiles τs and assume that the condition βk,j(τs) = 0 for all s implies the conditional independence among the corresponding variables. Indeed, the sparse graphical model based on (1) addresses more general cases than just modeling the conditional mean. In practice instead of the continuum, grid points 0 < τ1 < < τm < 1 are used for the selection process. In many practical scenarios, conditional quantiles may not be linear over all quantiles and over all the variables. In that case, we consider ˆL(Xk, k(τ)) = ˆβk,0(τ)+P j =k ˆβk,j(τ)Xj, the best linear approximation that minimizes the expected quantile loss function E[ρτ(Xk L(Xk, k(τ))] where L(Xk, k) varies over all linear functions of variables/covariates other than Xk and, the quantile loss function is given by ρτ(z) = zτ, z 0; ρτ(z) = (1 τ)z, z < 0. We also assume that this minimizer is unique. Next, we assume,

C1. If Xk, k(τ) does not depend on Xj for some j, for any τ, then the coeﬃcient of Xj in ˆL(Xk, k(τ)) is zero over all quantiles, that is ˆβk,j(τ) = 0 for all τ;

C2. If Xk, k(τ) depends on Xj for some τ, then there exists ϵ, δ, δ > 0 such that for τ on the interval [ϵ, 1 ϵ], we have |ˆβk,j(τ)| > δ for τ in an open subset of [ϵ, 1 ϵ] of radius δ and ˆβk,j(τ) is a continuous function of τ for τ (0, 1).

Condition C1 enforces that conditional independence implies the same for best linear quantile function and condition C2 implies that Xj s that are connected to a particular Xk are detectable through linear quantile regression. Condition C2 can be relaxed by using polynomial/spline basis to accommodate general functions, but here we restrict ourselves to linear functions and linear quantile regression. Suppose we have n independent observations which can be presented as a n P data matrix X = {Xij, i = 1, , n, j = 1, , P}. We write X = [X1, , XP ] where Xi is the n 1 dimensional ith column vector containing the data corresponding to the ith variable. Since we consider the conditional quantile regression for each of the variable Xj given all the other variables, for the sake of simplicity we describe the general methodology only for a speciﬁc variable Xk. For notational convenience, we assume Y is the kth column of X containing the data related to Xk. Furthermore, X = X k is a n P dimensional matrix containing data corresponding to all other variables except the kth one. Hence, we redeﬁne X having Xi in the i + 1 th column if i < k and Xi in the ith column for i > k. We also allow the intercept term as a vector of ones in the ﬁrst column. In the quantile regression for Xk, we treat Y as the response and X as the covariates. For the τ th quantile regression, we obtain the estimates of the regression coeﬃcients by minimizing the loss function l such as min β Pn i=1 ρτ(yi x iβ) the regression coeﬃcient vector β = {β0, β1, . . . , βk 1, βk+1, . . . , βP }, yi is the ith element of Y and xi is the ith row of X. Mathematically minimizing this loss function l is equivalent to maximizing l where exp( l) is proportional to the likelihood function. This duality between a likelihood and loss, particularly viewing the loss as the negative of the log-likelihood, is referred to in the Bayesian literature as a logarithmic scoring rule (see, for example, Bernardo (1979), page 688). Using loss function to construct likelihood may cause model misspeciﬁcation. Later we address the issue and show even under model misspeciﬁcation, we have the posterior

Quantile Graphical Models: a Bayesian Approach

concentration around the best linear approximation of the conditional quantile functions. Accordingly, the corresponding likelihood based method can be formulated by developing the model as yi = x iβ + ui where uis are independent and identically distributed (iid) random variables with the scale parameter t as f(u|t) = tτ(1 τ)exp( tρτ(u)). Using the likelihood corresponding to the quantile regression gives the consistent estimate of the coeﬃcients of the conditional quantile regression (Sriram et al. (2013)). Misspeciﬁed likelihood (see Chernozhukov and Hong (2003); Yang et al. (2016) ) may impact the posterior inference such as conﬁdence interval for coeﬃcients. But here our main goal is to model the conditional quantile function through linear approximation and perform a model selection for the quantile function through a likelihood equation. Also, we do not enforce any ordering restriction between the quantile functions for diﬀerent quantiles. If the linear representation holds for conditional quantile then the posterior estimates from the likelihood based on the loss function should show the desired ordering, as we can estimate the coeﬃcients of the quantile regressions consistently. The quantile based conditional distributions may not correspond to a joint distribution. However, here we model the linear approximation of the conditional quantile functions over a grid of quantiles and construct posterior probability of the selecting the neighbors of a particular node/variable by constructing the pseudo likelihood function based on quantile loss. Later we show that even if we have misspeciﬁed model, we have posterior probability of selecting wrong edge/neighbor will go to zero under this loss based pseudo likelihood. Using the results from Li et al. (2010) and Kozumi and Kobayashi (2011), we can express : ui = ξ1vi + t 1

2 ξ2 vizi, where ξ1 = 1 2τ τ(1 τ), ξ2 = q

2 τ(1 τ) , v Exp(t) and z N(0, 1). Furthermore, the variables indexed by diﬀerent i s are independent. The ﬁnal model can be represented by integrating previous results as

yi = x iβ + ξ1vi + ξ2t 1

2 vizi vi Exp(t),

zi N(0, 1). (2)

For selecting the adjacent nodes (neighborhood selection) for node k, a Bayesian variable selection technique has been performed. The stochastic search variable selection (SSVS) is adapted using a spike and slab prior for the regression coeﬃcients as : p(βj|Ij) = Ij N(0, g2v2 0) + (1 Ij)N(0, v2 0), (George and Mc Culloch (1993)) for j = 1, . . . , P, j = k and Ij is the indicator variable related to the inclusion of the jth variable. Let γ be the vector of indicator function Ij s. We denote the spike variance as v2 0 and the slab variance as g2v2 0, where g is a large constant. Alternatively, writing βγ,j = βj Ij (Kuo and Mallick (1998)) can be helpful, where we use the indicator function in the likelihood and model the quantile of y by x βγ. Further, a Beta-Binomial prior is assigned for Ij. The corresponding Bayesian hierarchical model is described as

βj N(0, t 1σ2 β),

π Beta(a1, b1),

t Gamma(a0, b0). (3)

Guha, Baladandayuthapani, Mallick

The Beta Binomial prior opposed to a ﬁxed binomial distribution with a ﬁxed π induces sparse selection (Scott and Berger (2010)). For a sparse estimation problem we consider m diﬀerent quantile grid points in (0, 1) as τ1 . . . , τm. Let βl = {β0,l, .., βk 1,l, βk+1,l . . . βP,l} be the coeﬃcient vector corresponding to the τl and βγ,l = {β0,l, .., βk 1,l Ik 1,l, βk+1,l Ik+1,l, . . . βP,l IP,l}. Let β be the vector of all the βl s; β(l 1)P+j = βj 1,l if j < k and β(l 1)P+j = βj,l for j > k. In this setup, Ij,l = 0 for all l implies that Xj is not in the model, and Ij,l = 1 for some l implies that Xj is included in the model. Let tl be the scale parameter for τl. For τl, we write vi, ξ1 and ξ2 from (2) as vi,l, ξ1,l and ξ2,l, respectively. Let v be the vector of vi,l s. Using τ1, τ2, . . . , τm the corresponding loss function for τl is l(βl) = ρτl(y x βγ,l) and the corresponding likelihood function is

fτl(yi|tl, βl, γ) tl exp( tlρτl(yi x iβγ,l)). (4)

The hierarchical model can be written as follows:

βl|τl MV NP (0P , Σβ,P P ), l = 1, . . . , m,

Ij,l Ber(πl),

πl Beta(a1, b1),

tl Gamma(a0, b0),

fτl(Y|tl, βl, γ) tn l exp( tl

i=1 ρτl(yi x iβγ,l)). (5)

Here, 0P is a vector of zeros of length P and, MV NP (0P , Σβ,P P ) denotes P dimensional multivariate normal distribution with the mean vector 0P and the covariance matrix Σβ,P P . We use Π(.) to denote prior distributions. Using the setting in (5) and (2), we can express the posterior distribution of the unknowns as

Πl(βl, {Ij,l}j =k, πl, vl, tl|Y) t3n/2 l {

2 exp( tl (yi x iβγ,l ξ1,lvi,l)2

2vi,lξ2 2,l )

exp( tlvi,l)}Π(βl)} Y

j =k Π(Ij,l)Π(πl)Π(tl). (6)

Each of the posteriors Πl( ) gives probability to the parameters and hyper-parameters corresponding to τl in particular, on Θl = {βl, {Ij,l}j =k, πl, vl, tl}. Let, Θ = {Θl}l. The distribution on Θ induced by Πl( ) s given by Π(Θ) = Q Πl(Θl). The posterior distribution given in 6 is not available in an explicit form and we have to use simulation based approach like Markov Chain Monte Carlo (MCMC) to obtain realizations from it which is described in section 4. Even more, we have to repeat this procedure for each k over all quantiles, which makes it more computationally demanding. Due to these reasons, we also develop an approximate method based on the variational technique.

Quantile Graphical Models: a Bayesian Approach

3. Graph estimation consistency

In this section, we consider the consistency of the proposed graphical model. Two approaches can be adopted. One method is to look at the variable selection consistency for each of the nodes and the alternative way will be to consider the ﬁtted density induced by the graphical model. We take the latter approach ﬁrst and show the predictive consistency of the proposed network in scenarios encompassing the P > n case. The dimension is adaptively increased with increasing n, the number of observations for each variable. Let P = pn be the number of nodes. We show that with n increasing to inﬁnity under some appropriate conditions on the prior, the ﬁtted density lies in the Hellinger ball of radius ϵn, approaching to zero, around the true density with high probability, if the proposed model is correct. Next, we consider the case of model miss-speciﬁcation and neighborhood selection consistency.

3.1. Consistency under true model

Convergence in terms of Hellinger distance between the posterior graph and the true graph can be achieved under conditions similar to Jiang (2007). Here we brieﬂy deﬁne the convergence criterion, describing the conditions required and discuss their implications in terms of the graph estimation. Let, G be the true graph and fk,G be the density associated with the k the node of true graph under the proposed model and Π(.|.) be the posterior density and fk be the density under the model given in Equation 4 . Convergence in terms of Hellinger distance such as, PG [Π[d(fk, fk,G ) < ϵn|X ] > 1 δn] 1 λn

where ϵn, δn, λn going to zero as n can be achieved, where δn goes to zero in exponential rate. Here, d(f, f ) = q

f (x))2d(ν(x)) denotes the scaled standard Hellinger distance in some measure space χ with measure ν, where f be the true data generating density, and PG [ ] or P [ ] be the probability under true data generating density. For the neighborhood of Y = Xk, writing the coeﬃcients β( k) γ,l = βγ,l, β( k) l = βl,

β( k) j,l = βj,l, vi,l,k = vi,l, using (6) we have

Πl(β( k) l , {I( k) j,l }j, πl, vl, tl|.) (7)

i=1 vi,l,k 1

2 exp( tl (Xi,k x ki β( k) γ,l ξ1,lvi,l,k)2

2vi,l,kξ2 2,l )

exp( tlvi,l,k)}Π(β( k) l )} Y

j =k Π(Ij,l)Π(πl)Π(tl)

where x ki is the i th row of X k. Through this conditional modeling, we show the posterior concentration of f(xk|xi, i = k)f (xi, i = k) around f (x1, . . . , x P ). For the indicator function for the neighborhood selection of k th node, we assume Ij,l Ber(πn), j = k, with the restriction Ppn j=1 Ij,l rn. Let rn = pnπn. The restriction on the maximum possible dimension can be relaxed by assuming a small probability on the set Ppn j=1 Ij,l rn. Also, the scale parameter tl = t is assumed to be ﬁxed. The following

Guha, Baladandayuthapani, Mallick

results also hold for the Beta-Binomial prior on the indicator function and we address it later. Let ϵn be a positive sequence decreasing to zero and 1 nϵ2 n, where an bn implies bn an . We have the following prior speciﬁcations, β( k) l MV NP (0P , S 1 βl ), where S 1 βl is a diagonal matrix in our setting. Under the true data generating model given in Equation 4 , let l k(rn) = inf|γ|=rn P j / γ,j =k |β ( k) j,l |. Here the superscript denotes the true coeﬃcient values. Let ch1(M) denote the largest eigenvalue of the some positive deﬁnite matrix M. Let β( k) γl N(0, Vγl), i.e the distribution

restricted to the variables included in the model. Let, B(rn) = maxl{ch1(Vγl), ch1(V 1 γl )}. Suppose the following conditions hold. A1. rnlog(pn) nϵ2 n. A2. rnlog(1/ϵ2 n) nϵ2 n. A3. 1 rn rn pn. A4. P j =k |β ( k) j,l | < . A5. 1 rn pn < nα; α > 0. A6. B( rn) nϵ2 n. A7. pn l k(rn) ϵ2 n. Conditions similar to A1-A7 can be found in Jiang (2007). Condition A1 is needed for establishing the entropy bound on a smaller restricted model space, that is an upper bound on the number of Hellinger balls needed to cover the restricted model space. Conditions A2 and A6 ensure that we have suﬃciently large prior probability on the Kullback-Leibler(KL) neighborhood of true model. Assumption A7 is needed to ensure sparsity that is coeﬃcients from all but few variables are close to zero and the total residual eﬀect is small. Also, it has pn multiplied on the L.H.S as we may not have the boundedness of the node values. Also, the eigenvalue condition is satisﬁed trivially. The main idea is to show negligible prior probability for models with dimension larger than rn or where the coeﬃcient vector lies outside a compact set. Then next step would be to cover the smaller model space with N(ϵn) many Hellinger balls of size ϵn with log(N(ϵn)) nϵ2 n. Tests can be constructed similar to Ghosal et al. (2000). Then by showing that the prior probability of KL neighborhood around the true model has lower bound of some appropriate order, the following results can be achieved. Let, hk = q

f(xk|xi, i = k) p

f (x|xi, i = k))2f (xi =k)d(x)). Let the generic term Dn denotes the data matrix. Then we have the following theorem.

Theorem 3.1 Suppose supj E|Xj| = M < . Then from (6) under A1-A7, for some c 1 > 0 and for nδ pn nα; α > δ > 0 and for sup|γl| rn{ch1(Vγl), ch1(V 1 γl )} B rv n; v, B > 0, for large enough rn, the following convergence results hold in terms of the Hellinger distance if the true data is generated by the likelihood given by equation (4) for some τl, as the number of observations goes to inﬁnity a)

P [Πl(hk ϵn|Dn) > 1 e c 1nϵ2 n] 1.

Proof Given in the Appendix section.

Quantile Graphical Models: a Bayesian Approach

Remark 1 In particular, for rn nb with b = min{ξ, δ, ξ/v} and ϵn = n (1 ξ)/2 with ξ (0, 1), we have nϵ2 n = nξ and the decaying rate of the order e nξ.

Remark 2 If each of the node has ﬁnitely many neighbors, then some assumptions on tail conditions such as A4, A7 become redundant as only ﬁnitely many β ( k) j,l s are non zero for each k. For pn = O(nα), 0 < α < 1, we can have rn = pn and ϵn = n (1 ξ)/2, with ξ (α, 1).

Remark 3 The results in Theorem 3.1 hold for Beta-Binomial prior on the indicator function as well and given in the Appendix section.

3.2. Consistency under model misspeciﬁcation

3.2.1. Density estimation

Model (4) has been developed from a loss function and may not be the true data generating model. Therefore, we extend our consistency results under the condition of model misspeciﬁcation. Let, f0 k, k be the true density of Y = Xk given X k and Fk be the set of densities fl,k, k s given by (4). Let f0 k be the true data generating density for X k, the variables other that than Xk. Let f l,k, k Fk be the density in (4) such that f l,k, kf0 k has the smallest Kullback-Leibler (KL) distance with f0 k, kf0 k. We show that the posterior given in (6) concentrates around f l,k, k for τl . We ﬁx the scale parameter tl. It should be noted that for miss-speciﬁed case, we use f0 for the true density and f for the nearest point to the true density in KL sense, slightly changing earlier notation from Section 3.1.

Let lτ,β( k) l = E(ρτl(xk x kβ( k) l )) and ˆβ

l = arg minβllτ,β( k) l and suppose the

minimizers are unique. Let, ˆβ

( k) be the combined vector, analogous to β( k). Then under

some conditions, the posterior converges to fl,k, k( ˆβl

( k)), the density corresponding to the best linear quantile approximation for τl. Let inf KL(f0 k, kf0 k, fl,k, k(βl ( k))f0 k) = δ k,l,

which is achieved at the parameter value ˆβ

l . Posterior concentration under model misspeciﬁcation needs more involved calculations and can be shown under carefully constructed test functions, as given in Kleijn et al. (2006). However, such approach may depend on the convexity or boundedness of the model space. We take a route similar to Sriram et al. (2013)) based on the quantile loss function and show the convergence directly. To prove the consistency, we make a few assumptions. Without loss of generality, we assume that the variables are centered around zero. Let dk be the degree (the number of neighbors) of the k th node. Under the following conditions we prove the convergence theorem.

B1. maxkdk < M0 1 for some universal constant M0.

B2. E[eλ||Xk| E(|Xk|)|] e.5λ2ν2 for |λ| < b 1, k, ν > 0 (sub-exponential tail condition).

B3. There exists ϵ > 0, such that for |Xk| < ϵ, k, any m M0 dimensional joint density of any m number of covariates Xk s is uniformly bounded away from zero. We also assume that E|Xk| s are uniformly bounded.

Guha, Baladandayuthapani, Mallick

B4. logpn n.

B5. supk ˆβ

Theorem 3.2 From (4) and (6), under conditions B1 B5, for any δ > 0, Π(KL(f0 k, kf0 k, fl,k, kf0 k) > δ + δ k,l, for some k| ) goes to zero almost surely, as the number of observations n goes to inﬁnity.

Next, we derive the posterior convergence rate under the model misspeciﬁcation. For a sequence ϵn converging to zero, we assume

B6. ϵn n ξ, ξ < .25,

B7. logpn nϵ4 n.

Theorem 3.3 From (4) and (6), under conditions B1 B7, as the number of observations n goes to inﬁnity, Π(KL(f0 k, kf0 k, fl,k, kf0 k) > δn +δ k,l, for some k| ) goes to zero almost surely, where δn = 4ϵ2 n.

Proofs of Theorems 3.2 and 3.3 are given in the Appendix section. We ﬁrst show the results for bounded Xk s and later extend our results for heavy-tailed sub-exponential distributions, at the end of the proof of Theorem 3.3.

3.2.2. Neighborhood selection consistency

Next, we state the following Theorems about the neighborhood selection. For Xk or k th node, let N k = {i = k; i {1, . . . , P = pn} : Xi Xk}, where Xj Xk implies that there is an edge between j th and k th node. Let, N l,k = {i = k; i {1, . . . , P = pn} : ˆβk,i(τl) = 0}, be the neighborhood corresponding to best linear conditional quantile for τl, where ˆβk,i(τl) s are given in conditions C1, C2 in Section 2 and ˆβk,i(τl) is the coeﬃcient corresponding to

Xi, i = k, in ˆβ

l from Section 3.2.1.

Lemma 1 Under C1 and C2, for 0 = τ0 < τ1 < τ2 < τm < 1 and τi τi 1 < δi, there exists δ0, m0 > 0 such that for δi < δ0 for all i, m > m0 and N k = l N l,k.

Let M l,k be the model corresponding to the neighborhood N l,k and M k corresponds to N k . We assume the following.

B8. Ij,l Ber(πn) and logπn = O(n0.5+ϵ ); 0 < ϵ < 0.5.

B9. logpn = O(logn).

The above condition B8 puts a strong penalty on the model size which penalizes the neighborhood size of a node, and selection probability under posterior distribution of any bigger model, containing the true model for a node, goes to zero with high probability. Next, we assume the following for the conditional densities and the quantiles. This conditions are similar to the conditions in Angrist et al. (2006) in the context of estimating the conditional quantile regression coeﬃcient for miss-speciﬁed linearity. For a model M1 k

Quantile Graphical Models: a Bayesian Approach

at node k, let Z denote the |M1 k|+1 dimensional random variable consisting of 1 in the ﬁrst place and Xj s, j = k that are in the model M1 k in the remaining places and |M1 k| is the size of model M1 k. Let ˆβ(τ)M1 k be the corresponding coeﬃcients for the unique best linear conditional quantile solution.

C3. The true conditional density f0(xk|x k) is bounded and uniformly continuous in xk uniformly over support of X k.

C4 J(τ) = E[f0(Z ˆβ(τ)M1 k|Z)ZZ ] is positive deﬁnite for all τ with Eigen values uniformly bounded away from zero for τ in any open subset of [0, 1], for Z deﬁned above for any M1 k, and E[ Z 2+ϵ2] is uniformly bounded for some ϵ2 > 0, over all possible model of size |M1 K|, for any ﬁnite dimensional model M1 k.

Let ˆβ(τl)M1 k be the coeﬃcient vector that minimizes the linear conditional quantile regression loss E[ρτl(Y β XM1 K)] where Y = Xk, XM1 K stands for the variables other than Xk that are in model M1 k, including the one in ﬁrst coordinate for the intercept term, and ˆβ

n(τl)M1 k be the corresponding MLE for the likelihood based on this loss function. In

Angrist et al. (2006), convergence of the process J( ) n(ˆβ

n( )M1 k ˆβ( )M1 k) was shown, for τ in a subinterval of (0, 1). Using those results we show the following neighborhood selection related result. Let, Πn τl,k(M1, M2) = Πn l,k(M1, M2) denote the ratio of posterior probabilities of model M1 and M2, at node k for τl based on n observations. Let M l,k be model based on the neighborhood N l,k for τl, and for a model M1, corresponding to some node, let |M1| be its size or number of covariates/neighbors in the model. We assume tl is ﬁxed and equal to one, without loss of generality.

Theorem 3.4 For quantiles 0 < τ1 < < τm < 1, δi = τi τi 1, for equations (4) and (6), under B1, B2, B5, B8, B9, C1 C4 we have supl{Πn l,k(M1 k, M l,k) : M1 k = M l,k} 0, in probability, for any alternative model M1 k, as n goes to inﬁnity.

Remark 4 Let M1 k be any model corresponding to a neighborhood at node k, N1 k, which does not contain N k. Let ˆβ(τ)M1 k be the corresponding minimizer of the expected linear

conditional quantile loss for that model. Suppose, we assume ˆβ(τ)M1 k to be continuous on

τ and infτ (ϵ,1 ϵ)lτ,ˆβM1 k lτ,ˆβM k > 0 for any ϵ > 0, where lτ,ˆβM1 k = E[ρτ(xk Z ˆβ(τ)M1 k)]

and Z is the |M1 k| + 1 dimensional random variable with one in the ﬁrst coordinate and variables corresponding to M1 k in others. Then under the set up of Theorem 3.4, we have supτ (ϵ,1 ϵ){Πn τ,k(Mk, M k) : Mk = M k} 0, in probability. Therefore heuristically, for large n, choosing quantiles on [τϵ, 1 τϵ], τϵ > 0, even if we choose quantile densely, the false discovery rate should not keep on increasing with the number of quantile grids, and should stabilize. This conclusion is later veriﬁed in our simulation.

Guha, Baladandayuthapani, Mallick

4. Posterior analysis

We ﬁrst describe the MCMC steps for posterior simulation. Next, we derive the variational approximation algorithm steps for our case. For simplicity, we illustrate the posterior sampling for Y = Xk,n 1 and X = X k ; i.e the neighborhood selection for the k th node. For notational convenience, we will not use the suﬃx k in this section and formulate the method for a regression setup. Let us introduce some notations which will be used in both formulations. Let Xγ,l is n P dimensional covariate matrix containing Xi Ii,l in i + 1 th column for i < k and Xi Ii,l in ith column for i > k and the vector of ones in the ﬁrst column. To write the steps for variational approximation and MCMC, we deﬁne the following quantities.

Let Y1 = 1m 1 Y be an n m length vector formed by replicating Y, m times.

Let X1,γ be the matrix by arranging the Xγ s diagonally. Let, XE 1,γ denotes the matrix X1,γ, with indicators replaced by their expectations. Similarly, we have XE γ . Distributions corresponding to the Expectation calculation will be speciﬁed later and will arise in variational approximation set up. Similarly, the expectation calculation for the following steps will be speciﬁed later.

Let Y δ, Y δ are mn length vector such that, Y δ (l 1)n+i = Y1(l 1)n+i ξ1,l(E( 1

vi,l )) 1 for

l = 1, . . . , m and i = 1, . . . , n, and similarly, Y δ (l 1)n+i = Y1(l 1)n+i ξ1,l( 1

vi,l ) 1, and

Y δ,l, Y δ ,l be the analogous n length vectors for τl.

Let Σl be the n n diagonal matrix, where i th diagonal entry is E(tl)E( 1 vi,lξ2 2,l ) for

l = 1, . . . , m . Similarly, Σ1 l be the n n diagonal matrix, where i th diagonal entry is (tl)( 1 vi,lξ2 2,l ) for l = 1, . . . , m .

Let Sx = X1 ΣX1 and Sx,γ = X 1,γΣX1,γ. Similarly, SE x,γ is the expectation of Sx,γ. Let Sx,γ,l and SE x,γ,l be the matrices corresponding to τl.

Let β be the m P length vector such that β(l 1)P+j = βj 1,l if j < k and β(l 1)P+j = βj,l otherwise, for l = 1, . . . , m. Also, note that we denote the prior for β as β N(0, S 1 β ) as in Section 3. For τl, βl N(0, S 1 β,l ).

4.1. MCMC steps

Here, we describe the implementation of the MCMC algorithm to draw realizations from the posterior distribution. More speciﬁcally, we use Gibbs sampling by simulating from the complete conditional distributions which are described below (for the k th node).

(a) For the coeﬃcient vector βl: Given rest of the parameters the conditional distribution is:

qnew(βl|.) := MV N((Sx,γ,l + Sβ,l) 1(Xγ,l) Σ1 l Y δ ,l, (Sx,γ,l + Sβ,l) 1).

Quantile Graphical Models: a Bayesian Approach

Σ1 l be the n n diagonal matrix, where i th diagonal entry is tl( 1 vi,lξ2 2,l ) for l = 1, . . . , m and

Sx,γ,l = X γ,lΣ1 l Xγ,l. (b) For πl :

qnew(πl|.) := Beta(a1 +

j=1,j =k Ij,l, P 1

j=1,j =k Ij,l + b1).

(c) For vi,l s:

qnew(vi,j|.) vi,lf In G(vi,l, λi,l, µi,l)

where f In G(vi,j, λi,l, µi,l) is a Inverse Gaussian density with parameters λi,l = tl( (yi x iβγ,l)2

and µi,l = s

2tl+tl ξ2 1,l ξ2 2,l

. This step involves a further Metropolis-Hastings sampling with a

proposal density for vi,l s as f In G(vi,j, λi,l, µi,l). (d) For tl:

qnew(tl|.) := Gamma(a2, b2)

where a2 = a0 + n

2 + n and b2 = b0 + 1

2 P i( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l ) + P i vi,l.

(e) For the indicator functions:

log(P(Ij,l = 1|.)

P(Ij,l = 0|.)) = (log πl 1 πl ) 1

i,Ij,l=1 tl( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l )

i,Ij,l=0 tl( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l )}.

We simulate from this conditional distributions iteratively to obtain the realizations from the joint posterior distribution.

4.2. Variational approximation

As explained in section 2 , we approximate the posterior distribution to facilitate a faster algorithm. We use the variational Bayes methodology for this approximation. First, we brieﬂy review the variational approximation method for posterior estimation. For observed data Y with parameter Θ and prior Π(Θ) on it, if we have a joint distribution p(Y, Θ) and a posterior Π(Θ|Y ) respectively then

log p(Y ) = Z log p(Y, Θ)

Π(Θ|Y )q(Θ)d(Θ)

= Z log p(Y, Θ)

q(Θ) q(Θ)d(Θ) + KL(q(Θ), Π(Θ|Y )),

Guha, Baladandayuthapani, Mallick

for any density q(Θ). Here KL(p, q) = Ep(log p

q), the Kullback-Leibler distance between p and q. Thus,

log p(Y ) = KL(q(Θ), Π(Θ|Y )) + L(q(Θ), p(Y, Θ))

L(q(Θ), p(Y, Θ)) = KL(q(Θ), Π(Θ|Y )) log p(Y ). (8)

With given Y , we minimize L(q(Θ), p(Y, Θ)) = R log q(Θ)

p(Y,Θ)q(Θ)dΘ. Minimization of the L.H.S of (8) analytically may not be possible in general and therefore, to simplify the problem, it is assumed that the parts of Θ are conditionally independent given Y . That is

and s i=1Θi = Θ is a partition of the set of parameters Θ. Minimizing under the separability assumption, an approximation of the posterior distribution is computed. Under this assumption of minimizing L.H.S of (8) with respect to qi(Θi), and keeping the other qj(.), j = i ﬁxed, we develop the following mean ﬁeld approximation equation:

qi(Θi) exp(E i log(p(Y, Θ)), (9)

where E i denotes the expectation with respect to q i(Θ) = Qs j=1,j =i qj(Θj). We keep on updating qi(.) s sequentially until convergence. For τl, we have Θ = Θl = {βl, I = {Ij,l}, πl, tl, v = {vi,l}} with i = 1, . . . , n; j = 1, . . . P, j = k. To proceed, we assume that the posterior distributions of βl, Il = {Ij,l}, πl, vl = {vi,l} and tl s are independent given Y . Hence,

q(Θ) = q(tl)q(βl) Y

j =k q(Ij,l) Y

i q(vi,l)q(πl).

Under (3), (6), we have,

p(Y, Θ) tn l {t

n 2 l {Qn i=1 vi,l 1

2 exp( tl (yi x iβγ,l ξ1,lvi,l)2

2vi,lξ2 2,l )exp( tlvi,l)}}

2(β l(Sβl)βl))π P j =k Ij,l+a1(1 πl)P 1 P j =k Ij,l+b1ta0 1 l exp( b0tl).

Using the expression given in (9), we have the variational algorithm given in Table 1. The densities under this variational approximation algorithm converge very fast and that makes the algorithm many time faster than the standard MCMC algorithms. From (9) we have an explicit form of qnew(.) and for our case the updations inside the algorithm are given next.

4.2.1. Sequential updates

If qold(πl), qold(tl), qold(vi,l), qold(Ij) are the proposed posteriors of πl, {tl}, vi,l and Ij,l s at the current step of iteration, we update

qnew(βl) exp(Eold πl,tl,vi,l,Ij,l;i,jlog(p(Y, Θ)))

Quantile Graphical Models: a Bayesian Approach

Table 1: Variational Update Algorithm

1. Set the initial values q0(βl), q0(vl), q0(Il), q0(tl) and q0(πl). We denote the current density by qold(). For iteration in 1:N : 2. Find qnew(βl) by

qnew(βl) = arg min q (βl) L q (βl)qold(Il)qold(vl)qold(πl)qold(tl), p(Y, Θl) .

Initialize qold(βl) = qnew(βl). 3. Find qnew(Il) by

qnew(Il) = arg min q (Il) L qold(βl)q (Il)qold(vl)qold(πl)qold(tl), p(Y, Θl) .

Initialize qold(Il) = qnew(Il). 4. Find qnew(vl) by

qnew(vl) = arg min q (vl) L qold(βl)qold(Il)q (vl)qold(πl)qold(tl), p(Y, Θl) .

We initialize qold(vl) = qnew(vl). 5. Find qnew(πl) by

qnew(πl) = arg min q (πl) L qold(βl)qold(Il)qold(vl)q (πl)qold(tl), p(Y, Θl) .

Initialize qold(πl) = qnew(πl). 6. Find qnew(tl) by

qnew(tl) = arg min q (tl) L qold(βl)qold(Il)qold(vl)qold(πl)q (tl), p(Y, Θl) .

Initialize qold(tl) = qnew(tl). We continue until the stop criterion is met. end for 9. Return the approximation qold(βl)qold(Il)qold(vl)qold(πl)qold(tl).

Guha, Baladandayuthapani, Mallick

where Eold π,tl,vi,l,Ij;i,j denotes the expectation with respect to the joint density given by qold(πl)qold(tl)qold Q i qold(vi,l) Q j qold(Ij,l). We have the following closed form expression for updating the densities sequentially. At each step, the expectations are computed with respect to the current density function. Thus, for the coeﬃcient vector writing the update across l quantiles:

qnew(βl) := MV N((SE x,γ,l + Sβl) 1(XE γ,l) Σl Y δ,l, (SE x,γ,l + Sβl) 1).

For πl we have,

qnew(πl) := Beta(a1 +

j=1,j =k E(Ij,l), P 1

j=1,j =k E(Ij,l) + b1).

For vi,l s qnew(vi,j) vi,lf In G(vi,l, λi,l, µi,l)

where f In G(vi,j, λi,l, µi,l) is a Inverse Gaussian density with parameters λi,l = E(tl)E( (yi x iβγ,l)2

and µi,l = s

2E(tl)+E(tl) ξ2 1,l ξ2 2,l

For the indicator function we have

log(P(Ij,l = 1)

P(Ij,l = 0)) = E(log πl 1 πl ) 1

i,Ij,l=1 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l )

i,Ij,l=0 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l )}.

For the tuning parameter tl,

qnew(tl) := Gamma(a2, b2)

where a2 = a0 + n

2 + n and b2 = b0 + 1

2 P i E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2 ) + P i E(vi,l). All the moment computations in our algorithm involve standard class of densities. Hence, moments can be explicitly calculated and used in the variational approximation algorithm. Later in the examples we standardize the data and use tl = 1.

4.3. Algorithm for graph construction

Let A be the P P adjacency matrix of the target graphical model. Fixing τ1, . . . , τm, for k = 1, . . . , P, we compute the posterior neighborhood for each node as follows:

Construct Y = Xk and X k as in section 2.

Compute the posterior of Π(Θl|Y ), by using MCMC or the Variational algorithm, where Θl = {βl, Il = {Ij,l}, πl, tl, vl = {vi,l}} with i = 1, . . . , n; j = 1, . . . P, j = k for all l.

Quantile Graphical Models: a Bayesian Approach

As mentioned earlier in Section 2, Ij,l = 0 for all l, implies that βj,l is not in the model, and Ij,l = 1 for some l implies that they are included in the model for some l. If P(Ij,l = 1|Y ) > 0.5 for some l, then A(j, k) = 1, and A(j, k) = 0 otherwise.

Two nodes i and j are connected if at least one of the two is in the neighborhood of the other according to the adjacency matrix A.

5. Some illustrative examples

In this section, we consider three simulation settings to illustrate the application of the proposed methodology. We compare our methodology with the neighborhood selection method for Gaussian graphical model (GGM), using the R package huge (Zhao et al. (2012)) where the model is selected by huge.select function. We considered graphical Lasso (GLASSO) for graph estimation. We use a1 = 1, b1 = 1 in the Beta-Binomial prior. Using the setting of (3) and (10), we use an independent mean zero Normal prior on the components of β.

5.1. Example 1

Example 1(a)

To illustrate our method, we consider the following example. We consider P = 30 variables X1, . . . , X30. We construct X1, . . . , X10 in the following sequential manner:

X1, . . . , X10 Gamma(1, .1) 10,

X2 = .4X1 + ϵ1,

X6 = 1.1X1 + 4X4 + 1.3X9 + ϵ2,

X7 = Φ 1( exp(X2) 1 + exp(X2)) + ϵ3,

where ϵ1 N(0, 1), ϵ2 .5N(2, 1)+.5N( 2, 1), ϵ3 N(0, 1) and they are independent and independent of the Xi s for each step. The quantity Φ denotes the cdf of standard normal distribution. Next, we construct X11, . . . X20 from hierarchical multivariate normal random variables Y1, . . . , Y10. That is for i th observation, y 1,i, . . . , y 10,i MV N(0, Σ), xj,i = yj,iri, j = 1, . . . , 10, with 0 is a vector of zeros and Σkl = .7|k l| and 1

ri Gamma(3, 3), i = 1, . . . , n, ri s are independent, and we have X10+j = Yj, j 1. We generate independent normal random variables with mean zero and variance 1 for X21 till X29 and X30 be the vector of log(ri) s. Hence, we have total number of nodes/variables P = 30, and given the scale parameter X30, the graph has two disjoint parts namely: G1 = {X1, . . . , X9} and G2 = {X11, . . . , X20}. In addition, non-linear relationships are present between the variables. Generating n = 400 independent observations over 100 replications, we construct the network by our algorithm and compare it with the GGM based neighborhood selection method as mentioned earlier. For GGM we use huge.select from the R package huge which uses GLASSO and the implementations of the formulation from Meinshausen and B uhlmann (2006) (MB). The stability based selection criterion (argument stars in the R function) performs relatively better in this example and is therefore compared with our method.

Guha, Baladandayuthapani, Mallick

Table 2: A comparison between GGM and quantile based variational Bayes method(QVB) for Example 1.

Method FDR e1 e2 e3 QV B({τ} = {.5}) 0.31 0.33 0 0 QV B({τ} = {.3, .5, .7}) 0.69 0.19 0 0 GGM(MB) 4.96 1.62 0.14 .05 GGM(GLASSO) 14.96 3.20 0.04 .03

For quantile based variational Bayes (QVB), the data is standardized, and we use t = 1, independent N(0, v), v = 1 prior on the coeﬃcients. The QVB graph is robust to prior variance over a range 1 v 100. A typical ﬁtted subgraph for X1, . . . , X29 conditional on the scale parameter X30 is presented in Figure 2 for QVB and MCMC based ﬁts with same parameter speciﬁcations. The QVB method has successfully recovered the connected part inducing sparsity whereas GGM has estimated wrong connections. Moreover, the quantile based method performs better to separate the independent parts. Using MCMC algorithm, we obtain the similar graphs but the QVB is several hundred times faster than the MCMC. In Table 2, an account of false positivity ( detecting an edge, where there is none) has been provided along with the average number of undetected edges for the QVB. Here, FDR denotes the number of falsely detected edges on average per graph, e1, e2 and e3 denote the average number of undetected edges in G1, G2 and the average number of falsely detected connectors between them. It can be seen that the misspeciﬁcations are signiﬁcantly higher in GGM. The GGM detects a lot of extra edges along with the existing edges. Also, G1 and G2 are generally well separated by the quantile based method. Overall, the quantile based variational Bayes provides a sparser and a more accurate solution. A typical MCMC ﬁt is similar to QVB ﬁt (Figure 2) but MCMC ﬁts generally have slightly sparser graph with QVB detecting weaker connections more frequently.

Example 1(b): P > n case.

In the next example, we consider a sparse P > n scenario. We construct X1, . . . , X10 similar to Example 1 (A). Next, we construct X11, . . . X20 from a similar hierarchical multivariate normal random variables Y1, . . . , Y10. That is y 1,i, . . . , y 10,i MV N(0, Σ), xj,i = yj,irj,i, j = 1, . . . , 10, with 0 is a vector of zeros and Σkl = .7|k l| and 1 rj,i Gamma(3, 3), i = 1, . . . , n, rj,i s are independent, and we have X10+j = Yj, j 3 and

X11 = 3Y3 + 2Y5 + ϵ4,

X12 = 3Y6 + 2Y7 + ϵ5,

X10+j, = Yj, j 3.

Like the previous setup of Example 1(A) with n = 350 and with adding further noise variables X21, . . . , X370 which are generated from a standard normal distribution. Thus, we have n = 350 and P = 370. The data is standardized and we use the same setting as of Example 1(A). The proposed method performs well to detect the underlying latent structure, as well as provides a sparse solution (see Figure 3).

Quantile Graphical Models: a Bayesian Approach

Figure 2: Example 1(a). Network for X1, . . . , X29, conditional on the scale parameter. Top left panel shows the true subgraph. Top middle and right panel show the network constructed by the quantile based variational Bayes (QVB) and MCMC with {τ} = {0.3, 0.5, 0.7}, respectively.

Bottom right and left panel show constructions by GGM based method using GLASSO and MB in huge.select, respectively. Index i denotes i th vertex corresponding to Xi.

Guha, Baladandayuthapani, Mallick

Figure 3: Right panel shows a sparse ﬁtted graph by variational Bayes for Example 1 P > n case (P = 370, n = 350), with {τ} = {.5} and left panel shows the connected variables.

5.2. Example 2: Performance under Gaussianity

Here, we compare quantile based method with the GGM based methods, where the true data is Gaussian. First we construct simple structured graph such as hub-graph and band graph (with banded structure in inverse covariance and adjacency matrix), and then generate multivariate normal data matrices with those underlying structures. We use quantile based ﬁt and compare with GGM based ﬁt for such Gaussian data. For the next example, we consider sparse graphs. The parameter speciﬁcation for quantile based variational Bayes (QVB) is similar to that of last example.

5.2.1. Hub-graph and Band graph

Using n = 300, P = 50 and 3 hubs, we generate hub graph using huge.generate function. A typical generated graph, with adjacency and inverse covariance given in Figure 4 along with the GGM ﬁt. Here the nodes correspond to 1, . . . , 50 are X1, . . . , X50 and the hub centers are located at X1, X17 and X34. From the ﬁtted graphs for τ = {0.5}, {0.3, 0.5, 0.7} for QVB in Figure 4, it is evident that quantile based method s performance is similar to GGM based methods, with QVB resulting slightly sparser graphs. Next, we generate graph with underline inverse covariance matrix having a band structure with n = 300 and P = 50, with nodes/covariates X1, . . . , X50, where for |i j| 3 there is an edge between Xi and Xj. The ﬁtted and true adjacency matrices are given in Figure 5, where the QVB s performance compares favorably to that of GGM s.

5.2.2. Sparse Gaussian graph

We generate graphs for diﬀerent sparsity levels using the R function simulategraph and compare the quantile based ﬁt with the GGM ﬁt. Here, P = 40, n = 100 and sparsity

Quantile Graphical Models: a Bayesian Approach

Figure 4: Example 2, part 1. Upper row, left panel shows the true and generated quantities for hub-graph through huge.generate. Upper row, middle and right panel show the GGM(MB) ﬁt for stability and information based selection criterion, respectively. Middle row shows the ﬁt for GLASSO and the ﬁtted graphs for QVB.

Bottom row shows the ﬁt for another replication with same set up. Here the ﬁtted networks for τ = 0.5 and τ = {0.3, 0.5, 0.7} are same for QVB, and GGM selection criterions are stability based.

Guha, Baladandayuthapani, Mallick

Figure 5: Example 2, part 1. Upper panel shows the true adjacency matrix and the GGM (MB) ﬁtted adjacency for band graph, in left and right panel. Lower left,middle and right panel show the adjacency matrix for the GGM ﬁt (GLASSO), QVB ﬁt for τ = {0.5}, and QVB for τ = {0.3, 0.5, 0.7}, respectively. Index i denotes i th vertex corresponding to Xi. In the adjacency matrix (i, j) th place is given by black iﬀXi Xj or the corresponding value is 1, and otherwise given by white for zero or no edge.

Quantile Graphical Models: a Bayesian Approach

Figure 6: Example 2, sparse graph case. Here n = 100, P = 40, and sparsity level = 0.05. Left panel shows the absolute value of partial correlation between variables, when data is generated from Gaussian graphical model. Middle and right panel shows the ﬁtted adjacency matrices for QVB and GGM, respectively. We have τ = {0.5}, {0.3, 0.5, 0.7} with both resulting same adjacency matrix for QVB. Index i denotes i th vertex corresponding to Xi. In the adjacency matrix (i, j) th place is given by black iﬀXi Xj or the corresponding value is 1, and otherwise given by white for zero or no edge.

levels are .05, .1 and thus, we have nodes corresponding to X1, . . . , XP . Figure 6 shows the matrix of absolute values of the true partial correlation for the underlying true covariance matrix, for the sparsity level 0.05 and the corresponding adjacency matrices of the ﬁtted network by QVB method for τ = {0.5}, {0.3, 0.5, 0.7} and the GGM based ﬁtted graph. The partial correlation is zero if and only if there is no edge between corresponding indices. The strength of the edge is proportional to the magnitude of this partial correlation. It can be seen that QVB results in a sparse graph similar to GGM. Generally QVB generates a sparse graph where very weak connections may not be detected, similar to GGM based method. Figure 7 shows a case with sparsity level 0.1 where the partial correlation values for the most of the undetected edges are close to zero and we have a sparse graph where the relatively stronger connections are detected in both cases. We use MB speciﬁcation in GGM with default information criterion(ric) based selection, which performs relatively better in this example.

5.3. Example 3: Eﬀect of quantiles and computational gain

5.3.1. Example a. Detecting the effect on extreme values

Example 3.A.i. Next, we consider the case where the conditional distribution of one variable depends on the other in extreme values. For X1, . . . , X15 independent normal with mean zero and variance one, X 1 = 2|X4| + 1.5|X7| + .5N(0, 1) and X 2 = 1.5|X5| + 2|X8| + .5N(0, 1). Let, Z1 = X 1IX 1>6 +N( 2, 1)IX 1<6 and Z2 = X 2IX 2>5.5 +N( 2, 1)IX 2<5.5, and Zp = |Xp|, p = 3 . . . , 10 and Zp = Xp, p > 10.

Guha, Baladandayuthapani, Mallick

Figure 7: Example 2, sparse Gaussian graph. Top left panel shows the true graph for the Gaussian graphical model.Top middle and right panel shows the ﬁtted graphs for QVB and GGM, respectively. Here n = 100, P = 40, and sparsity level = 0.1. Bottom panel shows the absolute value of the partial correlations corresponding to the undetected connections, for QVB and GGM, in left and right panels, respectively. Mostly weaker connections have not been detected both in GGM and QVB.

Quantile Graphical Models: a Bayesian Approach

Figure 8: Left panel shows the true network. From left, second to fourth panel show ﬁtted graph by variational Bayes for τ = {0.5}, τ = {0.2} and τ = {0.8}, respectively for n = 350 in Example 3(A)(ii).

We observe Z1, . . . , Z15. Depending on the value of a latent variable, a connection becomes active or switched on , if it crosses some cutoﬀand remains switched oﬀ or inactive otherwise; namely, the connections: 1 4, 1 7, 2 5, 2 8. Here, i j or Xi Xj, implies that there is an edge between i th and j th node. Let en be the average number of such undetected connections for n observations. Table 3 shows the average number of en based on 100 replications for n = 200, 300, 500 for {τ} = {0.3}, {τ} = {0.5} and {τ} = {0.9} for quantile based MCMC. Higher quantile is able to detect these connections and has smaller average en. Also, en decreases with n, as with large n small signal is more likely to be detected. We use standardized version of the observations, t = 1 and N(0, 1) for the prior for the coeﬃcients for MCMC and we use 9000 samples with 5000 burn ins for this particular simulation setting. Ex 3.A.ii. We construct variables X1, . . . , X16 in the following hierarchical manner using moving average type covariance structure. For x1i, x2i, . . . , x15,i, the i th observation for X1, . . . , X15, we assume the following hierarchical model: x1,i/ri, . . . , x10,i/ri MV N(0, Σ), with 0 is a vector of zeros and Σkl = .7|k l| and 1 ri Gamma(3, 3) and x11,i . . . , x15,i are independent normal variable with mean zero and variance 1, and X16,i = log(ri). We have n = 350 and the data is standardized. Hence, the network has connections Xi Xj; |i j| < 2, i, j 10 and Xi X16, i 10. Using τ = {0.2}, {0.5}, {0.8}, the QVB ﬁtted networks are given in Figure 8. The connections Xi X16, i 10 are not detected for τ = 0.5, whereas most of them are detectable for two extreme quantiles.

5.3.2. Example 3 B. Granularity of quantile grid

If we make the quantile grids denser, then we will have neighborhood selected for each of the quantiles and the neighborhood selected would be the union of those neighborhood. But if we use more and more quantiles the FDR stabilizes, as it is implied by Theorem 3.4 and the following Remark 4, where we can have the ratio of posterior probability of any wrong alternate model with respect to true model, going to zero uniformly over all quantiles, with high probability. The following examples demonstrate this robustness of quantile-grid selection using the variational Bayes method.

Guha, Baladandayuthapani, Mallick

Figure 9: Fit for diﬀerent quantile grids. Example 3(B)

Table 3: A comparison between diﬀerent quantiles for the extreme value dependence case for Example 3.A using MCMC

τ en, n = 200 en, n = 300 en, n = 500 {τ} = {.3} 3.88 3.60 2.82 {τ} = {.5} 2.87 2.31 1.22 {τ} = {.9} 1.65 0.91 0.29

We consider the set up similar to Example 1(A) with n = 400, with quantile grids of width 0.1,.05 and .025, that are {0.2, . . . , 0.8}, {0.2, 0.25, . . . , 0.75, 0.8} and {0.2, 0.225, . . . , 0.775, 0.8}. We have 30 nodes With X1, . . . , X10 generated similar to Ex 1(A), X11 . . . , X20 follows multivariate normal MV N(0, Σ) with Σi,j = .7|i j|, and X21, . . . , X30 follows independent normal with mean zero and variance one. A typical QVB ﬁt for diﬀerent quantile set up is given in Figure 9, where τ = 0.5 captures all but one edge, and τ = {0.3, 0.5, 0.7}, τ = {0.2, 0.25, .., 0.75, 0.8}, gives the correct graph. The FDR s are given in Table 4. Let G1 be the subgraph based on X1, . . . , X9, and G2 is the subgraph based on X11, . . . , X20,which is disjoint from G1. Here e1 is the average number of undetected edges in G1, e2 in G2 and e3 be the average number of connectors detected between them. We can see that the FDR and e1, e2, e3, stabilize even when we increase the number of grids.

5.3.3. Computational gain due to QVB

In all the cases the variational approximation based algorithm performs well to detect the true graphs. Moreover, QVB is many times faster than the MCMC. We use 40 iterations

Table 4: Eﬀect of granularity of quantile grid

Quantile FDR e1 e2 e3 QV B({τ} = {0.5}) 0.37 0.32 0 0 QV B({τ} = {0.3, .0.5, 0.7}) 0.60 0.16 0 0.01 QV B({τ} = {.2, .3, . . . , .7, .8}) 0.77 0.13 0 0.03 QV B({τ} = {.2, .25, .3, . . . , .7, .75, .8}) 0.82 0.13 0 0.03 QV B({τ} = {.2, .225, . . . , .775, .8}) 0.86 0.13 0 0.03

Quantile Graphical Models: a Bayesian Approach

for QVB but in all the examples considered, the convergence happens within 20 iterations. Using 5000 samples for each node, and 5000 burn ins, the MCMC runtime is nearly 100 times or more of that of the QVB. For example for P = 20, 60 and n = 400 in the set up for example 1, QVB was found to be 170 and 134 times faster over a typical run using one quantile grid. Also, computational cost scales linearly with the number of quantile grids. Our computation is parallelizable over nodes and the grids of quantiles, though we do not implement it here.

6. Protein network

The Cancer Genome Atlas (TCGA) is a source of molecular proﬁles for many diﬀerent tumor types. Functional protein analysis by reverse-phase protein arrays (RPPA) is included in TCGA and looking at the proteomic characterization the signaling network can be established. Proteomic data generated by RPPA across > 8000 patient tumors obtained from TCGA includes many diﬀerent cancer types. We consider lung squamous cell carcinoma (LUSC) data set. The data set considered, has n = 121 observations with P = 174 high-quality antibodies. The antibodies encompass major functional and signaling pathways relevant to human cancer and a relevant network gives us their interconnection subject to LUSC. A comprehensive analysis of similar network can be found in Akbani et al. (2014) for various cancers, where the EGFR family along with MAPK and MEK lineage was found to be dominant determinant of signaling, where for LUSC it was mainly EGFR. We use our quantile based variational approach with {τl} = {.1, .2, .3, . . . , .7, .8, .9} and a normal prior on β (independent N(0, 1)). Overall, the QVB graph is robust to this prior variance(v) selection in the range v = [1, 100] with very few of the edges/weaker connections may be missing for a relatively higher variance. The data is standardized and we use t = 1. The graphical LASSO method cannot select a sparse (using huge.select) network both using criterion MB or GLASSO and using criterions for tuning parameter selection. Choosing the penalization by direct cross validation in GLASSO in Akbani et al. (2014), the network has been generated and it reports the important connections. The network and the connection tables with variable index can be found in Figure 10 and Table 5. The type of the connection (positive/negative) is also provided. We can say that one variable eﬀects other variable positively (negatively), conclusively, if the coeﬃcients in the corresponding quantile regression is greater (less) than zero for at least one quantile, and greater (less) than equal to zero for other quantiles. A network of protein was established for diﬀerent cancer types in Akbani et al. (2014), where important connections were established. We compare our network for the LUSC network from Akbani et al. where we ﬁnd some of the known established connections are detected and also some connection not mentioned in Akbani et al. have been detected. Though we refrain from making any inferential claim about the new connections, some further study may be helpful for possibly new biological insight. In our ﬁtted network, the strong EGFR/HER2 connections are detected as seen in Akbani et al. (2014). The connection between EGFRp Y 1068 and HER2p Y 1248 is detected which are known to cross react. The connection between E.Cadherin and alph/beta.Catenin is detected as expected. Unlike Akbani et al., p Akt and Pras40 are found to be connected

Guha, Baladandayuthapani, Mallick

Aktpt308 Aktps473

Beta.cantein

ccnd1.cyclin

srcp Y416 src_p Y527

GSK3alpha_betaps21

Fibrocentin

PRAS40_p T246

chk2.M Tuberin

AMPK_alpha CDK1

EIF4EBP1_PT37

EIF4EBP1_PS65

alpha_cantenin

HER2_PY1248

EGFR_PY1068

BCL2l11.Bim

MAPK_pt202_Y204

MEK1_PS217_S221

pkc_alpha_ps657

EGFR Ei F4BP1

Figure 10: Active proteins and connections for the LUSC data set.

in LUSC. This connection was reported for few other cancer types. Also, MEK is active and connected to PMAPK. EGFR is known to be active in lung cancer and mutation of MAPK , MEK are known to be present for various cancers (see Yatabe et al. (2008), Hilger et al. (2002)). Few connections, such as the new negative connection between p85 and claudin7, mentioned in Akbani et al. (2014) are not detected in this current set up. We have detected some new connections not given in Akbani et al. for LUSC data set, such as between SNAI2 and PARP1. Here, SNAI2 is a DNA-transcriptional repressor and PARP1 modulates transcription. Proteins, casp8 and ERCC1 are found to be connected with MET, which are not given in AKbani et al. casp8 performs protein metabolism and ERCC1 is related to structure speciﬁc DNA repairing and known to be important in lung cancer treatment (Ryu et al. (2014)). They both are connected to growth factor receptor MET. The detected connections between KRAS and smad4, Y WHAE and KRAS are not given in the network from Akbani et al. for LUSC data set and need further study.

Quantile Graphical Models: a Bayesian Approach

Table 5: Connections and corresponding nodes in LUSC data set

Proteins Sign MSH6 MSH2 + Akt PT308 Akt PS473 + ACC1 ACC + beta.cantenin E.Cadherin + SNAI2 PARP1 + CCND1.Cyclin CD20 + Srcp Y 416 Src + GSK3.alpha.beta.p S21 GSk3p S9 + Pai1 Fibrocentin + PRAS40 p T246 Akt p S473 + Chk2 p T68 chk2.M + Tuberin STAT5.alpha + Y AP PS127 Y AP + AMPK alpha CDK1 + AMPK PT172 AMPK alph + EIF4EBP1 p T37 EIF4EBP1 + EIF4EBP1 p T37 EIF4EBP1 PS65 + alpha.Catenin E.Cadherin + c Kit GAB2 + HER2 p Y 1248 EGFR PY 1068 + HER2 p Y 1248 Src PY 416 + MET SNAI2 + MET CASP8 + ERCC1 MET + BCL2 Bim + KRAS Y WHAE + KRAS Smad4 + MAPK p T202 Y 204 MEK1 p S217 S221 + PKC.alpha p S657 PKC.alpha + EGFR EGFRp Y 1068 + Rab25 SETD2 + N.Cadherin BCL2 + N.Cadherin MRE11 +

Guha, Baladandayuthapani, Mallick

7. Discussion

The proposed approach oﬀers a robust, non-Gaussian model as well as easily implementable algorithms for sparse graphical models. Even with large values of P and relatively smaller value of n, it is possible to detect underlying connections as shown in example 1 and in the analysis of the LUSC data set. In the protein network construction, we are able to establish the known signaling network with some newly discovered connections, which need to be validated. In this development, we prove the density estimation and neighborhood selection consistency and posterior concentration rate under both the true model and the misspeciﬁed model. Under misspeciﬁed model the posterior concentration occurs around the minimum KL distance point from the true density and the set of proposed densities. From simulation examples where we do not assume any density structure in the data generating model, the proposed method performs well. In future, we will further investigate the model selection properties for each node and related convergence rate.

Acknowledgment

Research reported in this publication was partially supported by National Cancer Institute of the National Institutes of Health under award number R01CA194391, NSF grants numbers NSF CCF-1934904, NSF IIS-1741173.

Quantile Graphical Models: a Bayesian Approach

8. Appendix

Proof of the theoretical results

Proof of Theorem 3.1:

The sketch of the proof is following. At ﬁrst we construct the KL neighborhood and show that it has suﬃcient prior probability. The sieve is constructed thereon and outside the sieve the prior probability is decreased exponentially. The construction from Jiang (2007), Jiang (2005) can be used as long as an equivalent KL ball around the true density can be constructed under the quantile model. First, we show our calculation for the neighborhood construction of k th node. Let h l = x β ( k) l , h 1,l = x β ( k) γn,l ,h 2,l = x β ( k) γcn,l . Here, coeﬃcient vector with subscript γn denotes the coeﬃcient is set to be zero if the corresponding variable is not in γn. Similarly, it is deﬁned for the subscript γc n. Also, x denotes a generic row of X = X k.

In model γn, let H be the set of β( k) l s such that β( k) j,l (β j,l ( k) η ϵ2 n rn ) for j = k γn,

where |γn| = rn, such that l k(rn) is minimized. Let hl = x β( k) γ,l then for β( k) l in H, we have

L(f) = |logf(xk, h l ) f(xk, hl) | = |log f(xk, h )

f(xk, h 1,l)

f(xk, h 1,l)

f(xk, hl) |

t l k(rn) X

j / γn |xj| + X

j γn tη ϵ2 n rn |xj|,

where f(xk, h) τ(1 τ)exp( tρτ(xk h)). This step follows from the Lemma 1 of Sriram et al. (2013). Therefore, Z L(f)f (xk| k)f (xi =k)dx t M (pn l k(rn) + ηϵ2 n) < ϵ2 n

for some appropriately chosen η (by A7). Hence, H lies in the ϵ2 n KL neighborhood. For normal prior on the coeﬃcient, Π(H) exp( cnϵ2 n) and Π(γ = γn) > exp( cnϵ2 n) for any c > 0 for large n, similar to Jiang (2007). Therefore, they provide suﬃcient prior mass on small KL neighborhood around the true density. Let Pn be the set such that regression coeﬃcients lies in [ Cn, Cn] and rn is the maximum model size. For δ = η ϵ2 n rn covering each of the coeﬃcients by δ radius l balls, in those balls we have Hellinger distance less than ϵ2 n. Hence, we have the total Hellinger covering number of Pn as N(ϵn) P rn r=0 pr n(2Cn

2δ + 1)r ( rn + 1)(pn(Cn

δ + 1)) rn (see Jiang; 2005, 2007). This step follows as d2(p, q) KL(p, q), where d is the Hellinger metric deﬁned in section 3 and KL is the Kullback-Leibler distance. Note that, log(N(ϵn)) = O( rn(log(Cn)+ log(pn) + log(1/ϵ2 n))). We have from A1, A2, using Cn as a power (greater than one) of n, from Jiang (2005), Jiang (2007), or any K1 > 0, for large n

log(N(ϵn), Pn) nϵ2 n,

Π( P c n) e K1nϵ2 n.

Guha, Baladandayuthapani, Mallick

Therefore, Theorem 3.1 follows from veriﬁcation of conditions for Theorems 5, 6 and Proof of Theorem 3, from Jiang (2005) or Proposition 1 from Jiang (2007). From Proposition 1 part (i) from Jiang (2007), P [Π(hk > ϵn|Dn) > e c 1nϵ2 n] 0 for some c 1 > 0.

Beta-Binomial Prior

We have shown the result for Ij,l Bernouli(πn) with rn = pnπn, rn satisfying A1 A7. We use the same rn for Beta-Binomial prior calculation. For γn with model size rn, constructed as in proof of Theorem 3.1, we show that the prior mass condition holds. For Beta-Binomial prior on Ij,l, we have for a1 = b1 = 1,

Π(γ = γn) (pn + 1) pn rn

From A1, Π(γ = γn) > (pn + 1) (rn+1) > exp( cnϵ2 n) for any c > 0 for large n. Therefore, the condition on prior mass holds. Hence, from the earlier proof, Theorem 3.1 follows.

Proof of Theorem 3.2

We prove this part for ﬁxed tl = t, and without loss of generality t is assumed to be 1. For Theorem 3.2 and 3.3 we ﬁrst prove under the assumption of bounded covariate with |Xk| < M for a simpliﬁed proof. Later, we relax the condition to accommodate sub exponential tail bound. For the case where covariates are not bounded, we assume M large enough such that E|XK| < M for all k. To show the concentration of fl,k, k under Πl( ), around f l,k, k the closest point in conditional quantile based likelihood for τl, we drop the

suﬃx l in fl,k, k( ), β( k) l , δ k,l and Πl( ) for convenience and show for one general quantile. We have

Π(KL(f0 k, kf0 k, fk, k(β( k))f0 k) > δ + δ k|.) =

Kc δ fn k, k(β( k) γ )Π(β( k), γ)d(β( k), γ) R fn k, k(β( k) γ )Π(β( k), γ)d(β( k), γ)

Kc δ fn k, k(β( k) γ )Π(β( k), γ)d(β( k), γ) R

vδ ,γ0 fn k, k(β( k) γ )Π(β( k), γ)dβ( k)

= Nn Dn . (11)

Here, γ0 the vector of 0 and 1 corresponding to the true active set (i.e present in the model) of covariates for k th node for the model with the KL distance δ k, and let |γ0| = M 0, the cardinality of the active set and vδ ,γ0 be the set with γ0 active and where

β( k) γ0 ˆβ

γ0 < δ . Here, Kc δ denotes the set of densities where the KL distance from

the f0 = f0 k, kf0 k is more than δ + δ k. On Kc δ, Ef0(log f k, k fk, k ) = KL(f0 k, kf0 k, fk, kf0 k)

KL(f0 k, kf0 k, f k, kf0 k) > δ. Also, n in the density fn k, k denotes the likelihood based on n observations. We divide Nn and Dn by fn k, k which is likelihood based on n observations under this minimum KL distance model at k th node for τl.

Quantile Graphical Models: a Bayesian Approach

Under Beta-Binomial prior Π(γ = γ0) > (pn + 1) pn M 0 1 > e M0log(pn+1). Note that if the diﬀerence between coeﬃcient vectors is δ in supremum norm, then the diﬀerence between corresponding log likelihood is at most MM0δ , by B1 and Lemma 1(b) from Sriram et al. (2013), if we assume M > 1 without loss of generality. Therefore, for the denominator, we have encδ Dn fn k, k > e M0log(pn+1)Π(vδ ,γ0) en(c M(M0+1))δ >

1 if c > M(M0 + 1), for large n. We split the numerator in two parts. First part contains the part where each of the

entry of β( k) lies in [ Kc mβ, Kc + mβ] a compact set with mβ = ˆβ

( k) and Kc > 0.

We denote the set by S and its compliment by Sc. Also, let ms β = supk ˆβ

( k) . For notational convenience, we will drop the index k from the coeﬃcient.

Calculation on S:

For any of the at most cn 2M0 pn 1 M0 many covariate combinations ( a conservative bound) for the kth node, we show the part in the S decreases to zero exponentially fast. Note that pn 1 M0 < p M0 n . For any covariate combination, we break the M0 dimensional model space in (MM0) 1δ width M0 dimensional grids along the axes corresponding to coeﬃcients and look at the induced grid points in S. Let, Jn(δ ) be the number of grid points and for density fk, k associated with each nodal

point of the (MM0) 1δ width grids, we have fn k, k fn k, k e .5nδ for large n with probability

one, as n 1log fn k, k fn k, k Ef0(log fk, k

f k, k ) < δ.

Also, over all possible covariate combinations at some grid point on the set where the KL distance is more than δ+δ k: P( fn k, k fn k, k > e .5nδ, inﬁnitely often (i.o) over all nodes and models)

= P(n 1log fn k, k fn k, k > .5δ, i.o over all nodes and models) P(n 1log fn k, k fn k, k Ef0(log fk, k

.5δ, i.o over all nodes and models) Jn(δ )2M0limn P n=n pnp M0 n e c nδ2

t2s 0, by Hoeﬀding inequality and Borel-Cantelli lemma using B4. Here, ts = M0(M + 1)(Kc + ms β)

and c > 0 is a generic constant and |log fk, k

f k, k | < 2ts.

For any point β = β( k) and its nearest grid point βgrid, we have fn k, k(β) fn k, k(βgrid) enδ (an

application of Lemma 1(b), Sriram et al.,2013). If the nearest grid point is in the set where

KL distance from the truth is more than δ we follow the earlier argument fn k, k fn k, k < e .5nδ

for large n uniformly over nodes and models, with probability one. Note that if the nearest grid point not in the set KL(f0 k, kf0 k, fk, k(β( k))f0 k) > δ + δ k, then at the nearest grid point KL(f0 k, kf0 k, fk, k(β( k))f0 k) > .75δ + δ k for δ < .25δ from Lemma 1(b), Sriram et al.,(2013) by generalizing for multiple covariates and taking expectation. From a similar

Hoeﬀding inequality argument fn k, k fn k, k < e .5nδ for large n uniformly over covariate choices and nodes for such grid points, with probability one. Hence, choosing δ less than .25δ, for large n we have for all the combinations of γ on S, pr.,n = P γ R

S,γ fn k, k fn k, k Π(.) e M0logpn2M0e nd1δ where d1 > 0 is a constant depending

upon δ . Also, logpn n. Therefore, pr.,n < e .5nd1δ almost surely.

Guha, Baladandayuthapani, Mallick

Calculation on Sc:

Next, we look at Sc. Let, cτ = minl{τl, 1 τl}. On Sc, at least one βi,l is outside [ Kc mβ, Kc + mβ]. Without loss of generality we assume βi,l ˆβi,l s have same sign as β0,l ˆβ0,l (otherwise we change X i = Xi and work with the reﬂected variable). Without loss of generality, we denote the covariate X i, i = k encompassing both reﬂected and non reﬂected scenarios. As there are only ﬁnitely many orderings, it is suﬃcient to consider only one such case and prove in that case. Furthermore without loss of generality, the variables are assumed to be centered. First, we consider the case when β0,l > ˆβ0,l. The case β0,l < ˆβ0,l follows identically. For notational convenience, we show our calculation for τl, y = Xk and the covariates X 1, X 2 for a model of size 2 where k = 1, 2. For general case, it follows similarly. Let bi,l = β0,l ˆβ0,l + (β1,l ˆβ1,l)x 1,i + (β2,l ˆβ2,l)x 2,i. Note that if x 1,i, x 2,i > ϵ > 0, then bi,l > 0. Let

δi,l = ρτl(yi β0,l β1,lx 1,i β2,lx 2,i) ρτl(yi ˆβ0,l ˆβ1,lx 1,i ˆβ2,lx 2,i).

Then from Lemma 1 and Lemma 5 from Sriram et al. (2013)

δi,l cτϵKc Ix 1,i>ϵ,x 2,i>ϵ 2|yi ˆβ0,l ˆβ1,lx 1,i ˆβ2,lx 2,i|.

Here I( ) is the indicator function. Let, Ai ϵ = {x 1,i > ϵ, x 2,i > ϵ} and Bi ϵ = {x 1,i < ϵ, x 2,i < ϵ}. The previous step follows from the proof of the Lemma 1 in Sriram et al. (2013) by writing down the loss function explicitly and from the fact that bi,l = µ(2) i,l µ(1) i,l > 0 on

Ai ϵ, where µ(1) i,l = ˆβ0,l + ˆβ1,lx 1,i + ˆβ2,lx 2,i and µ(2) i,l = β0,l + β1,lx 1,i + β2,lx 2,i. Considering

the ordering of yi, µ(1) i,l , µ(2) i,l , such as yi µ(1) i,l µ(2) i,l ; µ(1) i,l yi µ(2) i,l and so on, the above claim can be veriﬁed. Let min{E(IAiϵ), E(IBiϵ)} = aϵ > 0 (by B3, choosing appropriate ϵ > 0) and ri = |yi ˆβ0,l ˆβ1,lx 1,i ˆβ2,lx 2,i| and E(ri) ϵ , over all nodes and all possible model combination of size at most M0 for some constant ϵ > 0, at each node (follows from uniformly bounded

( k) and bounded E|Xk| s). Without loss of generality, the constant aϵ > 0 is the minimum of expectations of IAiϵ, IBiϵ where Ai ϵ, Bi ϵ are constructed similarly for models of size less than M0 over all nodes (i.e. sets of the type {x 1,i > ϵ, x 2,i > ϵ, x j,i > ϵ, } etc. for X1, X2, Xj, in the model where k = 1, 2, j, ..).

Establishing bound on the average of the indicators and ri

By Hoeﬀding bound P(Pn i=1 n 1(IAiϵ) < aϵ

2 ) < e 2n a2ϵ

4 and similar bound holds for Bi ϵ. Similarly, P(n 1 Pn i=1(ri) > 2ϵ ) < e c2n(ϵ )2 for some constant c2 > 0 as Xk s are bounded.

Hence by Borel-Cantelli lemma, the probability n 1 Pn i=1(IAiϵ)

n 1 Pn i=1 ri < aϵ 4ϵ inﬁnitely often is less than equal to

n=N 2M0p M0 n (e c2n(ϵ )2 + e 2n a2ϵ

Quantile Graphical Models: a Bayesian Approach

Therefore for all the possible at most M0 neighbors, we have, n 1 Pn i=1(IAiϵ)

n 1 Pn i=1 ri > d0 > 0 for some d0 for all but ﬁnitely many cases, with probability 1. The calculation holds for each quantile.

Also, lim P n=N pnp M0 n (e c2n(ϵ )2 + e 2n a2ϵ

4 ) 0 similarly. Therefore, this result holds over the union over all the vertices /nodes of the graph, over all possible model combination of maximum size M0 1.

Hence choosing Kc large enough, on Sc we have logfn k, k() logfn k, k = P i δi,l nu0, where u0 > 0, for large n, almost surely, over all nodes and model of maximum size M0 1.

Combining the Parts

Choosing δ < min{u0,.5d1δ}

2M(M0+1) , from (11) LHS goes to zero almost surely, as encδ Nn fn k, k 0, by

choosing c = 1.5M(M0 + 1).

Proof of Theorem 3.3

This proof follows similar construction of S and Sc from the previous proof of Theorem 3.2. Here we show for bounded Xk s ﬁrst.

For any of the at most cn 2M0 pn 1 M0 many covariate combinations for the kth node, we show the part in the S decreases to zero exponentially fast. We break the M0 dimensional model space in (MM0) 1δ width M0 dimensional squares.

Let, Jn(δ )be the number of grids and for each nodal point we show fn k, k fn k, k e 2nϵ2 n

almost surely. This step follows from the following application of Hoeﬀding inequality. Note that, Jn(δ ) = O(( 1

Showing fn k, k fn k, k e nϵ2 n for large n on S

Let, tm = M0(M + 1)(Kc + mβ), ts = M0(M + 1)(Kc + ms β). We have E(n 1log( fn k, k fn k, k )) <

Then, P(n 1log( fn k, k fn k, k ) > 2ϵ2 n, for some grid points in S) < Jn(δ )e 2n ϵ4n

4t2m . Here,

f k, k | < 2tm.

Choosing δ = ϵ2 n , we have P( fn k, k fn k, k > e 2nϵ2 n inﬁnitely often for some grid points in S)

lim N P n=N Jn(δ )e 2n ϵ4n

4t2m . Now from B6 and B7 we get P Jn(δ )e 2n ϵ4n

Therefore using Borel-Cantelli lemma, fn k, k fn k, k e 2nϵ2 n almost surely for large n for grid points in S.

Guha, Baladandayuthapani, Mallick

Moreover, P n pnp M0 n Jn(δ )e 2n ϵ4n

4t2s < (by B7). Therefore, this almost surely convergence happens over all possible covariate combinations and over all pn vertices/nodes of the graph.

For any point β and its nearest grid point βgrid ,we have fn k, k(β) fn k, k(βgrid) enδ . Therefore

on S, we have fn k, k(β)

fn k, k e nϵ2 n, almost surely (following the proof of Theorem 3.2).

Combining the parts

Calculation on S:

Choosing, δ = .5 ϵ2 n MM0 , we have logΠ(vδ ,γ0) = O(log 1

ϵ2n ) and encδ Dn f k, k > e M0logpn

Π(vδ ,γ0)en(c MM0)δ > 1 if c > MM0, for large n, from B6, B7. Therefore on S, choosing c > MM0 and .5ϵ2 n < cδ < .75ϵ2 n

Kc δ S fn k, k(βγ)π(β, γ)d(β, γ) R

vδ ,γ0 fn k, k(βγ)π(β, γ)dβ e .25nϵ2 n.

On Sc, the result from Theorem 3.2 holds and logfk, k(.) logf k, k nu0 almost surely for large n. Therefore, with n, pn going to inﬁnity, P(Π(KL(f0 k, kf0 k, fl,k, kf0 k) > δn+δ k for some node |.) goes to zero almost surely.

Relaxing boundedness condition

From B2, using Holder inequality, we have that for any M0 dimensional linear combination of absolute values of Xi s with bounded coeﬃcient (where coeﬃcient of Xi s are bounded by 1), denoting the random variable by generic symbol W:

E(eλ(W E(W))) e.5λ2ν 2 for |λ| < b 1

for some global b , ν < , for all possible such combinations. This is the condition for sub-exponential distribution with parameters (ν , b ) with ν 2 = M0ν2, b = M0b.

Showing for linear combinations

This result follows from the following argument using Holder s inequality for |αi| 1, i, for |λ| < b 1,

E(e|λ PM i=1 αi(|Xi| E(|Xi|))|) E(e|λ| PM i=1 |αi(|Xi| E(|Xi|))|) {e.5λ2M2ν2} 1 M e.5λ2M0ν2.

Then for w1, . . . , wn i.i.d W with mean W, we have P(|W E(W)| > t ) 2e n2t 2

2(n(ν )2+nbt ) (Bernstein-type inequality). Also, we have uniform tail bound on the variables/nodes and their linear combinations, as we have the same for the absolute values.

Quantile Graphical Models: a Bayesian Approach

Showing Theorem 3.3 for sub-exponential tail bound

From the tail bound result for linear combinations

i=1 (|xij1| + + |xijm|) > KM for some j1, . . . , jm {1, . . . , pn}; m M0 1)

p M0 n e c1n (12)

with some c1 > 0, KM = 1.5M0max E(|Xk|), as b < . Hence, n 1 Pn i=1(|xij1| + + |xijm|) KM for all but ﬁnitely many cases, with probability one by Borel-Cantelli lemma, as P p M0 n e c1n < . We can choose δ = .5 KM+1ϵ2 n and for Dn, on vδ ,γ0 we have n 1|logfn k, k logfn k, k| .5 (KM+1)ϵ2 nn 1 P i P j γ0 |xij| .5ϵ2 n as n goes to inﬁnity (using Lemma 1(b), Sriram et al.,

2013). Similarly, for Nn, on S we choose ((KM + 1)) 1δ size grids and the conclusion for bounded case holds. On S the absolute value of the coeﬃcients are bounded by Kmax = Kc + ms β. For linear combination with bounded coeﬃcient, we assumed sub-exponential distribution. Same holds for diﬀerences of such functions with bounded intercept terms, similarly (without loss of generality, we bound the absolute value of coeﬃcients and intercept terms by one, to get the global b , ν in sub exponential formulation, using Holder s inequality). We assume global constants b, ν , in the sub-exponential condition, slightly abusing the earlier notation. Finally, for each of the grid points,

P(n 1(log fn k, k fn k, k ) > 2ϵ2 n) = P(n 1 1 Kmax (logfn k, k logfn k, k) > 2 ϵ2 n Kmax )

< e c2 (Kc+ms β)2 nϵ4 n

for some ﬁxed c2 > 0 . This step follows using the sub-exponential property for the quantile loss functions at nodes and their linear combination, as we have shown it for absolute value the linear combinations of the covariates earlier, as b, ν are global constants, in the subexponential assumption in this case. On Sc the bound on n 1 Pn i=1 ri follows similarly, as the intercept ˆβ0,l and the coeﬃcients are bounded. Hence, the proof of Theorem 3.3 holds under relaxed assumptions. Theorem 3.2 holds similarly.

Proof of Proposition 2.1

The proof follows trivially from model given in Equation (1) and the linearity of conditional quantile function.

Proof of Lemma 1

Follows readily from the fact that under C1, if Xj is not connected to Xk then j is not contained in any N l,k, and if Xj Xk, then from C2, Xj is in some N l,k if we choose small enough quantile grid width.

Guha, Baladandayuthapani, Mallick

Proof of Theorem 3.4

Let, M l,k be the model for τl at k th node induced by N l,k, and M1 k = M l,k be any competing model at node k. Let fn l,k, k,M1(β) = fn l,k, k,βM1 be the likelihood under Equation 4, for n

observations for coeﬃcient β = β( k) l , for some model M1, at node k . Then,

Πn l,k(M1 k, M l,k) cπ s +s M1 n

R fn l,k, k,M1 k(β)π(β)dβ R fn l,k, k,M l,k(β)π(β)dβ = cπ s +s M1 n

R fn l,k, k,M1 k (β)

fn k, k, ˆβM l,k π(β)dβ

R fn l,k, k,M l,k (β)π(β)

fn k, k, ˆβM l,k dβ

= cπ s +s M1 n Nn BF Dn BF . (13)

Here the suﬃx l denote the likelihood used corresponds to τl, c < 2M0 is a constant as πn 0.5 without loss of generality. Let ˆβM l,k or the vector of ˆβ0, ˆβ1, . . . , ˆβm be the true

values of coeﬃcients that minimizes the expected quantile loss and the KL distance with the data generating density. Without loss of generality we can choose them to be ﬁrst m

variables. Let s = m be the size of true model for k th node and s M1 = s M1 k be the size of the competing model. For convenience we drop the l and writing M k instead of M l,k, we write ˆβM k .

For Ωβ ϵ n = {β : βi (ˆβi ϵ n 2/(2KM)); i = 0, . . . , m }, we have n 1|log fn k, k,ˆβ fn k, k,β |

ϵ n 2 with probability one, for β Ωβ ϵ n using the fact that for KM = 1.5M0max E(|Xk|), n 1 Pn i=1(|xij1|+ +|xijm|) KM for large n with probability one (the conclusion following 12; m < M0). For bounded covariate, we can use KM = M0M where |Xj| < M for all j. Note that E|Xj| is bounded by C4. As log(Π(Ωβ ϵ n)) = O(logn), for ϵ n 0, ϵ n 2 n 1+δ1, δ1 > 0 and we have e2nϵ n 2Dn BF > 1 with probability one. Next we consider two cases, M k M1 k and M k M1 k.

The case M k M1 k

Let βn M1 k be the Maximum likelihood estimate of ˆβM1 k, the minimizer of the expected loss

under misspeciﬁed model. Then βn M1 k converges to ˆβM1 k in probability. Consequently, we

show that, n 1log

fn k, k,βn M1 k fn k, k,ˆβM k δ in probability, for some δ > 0.

This step follows form the following argument writing log

fn k, k,βn M1 k fn k, k,ˆβM k = log

fn k, k,βn M1 k fn k, k, ˆβM1 k +

fn k, k, ˆβM1 k fn k, k, ˆβM k . Now, n 1log

fn k, k, ˆβM1 k fn k, k,ˆβM k < δ almost surely and hence, in probability, where

lτ,ˆβM1 k lτ,ˆβM k > 2δ.

Quantile Graphical Models: a Bayesian Approach

Note that |log

fn k, k,βn M1 k fn k, k,ˆβM1 k | nn 1 Pn i=1 Ps M1 k j=0 wj,n|xj,i|, where x0,i = 1, where n .5wj,n =

|βn j,M1 k ˆβj,M1 k|. Therefore |βn j,M1 k ˆβj,M1 k| = op(1), wj,n = Op(1) and n.5n 1|log

fn k, k,βn M1 k fn k, k,ˆβM1 k | =

Op(1) ( from Theorem 3, Angrist et al. (2006) and from the fact, n 1 Pn i=1(|xij1| + +

|xijm|) KM for large n with probability one), and as a result n 1log

fn k, k,βn M1 k fn k, k,ˆβM k δ in

probability. Hence, from equation (13), multiplying numerator and denominator by e2nϵ n 2

Πn l,k(M1 k, M l,k) cπ s +s M1 n e2nϵ n 2e

fn k, k,βn M1 k fn k, k,ˆβM k

Hence, logΠn l,k(M1 k, M l,k) < δ n for any 0 < δ < δ for large n in probability and therefore, Πn l,k(M1 k, M l,k) converges to zero in probability.

The case M k M1 k

Without loss of generality assume that M1 k has ﬁrst s M1 > s variable active. Note that,

fn k, k,βn M1 k fn k, k,ˆβM k | nn 1 Pn i=1 Ps M1 k j=0 wn j |xj,i|, where x0,i = 1, where n .5wj,n = |βn j ˆβj,M |.

Note that ˆβM k = ˆβM1 k by uniqueness of the minimizer of expected quantile loss and condition C1. As wj,n s are Op(1), therefore, n 1 Pn i=1 Ps M1 k j=1 wn j |xj,i| is Op(1). Hence, from equation (14) using B8,

logΠn l,k(M1 k, M l,k) (s M1 s )c0n.5+ϵ + n Op(1) + 2nϵ n 2 + c 0

for generic constants c0 > 0, c 0. Choosing ϵ n < n .25, we have Πn l,k(M1 k, M l,k) goes to zero in probability.

Proof of Remark 4

Let, wj,n(τ) is deﬁned similar to wj,n when we use τ as our quantile. Note that for M k M1 k supτ (ϵ,1 ϵ)|wj,n(τ)| is Op(1), for ϵ > 0 and therefore, supτ Πn τ,k(M1 k, M k) is op(1) from the earlier calculation. This step follows from the conclusion about the process over τ in Theorem 3 of Angrist et al. (2006). Suppose, we have minimizer of the quantile loss at τ, ˆβM k (τ) and ˆβM1 k(τ), under M k and

M1 k, respectively, for the case where M1 k does not contain M k. Then, for any δ > 0, there

exists ϵ1 > 0 such that, |n 1log

fn k, k, ˆβM1 k (τ )

fn k, k,ˆβM k (τ ) n 1log

fn k, k, ˆβM1 k (τ )

fn k, k,ˆβM k (τ ) | < δ/4 for |τ τ | < ϵ1,

Guha, Baladandayuthapani, Mallick

ϵ1 > 0 a small number. This step follows using n 1 P i P j P |xj,i| < KM for large n (shown in the proof of Theorem 3.3) and the continuity of ˆβM k (τ) and ˆβM1 k(τ).

Again, n .5supτ|log

fn k, k,βn M1 k fn k, k,ˆβM1 k | supτn 1 Pn i=1 Ps M1 k j=0 |wj,n||xj,i| = Op(1). Hence, using

ﬁnitely many ϵ1 equi-spaced grid at diﬀerent τ s we have the set Sτ. For τ Sτ we have Πn τ,k(M1 k, M k) goes to zero in probability from the calculation before equation 14, by showing

fn k, k,βn M1 k (τ)

fn k, k,ˆβM k (τ) < δ/2 in probability, for τ s in the set Sτ for some δ > 0. Here, we use

the fact infτ (ϵ,1 ϵ)lτ,ˆβM1 k lτ,ˆβM k > δ for some δ > 0 for a given ϵ > 0. Then, we repeat

the argument for the case M l,k M1 k in Theorem 3.4 proof. As Sτ is a ﬁnite set, this convergence to zero in probability, is uniformly over Sτ. For τ (ϵ, 1 ϵ) Sc τ, from the earlier calculation,

Πn τ,k(M1 k, M k) cπ s +s M1 n e2nϵ n 2e

fn k, k,βn M1 k (τ)

fn k, k,ˆβM k (τ) ]

cπ s +s M1 n e2nϵ n 2e

fn k, k,βn M1 k (τ)

fn k, k,ˆβM1 k (τ) +n 1log

fn k, k, ˆβM1 k (τ )

fn k, k,ˆβM k (τ ) +δ/4]

for |τ τ | < ϵ1, τ Sτ. We have, supτ|n.5n 1log

fn k, k,βn M1 k (τ)

fn k, k,ˆβM1 k (τ) | = Op(1) and n 1log

fn k, k, ˆβM1 k (τ )

fn k, k,ˆβM k (τ ) <

δ/2 in probability. Hence, Πn τ,k(M1 k, M k) e nδ/8 0 in probability uniformly over τ.

Sequential updates for variational formulation

For the formulation in Equation 9, we have

q (βl) exp 1

i (yi x iβγ,l ξ1,lvi,l)2)( t vi,lξ2 2,l ) 1

2β l Sβlβl)

i (yi x i,γβl ξ1,l E(v 1 i,l ) 1)2)E( t vi,lξ2 2,l )) 1

2β l Sβlβl + c0

2E{(Y δ,l Xγβl) Σl(Y δ,l Xγβl) + β l Sβlβl} + c0

2{(β E(Xγ Σl Xγ)β + β l Sβlβl 2β l E(Xγ) Σl Y δ,l} + c1

2{(βl (SE x,γ,l + Sβl) 1XE γ Σl Y δ,l) (SE x,γ + Sβl)

(βl (SE x,γ,l + Sβl) 1XE γ Σl Y δ,l) + c2

where c0, c1 and c2 are free of βl. Therefore, we have the multivariate normal form for βl and hence the result follows.

Quantile Graphical Models: a Bayesian Approach

log(q (πl)) = (a1 +

j=1,j =k E(Ij,l))logπl + (P 1

j=1,j =k E(Ij,l) + b1))log(1 πl) + c

for some constant c free of π. Therefore,

qnew(πl) := Beta(a1 +

j=1,j =k E(Ij,l), P 1

j=1,j =k E(Ij,l) + b1).

For vi,l: From Equation (10)

log q (vi,l) = 1

2{E( (yi x iβγ,l)2

ξ2 2,l )vi,l 1 + E(tl)( ξ2 1,l ξ2 2,l + 2)vi,l} 1

2 log vi,l + c ,

where c is free of vi,j. Note that inverse Gaussian density with parameter µ and λ has the form

f(x, µ, λ) x 3

2 exp( λ(x µ)2

. Equating the coeﬃcients of x and 1

x, i.e vi,l and 1 vi,l , we have λ = λi,l = E(tl)E( (yi x iβγ,l)2

and µ = µi,l = s

2E(tl)+E(tl) ξ2 1,l ξ2 2,l

Indicator function : We have,

log(P(Ij,l = 1)) = E(Ij,l log(πl) + (1 Ij,l) log(1 πl))

i,Ij,l=1 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l ) + c4

= E(log πl 1 πl ) 1

i,Ij,l=1 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l ) + c4

where c4 is a constant and

log(P(Ij,l = 0)) = E(log(1 πl)) 1

i,Ij,l=0 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l ) + c4

i,Ij,l=0 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l ) + c4 .

Guha, Baladandayuthapani, Mallick

log(P(Ij,l = 1)

P(Ij,l = 0)) = E(log πl 1 πl ) 1

i,Ij,l=1 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l )

i,Ij,l=0 E(tl)E( (yi x iβγ,l ξ1,lvi,l)2

vi,lξ2 2,l )}.

Here, E(log πl 1 πl ) is computed numerically by a Monte-Carlo estimate.

Rehan Akbani, Patrick Kwok Shing Ng, Henrica MJ Werner, Maria Shahmoradgoli, Fan Zhang, Zhenlin Ju, Wenbin Liu, Ji-Yeon Yang, Kosuke Yoshihara, Jun Li, et al. A pancancer proteomic perspective on the cancer genome atlas. Nature communications, 5(1): 1 15, 2014.

Joshua Angrist, Victor Chernozhukov, and Iv an Fern andez-Val. Quantile regression under misspeciﬁcation, with an application to the us wage structure. Econometrica, 74(2): 539 563, 2006.

Aliye Atay-Kayis and H el ene Massam. A monte carlo method for computing the marginal likelihood in nondecomposable gaussian graphical models. Biometrika, 92(2):317 335, 2005.

John Barnard, Robert Mc Culloch, and Xiao-Li Meng. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, pages 1281 1311, 2000.

Matthew James Beal et al. Variational algorithms for approximate Bayesian inference. university of London London, 2003.

Alexandre Belloni, Mingli Chen, and Victor Chernozhukov. Quantile graphical models: Prediction and conditional independence with applications to ﬁnancial risk management. Technical report, 2016.

Jos e M Bernardo. Expected information as expected utility. the Annals of Statistics, pages 686 690, 1979.

Stephen P Brooks, Paolo Giudici, and Gareth O Roberts. Eﬃcient construction of reversible jump markov chain monte carlo proposal distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):3 39, 2003.

Victor Chernozhukov and Han Hong. An mcmc approach to classical estimation. Journal of Econometrics, 115(2):293 346, 2003.

Arthur P Dempster. Covariance selection. Biometrics, pages 157 175, 1972.

Quantile Graphical Models: a Bayesian Approach

Adrian Dobra, Chris Hans, Beatrix Jones, Joseph R Nevins, Guang Yao, and Mike West. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90(1):196 212, 2004.

Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and alternative t-distributions. The Annals of Applied Statistics, pages 1057 1080, 2011.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303 (5659):799 805, 2004.

Edward I George and Robert E Mc Culloch. Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423):881 889, 1993.

Subhashis Ghosal, Jayanta K Ghosh, Aad W Van Der Vaart, et al. Convergence rates of posterior distributions. Annals of Statistics, 28(2):500 531, 2000.

P Giudici. Learning in graphical gaussian models. Bayesian Statistics, 5:621 628, 1996.

Paolo Giudici and PJ Green. Decomposable graphical gaussian model determination. Biometrika, 86(4):785 801, 1999.

RA Hilger, ME Scheulen, and D Strumberg. The ras-raf-mek-erk pathway in the treatment of cancer. Oncology Research and Treatment, 25(6):511 518, 2002.

Wenxin Jiang. Bayesian variable selection for high dimensional generalized linear models: convergence rates of the ﬁtted densities. Techical Report, Dept. Statistics, Northwestern Univ, 05-02, 2005.

Wenxin Jiang. Bayesian variable selection for high dimensional generalized linear models: convergence rates of the ﬁtted densities. The Annals of Statistics, 35(4):1487 1511, 2007.

Bas JK Kleijn, Aad W van der Vaart, et al. Misspeciﬁcation in inﬁnite-dimensional bayesian statistics. The Annals of Statistics, 34(2):837 877, 2006.

R Koenker and G Bassett Jr. Regression quantiles , econometrica: Journal of the economic society, vol. 46, no. 1, 1978.

Roger Koenker. Quantile regression for longitudinal data. Journal of Multivariate Analysis, 91(1):74 89, 2004.

Samuel Kotz and Saralees Nadarajah. Multivariate t-distributions and their applications. Cambridge University Press, 2004.

Hideo Kozumi and Genya Kobayashi. Gibbs sampling methods for bayesian quantile regression. Journal of statistical computation and simulation, 81(11):1565 1578, 2011.

Guha, Baladandayuthapani, Mallick

Lynn Kuo and Bani Mallick. Variable selection for regression models. Sankhy a: The Indian Journal of Statistics, Series B, pages 65 81, 1998.

Steﬀen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.

Qing Li, Ruibin Xi, Nan Lin, et al. Bayesian regularized quantile regression. Bayesian Analysis, 5(3):533 556, 2010.

John C Liechty, Merrill W Liechty, and Peter M uller. Bayesian correlation estimation. Biometrika, 91(1):1 14, 2004.

Han Liu, Fang Han, Ming Yuan, John Laﬀerty, Larry Wasserman, et al. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293 2326, 2012.

Bani K Mallick, David Gold, and Veerabhadran Baladandayuthapani. Bayesian analysis of gene expression data, volume 131. Wiley Online Library, 2009.

Nicolai Meinshausen and Peter B uhlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436 1462, 2006.

Sarah E Neville, John T Ormerod, and MP Wand. Mean ﬁeld variational bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1): 1113 1151, 2014.

Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735 746, 2009.

Alberto Roverato. Cholesky decomposition of a hyper inverse wishart matrix. Biometrika, 87(1):99 112, 2000.

Jeong Seon Ryu, Azra Memon, and Seul-Ki Lee. Ercc1 and personalized medicine in lung cancer. Annals of translational medicine, 2(4), 2014.

Juliane Sch afer and Korbinian Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 2005.

James G Scott and James O Berger. Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, pages 2587 2619, 2010.

James G Scott and Carlos M Carvalho. Feature-inclusion stochastic search for gaussian graphical models. Journal of Computational and Graphical Statistics, 17(4):790 808, 2008.

Eran Segal, Michael Shapira, Aviv Regev, Dana Pe er, David Botstein, Daphne Koller, and Nir Friedman. Module networks: identifying regulatory modules and their conditionspeciﬁc regulators from gene expression data. Nature genetics, 34(2):166 176, 2003.

Quantile Graphical Models: a Bayesian Approach

Karthik Sriram, RV Ramamoorthi, and Pulak Ghosh. Posterior consistency of bayesian quantile regression based on the misspeciﬁed asymmetric laplace density. Bayesian Analysis, 8(2):479 504, 2013.

Matthew P Wand, John T Ormerod, Simone A Padoan, Rudolf Fr uhwirth, et al. Mean ﬁeld variational bayes for elaborate distributions. Bayesian Analysis, 6(4):847 900, 2011.

Frederick Wong, Christopher K Carter, and Robert Kohn. Eﬃcient estimation of covariance selection models. Biometrika, 90(4):809 830, 2003.

Yunwen Yang, Huixia Judy Wang, and Xuming He. Posterior inference in bayesian quantile regression with asymmetric laplace likelihood. International Statistical Review, 84(3): 327 344, 2016.

Yasushi Yatabe, Takashi Takahashi, and Tetsuya Mitsudomi. Epidermal growth factor receptor gene ampliﬁcation is acquired in association with tumor progression of egfrmutated lung cancer. Cancer research, 68(7):2106 2111, 2008.

Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, 2007.

Tuo Zhao, Han Liu, Kathryn Roeder, John Laﬀerty, and Larry Wasserman. The huge package for high-dimensional undirected graph estimation in r. Journal of Machine Learning Research, 13(Apr):1059 1062, 2012.