# stable_graphical_models__ef3b433f.pdf Journal of Machine Learning Research 17 (2016) 1-36 Submitted 4/14; Revised 5/15; Published 9/16 Stable Graphical Models Navodit Misra navoditmisra@gmail.com Max Planck Institute for Molecular Genetics Ihnestr. 63-73, 14195 Berlin, Germany Ercan E. Kuruoglu ercan.kuruoglu@isti.cnr.it ISTI-CNR Via G. Moruzzi 1, 56124 Pisa, Italy and Max Planck Institute for Molecular Genetics Ihnestr. 63-73, 14195 Berlin, Germany Editor: JeffBilmes Stable random variables are motivated by the central limit theorem for densities with (potentially) unbounded variance and can be thought of as natural generalizations of the Gaussian distribution to skewed and heavy-tailed phenomenon. In this paper, we introduce α-stable graphical (α-SG) models, a class of multivariate stable densities that can also be represented as Bayesian networks whose edges encode linear dependencies between random variables. One major hurdle to the extensive use of stable distributions is the lack of a closed-form analytical expression for their densities. This makes penalized maximumlikelihood based learning computationally demanding. We establish theoretically that the Bayesian information criterion (BIC) can asymptotically be reduced to the computationally more tractable minimum dispersion criterion (MDC) and develop Stab Le, a structure learning algorithm based on MDC. We use simulated datasets for five benchmark network topologies to empirically demonstrate how Stab Le improves upon ordinary least squares (OLS) regression. We also apply Stab Le to microarray gene expression data for lymphoblastoid cells from 727 individuals belonging to eight global population groups. We establish that Stab Le improves test set performance relative to OLS via ten-fold cross-validation. Finally, we develop SGEX, a method for quantifying differential expression of genes between different population groups. Keywords: Bayesian networks, stable distributions, linear regression, structure learning, gene expression, differential expression 1. Introduction Stable distributions have found applications in modeling several real-life phenomena (Berger and Mandelbrot, 1963; Mandelbrot, 1963; Nikias and Shao, 1995; Gallardo et al., 2000; Achim et al., 2001) and have robust theoretical justification in the form of the generalized central limit theorem (Feller, 1968; Nikias and Shao, 1995; Nolan, 2013). Several special instances of multivariate generalization of stable distributions have also been described in literature (Samorodnitsky and Taqqu, 1994; Nolan and Rajput, 1995). Multivariate stable densities have previously been applied to modeling wavelet coefficients with bivariate α- c 2016 Navodit Misra and Ercan E. Kuruoglu. Misra and Kuruoglu stable distributions (Achim and Kuruoglu, 2005), inferring parameters for linear models of network flows (Bickson and Guestrin, 2011) and stock market fluctuations (Bonato, 2012). In this paper, we describe α-stable graphical (α-SG) models, a new class of multivariate stable densities that can be represented as directed acyclic graphs (DAG) with arbitrary network topologies. We prove that these multivariate densities also correspond to linear regression-based Bayesian networks and establish a model selection criterion that is asymptotically equivalent to the Bayesian information criterion (BIC). Using simulated data for five benchmark network topologies, we empirically show how α-SG models improve structure and parameter learning performance for linear regression networks with additive heavy-tailed noise. One motivation for the present work comes from potential applications to computational biology, especially in genomics, where Bayesian network models of gene expression profiles are a popular tool (Friedman et al., 2000; Ben-Dor et al., 2000; Friedman, 2004). A common approach to network models of gene expression involves learning linear regressionbased Gaussian graphical models. However, the distribution of experimental microarray intensities shows a clear skew and may not necessarily be best described by a Gaussian density (Section 3.2). Another aspect of microarray intensities is that they represent the average m RNA concentration in a population of cells. Assuming the number of m RNA transcripts within each cell to be independent and identically distributed, the generalized central limit theorem suggests that the observed shape should asymptotically (for large population size) approach a stable density (Feller, 1968; Nikias and Shao, 1995; Nolan, 2013). Univariate stable distributions have previously been used to model gene expression data (Salas-Gonzalez et al., 2009a,b) and it is therefore natural to consider multivariate α-stable densities as models for m RNA expression for larger sets of genes. In Section 3.2 we provide empirical evidence to support this reasoning. We further develop α-stable graphical (α-SG) models for quantifying differential expression of genes from microarray data belonging to phase III of the Hap Map project (International Hap Map 3 Consortium and others, 2010; Montgomery et al., 2010; Stranger et al., 2012). The rest of the paper is structured as follows : Section 2.1 describes the basic notation and background concepts for Bayesian networks and stable densities. Section 2.2 introduces α-SG models and establishes that these models are Bayesian networks that also represent multivariate stable distributions with discrete spectral measures. Section 2.3 establishes the equivalence of the popular but (in this case) computationally challenging Bayesian information criterion (BIC) for structure learning and the computationally more tractable minimum dispersion criterion (MDC), for all α-SG models that represent symmetric densities. Furthermore, we establish how data samples from any α-SG model can be combined to generate samples from a partner symmetric α-SG model with identical network topology and regression coefficients. Using these theoretical results we design Stab Le, an efficient algorithm that combines ordering-based search (OBS) (Teyssier and Koller, 2005) for structure learning with the iteratively re-weighted least squares (IRLS) algorithm (Byrd and Payne, 1979) for learning the regression parameters via least lp norm estimation. Finally, in Section 3 we implement the structure and parameter learning algorithm on simulated and expression microarray data sets. Stable Graphical Models In this section we develop the theory and algorithms for learning α-SG models from data. First, we discuss some well-established results for Bayesian networks and α-stable densities. 2.1 Background We begin with an introduction to Bayesian network models (Pearl, 1988) for the joint probability distribution of a finite set of random variables X = {X1, . . . XN}. A Bayesian network B(G, Θ) is specified by a directed acyclic graph (DAG) G, whose vertices represent random variables in X and a set of parameters Θ = {θi|Xi X}, that determine the conditional probability distribution p(Xi|Pa(Xi), θi) for each variable Xi X given the state of its parents Pa(Xi) X \ {Xi} in G (Koller and Friedman, 2009). We will overload the symbols Xj and Pa(Xj) to represent both sets of random variables and their realizations. The directed acyclic graph G implies a factorization of the joint probability density into terms representing each variable Xi and its parents Pa(Xi) (called a family) such that : i=1 p(Xi|Pa(Xi), θi) (1) The dependence of p(Xi|Pa(Xi), θi) on θi is usually specified by an appropriately chosen family of parametrized probability densities for the random variables, such as Gaussian or log-Normal. In this paper, we will use multivariate stable densities to model the random variables in X. The primary motivation for modeling continuous random variables using stable distributions comes from the generalization of the central limit theorem to distributions with unbounded variance (Feller, 1968; Nikias and Shao, 1995). Stable distributions are parametrized to allow varying degrees of impulsiveness and skewness. The generalized central limit theorem requires that the sums of stable random variables are stable and more generally in the limit of large N, all sums of N independent, identically distributed random variables approach a stable density. A formal definition for stable random variables can be provided in terms of the characteristic function (Fourier transform of the density function) Definition 1 A stable random variable X Sα(β, γ, µ), is defined for each α (0, 2], β [ 1, 1], γ (0, ) and µ ( , ). The probability density f(X|α, β, γ, µ) is implicitly specified by a characteristic function φ(q|α, β, γ, µ) : φ(q|α, β, γ, µ) E[exp(ıq X)] f(X|α, β, γ, µ) exp(ıq X)d X = exp ıµq γ|q|α[1 ıβ sign(q)r(q, α)] where, r(q, α) = tan απ π log |q| α = 1 The parameters α, β, γ and µ will be called the characteristic exponent, skew, dispersion and location respectively. Unfortunately, the density f(X|α, β, γ, µ) does not have a closedform analytical expression except for the three well-known stable distributions (Figure 1 and Table 1). Misra and Kuruoglu Distribution Sα(β, γ, µ) f(X|α, β, γ, µ) Support L evy(γ, µ) S0.5(1, γ, µ) γ 2π 1 (x µ)3/2 exp γ2 2(x µ) µ < x < Cauchy(γ, µ) S1.0(0, γ, µ) 1 π γ γ2+(x µ)2 < x < Normal(µ, σ) S2.0(0, γ = σ2 2 , µ) 1 2 πγ exp (x µ)2 Table 1: Closed-form analytical expressions for L evy, Cauchy and Normal densities and the corresponding α-stable parameters. Except for the Gaussian case, the asymptotic (large x) behavior of univariate α-stable densities shows Pareto or power law tails (L evy, 1925). The following lemma formalizes this observation (Samorodnitsky and Taqqu, 1994; Nolan, 2013) Lemma 1 If X Sα(β, γ, 0) with 0 < α < 2, then as x Pr(X > x) (1 + β)γCαx α 0 x α sin xdx) 1 = 1 πΓ(α) sin(απ 10 5 0 5 10 Standard densities Sα=0.5(β=1, γ=1, µ=0) Sα=1.0(β=0, γ=1, µ=0) Sα=2.0(β=0, γ=1, µ=0) Figure 1: The three instances of analytically known univariate α-stable densities Sα(β, γ, µ). L evy(γ, µ) S0.5(1, γ, µ) (solid blue curves), Cauchy(γ, µ) S1.0(0, γ, µ) (dashed green curves) and Normal(µ, σ) S2.0(0, σ2 2 , µ) (dot-dashed red curves). It is straight forward to use the characterization of stable random variables in Definition 1 to verify the following well-known properties (Samorodnitsky and Taqqu, 1994), Stable Graphical Models Property 1 If X1 Sα(β1, γ1, µ1) and X2 Sα(β2, γ2, µ2) are independent stable random variables, then Y = X1 + X2 Sα(β, γ, µ), with β = β1γ1 + β2γ2 γ1 + γ2 , γ = (γ1 + γ2) , µ = µ1 + µ2 Property 2 If X Sα(β, γ, µ) and c, d R, then Sα sign(c)β, |c|αγ, cµ + d , α = 1 Sα sign(c)β, |c|γ, c(µ 2γβ ln |c| π ) + d , α = 1 A word on the notation used throughout this paper. We will use the symbol Y p = (P λ |Yλ|p)1/p to represent the lp norm of a vector. The lp norm of a vector representing N realizations of a random variable Z is related to the pth moment E(|Z|p) = Z p p/N. For heavy-tailed α-stable densities, one convenient method for parameter estimation is via fractional lower order moments (FLOM) for p < α (Hardin Jr, 1984; Nikias and Shao, 1995). Later, we will discuss FLOM-based parameter learning in greater detail (Section 2.4.1). 2.2 α-Stable Graphical Models We can now introduce Bayesian network models reconstructed from stable densities that have compact representations for the characteristic function. Univariate α-stable densities can be generalized to represent multivariate stable distributions that are defined as follows (Samorodnitsky and Taqqu, 1994), Definition 2 A d-dimensional multivariate stable distribution over X = {X1, . . . Xd} is defined by an α (0, 2], µ Rd an a spectral measure Λ over the d-dimensional unit sphere Sd, such that the characteristic function Φ(q|α, µ, Λ) E[exp(ıq T X)] Sd ψ(s T q|α)Λ(ds) + ıµT q where, ψ(u|α) = |u|α(1 ı sign(u)r(u, α)) Definition 3 An α-stable graphical (α-SG) model B(G, Θ) is a probability distribution over X such that Xk Pa(Xj) wjk Xk Sα(βj, γj, µj) 2. Zj is independent of Zk , if j = k, Xj X where Pa(Xj) X \ {Xj} are the parent nodes of Xj in the directed acyclic graph G and Θ describes the distribution parameters wjk R, Wj = {wjk|Xk Pa(Xj)}, θj = {α, βj, γj, µj} Wj, Θ = {θi|Xi X}. A symmetric α-stable graphical (Sα-SG) model is a α-SG model with β = 0. Misra and Kuruoglu It is straightforward to see that B(G, Θ) is indeed a Bayesian network. Note also that the fact that Zj are stable follows directly from Property 1. Lemma 2 B(G, Θ) in Definition 3 represents a Bayesian network Proof Let d = |X|. First note that every directed acyclic graph can be used to infer an ordering (not necessarily unique) on the variables in X such that all parents of each variable have a lower order than the variable itself. Suppose we index each variable with its order in an ordering compatible with the DAG, such that Xi has order i. The proof rests on the fact that the transformation matrix from {Zi} to {Xi} for such a graph is lower triangular, with each diagonal entry equal to 1. Since the determinant of a triangular matrix equals the product of its diagonal entries, the Jacobian for the transformation (or the determinant of the transformation matrix), | (Z1,...Zd) (X1,...Xd)| = 1. Furthermore, since the noise variables Zj s are independent of each other PB(Z1, . . . Zd) = j=1 f(Zj|α, βj, γj, µj) also, p(Xj|Pa(Xj), θj) = f(Zj|α, βj, γj, µj) = PB(X) = PB(Z1, . . . Zd)| (Z1, . . . Zd) (X1, . . . Xd)| j=1 p(Xj|Pa(Xj), θj)| (Z1, . . . Zd) (X1, . . . Xd)| j=1 p(Xj|Pa(Xj), θj) Hence, B(G, Θ) is a Bayesian network. Before establishing the fact that an α-SG model is a multivariate stable density in the sense of Definition 2, we prove the following result (proof is provided in Appendix A) : Lemma 3 Every d-dimensional distribution with a characteristic function of the form Φ(q|α, µ, Λ) = k=1 φ(c T k q|α, βk, γk, µk) where, ck, q Rd represents a multivariate stable distribution with a discrete spectral measure Λ. We are now in a position to establish that α-SG models imply a multivariate stable density with a spectral measure concentrated on a finite number of points over the unit sphere. Lemma 4 Every α-SG model represents a multivariate stable distribution with a discrete spectral measure of the form in Lemma 3. Stable Graphical Models Proof We will prove the lemma by induction. First, observe that every Bayesian network can be used to assign an ordering (not unique) such that Pa(Xj) {X1 . . . Xj 1}. As before, we will use such an ordering to index each random variable in X, such that X|X| has no descendants. The base case of the lemma, where |X| = 1 is clearly true. Assume that the lemma is true for all Bayesian networks with |X| = m 1. Then for any Bayesian network B with |X| = m random variables ΦB(q) E[exp(ıq T X)] j=1 d Xjf(Zj|α, βj, γj, µj) exp(ıqj Xj) = Z h m 1 Y j=1 d Xjf(Zj|α, βj, γj, µj) exp(ıqj Xj) i Z d Xmf(Zm|α, βm, γm, µm) exp(ıqm Xm) = Z h m 1 Y j=1 d Xjf(Zj|α, βj, γj, µj) exp(ı qj Xj) i Z d Zmf(Zm|α, βm, γm, µm) exp(ıqm Zm) = Φ B( q)φ(qm|α, βm, γm, µm) where B is the Bayes net on X = X \ {Xm}, and qj = qj + wmjqm|Pa(Xm) {Xj}| Xj X Since by assumption, k=1 φ(s T k q|α, βk, γk, µk) = ΦB(q) = φ B( q)φ(qm|α, βm, γm, µm) k=1 φ( s T k q|α, βk, γk, µk), where : s T k q = Pm 1 j=1 sk,j(qj + wmjqm|Pa(Xm) {Xj}|) k < m qm k = m Therefore, ΦB(q) represents a m-dimensional multivariate stable distribution with a discrete spectral measure (Lemma 3). Therefore, by induction, every α-SG model represents a multivariate stable distribution with a discrete spectral measure of the form in Lemma 3. 2.3 Learning α-SG Models A popular method for structure learning in Bayesian network models is based on the Bayesian information criterion (BIC) which is also equivalent to the minimum description length (MDL) principle (Schwarz, 1978; Heckerman et al., 2000). Misra and Kuruoglu Definition 4 Given a data set D = {D1, . . . , DN}, the Bayesian Information Score SBIC(B|D) for a Bayesian network B(G, Θ) is defined as, SBIC(B|D) = X Dj D log PB(Dj) X The Bayesian information criterion (BIC) selects the Bayesian network that maximizes this score over the space of all directed acyclic graphs G and parameters Θ. The major stumbling block in using stable densities is due to the fact that there is no known closed-form analytical expression for them (apart from special cases representing Gaussian, Cauchy and Levy distributions). This makes BIC based inference computationally demanding due to the marginal likelihood term PB[Dλ]. One main contribution of this paper is an efficient method of learning the network structure and parameters for α-SG models. The next lemma establishes a new result that is useful in efficiently solving the learning problem. Lemma 5 Given a data set DY = {Y1, . . . , YN} generated from a stable random variable Y Sα(β, γ, µ) j=1 log f(Yj|α, β, γ, µ) = N log γ + h(Y |α, β) where, lim N h(Y |α, β) = Z d Y f(Y |α, β, 1, 0) log f(Y |α, β, 1, 0) = H h Sα(β, 1, 0) i where, H[.] is the entropy of the corresponding random variable. Proof Since Y includes samples from a stable distribution, Y Sα(β, γ, µ) by definition, performing a change of variable to Y Y = Y γ1/α µ (2) ( µ γ1/α α = 1 µ γ + 2β ln γ we get, the standard form density Y Sα(β, 1, 0) using Property 2. Furthermore, samples from the transformed data set Y = { Y1, . . . , YN} are also distributed according to the following standard density : f(Y |α, β, γ, µ) = f( Y |α, β, 1, 0)d Y d Y = f( Y |α, β, 1, 0) 1 γ1/α Stable Graphical Models This implies that if we know the parameters α, β, γ and µ for the density generating DY log f(Y |α, β, γ, µ) = j=1 log f(Yj|α, β, γ, µ) α + log f( Yj|α, β, 1, 0) o α + h(Y |α, β) where, h(Y |α, β) is defined by h(Y |α, β) 1 j=1 log f( Yj|α, β, 1, 0) (3) Here Yj and Yj are related via Equation 2 for all 1 j N. Note that since the transformed variables Yj are samples from f( Y |α, β, 1, 0), we have the following asymptotic result for large N lim N h(Y, α, β) = lim N 1 N j=1 log f( Yj|α, β, 1, 0) f( Y |α, β, 1, 0) log f( Y |α, β, 1, 0)d Y = H h Sα(β, 1, 0) i where, H[.] is the entropy of the corresponding random variable. As things stand, the entropy H[.] of stable random variables in the standard form is just as difficult to compute as the original log-likelihood and the previous lemma has just transformed one intractable quantity into another. However, there is an important class of models where we can ignore the entropy term during structure learning; this class of multivariate distributions have a special property that every linear combination of random variables is distributed as a stable distribution Sα(β, ., .) with the same α and β. One scenario when this is true is when the noise term is symmetric i.e. βi = 0 Xi X. This special case is important since we later show (Lemma 8) that every α-SG model can be easily transformed into a partner symmetric α-SG model with identical network topology and regression coefficients. For all practical purposes, learning the structure of symmetric α-SG models is effectively the same as learning structure of arbitrary α-SG models. Lemma 6 Given a symmetric α-stable graphical model for variables in X, Z w T X = X Xj X wj Xj S α, β(w) = 0, γ(w), µ(w) , w R|X| if, βi = 0, Xi X Misra and Kuruoglu Proof The dispersion γ(w) and skewness β(w) for the projection w T X of any d-dimensional stable random density are given by (Samorodnitsky and Taqqu, 1994) Sd |w T s|αΛ(ds) β(w) = γ(w) 1 Z Sd sign(w T s)|w T s|αΛ(ds) Since, X represents a symmetric α-stable graphical model, Lemma 4 and Lemma 3 imply (substituting the characteristic function in the expression for β(w) with the expansion in Lemma 3 : |w T ck|α 2 γk 2γ(w) n δ(s ck |ck|2 ) + δ(s + ck |ck|2 ) o |w T s|αsign(w T s)ds For a recent reference on multiple regression with stable errors, see also Nolan (2013b). We are now in a position to present the main contribution of this paper : an alternative criterion for model selection that is both computationally efficient and comes with robust theoretical guarantees (Lemma 7). The criterion is called minimum dispersion criterion (MDC) and is a penalized version of a technique previously used in signal processing literature for designing filters for heavy-tailed noise (Stuck, 1978). Definition 5 Given a data set D = {D1, . . . , DN}, the penalized dispersion score SMDC(B|D) for a Bayesian network B(G, Θ) is defined as, SMDC(B|D) = X α + |Pa(Xi)| The minimum dispersion criterion (MDC) selects the Bayesian network that maximizes this score over the space of all directed acyclic graphs G and parameters Θ. Lemma 7 Given a data set D = {D1, . . . , DN} generated by a symmetric α-stable graphical model, B (G , Θ ), the minimum dispersion criterion is asymptotically equivalent to the Bayesian information criterion over the search space of all symmetric α-stable graphical models Proof First consider the contribution to BIC score from each family (ie., each random variable and its parents) separately. Let Zj = Xj P Xk Pa(Xj) wjk Xk be any arbitrary set of regression coefficients for a candidate network B(G, Θ). Note that the coefficients Wj = {wjk|Xk Pa(Xj)} need not be the true regression coefficients W j and B need not be the true network B . We will use the notation Zi,λ for the realization of Zi in sample Dλ D. Since D includes samples from a symmetric α-stable graphical model, Lemma 6 Stable Graphical Models implies Zj Sα(β = 0, γj, µj). Therefore, using Lemma 5 Fam(Xj, Pa(Xj)|D) λ=1 log h f(Zj,λ|α, β = 0, γj, µj) i |Pa(Xj)| α + h( Zj|α, β = 0) |Pa(Xj)| where, as in Equation 3, Zj and Zj are related by the transformation in Equation 2 and Fam(Xj, Pa(Xj)|D) represents the contribution to BIC score from each family (ie., each random variable and its parents). = SBIC(B|D) Fam(Xj, Pa(Xj)|D) α + h(Zj|α, β = 0) + |Pa(Xj)| = lim N SBIC(B|D) N = lim N SMDC(B|D) N |X|H[Sα(β = 0, 1, 0) Since, |X|H[Sα(β = 0, 1, 0)] is independent of the candidate network structure and regression parameters {Wj|Xj X}, we get the result that for any pair of networks B and B = lim N 1 N SBIC(B|D) SBIC(B |D) = lim N 1 N SMDC(B|D) SMDC(B |D) Therefore, asymptotically, BIC is equivalent to MDC when data is generated by a symmetric α-SG graphical model. We now show how samples from any stable graphical model can be combined to yield samples from a partner symmetric stable graphical model with identical parameters and network topology. This transformation was earlier used by Kuruoglu (2001) in order to estimate parameters from skewed univariate stable densities. We should point out that the procedure described above has the drawback that symmetrized data set has half the sample size. Lemma 8 Every α-SG model can be associated with a symmetric α-SG model with identical skeleton (graph structure) and regression parameters. Proof Given a data set D = {D1, . . . DN} representing any α-SG model B(G, Θ), consider a resampled data set b D = {c D1, . . . d DNS} with variable realizations d Xi,λ = Xi,2λ Xi,2λ 1 , λ {1, . . . NS = N/2 } These bootstrapped data samples c Dλ = { d Xi,λ|Xi X} represent independent realizations of random variables b X {c Xi|Xi X}. Similarly, we may use the regression parameters W to define resampled noise variables : c Zj c Xj X c Xk Pa(Xj) wjk c Xk We now make two observations : Misra and Kuruoglu 1. If Zj = Xj P Xk Pa(Xj) wjk Xk Sα(βj, γj, µj), then using Property 1 c Zj c Xj X c Xk Pa(Xj) wjk c Xk Sα(β = 0, 2γj, 0) 2. The transformed noise variables c Zj are independent of each other. But these conditions define an α-SG model (Definition 3). Therefore, by Lemma 2, the resampled data is distributed according to a Bayesian network b B(G, bΘ) such that c Zj c Xj X c Xk Pa(Xj) wjk c Xk P b B( b X) = j=1 f(c Zj|α, 0, 2γj, 0) bθj = {α, β = 0, 2γj, 0} Wj, bΘ = { bθj|Xj X} The expression for MDC in Definition 5 does not involve the stable pdf and hence one may wonder how the variables of the distribution could be estimated. The answer is given by the following property of stable distributions Kuruoglu (2001). Lemma 9 If Z Sα(0, γ, 0), then E(|Z|p) = C(p, α)γp/α 1 < p < α C(p, α) = Γ(1 p α) Γ(1 p) cos(p π 2.4 The Stab Le Algorithm In this section we describe Stab Le, an algorithm for learning the structure and parameters of α-SG models (Algorithm 1). The first step of Stab Le is to center and symmetrize the entire data matrix DI in terms of the variables b X, as described in Lemma 8. This is followed by estimating the global parameter α using the method of log statistics (Kuruoglu, 2001). Finally, structure learning is performed by a modified version of the ordering-based search (OBS) algorithm (Section 2.4.2). The details of parameter estimation and structure learning algorithms are discussed next. 2.4.1 Parameter Learning First, we describe the algorithms Stab Le uses to estimate the characteristic exponent α from the data matrix D, as well as the parameters Γ = {γj|Xj X} and Wj = {wjk|Xk Pa(Xj)} for any given directed acyclic graph G. Stable Graphical Models Algorithm 1 Stab Le Input: Input data matrix DI, number of random restarts Nreps Output: α-SG model B(G, Θ) over X D Symmetrized(DI) // Symmetrize the data as per Lemma 8 Estimate α from D // Use log-statistics, Equation 4 Initialize B(G, Θ) = for i =1 to Nreps do Initialize a random ordering σ Bσ(G, Θ) = OBS(D, α, σ) // Ordering-based search, Algorithm 4 if SMDC(Bσ|D) > SMDC(B|D) then B = Bσ end if end for Estimating the global parameter α : Log statistics can be used to estimate the characteristic exponent α from the centered and symmetrized variables in b X (Kuruoglu, 2001). Algorithm: Since every linear combination of variables in b X has the same α, if we define i=1 c Xi , then L2 E h log | b X| E[log | b X|] 2i dy2 Γ(y) y=1 = π2 Estimating the dispersion γj, and regression parameters Wj = {wjk|Xk Pa(Xj)} If γj(Wj) is the dispersion parameter for the distribution of Zj = Xj P Xk Pa(Xj) wjk Xk, then the minimum dispersion criterion selects regression parameters W j = arg min 1 α log γj(Wj) Minimum dispersion regression coefficients are estimated using a connection between the lp-norm of a stable random variable and the dispersion parameter γ (Zolotarev, 1957; Kuruoglu, 2001) given above in Lemma 9. This lemma tells us that within a constant term log C(p, α), minimizing 1 α log γj is identical to minimizing the lp-norm Zj p (PN λ=1 |Zj,λ|p)1/p for 1 < p < α. W j = arg min log Zj p arg min log ( λ=1 |Zj,λ|p)1/p Misra and Kuruoglu Algorithm 2 IRLS // Find the least lp norm regression coefficients Input: N dimensional vector for realizations of the child node Y , N M matrix X of realizations of the parent set Pa(Y ), tolerance ϵ and p (0, 2] Output: M dimensional vector of regression co-efficients W = arg min W Y XW p Initialize W with OLS co-efficients W = (XT X) 1(XT Y ) repeat Initialize buffer for current regression coefficients β = W Initialize a diagonal N N matrix Ωfrom β for weighted least squares regression Ωij = δij(Yi (XW)i)p 2 i, j {1, . . . N} Update regression coefficients vector W = (XT ΩX) 1(XT ΩY ) until β W 2 < ϵ // Change in regression coefficients is within tolerance Algorithm: Minimization of the lp norm is performed by the iteratively least squares (IRLS) algorithm (Byrd and Payne, 1979). Briefly, the IRLS algorithm repeatedly solves an instance of the weighted least squares problem to achieve successive estimates for the least lp norm coefficients (Algorithm 2). IRLS is attractive since rigorous convergence guarantees can be given (Daubechies et al., 2010) and the method is easy to implement since several software packages are available for the weighted least squares problem. Even though the IRLS objective is no longer convex for p < 1.0, Daubechies et al. (2010) show that under certain sparsity conditions, the algorithm can recover the true solution. Simulations described in Section 3.1 tend to support this observation. For experiments described in this manuscript, Stab Le used two values of p for lp-norm estimation. For learning regression coefficients during structure learning, IRLS was implemented with p = α/1.01, since lower values tended to give noisier estimates (possibly due to numerical errors). However, we also found that estimating the term log C(p, α) is prone to numerical errors for small values of |α p|. Therefore, we ignore this constant term during structure learning since it is common to all candidate structures. Stab Le estimates the dispersion parameters γj after structure learning, by computing the lp-norm for p sufficiently smaller than α (e.g. α/10 p α/2 and applying Lemma 9. 2.4.2 Structure Learning Searching the space of all network structures can be performed through any of the popular hill-climbing algorithms. In this paper we used the ordering-based search (OBS) algorithm (Teyssier and Koller, 2005) to search for a local optimum in the space of all directed acyclic graphs. The algorithm starts with an initial ordering σ and then learns a DAG consistent with σ ( i.e., all parents of each node must have a lower order). This part of structure learning is performed via a subroutine K2Search (Algorithm 3), which is a modified version of the hill-climbing based K2Search algorithm Cooper and Herskovits (1992). K2Search starts with an empty parent set for each node Xi X and greedily adds edges until the MDC based score FS(Xi, Pa(Xi)|D, α) = N α log γi |Pa(Xi)| 2 log N reaches a local maximum. The main difference from Gaussian graphical models (Heckerman et al., 2000; Schmidt et al., 2007) is that K2Search scores each family based on least lp norm instead Stable Graphical Models Algorithm 3 K2Search Input: Symmetrized data matrix D, fixed ordering σ and shape parameter α Output: αSG model B(G, Θ) given the ordering σ Initialize B(G, Θ) = for i = 2 to |X| do // Find the optimal parent set Pa(σi) by greedily // adding edges starting from Pa(σi) = repeat Initialize no Change = true Initialize best = FS(σi, Pa(σi)|D, α) Add Pa = // Search for a potential parent for Xj {σ1 . . . σi 1} \ Pa(σi) do Estimate regression weights Wσi for parent set Pa(σi) Xj using IRLS if FS(σi, Pa(σi) Xj|D, α) > best then best = FS(σi, Pa(σi) Xj|D, α) // Update best score and Add Pa = Xj // possible new parent no Change = false end if end for Pa(σi) = Pa(σi) Add Pa // Add the new parent until no Change is true // Repeat until local optimum end for Algorithm 4 OBS // Find the optimal α-SG model using OBS Input: Symmetrized data matrix D, shape parameter α, initial ordering σ Output: α-SG model B(G, Θ) over X Initialize SG model B=K2Search(D, σ, α) for i = 1 to |X| 1 do Initialize Tiσ = Twiddle(i, σ) // New ordering Tiσ by swapping σi & σi+1 B= K2Search(D, Tiσ, α) // Compute the optimum B given Tiσ DS(i) = SMDC( B|D) SMDC(B|D) // Set Delta score for the twiddle end for repeat Initialize no Change = true Find a = arg max DS(i) // Find the best twiddle Taσ B= K2Search(D, Taσ, α) // Compute the optimum given Taσ if SMDC( B|D) > SMDC(B|D) then σ = Taσ, B = B // Accept the swap and update σ, B DS(a 1) (if a > 1) // Update delta scores for neighbors a 1 DS(a + 1) (if a < |X| 1) // and a + 1, if valid no Change = false end if until no Change is true // Repeat until local optimum Misra and Kuruoglu of ordinary least squares (OLS). Once K2Search has learned the locally optimum DAG for a given ordering σ, OBS explores other ordering by performing elementary operations (or twiddles ) that swap the order of successive variables and recomputes the K2Search scores. This process is continued until a local optimum. Stab Le also performs a fixed number of random restarts to explore more of the search space. In all experiments reported here we used 10 random restarts. Pseudo code for the methods is described in Algorithms 4 and 3. 3. Empirical Validation In this section we describe two sets of numerical experiments to assess the performance of Stab Le. The first set is based on synthetic data representing five benchmark network topologies (Section 3.1). These experiments test the accuracy and robustness of MDC based learning on simulated data sets where the ground truth (structure and parameters) is known. For the second set of experiments, we apply Stab Le to a gene expression data set (Section 3.2) from Phase III of the Hap Map project (International Hap Map 3 Consortium and others, 2010). These samples represent microarray measurements of m RNA expression within lymphoblastoid cells from 727 individuals belonging to eight global population groups (Montgomery et al., 2010; Stranger et al., 2012). For structure learning, we chose ordinary least squares (OLS) based BIC penalized loglikelihood SOLS(B|D) for comparison. SOLS(B|D) = X n log Zi Zi 2 + |Pa(Xi)| 2 log N o (5) OLS is commonly used for learning Gaussian graphical models and should be identical to Stab Le for α = 2.0 (for that case SOLS and SMDC are the same up to a network and parameter independent term). This comparison allowed us to assess the effect of heavytailed noise (α < 2.0) on learning performance. 3.1 Synthetic Data We performed numerical experiments based on simulated data sets for five network topologies from the Bayesian network repository 1. These were (number of nodes, edges within brackets) : ALARM (37, 46), BARLEY (48, 84), CHILD (20, 25), INSURANCE (27, 52) and MILDEW (35, 46). Adjacency matrix for each network was downloaded from the supplement to Tsamardinos et al. (2006)2. Each node Xi X was assigned an additive α-stable noise variable Zi with same parameters Sα(β, γ, 0) and each edge was assigned a regression coefficient that was sampled from [ ρ 2] uniformly at random. The Sα(β, γ, 0) noise variable was simulated using the method of Chambers et al. (1976). For each set of experiments, we simulated 100 datasets, each with 2000 samples from an α-SG model with randomly chosen regression weights, but fixed network topology and α-stable noise parameters. The 1. A description for each network is available at http://www.cs.huji.ac.il/site/labs/compbio/Repository/ . 2. Supplement can be accessed at http://www.dsl-lab.org/supplements/mmhc paper/mmhc index.html. Stable Graphical Models 0 20 40 60 80 100 A. True positives 0 20 40 60 80 100 B. False positives 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Scoring criterion Figure 2: The ALARM network - Inferred structure. Comparative performance of MDC based Stab Le algorithm (solid blue curves) versus an identical algorithm based on OLS score (dashed red curves). Vertical axes show true positives in A and false positives in B, for directed edges present in the input network. Horizontal axes show respective confidence (percentage of simulated data sets with the feature) goal was to assess Stab Le in terms of its performance at structure learning and estimation of stable noise parameters for a variety of regression coefficients. We performed five sets of experiments for each network, corresponding to different values of α = 0.8, 1.1, 1.4, 1.7, 2.0. For each set of experiments, we chose ρ = 1.0, β = 0.9 and γ = 1.0. We chose such a high skew (β = 0.9) in the input data to test our algorithm on its ability to symmetrize and correctly learn (possibly) difficult problem instances. Instead of β however, we report a related parameter θ = arctan(β tan α π 2 ) which can be inferred more robustly in practice since it avoids the singularity near α = 2 (Kuruoglu, 2001). We used the zeroth order signed moments based method for estimating θ (Kuruoglu, 2001). λ=1 sign(Xi,λ), Xi X (6) Misra and Kuruoglu 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 B. Standard deviation Scoring criterion Figure 3: The ALARM network - Estimated regression parameters. We report two set of results for each network : structure learning and parameter estimation. For convenience, we describe the results for the ALARM network first (results for other data sets are provided in Appendix B). 3.1.1 Inferred Structure Figure 2 shows the comparative performance of MDC and OLS based approaches. Each curve shows the number of inferred directed edges. Figure 2A, B show the number of true positives and true negatives at a given confidence level (percentage of simulated data sets where the directed edge was learnt). Solid (blue) curves show the performance of MDC and dashed (red) curves show OLS based method. The results are along expected lines with the difference between the two getting larger as α is varied away from 2.0. One clear trend is that while the sensitivity to true positive detection degrades for OLS (Type II errors) as α decreases, the MDC based method remain robust to changes in α. Both methods are however quite reliable at not inferring incorrect edges (false positives or Type I errors). Similar behavior is observed for other data sets as well (Appendix B). 3.1.2 Estimated Parameters Figure 3 shows the comparative performance of MDC and OLS scores in estimating regression coefficients. Figure 3A shows the bias in mean estimates (in absolute magnitude) and Figure 3B, the standard deviation around the mean in estimated coefficients and are averaged over all true positives and all simulated data sets. Note that each of the 100 simulated data set had regression coefficients sampled independently from [ 1/2, 1/2]. OLS had a much higher standard deviation and bias at low α. As with structure learning, this pattern was consistently observed for other network topologies as well (Appendix B). Stable Graphical Models 0.8 1.1 1.4 1.7 2 A. Estimated α 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 B. Estimated θ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 C. Estimated logγ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation Figure 4: The ALARM network - Estimated noise parameters. We also assessed the ability of Stab Le to infer α-stable noise parameters accurately and robustly. However, we could not show a comparative performance since OLS scores assume Gaussian noise. Figure 4 shows the box plot and basic statistics for the estimates for α, θ and log γ from the symmetrized data set (node specific parameters θ and log γ are reported as averages). α , θ 1 |X| P i arctan(βi tan α π 2 ) , log γ = 1 |X| Both α and θ estimates have low bias and standard deviation for each of the five data sets. But, log γ estimates show a clear tendency to overestimate the dispersion in noise at very low α. This is however a difficult parameter domain for most existing methods for parameter estimation, even for univariate α stable densities (Kuruoglu, 2001). As with other inferences, the performance of Stab Le is again robust to changes in network topology (Appendix B). Misra and Kuruoglu ID Ethnicity Location # Samples # Genes/Probes CEU Caucasians Utah, USA 109 21800 CHB Han Chinese Beijing, China 80 21800 GIH Gujarati Indians Houston, USA 82 21800 JPT Japanese Tokyo, Japan 82 21800 LWK Luhya Webuye, Kenya 83 21800 MEX Mexican Los Angeles, USA 45 21800 MKK Maasai Kinyawa, Kenya 138 21800 YRI Yoruba Ibadan, Nigeria 108 21800 Table 2: The Hap Map III population groups and selected microarray probes as reported by Stranger et al. (2012). 3.2 Gene Expression Microarray Data In this section, we describe two sets of analyses for gene expression microarray data from phase III of the Hap Map project3. Our approach models the set of gene expression profiles as a multivariate stable distribution that can be represented by an α-SG model. The first set of experiments aimed at comparing the prediction accuracy of MDC with OLSbased structure learning via ten-fold cross-validation (Section 3.2.2). The results of these experiments establish the utility of heavy-tailed models for gene expression profiles. Next, we apply α-SG models to the problem of quantifying differential expression (DE) of a gene between samples belonging to different conditions. This is a common task in gene expression-based analyses in contemporary genomics. However, popular methods for detecting differentially expressed genes usually assume the expression profile for each gene to be independent of others. Based on this assumption, DE quantification is performed by testing the null hypothesis that the log-expression of each gene is identical across the observed conditions and using the corresponding p-value as a measure of DE. In Section 3.2.3, we develop SGEX, a new technique for quantifying differential expression of each gene that is based on α-SG models. We apply SGEX to quantify the DE for a gene in each population group within the Hap Map data. Contrary to most existing methods, SGEX takes into account both the heavy-tailed behavior of gene expression densities, as well as linear dependencies between m RNA expression of different genes. 3.2.1 Data Normalization We downloaded pre-processed data for 727 individuals from eight global population groups as reported in Stranger et al. (2012). Details about the eight population groups are provided in Table 2. For each individual, the input data represented log-intensities for 21800 microarray probes4 that were quantile and median normalized, as described in the original paper (Stranger et al., 2012). These microarray intensities provide a measure for m RNA concen- 3. Data sets can be downloaded from the Array Express database http://www.ebi.ac.uk/arrayexpress/ using Series Accession Numbers E-MTAB-198 and E-MTAB-264. 4. Each selected probe mapped to a unique, autosomal Ensembl gene. Ensembl gene IDs are available at http://www.ensembl.org. Stable Graphical Models tration within a sample of lymphoblastoid cells from each individual. Before performing structure learning, we further processed each probe intensity as follows : 1. The log-intensity l(i) for each probe i was median-centered to obtain transformed logintensities ml(i), ie., the number of samples with positive log-intensity was half (or 0.5 less than) the total (= 363 = 727/2 ). This is a standard technique for learning Gaussian graphical models from gene expression data and does not affect the network structure. 2. The median-centered log-intensities were used to assign a rank R(i) to each probe i, in decreasing order of variance. Even for α-stable distributions, variance of log transformed data is finite (Kuruoglu, 2001). This is also a standard technique for restricting computing time by selecting a subset of genes with most variation. 3. The median-centered log-intensities {ml(i)|R(i) 21800} were exponentiated to I = {2ml(i)|R(i) 21800}. 4. The exponentiated-median-centered log-intensities Ik = {2ml(i)|R(i) k 21800} for the top k ranked probes were provided as input to Stab Le (for cross-validation) and SGEX (for DE quantification, as described in Section 3.2.3). In the experiments reported here k = 100. We estimated α over 1000 resampled bootstrap replicates of the data. This was meant as a diagnostic to assess the heavy-tailed nature of the intensities. As shown in Figure 5A, the data suggests a clear departure from a Gaussian profile. 3.2.2 Cross-validation Analysis We performed a ten-fold cross-validation for the top 100 ranked probes from the Hap Map data. Since we wanted to compare MDC with OLS-based learning, we report goodness of fit of the graphical model B on the test set T = {T1, . . . TN} in terms of log fractional lower order moments : LFLOM(T|B, p) = X p log E(|Zi|p) i = X p log E(|Xi X Xj Pa(Xi) wij Xj|p) i where, wij represents the regression co-efficient for the edge (Xj, Xi). Clearly, if most of the variation in Xi can be explained by the parent set Pa(Xi), the corresponding LFLOM will be small. For p = 2, LFLOM is identical to the negative log-likelihood for Gaussian graphical models5. However, the second order moment diverges for α < 2 (Lemma 9). Therefore, LFLOM provide a more robust estimate for evaluating the model on test set for heavy-tailed noise (α < 2). Figure 5B shows the average (over the ten-folds) of LFLOM for MDC (blue) and OLSbased (red) models. In each case, the curves show the difference in LFLOM between optimal (MDC or OLS) network and an empty network (NULL). This allows us to also assess the deterioration in test set performance by treating each gene as an independent 5. Note that the noise term Zi has zero mean, since the data is centro-symmetrized before cross-validation. Misra and Kuruoglu Figure 5: Test set performance and differential expression quantification with SGEX. A shows a box plot of estimated α over 1000 bootstrap replicates. B shows comparative Test set performance for MDC and OLS based networks relative to an empty network (no edges). C shows a heat map of LD that quantifies differential expression of a gene. The color for each column is normalized by scaling and centering. random variable (a common assumption in DE quantification). Although the data set contains only 727 samples, we see a clear improvement in test set performance of α-SG models (MDC curve) relative to Gaussian graphical models (OLS curve). 3.2.3 Quantifying Differential Expression With SGEX Finally, we discuss SGEX, a new technique for quantifying differential expression using αSG models. SGEX is based on cross-validation for assessing DE of a gene across different conditions. For the Hap Map data, we chose each of the eight population groups in turn as the test set and learnt the optimal α-SG model for the rest of the samples. We then estimated LD(i, η), the change in negative log-likelihood per sample between the test set set η and the training set as a measure of DE for each probe i LD(i, η) = 1 h log Eη(|Zi|p) log E η(|Zi|p) i , p ( 1, α) Here, Eη(.) is the expectation value for population η (test set) and E η(.) for the rest (training set). Note that Lemma 9 guarantees that RHS of the previous equation is indeed independent of p. For the calculation reported here p = α/1.01, just as it was during structure learning. Thus, LD(i, η) measures the average increase (or decrease) in log-dispersion for the noise variable Zi corresponding to probe i within population η. This density is represented as a heat map in Figure 5C. We should point out that a higher (or lower) dispersion Stable Graphical Models for the noise variable associated with a gene in the test set does not necessarily imply over (or under) expression of a gene in the test set population. The change in dispersion could also be due to a change in network topology or regression coefficients for the test set population. 4. Discussion In this paper we have introduced and developed the theory for efficiently learning α-SG models from data. In particular, one of the main contributions of this paper is to show how the BIC can be asymptotically reduced to the MDC for α-SG models. This result makes it feasible to efficiently learn the structure of these models, since the log-likelihood term does not have a closed form expression in general. We have also empirically validated the resultant algorithm Stab Le on both simulated and microarray data. In both cases, the presence of heavy-tailed noise has a clear effect on learning performance of OLS based methods. Based on these results, we recommend a bootstrapped estimation of α as an effective and computationally efficient diagnostic to assess the applicability of OLS based Gaussian graphical models. We have also described SGEX, a new technique for quantifying differential expression from microarray data. α-SG models may also have wider applicability to other aspects of computational biology, especially to data from next-generation sequencing technologies. In addition to m RNA expression measurements (RNA-seq experiments), α-SG models may prove helpful for other experiments, such as protein-DNA binding (Ch IP-seq experiments) and DNA accessibility measurements (DNase-seq and FAIRE-seq experiments). Finally, we should mention that there are several potential applications of α-SG models beyond computational biology. In particular, image processing provides several problem instances where there is a need to relate different regions of the image. For example, functional magnetic resonance imaging (f MRI) experiments generate a series of images highlighting activity sites in the brain in response to stimuli. Bayesian networks are an effective way of modeling statistical relations between different areas of the brain and the stimuli (Li et al., 2011). Stable distributions may provide a better model for such applications. Another image processing application with potentials for α-SG models is remote sensing images of the earth (Mustafa et al., 2012) where image histograms demonstrate clearly skewed and heavy tailed characteristics (Kuruoglu and Zerubia, 2004). Traffic modeling (Castillo et al., 2012) and financial data analysis (Bonato, 2012) are also promising application areas. 5. Software Availability Source code for Stab Le and data sets used here are available at https://sourceforge.net/projects/sgmodels/. SGEX is available upon request from the first author. Acknowledgments E.E. Kuruoglu was funded by the Alexander von Humboldt Foundation with an Experienced Research Fellowship. Misra and Kuruoglu In this section we provide the proof for Lemma 3 Lemma 3 Every d-dimensional distribution with a characteristic function of the form Φ(q|α, µ, Λ) = k=1 φ(c T k q|α, βk, γk, µk) where, ck, q Rd represents a multivariate stable distribution with a discrete spectral measure Λ. Proof Assume the following ansatz for the spectral measure Λ, Λk = ck α 2 γk 2 (1 + βk)δ(s ck ck 2 ) + (1 βk)δ(s + ck ck 2 ) and location vector µ, ηk(ck|α, βk, γk, µk) = µk α = 1 µk 2βkγk π log ck 2 α = 1 k=1 ηk(ck|α, βk, γk, µk)ck Rd Upon substitution into the parametrization in Definition 2 we get Z Sd ψ(s T q|α)Λkds = ck α 2 γk 2 (1 + βk)ψ( c T k q ck 2 |α) + (1 βk)ψ( c T k q ck 2 |α) = ck α 2 γk 2 |c T k q|α (1 + βk)(1 ısign(c T k q)r( c T k q ck 2 , α)) + (1 βk)(1 + ısign(c T k q)r( c T k q ck 2 , α)) Sd ψ(s T q|α)Λk.ds = γk|c T k q|α 1 ıβksign(c T k q)r( c T k q ck 2 , α) = γk|c T k q|α 1 ıβksign(c T k q)r(c T k q, α) ıβkγk|c T k q|αsign(c T k q) r( c T k q ck 2 , α) r(c T k q, α) Since, r( c T k q ck 2 , α) r(c T k q, α) = 0 α = 1 2 π log ck 2 α = 1 ıβkγk|c T k q|αsign(c T k q) r( c T k q ck 2 , α) r(c T k q, α) = ıc T k q 2βkγk π log ck 2 α = 1 ( ıc T k q(µk µk) α = 1 ıc T k q µk µk + 2βkγk π log ck 2 α = 1 = ıc T k q µk ηk(ck|α, βk, γk, µk) Stable Graphical Models Sd ψ(s T q|α)Λk.ds = log φ(c T k q|α, βk, γk, µk) + ıηk(ck|α, βk, γk, µk)c T k q = log Φ(q|α, µ, Λ) = Z Sd ψ(s T q|α)Λ(ds) + ı µq Sd ψ(s T q|α)Λk.ds + ı k=1 ηk(ck|α, βk, γk, µk)c T k q = log Φ(q|α, µ, Λ) = k=1 log φ(c T k q|α, βk, γk, µk) = Φ(q|α, µ, Λ) = k=1 φ(c T k q|α, βk, γk, µk) Misra and Kuruoglu The BARLEY network 0 20 40 60 80 100 A. True positives 0 20 40 60 80 100 B. False positives 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Scoring criterion Figure 6: The BARLEY network - Inferred structure 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 B. Standard deviation Scoring criterion Figure 7: The BARLEY network - Estimated regression parameters. Stable Graphical Models 0.8 1.1 1.4 1.7 2 A. Estimated α 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 B. Estimated θ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 C. Estimated logγ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation Figure 8: The BARLEY network - Estimated noise parameters Misra and Kuruoglu The CHILD network 0 20 40 60 80 100 A. True positives 0 20 40 60 80 100 B. False positives 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Scoring criterion Figure 9: The CHILD network - Inferred structure 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 B. Standard deviation Scoring criterion Figure 10: The CHILD network - Estimated regression parameters. Stable Graphical Models 0.8 1.1 1.4 1.7 2 A. Estimated α 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 B. Estimated θ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 C. Estimated logγ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation Figure 11: The CHILD network - Estimated noise parameters Misra and Kuruoglu The INSURANCE network 0 20 40 60 80 100 A. True positives 0 20 40 60 80 100 B. False positives 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Scoring criterion Figure 12: The INSURANCE network - Inferred structure 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 B. Standard deviation Scoring criterion Figure 13: The INSURANCE network - Estimated regression parameters. Stable Graphical Models 0.8 1.1 1.4 1.7 2 A. Estimated α 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 B. Estimated θ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 C. Estimated logγ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation Figure 14: The INSURANCE network - Estimated noise parameters Misra and Kuruoglu The MILDEW network 0 20 40 60 80 100 A. True positives 0 20 40 60 80 100 B. False positives 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Scoring criterion Figure 15: The MILDEW network - Inferred structure 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 B. Standard deviation Scoring criterion Figure 16: The MILDEW network - Estimated regression parameters. Stable Graphical Models 0.8 1.1 1.4 1.7 2 A. Estimated α 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 B. Estimated θ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation 0.8 1.1 1.4 1.7 2 C. Estimated logγ 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Standard deviation Figure 17: The MILDEW network - Estimated noise parameters A. Achim and E. E. Kuruoglu. Image denoising using bivariate α-stable distributions in the complex wavelet domain. IEEE Signal Processing Letters, 12(1):17 20, 2005. A. Achim, A Bezerianos, and P. Tsakalides. Novel Bayesian multiscale method for speckle removal in medical ultrasound images. IEEE Transactions on Medical Imaging, 20(8): 772 783, 2001. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. Journal of Computational Biology, 7(3-4): 559 583, 2000. J. Berger and B. Mandelbrot. A new model for error clustering in telephone circuits. IBM Journal of Research and Development, pages 224 236, 1963. D. Bickson and C. Guestrin. Inference with multivariate heavy-tails in linear models. In Proceedings of NIPS, 2011. Misra and Kuruoglu M. Bonato. Modeling fat tails in stock returns: a multivariate stable-GARCH approach. Computational Statistics, 27(3):499 521, 2012. R. H. Byrd and D. A. Payne. Convergence of the iteratively reweighted least squares algorithm for robust regression. Technical Report 313, The Johns Hopkins University, Baltimore, MD, 1979. E. Castillo, M. Nogal, M. Men endez, J., S. S anchez-Cambronero, and P. Jim enez. Stochastic demand dynamic traffic models using generalized beta-Gaussian Bayesian networks. IEEE Transactions on Intelligent Transportation Systems, 13(2):565 581, 2012. J. Chambers, C. Mallows, and B. Stuck. A method for simulating stable random variables. Journal of the American Statistical Association, 71(354):340 344, 1976. G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309 347, 1992. I. Daubechies, R. De Vore, M. Fornasier, and C. S. G unt urk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, LXIII:1 38, 2010. W. Feller. An Introduction to Probability Theory, vol. I, vol. II. John Wiley, New York, 1968. N. Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303 (5659):799 805, 2004. N. Friedman, M. Linial, I. Nachman, and D. Pe er. Using Bayesian networks to analyze expression data. Journal of computational biology, 7(3-4):601 620, 2000. J. R. Gallardo, D. Makrakis, and L. Orozco-Barbosa. Use of α-stable self-similar stochastic processes for modeling traffic in broadband networks. Performance Evaluation, 40(1): 71 98, 2000. C. D. Hardin Jr. Skewed stable variables and processes. Technical Report 79, Univ. North Carolina, Chapel Hill, 1984. D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for density estimation, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1:49 75, 2000. International Hap Map 3 Consortium and others. Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52 58, 2010. D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, Cambridge, MA, 2009. E. E. Kuruoglu. Density parameter estimation of skewed α-stable distributions. IEEE Transactions on Signal Processing, 49(10):2192 2201, 2001. Stable Graphical Models E. E. Kuruoglu and J. Zerubia. Modeling SAR images with a generalization of the Rayleigh distribution. IEEE Transactions on Image Processing, 13(4):527 533, 2004. P. L evy. Calcul des probabilit es. Gauthier-Villars Paris, 1925. R. Li, K. Chen, A. S. Fleisher, E. M. Reiman, L. Yao, and X. Wu. Large-scale directional connections among multi resting-state neural networks in human brain: A functional mri and bayesian network modeling study. Neuro Image, 56(3):1035 1042, 2011. B. Mandelbrot. The variation of certain speculative prices. Journal of Business, 26:394 419, 1963. S. B. Montgomery et al. Transcriptome genetics using second generation sequencing in a caucasian population. Nature, 464(7289):773 777, 2010. Y. T. Mustafa, V. A. Tolpekin, and A. Stein. Application of the expectation maximization algorithm to estimate missing values in gaussian bayesian network modeling for forest growth. IEEE Transactions on Geoscience and Remote Sensing, 50(5):1821 1831, 2012. C. L. Nikias and M. Shao. Signal Processing with Alpha-Stable Distributions. Wiley, New York, 1995. J. P. Nolan. Stable Distributions - Models for Heavy Tailed Data. Birkh auser, Boston, Chapter 1 online at academic2.american.edu/ jpnolan edition, 2013. J. P. Nolan. Linear and nonlinear regression with stable errors. Journal of Econometrics, 172(2): 86 194, 2013. J. P. Nolan and B. Rajput. Calculation of multi-dimensional stable densities. Communications in Statistics - Simulation and Computation, 24(3):551 566, 1995. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA, 1988. D. Salas-Gonzalez, E. E. Kuruoglu, and D. P. Ruiz. Modelling and assessing differential gene expression using the alpha stable distribution. The International Journal of Biostatistics, 5(1):1 24, 2009a. D. Salas-Gonzalez, E. E. Kuruoglu, and D. P. Ruiz. A heavy-tailed empirical bayes method for replicated microarray data. Computational Statistics & Data Analysis, 53(5):1535 1546, 2009b. G. Samorodnitsky and M. S. Taqqu. Stable Non-Gaussian Random Processes. Chapman and Hall, New York, 1994. M. Schmidt, A. Niculescu-Mizil, and K. Murphy. Learning graphical model structure using L1-regularization paths. In Proceedings of AAAI, 2007. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461 464, 1978. Misra and Kuruoglu B. E. Stranger et al. Patterns of cis regulatory variation in diverse human populations. PLo S genetics, 8(4):e1002639, 2012. B. W. Stuck. Minimum error dispersion linear filtering of scalar symmetric stable processes. IEEE Transactions on Automatic Control, 23:507 509, 1978. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In Proceedings of Uncertainty in Artificial Intelligence (UAI), 2005. I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1):31 78, 2006. V. M. Zolotarev. Mellin-Stieltjes transforms in probability theory. Theory Probability Appl, 2:433 460, 1957.