# structure_learning_of_partitioned_markov_networks__f2b81a06.pdf

Structure Learning of Partitioned Markov Networks

Song Liu LIU@ISM.AC.JP Taiji Suzuki SUZUKI.T.CT@M.TITECH.AC.JP Masashi Sugiyama SUGI@K.U-TOKYO.AC.JP Kenji Fukumizu FUKUMIZU@ISM.AC.JP The Institute of Statistical Mathematics, Tokyo 190-8562, Japan. Tokyo Institute of Technology, Tokyo 152-8552, Japan; PRESTO, Japan Science and Technological Agency (JST). University of Tokyo, Chiba 277-8561, Japan.

We learn the structure of a Markov Network between two groups of random variables from joint observations. Since modelling and learning the full MN structure may be hard, learning the links between two groups directly may be a preferable option. We introduce a novel concept called the partitioned ratio whose factorization directly associates with the Markovian properties of random variables across two groups. A simple oneshot convex optimization procedure is proposed for learning the sparse factorizations of the partitioned ratio and it is theoretically guaranteed to recover the correct inter-group structure under mild conditions. The performance of the proposed method is experimentally compared with the state of the art MN structure learning methods using ROC curves. Real applications on analyzing bipartisanship in US congress and pairwise DNA/time-series alignments are also reported.

1. Introduction

An undirected graphical model, or a Markov Network (MN) (Koller & Friedman, 2009; Wainwright & Jordan, 2008) has a wide range of applications in real world, such as natural language processing, computer vision, and computational biology. The structure of MN, which encodes the interactions among random variables, is one of the key interests of MN learning tasks. However, on a highdimensional dataset, learning the full MN structure can be cumbersum since we may not have enough knowledge to model the entire MN, or our application only concerns a speciﬁc portion of the MN structure.

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

Figure 1. An illustration of a full MN (left) and PMN (right). Full MN models all the connections among random variables, while PMN only models the interactions between groups (red edges) and does not care connections within groups.

Rather than considering the full MN structure over the complete set of random variables, we focus on learning a portion of the MN structure that links two groups of random variables, namely the Partitioned Markov Network (PMN). PMN is suitable for describing the inter-group relations . For example, politicians in US Congress are naturally grouped into two parties (Democrats and Republicans). Learning a PMN on congresspersons via their voting records will reveal bipartisan collaborations among them. A full gene network may have complicated structure. However if genes can be clustered into a few homologous groups, PMN can help us understand how genes in different functioning groups interact with each other. An illustration of a full MN and a PMN is shown in Figure 1.

Since a PMN can be regarded as a sub-structure of a full MN, a naive approach may be learning a full MN over the complete set of random variables and ﬁguring out its PMN. In fact, the machine learning community has seen huge progresses on learning the sparse structures of MNs, thanks to the pioneer works on sparsity inducing norms (Tibshirani, 1996; Zhao & Yu, 2006; Wainwright, 2009).

A majority of the previous works fall into the category of the regularized maximum likelihood approach which maximizes the likelihood function of a probabilistic model under sparsity constrains. Graphical lasso (Friedman et al., 2008; Banerjee et al., 2008) considers a joint Gaussian model parameterized by the inverse covariance matrix,

Learning Partitioned Markov Networks

where zero elements indicate the conditional independence among random variables, while others have developed useful variations of graphical lasso in order to loosen the Gaussianity assumed on data (Liu et al., 2009; Loh & Wainwright, 2012). SKEPTIC (Liu et al., 2012) is a semiparametric approach that replaces the covariance matrix with the correlation matrix, such as Kendall s Tau in MN learning.

The latest advances along this line of research has been made by considering a node-wise conditional probabilistic model. Instead of learning all the structures in one shot, such a method focuses on learning the neighborhood structure of a single random variable at a time. Maximizing the conditional likelihood leads to simple logistic regression (in the case of the Ising model) (Ravikumar et al., 2010) or linear regression (in the case of the Gaussian model) (Meinshausen & B uhlmann, 2006).

Unfortunately, the maximum (conditional) likelihood method can be difﬁcult to compute for general non Gaussian graphical models, since computing the normalization term is in general intractable. Though one may use sampling such as Monte-carlo methods (Robert & Casella, 2005) to approximate the normalization term, there is no universal guideline telling how to choose sampling parameters so that the approximation error is minimized.

A more severe problem is that sparsity approaches may have difﬁculties when learning a dense MN. Speciﬁcally, the samples size required for a successful structure recovery grows quadratically with the number of connected neighbors (Raskutti et al., 2009; Ravikumar et al., 2010). However, it is quite reasonable to assume that in some applications, one node may have many neighbors within its own group while connections to the other group are sparse: a congressperson is very well connected to other members inside his/her party but has only a few links with the opposition party. Genes in a homologous group may have dense structure but they only interact with another group of genes via a few ties.

Is there a way to directly obtain the PMN structure? Neither maximizing a joint nor conditional likelihood take the partition information into account and interactions are modelled globally. However PMN encodes only the local conditional independence between groups, and the requirement for obtaining a good estimator should be much milder.

The above intuition leads us to a novel concept of the Partitioned Ratio (PR). Given a set of partitioned random variables X = (X1, X2), PR is the ratio between the joint probability P(X) and the product between its marginals P(X1)P(X2), i.e. P (X) P (X1)P (X2). In the same way that the joint distribution can be decomposed into clique potentials

of MN, we prove PR also factorizes over subgraph structures called passages, which indicate the connectivity between two groups of random variables X1 and X2 in a PMN.

Conventionally, PR is a measure of the independence between two sets of random variables. In this paper, we show that the factorization of this quantity indicates the linkage between two groups of random variables, which is a natural extension of the regular usage of PR.

Most importantly, we show the sparse factorization of this quantity may be learned via a one shot convex optimization procedure, which can be solved efﬁciently even for the general, non-Gaussian distributions. The correct recovery of sparse passage structure is theoretically guaranteed under the assumption that the sample size increases with the number of passages which is not related to the structure density of the entire MN.

This paper is organized as follows. In Section 2, we review the Hammersley and Clifford theorem (Section 2.1) and deﬁne some notations as preliminaries (Section 2.2). The factorization theorems of PMN are introduced in Section 3 with a few simpliﬁcations. We give an estimator to obtain the sparse factorization of PR in Section 4 and prove its recovered structure is consistent in Section 5. Finally, experimental results on both artiﬁcial and real-world datasets are reported in Section 6.

2. Background and Preliminaries

In this section, we review the factorization theorems of MN. We limit our discussions on strictly positive distributions from now on. A graph is always assumed to be ﬁnite, simple, and undirected.

2.1. Background and Motivation

Deﬁnition 1 (MN). For a joint probability P(X) of random variables X = {X1, X2, . . . , Xm}, if for all i, P(Xi|\Xi) = P(Xi|XN(i)), where XN(i) is the neighbors of node Xi in graph G, then P is an MN with respect to G.

Deﬁnition 2 (Gibbs Distribution). For a joint distribution P on a set of random variables X, if the joint density can be factorized as

C C(G) φC(XC),

where Z is the normalization term, C(G) is the set of complete subgraphs of G and each factor φC is deﬁned only on a subset of random variables XC, then P is called a Gibbs distribution that factorizes over G.

Theorem 1 (See e.g., Hammersley & Clifford (1971)). If

Learning Partitioned Markov Networks

(I) (II) (III)

Figure 2. If (I) is an MN over X, then (I), (II), (III) are all PMNs over X. If (I) is a PMN over X, (I), (II), (III) are not necessarily the MN over X (but still PMNs over X).

P is an MN with respect to G (Deﬁnition 1), then P is a Gibbs distribution that factorizes over G (Deﬁnition 2)

Theorem 2 (See e.g., Koller & Friedman (2009)). If P is a Gibbs distribution that factorizes over G then P is an MN with respect to G.

Theorems 1 and 2 are the keystones of many MN structure learning methods. It states, by learning a sparse factorization of a joint distribution, we are able to spot the structure of a graphical model. However, learning a joint distribution has never been an easy task due to the normalization issue and if the task is to learn a PMN that only concerns conditional independence across two groups, such an approach seems to solve a more general task as an intermediate step (Vapnik, 1998).

Does there exist an alternative to the joint distribution, whose factorization relates to the structure of PMN? Ideally, such factorization should be efﬁciently estimated from samples with a tractable normalization term and the estimation procedure should provide good statistical guarantees.

In the rest of the paper, we show PR has the desired properties to indicate the structure of a PMN: It is factorized over the structure of a PMN (Section 3) and easy to estimate from joint samples (Section 4) with good statistical properties (Section 5).

2.2. Deﬁnitions

Notations. Sets are denoted by upper-case letters, e.g., A, B. An upper-case with a lower-case subscript Ai means the i-th element in A. Set operator A\B means excluding set B from set A. \B means the whole set excluding the set B. A = (A1, A2) is a partition of set A and an upper-case followed by an integer number, e.g. A1, A2 means groups divided by such a partition. Given a graph L = N, E and a subgraph K L, NK or EK denotes the subset of N or E whose elements are indexed topologically by K. Upper-case with bold font, e.g. K, is a set of sets.

Figure 3. (Left) ABCD and (Right) AB. . . Z are two passages.

PMN and Gibbs Partitioned Ratio. Now, we formally deﬁne a graph G = X, E , where X is a set of random variables and X = (X1, X2), i.e. X1 X2 = , X1 X2 = X and X1, X2 = . The concept of PMN can now be deﬁned.

Deﬁnition 3 (PMN). For a joint probability P(X), X = (X1, X2), if

P(Xi|X1 XN(i)\Xi) = P(Xi|\Xi), Xi X1, (1)

P(Xi|X2 XN(i)\Xi) = P(Xi|\Xi), Xi X2, (2)

then P is a PMN with respect to G.

The following proposition is a consequence of Deﬁnition 3, and an example is visualized in Figure 2.

Proposition 1. If P is an MN with respect to G, then P is a PMN with respect to G, but not vice versa.

Proposition 2. If P is a PMN with respect to G, Xu X1, Xv X2, and v N(u), then Xu Xv|\{Xu, Xv}.

See Appendix A for the proof.

The concept of Passage is deﬁned as follows:

Deﬁnition 4 (Passage). Let X = (X1, X2). We deﬁne a passage B of G as a subgraph of G, such that XB X1 = , XB X2 = , and Xu (X1 XB), Xv (X2 XB), we have edge (Xu, Xv) EB.

Here we highlight two of the passage structures of two graphs in Figure 3.

From deﬁnition, we can see all cliques that go across two groups are passages, but not all passages are cliques:

Proposition 3. Let X = (X1, X2). Given a passage B of G, B is a complete subgraph if and only if Xu, Xv XB X1, edge (Xu, Xv) EB and Xu, Xv XB X2, edge (Xu, Xv) EB.

As an analogy to a Gibbs distribution used in the Hammersley-Clifford Theorem, we deﬁne the Gibbs partitioned ratio.

Learning Partitioned Markov Networks

Deﬁnition 5 (Gibbs Partitioned Ratio). For a joint distribution P over X = (X1, X2), if the partitioned ratio has the form

P(X1, X2) P(X1)P(X2) = 1

B B(G) φB(XB),

where B(G) is the set of all passages in G, then P (X1,X2) P (X1)P (X2) is called the Gibbs partitioned ratio (GPR) over G.

3. Factorization over Passages

In this section, we will investigate the question: can we have a similar factorization theorem like Theorems 1 and 2 for PMN? If so, learning the sparse factorization of PR may reveal the Markovian properties among random variables.

3.1. Fundamental Properties

There are two steps for introducing our factorization theorems. The ﬁrst step is establishing the Markovian property of random variables using the factorization of PR.

Theorem 3. Given X = (X1, X2), if PR P (X1,X2) P (X1)P (X2) is a GPR over a graph G then P is a PMN with respect to G.

See Appendix B for the proof.

Next, let us prove the other direction: From the Markovian property to the factorization. Theorem 4. Given X = (X1, X2), if P is a PMN with respective to a graph G, then P (X1,X2) P (X1)P (X2) is a GPR over G.

See Appendix C for the proof.

Simply, the factorization of a GPR is only related to the linkage (or rigorously, passages) between two groups. Interestingly, if we have an MN whose groups are linked via a few bottleneck passages, then the factorization is simply over those sparse passages, no matter how densely the graph are connected within each group. This gives PMN a signiﬁcant advantage over traditional MN in terms of modelling: If the interactions between groups are simple (e.g. linear), we do not need to care the interactions within groups, even if they are highly complicated (e.g. non-linear). For example, in the bipartisan analysis problem, a PR over congresspersons can be represented only via a few cross-party links, and a large chunk of connections between congresspersons within their own party can be ignored, no matter how complicated they are.

Theorems 3 and 4 point out a promising direction for structural learning of a PMN: Once the sparse factorization of a GPR is learned, we are able to recover the sparse passages of a PMN partitioned into two groups.

3.2. Simpliﬁcation of Passage Factorization

The Hammersley-Clifford theorem (Theorem 1) shows P factorizes over cliques of G, given P is an MN with respect to G. However, if one does not know the maximum size of cliques, the model of a probability function has to consider factors on all potential cliques, i.e., all subsets of X. It is unrealistic to construct a model with 2|X| factors under the high-dimensional setting.

Therefore, a popular assumption called pairwise MN (Koller & Friedman, 2009; Murphy, 2012) has been widely used to lower the computational burden of MN structure learning. It assumes that in P, all clique factors can be further recovered using only bivariate and univariate components which give rise to a pairwise model with only (|X|2 + |X|)/2 factors. Some well known MNs, such as Gaussian MN and Ising model are all examples of pairwise MNs.

Similar issues also happen when modelling GPR. There are (2|X1| 1)(2|X2| 1) possible passage potentials for the set of random variables X = (X1, X2). Following the same spirit, we can consider a simpliﬁed model of PR by assuming that all passage potentials of the GPR must factorize in a pairwise fashion, i.e.:

Deﬁnition 6 (Pairwise PR). For a joint distribution P over X = (X1, X2), if the partitioned ratio has the form

P(X1, X2) P(X1)P(X2) = 1

B B(G) φB(XB)

Xu,Xv XB,u v hu,v(Xu, Xv),

then P (X1,X2) P (X1)P (X2) is called the pairwise Gibbs partitioned ratio (pairwise PR) over G.

If we can assume the GPR we hope to learn is also a pairwise PR, the model may only contain (|X|2 + |X|)/2 pairwise factors, and is much easier to construct.

In fact, pairwise PR does not have straightforward relationship with pairwise MN, i.e., a PR of a pairwise MN may not be a pairwise PR, meanwhile the joint distribution corresponding to a pairwise PR may not be a pairwise MN, since the pairwise MN and the pairwise PR apply the same assumption on the parameterizations of two fundamentally different quantities, the joint probability and the PR respectively.

Whether one should impose such an assumption on joint probability or PR is totally up to the application, as neither parameterization is always superior to the other. If the application focuses on learning the connections between two groups, we believe imposing such an assumption on PR directly is more sensible.

Learning Partitioned Markov Networks

However, as a special case, a joint Gaussian distribution is a pairwise MN, and its PR is also a pairwise PR. Proposition 4. If P over X = (X1, X2) is a zero-mean Gaussian distirbution, then the PR P (X1,X2) P (X1)P (X2) is a pairwise PR.

Since the Gaussian distribution factorizes over pairwise potentials, and the marginal distribution P(X1) and P(X2) are still Gaussian distributions. From the construction of the potential function (6) in the proof of Theorem 4, we can verify this statement. Moreover, one can show it has the pairwise factor hu,v(Xu, Xv) = exp(θu,v Xu Xv), where θu,v is the parameter.

This pairwise assumption together with factorization theorems motivate us to recover the structure of PMN by learning a sparse pairwise PR model: For any Xu X1, Xv X2, if Xu, Xv appear in the same pairwise factor of a PR model, they must be at least involved in one of the passage potentials.

4. Estimating PR from Samples

To estimate PR using such a model, we require a set of samples {x(i)}n i=1 i.i.d. P, x Rm,

and each sample vector x(i) is a joint sample, i.e. x(i) = x(i) 1 , x(i) 2 where x1, x2 are subvectors corresponding to two groups.

We deﬁne a log-linear pairwise PR model g(x; θ):

g(x; θ) := 1 N(θ) exp[ X

u v θ u,vψ(xu,v)],

where θu,v Rb is a column vector,

θ = (θ 1,2, . . . , θ 1,m, θ 2,3, . . . , θ 2,m, . . . , θ m 1,m) ,

and ψ is a vector valued feature function ψ : R2 Rb. Notice that we still have to model all pairwise features in x, but the vast majority of these pairs are going to be nulliﬁed due to Theorem 4 if links between two groups are sparse.

N(θ) is deﬁned as a normalization function of g(x; θ):

N(θ) := Z p(x1)p(x2) exp[ X

u v θ u,vψ(xu,v)]dx, (3)

where p(x1) and p(x2) are the marginal distributions of p(x), so it is guaranteed that R p(x1)p(x2)g(x; θ)dx = 1.

N(θ) in (3) can be approximated via two-sample Ustatistics (Hoeffding, 1963) using the dataset,

N(θ) ˆN(θ) := 1 n 2 X

j =k exp[ X

u v θ u,vψ(x[j,k] u,v )],

where x[j,k] is a permuted sample: x[j,k] = (x(j) 1 , x(k) 2 ).

Notice that the normalization term N(θ) in (3) is an integral with respect to a probability distribution p(x1)p(x2). Though we do not have samples directly from such a distribution, U-statistics help us simulate such an expectation using joint samples. In Maximum Likelihood Estimation, density models are in general hard to compute since their normalization term is not with respect to a sample distribution. In comparison, N(θ) can always be easily approximated for any choice of ψ. This gives us the ﬂexibility to consider complicated PR models beyond the conventional Gaussian or Ising models.

This model can be learned via the algorithm of maximum likelihood mutual information (MLMI) (Suzuki et al., 2009), by simply minimizing the Kullback-leibler divergence between p(x) and pθ(x) = p(x1)p(x2)g(x; θ):

ˆθ = argmin θ KL[p||pθ].

Substitute the model of g(x; θ) into the above objective and approximate N(θ) by ˆN(θ), then the estimated parameter ˆθ is obtained as

ˆθ = argmin θ

u v θ u,vψ(x(i) u,v) + log ˆN(θ)

| {z } ℓMLMI(θ)

where C is some constant. From now on, we denote ℓMLMI(θ) as the negative likelihood function. Due to Theorem 4 and our parametrization, if the passages between two groups are rare, then θ is very sparse. Therefore, we may use sparsity inducing group-lasso penalties (Yuan & Lin, 2006) to encourage the sparsity on each subvector θu,v:

ˆθ = argmin θ ℓMLMI(θ) + λ X

u v θu,v . (4)

This objective is convex, unconstrained, and can be easily solved by standard sub-gradient methods. λ is a regularization parameter that can be tuned via cross-validation.

Now let us deﬁne the true parameter θ , such that p(x) = q(x)g(x; θ ). The learned parameter ˆθ is an estimate of θ , where θ u,v is non-zero on pairwise features that are at least involved in one of the passage potentials. Moreover, as Theorem 3 and Proposition 2 show, if Xu X1 and Xv X2 are not in any of the passage structures, i.e., θ u,v = 0, then Xu Xv|\{Xu, Xv}.

Given the optimization problem (4), it is natural to consider the structure recovery consistency, i.e., under what conditions, the sparsity pattern of ˆθ is the same as that of θ ?

Learning Partitioned Markov Networks

5. High-dimensional Structure Recovery Consistency

To better state the structure recovery consistency theorem, we use new indexing system with respect to the sparsity pattern of the parameter. Denoting the pairwise index set as H = {(u, v)|u v}, two sets of subvector indices can be deﬁned as S = {t H | θ t = 0}, Sc = {t H | θ t = 0}. We rewrite the objective (4) as

ˆθ = argmin θ ℓ(θ) + λn X

t S θt + λn X

t Sc θt . (5)

Similarly we can deﬁne ˆS and ˆSc. From now on, we simplify ℓMLMI(θ ) as ℓ(θ ).

Now we state our assumptions. Assumption 1 (Dependency). The minimum eigenvalue of the submatrix of the log-likelihood Hessian is lowerbounded:

Λmin( θS θSℓ(θ )) λmin > 0,

with probability 1, where Λmin is the minimum-eigenvalue operator of a symmetric matrix Assumption 2 (Incoherence).

θt θSℓ(θ ) [ θS θSℓ(θ )] 1 1 1 α,

with probability 1, where 0 < α 1, and Y 1 = P

i,j Yi,j 1.

The ﬁrst two assumptions are common in the literatures of support consistency. The ﬁrst assumption guarantees the identiﬁability of the problem. The second assumption ensures the pairwise factors in passages are not too easily affected by those are not in any passages. The third assumption states the likelihood function is well-behaved . Assumption 3 (Smoothness on Likelihood Objective). The log-likelihood ratio ℓ(θ) is smooth around its optimal value, i.e., it has bounded derivatives

2ℓ(θ + δ) λmax < ,

max t {S Sc} max δ, δ θ

θt 2ℓ(θ + δ) λ3,max < ,

with probability 1.

, ||| ||| are the spectral norms of a matrix and a tensor respectively (See e.g., Tomioka & Suzuki (2014) for the deﬁnition of the spectral norm of a tensor). Assumption 4 (Bounded PR Model). For any vector δ Rdim(θ ) such that δ θ , the following inequality holds:

0 < Cmin g(x; θ) Cmax < ,

f t Cft,max

b and f t C f t,max, t (S Sc).

This assumption simply indicates our PR model is bounded from above and below around the optimal value. Though it rules out the Gausssian distribution whose PR is not necessarily upper/lower-bounded, as a theory of generic pairwise models, we think it is acceptable. Theorem 5. Suppose that Assumptions 1, 2, 3, and 4 are satisﬁed as well as mint S θ t 10 λmin p

|S|λn. Suppose also that the regularization parameter is chosen so that

where M is a positive constant. Then there exist some constants L, K1 and K2 such that if n L|S|2 log m2+m

2 , with the probability at least 1 K1 exp K2λ2 nn , MLMI in (5) has the following properties:

Unique Solution: The solution of (5) is unique.

Successful Passage Recovery: ˆS = S and ˆSc = Sc.

ˆθ θ = O( q

The proof of Theorem 5 is detailed in Appendix D. Since the PR function is a density ratio function between p(x) and p(x1)p(x2), and (5) is also a sparsity inducing Kullback-Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2008), the previously developed support consistency theorem (Liu et al., 2015; 2016) can be applied here as long as we can verify a few assumptions and lemmas.

The sample size required for the proposed method increases with log m (since log (m2+m)

2 2 log m if m > 2 ) and the estimation error on θ vanishes at the speed of q

n . They are the same as the optimal rates obtained in previous researches for Gaussian graphical model structure learning (Ravikumar et al., 2010; Raskutti et al., 2009).

This theorem also indicates that the sample size required is not inﬂuenced by the structural density of the entire MN structure, but by the number of pairwise factors in the passage potentials. This is encouraging since we are allowed to explore PMNs with dense groups which would be hard to learn using conventional methods.

6. Experiments

Unless speciﬁed otherwise, we use pairwise feature function ψ(xu, xv) = xuxv. Note this does not mean we assume the Gaussianity over the joint distribution, since this is a parameterization of a PR rather than a joint distribution.

6.1. Synthetic Datasets

We are interested in comparing the proposed method with a few possible alternatives: LL (Meinshausen & B uhlmann,

Learning Partitioned Markov Networks

(a) Θ, ρ = .8, |X1| = 40

(c) ROC of Gaussian Dataset

(d) ROC of Diamond Dataset

Figure 4. Synthetic experiments

2006; Ravikumar et al., 2010), SKEPTIC (Liu et al., 2012) and Diff (Zhao et al., 2014): A direct difference estimation method that learns the differences between two MNs without learning each individual precision matrix separately. In this paper, we employed this method to learn the differences between two Gaussian densities: p(x) and p(x1)p(x2).

We ﬁrst generate a set of joint samples {x(i)}50 i=1 N(0, Θ 1), where Θ R50 50 and is constructed in two steps. First, create

( ρ|i j| ij, i, j < 40 or i, j > 40, 0, Otherwise,

where 0 < ρ < 1 is a coefﬁcient controlling the dominance of the diagonal entries. Second, let Λ be the 15th smallest eigenvalue of Θ, and ﬁll the submatrices Θ{41,...,50},{31,...,40} and Θ{31,...,40},{41,...,50} with ΛI10, where I10 is a 10 10 identity matrix. By such a construction, we have created two groups over X: X = (X{1,...,40}, X{41,...,50}) and 10 passages between them. Notably, within two groups, the precision matrix is dense, and random variables interact with each other via powerful links when ρ is large. An example of Θ when ρ = 0.8 is plotted in Figure 4(a). We measure the performance of three methods using the True Postive Rate (TPR) and True Negative Rate (TNR). The detailed deﬁnition of TPR and TNR is deferred to Appendix, E.

The ROC curve in Figure 4(c) can be plotted by adjusting the sensitivity of each method: Tuning the regularization parameter of the proposed method and LL, or the threshold parameter of Diff.

As we can see, the proposed method has the best overall performance on all ρ choices, comparing to both LL and Diff. Also, as the links within each group get more and more powerful (by increasing ρ), the performance of LL and Diff decay signiﬁcantly, while the proposed method almost remain unchanged.

As the proposed method is capable of handling complex models, we draw 50 samples from a 52-dimensional diamond distribution used in (Liu et al., 2014) where the correlation among random variables are non-linear. To speed-up the sampling procedure, the graphical model of this distribution is constructed by concatenating 13 simple 4-variable MNs whose density functions are deﬁned as

p(xa, xb, xc, xd) exp ρx2 ax2 b .5xbxc .5xbxd N,

where N is short for a normal density N(0, .5I4) over xa, xb, xc and xd. Notice this distribution does not have a closed form normalization term. The graphical model of such a distribution is illustrated in Figure 4(b). In this experiment, the coefﬁcient ρ is used to control the strength of inter-group interactions (xa xb), and we set ψ(xu, xv) = x2 ux2 v. Other than LL, we include SKEPTIC due to the non-Gaussian nature of this dataset. The performance is compared in Figure 4(d) using ROC curves.

The correlation among random variables are completely non-linear. As the power of interactions on passages increases, LL performs worse and worse since it still relies on the Gaussian model assumption. Thanks to the correct PR model, the proposed method performs reasonably well and gets better when ρ increases. As the density model does not ﬁt into the Gaussian copula model, SKEPTIC also performs poorly.

6.2. Bipartisanship in 109th US Senate

We use the proposed method to study the bipartisanship between Democrats and Republicans in the 109th US Senate via the recorded votes. There were totally 100 senators (45 Democrats and 55 Republicans) casting votes on 645 questions with yea , nay or not voting . The task is to discover the cross-party links between senators. We construct a dataset {x(i)}645 i=1 X using all 645 questions as observations, where each observation x {1, 1, 0}100

corresponds to the votes on a single question by 100 senators, and random variables X = X{1,...,45}, X{46,...,100}

Learning Partitioned Markov Networks

(R)SESSIONS

(D)KENNEDY ED

(D)LAUTENBERG

(D)NELSON BEN (R)KYL (R)CHAMBLISS

(R)HUTCHISON

Figure 5. Bipartisanship in 109th US Senate. Preﬁx (D) or (R) indicates the party membership of a senator. Red: positive inﬂuence, Blue: negative inﬂuence. Edge widths are proportional to |θu,v|.

are senators partitioned according to party memberships.

We run the proposed method directly on this dataset, and decrease λ from 10 until | ˆS| > 15. To avoid complication, we only plot edges that contain nodes from different groups in Figure 5.

It can be seen that Ben Nelson, a conservative Democrat, who frequently voting against his party (Wikipedia, 2016a), has multiple links with the other side. On the right, Democrat Tom Carper tends to agree with Republican Lincoln Chafee. Carper collaborated with Chafee on multiple bipartisan proposals (Press-Release, a;b) while Chafee, who support for ﬁscal and social policies that often opposed those promoted by the Republican Party (Wikipedia, 2016b) ﬁnally switched his afﬁliation to Democratic in 2013. Interestingly, we have also observed a cluster of senators who tend to disagree with each other.

6.3. Pairwise Sequences Alignment

PMN can also be used to align sequences. Given a pair of sequences where points are collected from the domain X, we pick sequence 1 and construct the dataset by sliding a window sized n toward future, until reaching the end. Suppose there are m1 windows generated, then we can create a dataset {x(i) 1 }n i=1, x X m1. Similarly, we construct another dataset {x(i) 2 }n i=1, x X m2 on sequence 2, and

make joint samples by letting x(i) = x(i) 1 , x(i) 2 . After learning a PMN over two groups, if Xu and Xv are connected, then we regard the elements in the u-th window and the elements in the v-th window are aligned . See Figure 7 in Appendix for an illustration.

We run the proposed method to learn PMNs over two datasets: Twitter keyword count sequences (Liu et al., 2013) and Amino acid sequences with Genebank ID: AAD01939 and AAQ67266. The results were obtained by decreasing λ from 10 so | ˆS| > 15.

For the Twitter dataset, we collect normalized frequencies of keywords as time-series over 8 months, during the

(a) Twitter keyword frequency time-series alignments. n = 50, m = 962 and X = R.

(b) Amino acid sequence alignments between AAD01939 (human) and AAQ67266 (ﬂy). n = 10, m = 592, φ(xi, xj) = δ(xi, xj) and X = {amino acid dictionary}.

Figure 6. Sequence alignment. For two aligned windows with size n, we plot n gray lines between two windows linking each pair of elements. Since lines are so close to each other, they look like gray shades on the plot. The color box contains the region of consecutively aligned windows.

event Deepwater Horizon oil spill in 2010. We learn alignments between two pairs of keywords: Obama vs. Spill and Spill vs. BP . The results are plotted in Figure 6(a) where we can see the sequences of two pairs are aligned well in chronological order. The two popular keywords, BP and Spill are synchronized throughout almost the entire event while Spill and Obama are only synchronized later on after he delivered his speech in Oval Ofﬁce on this crisis on June 15th, 2010.

The next experiment uses two amino acid string sequences, consisting codes such as V , I , L and F , etc. Figure 6(b) shows that the proposed method has successfully identiﬁed the aligned segment between eyeless gene of Drosophila melanogaster (a fruitﬂy) and human aniridia genes. The same segment is also spotted by widely used Needleman-Wunsch (NW) algorithm (Needleman & Wunsch, 1970) with statistical signiﬁcance.

Learning Partitioned Markov Networks

Acknowledgments

SL acknowledges JSPS Grant-in-Aid for Scientiﬁc Research Activity Start-up 15H06823. TS acknowledges JSTPRESTO. MS acknowledges the JST CREST program. KF acknowledges JSPS Grant-in-Aid for Scientiﬁc Research on Innovative Areas 25120012. Authors would like to thank four anonymous reviewers for their helpful comments.

Banerjee, O., El Ghaoui, L., and d Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9:485 516, March 2008.

Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

Hammersley, J. M. and Clifford, P. Markov ﬁelds on ﬁnite graphs and lattices. 1971.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963.

Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, USA, 2009.

Liu, H., Lafferty, J., and Wasserman, L. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10:2295 2328, 2009.

Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, pp. 2293 2326, 2012.

Liu, S., Yamada, M., Collier, N., and Sugiyama, M. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013.

Liu, S., Quinn, J. A., Gutmann, M. U., Suzuki, T., and Sugiyama, M. Direct learning of sparse changes in Markov networks by density ratio estimation. Neural Computation, 26(6):1169 1197, 2014.

Liu, S., Suzuki, T., and Sugiyama, M. Support consistency of direct sparse-change learning in Markov networks. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence (AAAI2015), 2015.

Liu, S., Relator, R., Sese, J., Suzuki, T., and Sugiyama, M. Support consistency of direct sparse-change learning in Markov networks. Annals of Statistics, 2016. to appear.

Loh, P.-L. and Wainwright, M. J. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 2087 2095. 2012.

Meinshausen, N. and B uhlmann, P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436 1462, 06 2006.

Murphy, K. P. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443 453, 1970.

Press-Release. Carper urges bipartisan compromise on clean air, a. URL http://www.epw.senate.gov/ pressitem.cfm?party=rep&id=230919.

Press-Release. Carper-chafee-feinstein will offer bipartisan budget plan similar to blue dog house proposal, b. URL http://www.carper.senate.gov/ public/index.cfm/pressreleases?ID= e5603ed2-80ae-483b-a98d-4aa9932edeaf.

Raskutti, G., Yu, B., Wainwright, M. J., and Ravikumar, P. Model selection in Gaussian graphical models: High-dimensional consistency of ℓ1-regularized mle. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, pp. 1329 1336. Curran Associates, Inc., 2009.

Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. Highdimensional Ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics, 38(3):1287 1319, 2010.

Robert, C. P. and Casella, G. Monte Carlo Statistical Methods. Springer-Verlag, Secaucus, NJ, USA, 2005.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008.

Suzuki, T., Sugiyama, M., and Tanaka, T. Mutual information approximation via maximum likelihood estimation of density ratio. In Information Theory, 2009. ISIT

Learning Partitioned Markov Networks

2009. IEEE International Symposium on, pp. 463 467, June 2009.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996.

Tomioka, R. and Suzuki, T. Spectral norm of random tensors. ar Xiv preprint ar Xiv:1407.1870 [math.ST], 2014.

Vapnik, V. N. Statistical Learning Theory. Wiley, New York, NY, USA, 1998.

Wainwright, M. J. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE Trans. Inf. Theor., 55(5):2183 2202, May 2009.

Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1 305, 2008.

Wikipedia. Ben Nelson Wikipedia, the free encyclopedia, 2016a. URL https://en.wikipedia.org/ wiki/Ben_Nelson. [Online; accessed 30-Jan-2016].

Wikipedia. Lincoln Chafee Wikipedia, the free encyclopedia, 2016b. URL https://en.wikipedia. org/wiki/Lincoln_Chafee. [Online; accessed 30-Jan-2016].

Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006.

Zhao, P. and Yu, B. On model selection consistency of lasso. The Journal of Machine Learning Research, 7: 2541 2563, 2006.

Zhao, S., Cai, T., and Li, H. Direct estimation of differential networks. Biometrika, 101(2):253 268, 2014.