# learning_graphs_with_a_few_hubs__af058738.pdf

Learning Graphs with a Few Hubs

Rashish Tandon, Pradeep Ravikumar {RASHISH, PRADEEPR}@CS.UTEXAS.EDU Department of Computer Science The University of Texas at Austin, USA

We consider the problem of recovering the graph structure of a hub-networked Ising model given i.i.d. samples, under high-dimensional settings, where number of nodes p could be potentially larger than the number of samples n. By a hub-networked graph, we mean a graph with a few hub nodes with very large degrees. State of the art estimators for Ising models have a sample complexity that scales polynomially with the maximum node-degree, and are thus ill-suited to recovering such graphs with a few hub nodes. Some recent proposals for speciﬁcally recovering hub graphical models do not come with theoretical guarantees, and even empirically provide limited improvements over vanilla Ising model estimators. Here, we show that under such low sample settings, instead of estimating difﬁcult components such as hub-neighborhoods, we can use quantitative indicators of our inability to do so, and thereby identify hub-nodes. This simple procedure allows us to recover hub-networked graphs with very strong statistical guarantees even under very low sample settings.

1. Introduction

Graphical Models are a popular class of multivariate probability distributions that are widely used in applications across science and engineering. The key idea here is to represent probability distributions compactly as a product of functions over the cliques of an underlying graph. The task of graphical model selection is to learn the underlying undirected graph given samples drawn from the distribution it represents. This task becomes particularly difﬁcult in highdimensional data settings, where the number of variables p could be larger than the number of samples n.

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

Due in part to its importance, many practical algorithms with strong statistical guarantees have been proposed for this graphical model selection problem. In this paper, we focus on binary Ising models, where the variables are binary. For such Ising graphical models, Ravikumar et al. (2010) show that local node-wise ℓ1-regularized logistic regressions can recover the underlying graph exactly with high probability, when given n = O(d3 log(p)) i.i.d. samples, where p is the number of nodes, and d is the maximum node-degree of the graph. Another class of methods are based on local search and thresholding (Abbeel et al., 2006; Csisz ar & Talata, 2006; Bresler et al., 2008; Anandkumar et al., 2011), but in the absence of other stringent assumptions, their computational complexity scales exponentially with the local node-degrees d. Among more global approaches, Ji & Seymour (1996); Csisz ar & Talata (2006); Peng et al. (2009) and others have proposed penalized pseudo-likelihood (Besag, 1975) based approaches; while Yang & Ravikumar (2011) have proposed penalized estimators based on variational approximations to the graphical model log-likelihood; however the sample complexity of these methods also scale polynomially with the maximum node-degree of the graph.

In this paper, we consider the setting where the graphs have a few hub nodes, which are highly connected nodes whose degree could scale as large as linearly in the number of nodes. An importance instance of this are power-law graphs, which occur ubiquitously in many real-world settings, and in which hub-nodes with large degrees are few in number but not non-existent, and their maximum node degree could be very large. Since the sample complexity of the state of the art methods listed above scale polynomially with the maximum node-degree, they would thus not be very suitable in recovering such power-law graphs with hub nodes. Motivated by this, there have been a few statistical estimators proposed that explicitly target powerlaw graphical model estimation. Liu & Ihler (2011) propose a novel non-convex regularization motivated by the power-law degree distribution, a convex variant of which was also considered in (Defazio & Caetano, 2012). While these methods did not provide theoretical guarantees, even their experimental results demonstrated limited improve-

Learning Graphs with a Few Hubs

ments in sample complexity over ℓ1 regularization based methods. Peng et al. (2009) propose a pseudo-likelihood based procedure for learning discrete graphical models that minimizes the sum of weighted node-wise conditional loglikelihoods, where the node-wise weights could potentially be tuned to encourage power-law structure, but this was suggested as a heuristic. For the speciﬁc case of Gaussian graphical models, Hero & Rajaratnam (2012) provide an approach based on thresholding sample partial correlation matrices, and provide asymptotic expressions for false discovery rates under stringent weak dependence assumptions.

Consider the following leading question: what if we do not have enough samples to solve for the node-conditional distribution of a hub-node in an Ising model i.e. what if we we have less than d3 h log p samples, where dh is the degree of the hub node? The estimators above that focus on the estimation of a hub-networked graphical model all focus in part on the estimation of such difﬁcult sub-problems; so that they have a large sample complexity for estimating such hub-networked graphical models (Tandon & Ravikumar, 2013). In this paper, we propose to turn the problem on its head, and use our inability to estimate such difﬁcult sub-problems given limited samples, to then turn around and be able to estimate the hub-network. To provide intuition for our strategy, consider a star-shaped graph, with one hub node, and the rest being spoke nodes connected only to the hub. The maximum degree of the hub node is thus p 1, so that estimating the node-conditional distribution of the hub-node would require samples scaling as p3 log p. What if only have samples scaling as log p? But suppose we are also able to realize that we are unable to estimate the node-conditional distribution of the hub-node; and only those of the spoke nodes. We can then ignore the neighborhood estimation of the hub-node, and use the reliable neighborhood estimates of just the spoke nodes: this sufﬁces to estimate the star-graph.

In this paper, we formalize this strategy: we provide a quantitative criterion for checking whether or not the given number of samples sufﬁce for regularized node-conditional distribution estimation as in (Ravikumar et al., 2010) at a given node. We then use this to detect hub nodes, and use only the neighborhood estimates from the remaining nodes to construct the graph estimate. We note that our notion of hub nodes is speciﬁcally related to the difﬁculty of nodeneighborhood estimation, which only roughly corresponds to the node-degree (while the required sample size scales as O(d3 log p), the constants matter in ﬁnite sample settings).

Our criterion is based on the following key observations on ℓ1 regularized node-neighborhood estimation for any node u V conditioned on the rest of the nodes. Consider the variance of the Bernoulli event of the incidence of any node

v V \u in the node-neighborhood estimate, as a function the regularization penalty. When the penalty is very small, the node-neigborhood estimate will include all nodes, and the variance will be zero; when the penalty is just right, the node-neighborhood estimate will be correct and will include v iff it is a neighbor with very high probability, so that the variance will again be (close to) zero, and when the penalty is very large, the node-neighborhood estimate will be null, and the variance will again be zero. Contrast this behavior with the setting where there are very few samples to allow for neighborhood recovery at any value of the regularization penalty: then the variance starts off at zero, rises, and then slowly goes to zero as the nodeneighborhood becomes null. The difference in the observable behaviors between these two settings thus allows us to differentiate hub nodes from non-hubs. As we show, we are able to provide concrete statistical guarantees for our procedure, demonstrating improved sample complexity over the vanilla ℓ1 regularized node-regression procedure.

We note that the approach of Liu & Ihler (2012) is similar in spirit to ours, utilizing a weighted combination of the node wise estimates to obtain the overall estimate, where the weights are the inverse of an alternate notion of variance. However, their approach deals with parameter estimation in the asymptotic sense, and is not applicable to structure estimation in the high dimensional setting.

Overall, our paper makes a key advance in the estimation of hub-networked graphical models: we provide a tractable procedure with strong statistical guarantees even under very low-sample settings where we cannot even estimate the node-conditional distributions of the hub nodes. Our methods involve binary reliability indicators for node-conditional distribution estimation, which could have broader applications in many scientiﬁc and engineering applications, even outside the context of graphical model estimation.

2. Notation and Preliminaries

Let X = (X1, . . . , Xp) be a random vector, with each variable Xi (i [p]) taking values from a discrete set X. Let G = (V, E) be an undirected graph over p nodes, corresponding to the p variables {X1, . . . , Xp}. A pairwise Markov random ﬁeld over X = (X1, . . . , Xp) is a probability distribution speciﬁed by non-negative pairwise functions φrt : X X R for each edge (r, t) E:

rt E φrt(xr, xt) (1)

Note that we use rt as a shorthand for the edge (r, t). In this paper, we focus on the Ising model setting i.e. where we have binary variables with X = { 1, 1}, and where φrt = exp (θrtxrxt) for a given set of parameters θ =

Learning Graphs with a Few Hubs

{θrt | rt E}. In this case, (1) can be rewritten as :

Pθ(x) = 1 Z(θ) exp

rt E θrtxrxt

where Z(θ) = P

x { 1,1}p exp P

rt E θrtxrxt .

Let D := {x(1) . . . , x(n)} be n samples drawn i.i.d from the Ising model distribution Pθ with parameters θ R( p 2)

and Markov graph G = (V, E ), |V | = p. Note that each sample x(i) is a p-dimensional binary vector x(i) { 1, 1}p. The edge set E is related to the parameters θ

as E = {(r, t) V V | θ rt = 0}.

The task of graphical model selection is to infer this edge set E using the n samples. Any estimator b En for this task is said to be sparsistent if it satisﬁes P h b En = E i 1 as n .

2.1. ℓ1-regularized estimator

We now brieﬂy review the state-of-the-art estimator of (Ravikumar et al., 2010) (called the ℓ1-estimator henceforth). The key idea there is to estimate the true graph E by estimating the neighbourhood of each node r V in turn. Suppose N (r) denotes the true neighbours of the vertex r, so that N (r) = {t | (r, t) E }. The ℓ1-estimator uses sparsistent neighborhood estimators b Nn(r) V r V s.t. P h b Nn(r) = N (r) i 1 as n , to then obtain a sparsistent estimate of the entire graph.

Note that for any r V , the set of parameters θ is related to the true neighbourhood as N (r) = {t | θ rt = 0, t V }. The ℓ1-estimator exploits this to pose neighbourhood selection as an ℓ1-regularized logistic regression problem, minimizing the negative conditional loglikelihood for each node with an additional ℓ1-penalty. Note that for a set of parameters θ and a node r V , the conditional distribution of Xr conditioned on XV \r is given as

Pθ xr | x V \r = exp(2xr P

t V \r θrtxt)

1 + exp(2xr P

t V \r θrtxt). (3)

Deﬁning θ\r = {θrt | t V, t = r} and x\r = {xt | t V, t = r}, the negative conditional log-likelihood of the samples D would be given by

L(θ\r; D) =

n log 1 + exp 2x(i) r θT \rx(i) \r 2x(i) r θT \rx(i) \r o .

The ℓ1-estimator solves the following optimization problem for each r V :

arg min θ\r Rp 1

L(θ\r; D) + λ θ\r 1 . (5)

Let bθ\r(D) correspond to the solution of (5). Then the neighbourhood estimate is given as the non-zero locations or support of bθ\r(D): b Nλ(r; D) = Support bθ\r(D) . Finally, the edge estimate is computed by taking the union of all neighbourhood estimates: b En,λ = r V {(r, t) | t

b Nλ(r; D)}.

The ℓ1-estimator has been shown to have strong statistical guarantees under certain incoherence conditions. Below, we restate the incoherence conditions of (Ravikumar et al., 2010), for the sake of completeness. These are stated in terms of the Hessian (in expectation) of the likelihood function for the true parameter vector θ \r, which is given as Q r = E 2 log Pθ xr | x V \r . For brevity, we shall brieﬂy write Q r as Q , the true neighbourhood set N (r) as N, and its complement, V \ N (r) as N c. Then, their incoherence conditions (with r V being implicit in Q

and N) are :

(A1) a const. Cmin > 0 s.t. Λmin (Q N N ) Cmin. Also,

a const. Cmax s.t. Λmax E h XV \r XT V \r i Cmax

(A2) a constant α (0, 1] s.t. Q N c N (Q N N ) 1 1 α

Note that Λmin( ) and Λmax( ) correspond to the minimum and maximum eigenvalues of a matrix respectively, and corresponds to the standard ℓ -matrix norm.

Now, we restate the main theorem below from (Ravikumar et al., 2010) using our notation, and refer the reader to their paper for details. Theorem 1 (Guarantee for the ℓ1-estimator; see (Ravikumar et al., 2010)). Suppose an Ising graphical model with true parameter set θ satisﬁes conditions (A1) and (A2) for all nodes r V . Consider any r V , and let dr = θ \r 0 denote its degree. Then, there exist constants

c1, c2, c3, c4 such that if we have λ c1 q

c2 d3 r log p and N sub(r) = n t N (r) |θ rt| c3 drλ o , then

P b Nλ(r; D) = N sub(r) 1 2 exp c4λ2n . (6)

Based on Theorem 1, and a simple application of the union bound, we can see that the sample complexity for recovering the entire graph scales as n = Ω(d3 max log p) samples, where dmax is the maximum degree of the graph

Learning Graphs with a Few Hubs

G = (V, E ). However, as detailed earlier, dmax may be huge for hub-graphs, so that the sample complexity of the ℓ1-estimator will be large for such graphs.

3. Our Algorithm

As noted in the introduction, our approach is based on using a quantitative criterion for checking whether or not the given number of samples sufﬁce for regularized nodeconditional distribution estimation as in the ℓ1-estimator at a given node. Given such a criterion, we can then take the union of only those neighborhood estimates which the method is guaranteed to estimate accurately, and not consider the junk estimates. Towards building such a observable sufﬁciency criterion, we ﬁrst setup some notation.

3.1. Sufﬁciency Measure

For every r V and t V \ r, we deﬁne pr,n,λ(t) =

P t b Nλ(r; D) , as the probability of variable t being included in the neighbourhood estimate of variable r, estimated by the ℓ1-estimator with regularization λ, given n samples drawn i.i.d. from the underlying Ising model. Note that the probability is taken over n samples. Based on Theorem 1, we have the following simple corollary. Corollary 1. For any r V , suppose θ and (n, λ) satisfy all conditions of Theorem 1 with constants c1, c2, c3, c4; then

pr,n,λ(t) 1 2 exp c4λ2n if t N sub(r) and,

pr,n,λ(t) 2 exp c4λ2n if t / N sub(r), (7)

where N sub(r) = n t N (r) |θ rt| c3

Thus, when the number of samples n is sufﬁcient for neighborhood recovery, depending on whether node t is in the true neighborhood of r, pr,n,λ(t) goes extremely close to zero or one; equivalently pr,n,λ(t) (1 pr,n,λ(t)) goes extremely close to zero. Building on this observation, let us deﬁne the sufﬁciency measure

Mr,n,λ = max t V \r pr,n,λ(t) (1 pr,n,λ(t)) . (8)

It can thus be seen that this sufﬁciency measure goes to zero when the number of samples n is sufﬁcient for recovering the neighborhood of node r.

In the sequel, we will analyze a natural U-statistic to estimate this sufﬁciency measure from data. We ﬁrst require some more notation. For any b 1 < b < n

2 , we deﬁne Sb(D) as the set of all possible subsamples of size b, drawn from D without replacement, so that

Sb(D) = {(x(i1), . . . , x(ib)) | 1 i1 < . . . < ib n}. (9)

Given any subsample Db Sb(D) of size b, let F t λ,r(Db) be a function such that

F t λ,r(Db) =

( 1 if t b Nb,λ(r; Db) 0 otherwise. (10)

Now, we consider the U-statistic (of order b),

epr,b,λ(t; D) = 1 n b X

Db Sb(D) F t λ,r(Db). (11)

Note that E [epr,b,λ(t; D)] = pr,b,λ(t). We are now ready to provide the U-statistic estimate of the sufﬁciency measure in (8):

f Mr,b,λ(D) = max t V \r epr,b,λ(t; D) (1 epr,b,λ(t; D)) .

Computing f Mr,b,λ(D) would require computing epr,b,λ(t; D) for every t V \ r, which in turn would require considering all possible n b sub-samples of D. However, as we show below (see also analyses in (Liu et al., 2010; Politis et al., 1999) on sub-sampling), it sufﬁces to choose a number N n/b of subsamples drawn at random. Thus, our actual estimate for pr,b,λ(t) is

bpr,b,λ(t; D) = 1

i=1 F t λ,r(Di), (13)

where D1, . . . , DN are subsamples chosen independently and uniformly at random from Sb(D), and the estimate for the sufﬁciency measure is

c Mr,b,λ(D) = max t V \r bpr,b,λ(t; D) (1 bpr,b,λ(t; D)) .

We describe the procedure to calculate c Mr,b,λ(D) in Algorithm 1.

Once c Mr,b,λ(D) has been computed, we have the following lemma which shows that it is ϵ-close to Mr,b,λ with high probability, provided we have sufﬁciently many samples .

Proposition 1 (Concentration of c Mr,b,λ(D) to Mr,b,λ). For any δ (0, 1] and ϵ > 0, if we have n > 18b

ϵ2 [log p + log (4/δ)] and N n

P | c Mr,b,λ(D) Mr,b,λ| ϵ 1 δ. (15)

3.2. Behavior of the Sufﬁciency Measure

The key question of interest is whether we can use the sufﬁciency measure Mr,b,λ (via its sample estimate

Learning Graphs with a Few Hubs

Algorithm 1 Estimating c Mr,b,λ(D)

Input : Data D := {x(1), . . . , x(n)}, Regularization parameter λ, Sub-sample size b , No. of sub-samples N Output: An estimate of c Mr,b,λ(D)

t V \ r, bpr,b,λ(t; D) 0 for i = 1 to N do

Pick a sub-sample Di chosen uniformly randomly from Sb(D) Compute b Nb,λ(r; Di) by solving (5) (ℓ1-estimate) for t b Nb,λ(r; Di) do bpr,b,λ(t; D) bpr,b,λ(t; D) + 1

t V \ r, bpr,b,λ(t; D) bpr,b,λ(t; D)/N

c Mr,b,λ(D) max t V \r bpr,b,λ(t; D) (1 bpr,b,λ(t; D))

c Mr,b,λ(D)) to detect hub-nodes that we deﬁne speciﬁcally as those nodes for which we do not have enough samples for the ℓ1-estimator to be sparsistent. Correspondingly, let us deﬁne non-hub nodes in this context as those nodes for which we do have enough samples for the ℓ1-estimator to be sparsistent. We formalize these notions below.

Deﬁnition 1 (Non-Hub Node vs. Hub Node). Assume that the true parameter set θ satisﬁes the incoherence conditions, (A1) and (A2), for all nodes r V . Consider any node r V . It is termed a non-hub node w.r.t. n samples if a regularization parameter λ s.t. (n, λ) satisfy all conditions of Theorem 1 with constants c1, c2, c3, c4. Otherwise, the node is termed a hub node.

Since the sample complexity of neighborhood estimation via the ℓ1-estimator scales cubically with the node-degree (from Theorem 1), hub nodes as we deﬁne here correspond loosely to high-degree nodes, but in the sequel, the exact speciﬁcation of hub and non-hub nodes are as detailed by the deﬁnition above.

Before we describe the behaviour of Mr,b,λ for hub nodes and non-hub nodes, we impose the following technical assumptions on the behaviour of pr,b,λ(t) needed for our algorithm to work.

Assumption 1. r V , pr,b,λ(t) satisﬁes the following: For ﬁxed b and some constant c(> 0), let

λmin(t) = min {λ 0 | pr,b,λ(t) 1 2 exp ( c log p)} ,

λmax(t) = max {λ 0 | pr,b,λ(t) 2 exp ( c log p)} . (16)

Then, λmin(t) and λmax(t) are attained at ﬁnite values s.t.

(a) For any t V \ r and λ (λmin(t), λmax(t)), we have

pr,b,λ(t) [2 exp( c log p), 1 2 exp( c log p)] . (17)

(b) For all t / N (r),

λmin(t) λmin < λmax λmax(t), (18)

for some ﬁnite λmin, λmax 0 independent of t.

(c) For any t V \r, t / N (r) : λmin(t ) < λmax (t).

Additionally, pr,b,λ(t) is a continuous function of λ.

To build intuition for the assumptions, as well as our analysis in the sequel, it will be instructive to consider the behavior of the inclusion probability pr,b,λ(t) as we increase λ from zero to inﬁnity. When λ is zero, the ℓ1-estimator reduces to the unregularized conditional MLE: any variable t V \r will always occur in the neighborhood estimate of node r, and pr,b,λ(t) will be equal to one. As λ increases, the inclusion probability in turn reduces, and at a very large value of λ, the inclusion probability pr,b,λ(t) will become equal to zero: this follows from the property of the ℓ1-estimator, where there exists a large regularization weight when the parameter estimate becomes equal to zero.

In the assumptions above, it can be seen that if λmin(t) and λmax(t) exist, then by deﬁnition, we must have λmin(t) λmax(t). Part (a) of the assumption is a smoothness constraint that ensures that if the probability of inclusion or exclusion of a variable into a neighbourhood gets close to 1, then it stays close to 1, and does not vary wildly. Part (b) ensures that ranges of [λmin(t), λmax(t)] intersect at least for all irrelevant variables t / N (r). This is a very mild assumption that ensures that the inclusion probability of an irrelevant variable does not stay exactly at one as we increase λ, and reduces at least very slightly (below the threshold of 1 2 exp( c log p)) before other irrelevant variables have their inclusion probability drop from one all the way to zero. Part (c) is a closely related mild assumption that ensures that the probability of inclusion of atleast one irrelevant variable would have dropped by a small value from 1 before any other variable has its inclusion probability drop from one all the way to zero. We note that these mild technical assumptions on the inclusion probabilities always hold in our empirical observations.

Armed with these assumptions, we now analyze the behavior of our sufﬁciency measure Mr,b,λ. Our next proposition shows that there exists atleast one bump in the graph of the sufﬁciency measure against the regularization penalty λ.

Learning Graphs with a Few Hubs

0 0.1 0.2 0.3 0.4 0.5 0.6 0

Regularization Parameter

Sufficiency Measure

(a) Mr,b,λ for a hub node (degree = 19)

0 0.1 0.2 0.3 0.4 0.5 0.6 0

Regularization Parameter

Sufficiency Measure

(b) Mr,b,λ for a non-hub node (degree = 1)

Figure 1. Behaviour of Mr,b,λ for non-hub nodes and hub-nodes in a star graph on p = 100 nodes

Proposition 2 ( First Bump ). Suppose Assumption 1 holds with constant c > 0. Let

γ = 2 exp( c log p) (1 2 exp( c log p)) . (19)

For any node r V , let λl = inf {λ 0 : Mr,b,λ γ} be the smallest regularization penalty where the sufﬁciency measure is greater than a small threshold above zero, and λu = inf {λ > λl : Mr,b,λ < γ} be the next value of the regularization penalty where the sufﬁciency measure falls below that threshold. Then, (a) the inﬁma above are attained at ﬁnite values, and (b) for any k (γ, 1/4], λ (λl, λu) s.t. Mr,b,λ k.

Our next two propositions track the behavior of the ℓ1estimate b Nb,λ(r; D) after the ﬁrst bump outlined above. The very next proposition provides the behavior for nonhub nodes.

Proposition 3 (Behavior at λu for non-hub nodes ). Let r V be a non-hub node w.r.t. b samples, and constants for all conditions of Theorem 1 being c1, c2, c3, c4. Let Assumption 1 hold for a constant c > 1 with c < c1c4, and let λu be as deﬁned in Proposition 2. Then, b Nb,λu(r; D) recovers the neighborhood with high probability:

P N sub(r) b Nb,λu(r; D) N (r)

> 1 2 exp ( (c 1) log p) ,

where N sub(r) = n t N (r) |θ rt| c3

The proposition thus tells us that for non-hub nodes , after the ﬁrst bump when the value of Mr,b,λ becomes very close to zero, the ℓ1-estimator recovers N sub(r) w.h.p. (as also indicated by Theorem 1). Note that when increasing λ further, there would be further bump(s): the value of Mr,b,λ would rise, but would again drop back to zero: when λ is very large, the neighborhood estimate is null, so that the probability for any node to be in the neighborhood will be exactly zero; so that the sufﬁciency measure will be

equal to zero. Figure 1(b) demonstrates this behavior in a simulated dataset.

On the other hand, for hub nodes , the behavior of b Nb,λ(r; D) at λ = λu, deﬁned in Proposition 2, is given by the following proposition.

Proposition 4 (Behavior at λu for hub nodes ). Let r V be a hub node w.r.t. b samples. Also, let Assumption 1 hold with constant c > 1. Then b Nb,λu(r; D) excludes irrelevant variables with high-probability:

P b Nb,λu(r; D) N (r) > 1 2 exp ( (c 1) log p) .

The proposition thus tells us that for hub nodes , after the ﬁrst bump when the value of Mr,b,λ becomes very close to zero, irrelevant variables are excluded, though however there is no guarantee on relevant variables being included. Empirically in fact, the end of the ﬁrst bump typically occurs at a very large value of λ when all variables are excluded; in particular, the graph of Mr,b,λ against λ typically has a single bump. Figure 1(a) demonstrates this behavior in a simulated dataset.

Propositions 3 and 4 thus motivate using the behaviors of the sufﬁciency measure as outlined above to distinguish hub nodes and non-hub nodes; and then compute the graph estimate using the neighborhood estimates from the nonhubs alone. This natural procedure is described in Algorithm 2.

Algorithm 2 Algorithm to compute neighborhood estimate b N(r), for each node r V , and the overall edge estimate b E Input : Data D := {x(1), . . . , x(n)} , Regularization parameters Λ := {λ1, . . . , λs} , Sub-sample size b, No. of sub-samples N, Thresholds on sufﬁciency measure tl and tu, Node r V Output: An estimate b N(r) of the neighborhood for each r V , and the overall edge estimate b E

foreach r V do

λ Λ, Compute c Mr,b,λ(D) using Algorithm 1 λ Smallest λ Λ s.t. c Mr,b,λ(D) > tu Λ {λ Λ : λ > λ } λ0 Smallest λ Λ s.t. c Mr,b,λ(D) < tl b N(r) n t | bpr,b,λ0(t; D) 1+ 1 4tl

r V {(r, t) | t b N(r)}

The following theorem is a natural corollary of Theorem 1, and Propositions 3 and 4. Note that in the below, we assume that the true parameter set θ satisﬁes the incoherence conditions, (A1) and (A2), for all nodes r V ; and that

Learning Graphs with a Few Hubs

Assumption 1 holds r V , with an appropriate constant c > 2, satisfying conditions of Proposition 3 for non-hub nodes . Theorem 2 (Guarantee for the estimator of Algorithm 2). Suppose we run Algorithm 2 setting tl = 2 exp( c log p)(1 2 exp( c log p)) + ϵ, tu = 1/4 ϵ, the sub-sample size b = f(n) (with n f(n) < n/2), and number of sub-samples N n/f(n) , such that

n > 18f(n) [log p + log (4/δ)] /ϵ2. (20)

For any degree-value d {1, . . . , p} and constant c > 0, denote

(s, t) E min(d(s), d(t)) d, |θ st| c r

(21) where d(v) corresponds to the degree of vertex v in E . Then, there exist constants c, c , c , c , such that if the subsample size scales as

f(n) > c d3 log p, (22)

then the graph structure estimate b E of Algorithm 2 satisﬁes:

P Ed b E E 1 2 exp ( c log p) δ. (23)

Now, let us deﬁne the critical degree, dc, of a graph G = (V, E ), as the minimum degree such that neighborhoods of vertices with at most the said degree cover the whole graph, i.e.

s.t. (s, t) E , either d(s) d or d(t) d. (24)

The following corollary then gives the sample complexity for exact recovery of the graph, assuming that the edges have sufﬁcient weight. Corollary 2. Let the conditions of Theorem 2 be satisﬁed, with b = f(n) as the sub-sample size. Let dc be the critical degree of the graph G . Then there exist constants c , c , c s.t. if the sub-sample size scales as

f(n) > c d3 c log p, (25)

and |θ st| c q

n (s, t, ) E , then

P b E = E 1 2 exp( c log p), (26)

where b E is the graph structure estimate from Algorithm 2.

Note that we may choose f(n) = c n1 ρ, for some value of ρ (0, 0.5], as the sub-sample size. The choice of ρ would be governed by dc for the graph under consideration. For example, if dc is a constant (e.g. dc = 1 in a star graph), then the optimal choice of ρ would be 0.5, yielding a overall sample complexity of Ω (log p)2 .

4. Experiments

In this section, we present experimental results demonstrating that our algorithm does indeed succeed in recovering graphs with a few hubs.

4.1. Synthetic Data

We ﬁrst performed structure learning experiments on simulated data using 3 types of graphs:

(a) a collection of stars with p = 100 nodes involving 5 hub nodes with degree d = 19, each connected to 19 other degree d = 1 nodes.

(b) a grid graph with 81 nodes (9 9), with 2 additional high degree hub-nodes of degree d = 12 (so that p = 83) attached to random points in the grid.

(c) a power-law graph on p = 100 nodes generated using the preferential attachment scheme (Barabasi & Albert, 1999).

For each graph, we considered a pairwise Ising model with edge weight θ rt = ω max(dr,dt), for some ω > 0, and where dr and dt were the degrees of r and t respectively. For each such Ising model, we generated n i.i.d. samples D = {x(1), . . . , x(n)} using Gibbs sampling.

In all our experiments, for our algorithm (denoted as SL1 in our plots), the value of N, the number of times to subsample, was ﬁxed to 60. We set lower and upper thresholds on the sufﬁciency measure as tl = 0.1 and tu = 0.2. The number of subsamples was set to b = min 20 n, n

and the set of regularization parameters was taken as Λ = {0.005, 0.01, 0.015, . . . , 1}. We performed comparisons with the ℓ1-estimator (Ravikumar et al., 2010) (denoted as L1 in our plots) and the reweighted ℓ1-estimator for scalefree graphs (Liu & Ihler, 2011) (denoted as RWL1 in our plots). For both these methods, the best regularization parameter was chosen using the Bayesian information criterion (BIC) from the grid of regularization parameters Λ. Figure 2 shows plots of the Average Hamming Error (i.e. average number of mismatches from the true graph) with varying number of samples for our method and the baselines, computed over an average of 10 trials. Since our estimate uses subsamples to compute its sufﬁciency measure, when the number of samples is extremely low, the deviation of the sample sufﬁciency measure estimate from the population sufﬁciency measure becomes large enough so that the resulting mistakes made by our method in designating hubs and non-hubs increase its overall Hamming error. We note however that at such extremely low number of samples, it can be seen that the overall Hamming error of any estimator is quite high, so that none of the estimators provide useful graph estimates in any case. It can be seen

Learning Graphs with a Few Hubs

0 500 1000 1500 2000 2500 0

Number of Samples

Hamming Error

L1 RWL1 SL1

(a) Stars Graph

0 500 1000 1500 2000 2500 3000 20

Number of Samples

Hamming Error

L1 RWL1 SL1

(b) Hub+Grid Graph

0 500 1000 1500 2000 2500 3000 20

Number of Samples

Hamming Error

L1 RWL1 SL1

(c) Preferential Attachment Graph

Figure 2. Plots of Average Hamming Error vs Number of Samples

(a) L1 with BIC (b) Our Algorithm

Figure 3. Graphs obtained using US Senate voting records data from the 109th congress (Bannerjee et al., 2008)

that other than at such extremely few samples, we achieve much lower Hamming error than both L1 and RWL1, and which is particularly pronounced for scale-free graphs such as those provided by the preferential attachment model.

4.2. Real World Data

We ran our algorithm on a data set consisting of US senate voting records data from the 109th congress (2004 - 2006) (Bannerjee et al., 2008). It consists of 100 nodes (p = 100), corresponding to 100 senators. There are 542 samples, each representing a bill that was put to vote. For each (senator, bill) pair, the vote is recorded as either a 1 (representing a yes), a 1 (representing a no) or a 0 (representing a missed vote). For the purpose of the experiment, all 0 entries were replaced by 1, as also done in (Bannerjee et al., 2008).

Our algorithm was run with the parameters N = 60, tl = 0.1, tu = 0.2, b = 450 and Λ = {0.005, 0.01, 0.015, . . . , 1}. Figure 3(b) shows the graph obtained using our method, whiles Figure 3(a) shows the

graph obtained by running the ℓ1-estimator (Ravikumar et al., 2010) with the regularization parameter being chosen using the Bayesian Information Criterion (BIC) from the set of regularization parameters Λ.

We see that the graph obtained using the ℓ1-estimator with BIC is much denser than what we obtain. This also corroborates the observation of (Liu et al., 2010), that BIC leads to larger density in high dimensions. A few of the nodes in the graph using our algorithm are seen to have 0 degree, and are thus disconnected from the graph. This might be because these might be higher degree hub nodes, but for which the number of samples is not sufﬁcient enough to provide a reliable estimate of the neighbourhoods visa-vis their degree. Overall, the sparse graph we obtained using our reliability indicator based method suggests the need for such reliability indicators to prevent the inclusion of spurious edge-associations.

Acknowledgements We acknowledge support from NSF via IIS-1149803 and DMS-1264033, and ARO via W911NF-12-1-0390.

Learning Graphs with a Few Hubs

Abbeel, P., Koller, D., and Ng, A. Y. Learning factor graphs in polynomial time and sample complexity. Jour. Mach. Learning Res., 7:1743 1788, 2006.

Anandkumar, A., Tan, V. Y. F., and Willsky, A.S. High Dimensional Structure Learning of Ising Models : Local Separation Criterion. Preprint, June 2011.

Bannerjee, O., , Ghaoui, L. El, and d Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485 516, March 2008.

Barabasi, A.L. and Albert, R. Emergence of scaling in random networks. Science, 286(5439):509 512, 1999.

Besag, J. Statistical analysis of non-lattice data. The Statistician, 24(3):179 195, 1975.

Bresler, Guy, Mossel, Elchanan, and Sly, Allan. Reconstruction of markov random ﬁelds from samples: Some observations and algorithms. In Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques, APPROX 08 / RANDOM 08, pp. 343 356. Springer-Verlag, 2008.

Csisz ar, I. and Talata, Z. Consistent estimation of the basic neighborhood structure of Markov random ﬁelds. The Annals of Statistics, 34(1):123 145, 2006.

Defazio, A. and Caetano, T. A convex formulation for learning scale-free networks via submodular relaxation. In Advances in Neural Information Processing Systems 24, 2012.

Hero, A. and Rajaratnam, B. Hub discovery in partial correlation graphical models. IEEE Trans. on Information Theory, 58(9):6064 6078, 2012.

Ji, C. and Seymour, L. A consistent model selection procedure for markov random ﬁelds based on penalized pseudolikelihood. Ann. Appl. Probab., 6(2):423 443, 1996.

Liu, Han, Roeder, Kathryn, and Wasserman, Larry A. Stability approach to regularization selection (stars) for high dimensional graphical models. In NIPS, pp. 1432 1440, 2010.

Liu, Q. and Ihler, A. T. Learning scale free networks by reweighted l1 regularization. Journal of Machine Learning Research - Proceedings Track, 15:40 48, 2011.

Liu, Qiang and Ihler, Alexander. Distributed parameter estimation via pseudo-likelihood. In International Conference on Machine Learning (ICML), pp. 1487 1494. July 2012.

Peng, J., Wang, P., Zhou, N., and Zhu, J. Partial correlation estimation by joint sparse regression models. JASA, 104 (486):735 746, 2009.

Politis, D., Romano, J.P., and Wolf, M. Subsampling. Springer series in statistics. Springer Verlag, 1999. ISBN 9780387988542.

Ravikumar, P., Wainwright, M. J., and Lafferty, J. Highdimensional ising model selection using ℓ1-regularized logistic regression. Annals of Statistics, 38(3):1287 1319, 2010.

Tandon, R. and Ravikumar, P. On the difﬁculty of learning power law graphical models. In In IEEE International Symposium on Information Theory (ISIT), 2013.

Yang, E. and Ravikumar, P. On the use of variational inference for learning discrete graphical models. In International Conference on Machine learning (ICML), 28, 2011.