# true_nonlinear_dynamics_from_incomplete_networks__d7347f3c.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

True Nonlinear Dynamics from Incomplete Networks

Chunheng Jiang, Jianxi Gao, Malik Magdon-Ismail Rensselaer Polytechnic Institute 110 8th Street, Troy, NY 12180 {jiangc4, gaoj8}@rpi.edu, magdon@cs.rpi.edu

We study nonlinear dynamics on complex networks. Each vertex i has a state xi which evolves according to a networked dynamics to a steady-state x i . We develop fundamental tools to learn the true steady-state of a small part of the network, without knowing the full network. A naive approach and the current state-of-the-art is to follow the dynamics of the observed partial network to local equilibrium. This dramatically fails to extract the true steady state. We use a mean-ﬁeld approach to map the dynamics of the unseen part of the network to a single node, which allows us to recover accurate estimates of steady-state on as few as 5 observed vertices in domains ranging from ecology to social networks to gene regulation. Incomplete networks are the norm in practice, and we offer new ways to think about nonlinear dynamics when only sparse information is available.

1 Dynamical Systems on Incomplete Networks The fundamental task in learning is to infer unknown quantities of interest from incomplete data. We study learning complex nonlinear dynamics on networks from incomplete data. Such problems are fundamental because complex nonlinear dynamical systems are ubiquitous, often modeled as coupled nonlinear ordinary differential equations (ODEs), for example epidemic spreading (Pastor-Satorras and Vespignani 2001), Michaelis-Menten gene regulatory dynamics (Alon 2006; Gao, Barzel, and Barab asi 2016), Lotka-Volterra ecological dynamics (Lotka 1910). A graph G = (V, E) with n n (weighted) adjacency matrix A is the backbone on which the dynamical equations are coupled together. We consider a general dynamics in which each vertex i of G has a state xi which evolves according to a selfdriving force and a sum of interaction forces over neighbors

xi = f(xi) +

j V Aijg(xi, xj). (1)

The functions f( ) and g( , ) are general and usually nonlinear, and the positive weighted connectivity matrix A modulates the interactions between vertices. Several instances of

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

such dynamics with appropriate choices of f( ) and g( , ) are shown in Table 1. From an initial state, one can step forward in time, simulating the dynamics in (1) until convergence to equilibrium states x i . The complete information setting in (1) is unrealistic, and we must accept that in practice, only part of a network can be measured. Hence, we assume that a subgraph with m nodes G(s) = (V (s), E(s)) is known, with corresponding m m adjacency matrix A(s), where V (s) V and E(s) E (s for sampled). This paradigm is unavoidable when the full network is unmeasurable (for example, protein-protein interactions, metabolic and terrorist networks (Stumpf and Wiuf 2005)). The paradigm is also useful when the full network is too large to handle, where one can deliberately sample a much smaller subnetwork for the analysis. Analyzing a full network from a sampled subnetwork has been studied in several contexts, e.g. to estimate average or total degree (Kurant, Markopoulou, and Thiran 2011); degree distributions and clustering coefﬁcients (Stumpf and Wiuf 2005; Gjoka, Kurant, and Markopoulou 2013; Seshadhri, Pinar, and Kolda 2014); shortest paths (Leskovec and Faloutsos 2006); motif counts (Klusowski and Wu 2018); vertex and edge counts (Katzir, Liberty, and Somekh 2011; Kurant, Butts, and Markopoulou 2012). Our task is to estimate true steady-states x for vertices in V (s), despite only seeing an incomplete network G(s). The state-of-the-art naive approach is to simulate (1) on the subgraph G(s). For example, one may collect a social network from Boston and run the epidemic model (Table 1) to obtain the probability of each person to be infected. The results are a dramatic and universal disaster, because the sub-social network of Boston is just a small part of a vast social network, and the people in Boston interact with people outside. That the naive method is bad is not surprising because the essence of the dynamics in (1) are the interactions, and the subgraph is missing many of those interactions. Hence, not observing a large part of the network appears to be an insurmountable hurdle to learning the true steady states on the observed (small) part of the network. We develop a methodology to accurately predict true steady states using only information local to G(s). We demonstrate the power of our approach in Figure 1, for an

Applications Vertex State at vertex i Dynamics Ecological (1949; 2006) Species Abundance xi = Bi + xi(1 xi

j Aij xixj Di+Eixi+Hjxj Regulatory (2006; 2008) Gene Expression level xi = Bxf i +

j Aij R xh j xh j +1 Epidemic (2001; 2004; 2005) Person Infection rate xi = Bxi +

j Aij R(1 xi)xj

Table 1: Examples of real systems with nonlinear interaction dynamics.

Full Real Network

State Value

True Steady State

Predicted Steady State

This Paper Naive High Naive Low

(Above) Trajectories of the 5 vertices in the sampled graph G(s). (Left) Scatter plot that compares steadystate values. Our method achieves near perfect prediction on the observed 5 nodes. The naive method gets all state values wrong, but even more worrying is that a single stable equilibrium bifurcates into two equilibria, and the steady-state depends on initial conditions. In the low state, all species go extinct. Naively simulating the nonlinear coupled ODE on the subgraph is a disaster.

Figure 1: Predicting steady-state abundances of 5 species interacting in a larger 97 species ecological network. Predictions use only the interactions of those ﬁve species (incomplete information).

ecological network where vertices are species and states are species-abundance. This ecological network of 97 species follows the symbiotic dynamics in Table 1, see (Gao, Barzel, and Barab asi 2016). Let us see what happens when a biologist who is interested in ﬁve species collects the relevant 5-vertex subgraph G(s) and performs a naive simulation on the subgraph to get a steady-state. The results in Figure 1 are as expected. The naive simulation is wrong. Even worse, it cannot identify the number of equilibria from the subgraph: the full network has one attractor, but the subgraph has two. It means that, with the wrong initial conditions, the biologist would conclude that the ﬁve species are going extinct, when in fact they are all doing ﬁne in the real network. Our Contributions. We give the ﬁrst method to accurately learn steady-state dynamics when only a part of the network is observed. This is remarkable given the inherent interactive nature of the dynamics. The result in Figure 1 demonstrates that our methods extract very close approximations to the true steady state dynamics in the full network when just 5% of vertices are observed. This surprising result has the potential for huge impact since up to now, the state-of-theart is the naive approach which produces completely wrong conclusions. There are three main ideas behind our method.

A mean ﬁeld approximation to account for the impact of the unobserved part of the network.

Summarizing the mean ﬁeld impact using a resilience parameter, β, which depends only on network topology. How the resilience impacts the ﬁnal outcome depends on

the coupled nonlinear dynamics through f( ) and g( , ).

Estimating the full network s resilience from the observed subgraph. A network s resilience is important in other contexts. The resilience characterizes a complex system s ability to retain its basic functionality under edge and vertex faults. Hence, our estimates of resilience from incomplete information are of independent interest.

Combining these three ideas, we obtain accurate estimates of the steady-states as in Figure 1. Our estimates are nearexact matches to the true steady-states.

The true dynamics on the full network G are governed by the coupled nonlinear dynamics in (1). We represent G by its adjacency matrix A, and assume that the total size of the network, n, is known. The observed sampled subgraph G(s) = (V (s), E(s)) has adjacency matrix A(s). There are many ways to sample vertices and edges from a graph. We focus on two natural sampling methods which are reasonable models of how the incomplete data is often obtained.

(Random Vertex Sampling) Form the induced subgraph for randomly sampled vertices. We assume the degrees of the sampled vertices are also known. For example, we know the number of friends each person has in a social network and who is friends with whom among the sampled subnetwork.

(Random Walk) A random vertex is sampled. At each step, an available edge is followed to sample a new vertex. The degrees of the vertices are implicitly available since, at each step, the available edges must be known.

Our method can be extended to other sampling schemes, such as edge sampling, degree biased sampling, Metropolis Hastings random walks. The main property we require of the sampling is that speciﬁc topological parameters of the graph can be reliably estimated from the sample. For i V (s), the steady-state value in G is denoted x i . We denote by z i the steady-state value obtained using the partially observed network. We loosely use zi to refer both to the vertex and state variable at the vertex. The naive method solves the same system in (1) for A(s) instead of A. That is, for i V (s), z i is the steady-state of the dynamical system

(Naive method) zi = f(zi) +

j V (s) A(s) ij g(zi, zj). (2)

This naive approach produces incorrect conclusions, yet it is common practice because that s all practitioners have. To get correct results, one must account for missing data.

3 Mean Field Approximation and Resilience The main idea is shown in the (sampled) subgraph in Figure 2(a). We focus on one vertex in V (s), z2. Vertex z2 interacts with its neighbors z1 and z4 in the subgraph, and its neighbors outside the subgraph. All the subgraph nodes z1, . . . , z5 are in a similar situation. Now ﬁx the value for each external neighbor of z2 to its true steady-state value in the full network, shown as x 1, x 2, x 3. Do the same for the external neighbors of all the subgraph vertices z1, . . . , z5. As far as the subgraph is concerned, all external neighbors have converged to their steady-state values and are providing the right interactive feedback to all subgraph nodes. The subgraph is effectively isolated from the rest of the network and will converge to steady-states of the full network. We state the next theorem without proof. The preceding discussion essentially amounts to the proof.

x 1 x 2 x 3

xeff xeff xeff

Figure 2: Mean ﬁeld approximation.

Theorem 1. The steady-states z i of the dynamical system

zi = f(zi)+

j V (s) A(s) ij g(zi, zj)+

j V (s) Aijg(zi, x j) (3)

recover one of the true steady-states of the full network.

We cannot implement (3), because x i are unknown. In the mean ﬁeld approximation, we replace x i by an average

inﬂuence xeff, see Figure 2(b). Of course it is an approximation, but it works well for analyzing complex interacting systems such as spin-systems (Edwards and Anderson 1975). The iteration-0 estimate is xeff for all states. Fix the states for all vertices but (say) z2 to xeff, and ﬁnd the steady-state for z2, as in Figure 2(b). Now, z2 is effectively isolated from the rest of the network as all its neighbors are ﬁxed at xeff. So, z2 evolves to a steady-state, following the dynamics z2 = f(z2) + 5g(z2, xeff). Repeat for each subgraph-vertex to arrive at z(1) i , the steady-states of zi = f(zi) + δig(zi, xeff), (4) where δi is the degree of zi in G. These equations are uncoupled since xeff is ﬁxed. This is the method used in (Gao, Barzel, and Barab asi 2016) to analyze the dynamics on the full network by reducing to n uncoupled ODEs. We now iterate further. Suppose the steady-state from iteration τ is z(τ) i . We obtain z(τ+1) i , the approximation at iteration τ + 1 as the steady-state solution to the uncoupled equations

zi = f(zi) +

j V (s) A(s) ij g(zi, z(τ) j ) +

j V (s) Aijg(zi, xeff). (5) Comparing (5) to the exact solution in (3), the external forces are replaced by an effective external force and the interaction term is approximated by an interaction to a previous steady-state. Iterating to convergence, z(τ) i converges to a steady-state z i which solves the coupled system

zi = f(zi) +

j V (s) A(s) ij g(zi, zj) + (δi δ(s) i )g(zi, xeff), (6)

where δ(s) i is the degree of zi in G(s). The naive method in (2) resembles (6) with one crucial difference, an additional term for the external force on a vertex. The entire system in (6) corresponds to adding just one more vertex to our subgraph, whose value is ﬁxed at xeff, with a weighted edge to zi of weight A(s) i,xeff = δi δ(s) i . We show this augmented graph for our example in Figure 2(c).

To account for missing vertices, add one vertex to the subgraph, ﬁxed to xeff and add degree-weighted edges from xeff to all vertices in the subgraph.

Next, we discuss how to compute xeff to estimate the meanﬁeld interaction with unseen vertices. The complication is that this estimate cannot depend on the missing information. This is possible because xeff depends on the missing information only through global topological statistics of the network, and we can estimate those topological statistics when the subgraph is sampled appropriately.

3.1 Computing the Effective External Impact There are two unknown quantities in (6). The degree δi and xeff. We now discuss xeff, following the general approach in (Gao, Barzel, and Barab asi 2016). Consider vertex i and the interaction term

j Aijg(xi, xj) in (1), where Aij is the inﬂuence j has on i. The in-degree sin i =

j Aij and the out-degree sout i =

j Aji. Assuming Aij 0, the interaction term is the in-degree times an average interaction,

j Aijg(xi, xj) = sin i

j Aijg(xi, xj)

Here, the in-degree sin i captures the idiosyncratic part, and the average g( , ) captures the network effect. Our ﬁrst mean-ﬁeld approximation is to replace local averaging with global averaging, which approximates the network-impact on a vertex as nearly homogeneous. Speciﬁcally, we have

j Aijg(xi,xj)

ij Aijg(xi,xj)

ik Aik = 1T Ag(xi,x)

1T A1 , (8)

where the vector g(xi, x) has jth component g(xi, xj). Deﬁne an averaging linear operator

LA(z) = 1T A

1T A1z = sout z

sout 1, (9)

which is a weighted average of the entries in z. Our meanﬁeld approximation results in the approximate dynamics

xi = f(xi) + sin i LA[g(xi, x)]. (10)

In the ﬁrst order linear approximation, we take LA inside g. Our second mean-ﬁeld approximation is that the average of external interactions is approximately the interaction with the average. That is LA[g(xi, x)] g(xi, LA(x)) and

xi = f(xi) + sin i g(xi, LA(x)), (11)

where LA(x) is a global state. Let xav LA(x). Applying LA to both sides of (11) gives

xav = LA[f(x)] + LA[sing(x, xav)]. (12)

Extensive tests in (Gao, Barzel, and Barab asi 2016) show that the in-degrees sin i and the interactions g(xi, xav) are roughly uncorrelated, so the LA-average of the product is the product of LA-averages. Thus, our third mean-ﬁeld approximation is LA[sing(x, xav)] LA(sin)LA[g(x, xav)]. Using the linear approximation again, we take the LA-average inside f and g

xav = f(LA(x)) + LA(sin)g(LA(x), xav). (13)

Now we have a dynamical system for xav,

xav = f(xav) + βg(xav, xav), (14)

where the resilience β = LA(sin). For undirected graphs, β =

i δi = δ2 / δ . The steady-state of (14) is the external effective impact, xeff. Plugging it into (11) gives an uncoupled ODE for xi,

xi = f(xi) + sin i g(xi, xeff). (15)

In the mean-ﬁeld approximation, g(xi, xj) in (1) is replaced by an interaction with a mean-ﬁeld external world g(xi, xeff) and the number of neighbors impacting xi is captured by sin i . To approximately obtain the steady-states of the system, one ﬁrst solves the ODE in (14) to get xeff, and then n uncoupled ODEs at each vertex to get xi, which only depends on sin i if given xeff. The method works well because the mean-ﬁeld approximations only need to hold at the steady-state. Hence, we can recover the steady-state for any vertex (for example the sampled vertices) from accurate estimates of degrees sin i (δi in the undirected case) and the resilience parameter β.

3.2 Evaluating the Mean-Field Approximation One non-trivial implication of our mean-ﬁeld approach is that the steady-states are approximated by solving the uncoupled equations in (4). The parameter xeff in (4) only depends on β = δ2 / δ . So the ODE in (4) only depends on the degree sequence of the original network, which means the steady-states can be approximated by knowing only a network s degree sequence. We verify this in Figure 3(a), which compares true steady-states in the ecological network with steady-states of a random network that preserves the degree sequence. The near-perfect matching of the steadystates is empirical evidence for our mean-ﬁeld approach.

3.3 Accuracy of Our Approach The mean-ﬁeld approximation essentially replaces individual interactions with an average, and amounts to a homogeneity assumption. The more homogeneous a network, the more accurate our approximations. Indeed, the method of solving for xeﬀas a steady state of (14) and then for x i as steady states of (15) produces an exact solution for a regular network (perfectly homogeneous). The next theorem says that our method in (6) is perfect for such regular networks.

Theorem 2. For a k-regular network, the steady-states z i obtained by solving the dynamical system in (6) with xeﬀ obtained as a steady-state of (14) with β = k recovers an exact steady-state x i .

Proof. (Sketch) xeff is a steady state of (14) with β = k, hence f(xeﬀ) + kg(xeﬀ, xeﬀ) = 0 as x = 0. We show that x i = xeﬀfor i [1, N] is a ﬁxed point of the system. Since node i has k neighbors and each of them has state x j = xeﬀ, xi = f(xi) + βg(xi, xeﬀ) = 0 when xi = xeﬀ. Lastly, (6) converges to (14), because xi = xeﬀ.

Essentially, our approach is perfect for regular networks. The degree of inhomogeneity in the network is therefore a parameter which controls the quality of approximation. We now show some experimental results with synthetic networks where we can control for the inhomogeneity. Even for extremely inhomogeneous networks, our approach suffers little, indicating the strength of the mean-ﬁeld approach.

Erd os-R enyi scale-free

Relative error

Heterogeneity

Figure 3: (a) G(rewired) vs. G; (b) Impact of heterogeneity.

To evaluate the impact of heterogeneity, we use 15 random 1000-vertex networks with different heterogeneities H measured by the relative degree dispersion, H [ δ2 δ 2]/ δ . Relative state estimation errors for our method

in (6) for an observed subgraph of just 10 vertices and estimates of β and xeff from Section 4 are shown in Figure 3(b). As expected, performance deteriorates as heterogeneity increases. The far right is the very heterogeneous scale-free network and leftmost is the nearly homogeneous Erd os R enyi network. Even for very heterogeneous networks, the relative error is only about 3%.

4 Estimating the Resilience

To get xeff, we solve for the steady-states of (14), so we need to estimate the resilience β = δ2 / δ , a topological statistic of the full network. Resilience is well studied in science and engineering, arising in many contexts because it captures a complex system s ability to retain its basic functionality under faults (Gao, Barzel, and Barab asi 2016). Understanding a network s resilience is essential for us to evade the consequences of resilience loss, such as malfunction of gene regulation networks, cascading failures in technological systems, mass extinctions in ecological networks. We estimate resilience β from the sampled subgraph G(s). In other contexts where it is important to measure resilience, the full network topology is also often not available perhaps due to privacy, or, as is usually the case in practice, the inability to measure the full network (for example, protein-protein interactions, metabolic and terrorist networks (Stumpf and Wiuf 2005)). Despite advances in graph sampling, we are not aware of accurate estimators of resilience from an incomplete view of the network. The naive resilience-estimator treats the observed subnetwork as the full network and estimates β with β(s). We derive corrections to this naive estimate for a variety of sampling schemes, and give analysis of the estimation accuracy (bias and variance) and the sample complexity. This general infrastructure enables one to manage a network s resilience from incomplete data, and may be of independent interest. We treat resilience estimation for undirected, unweighted graphs. However, our results can be easily extended to the directed and weighted cases. This part of our work falls into the general area of graph analysis from incomplete (sampled) subgraphs. Typical sampling methods are vertexbased, exploration-based (see for example (Leskovec and Faloutsos 2006; H ubler et al. 2008; Ahmed, Neville, and Kompella 2011)) and edge-based. We focus on the ﬁrst two sampling schemes since they are natural ways to sample a graph in practice, however, the results do extend to edgebased sampling. The birds-eye view of the workﬂow is

G, β G(s), β(s) ˆβ(G(s), β(s), Φ) β, sampler Φ this work

where β(s) is the naive resilience estimator that treats G(s)

as if it were the complete network. Our ﬁnal estimator ˆβ can depend on the sampling method Φ.

4.1 Estimating β for Random Subgraphs

We consider several types of subgraph sampling. The simplest is to sample m vertices uniformly without replacement and measure the degree δi of each vertex. The sample averages δ2 s and δ s are unbiased and concentrate at the

true averages δ2 and δ . This means ˆβ = δ2 s/ δ s is asymptotically unbiased and concentrates at β. We use s to denote the average over the subgraph vertices in V (s). An interesting variant is to sample vertices with a degreebias, so the probability to sample vertex i is δi/ i δi. In this case, the expected degree of a sampled node is i δ2 i /

i δi = β, hence ˆβ = δ s is an unbiased estimate of β. More generally, one can sample vertices with an arbitrary degree-biased importance distribution q(δ). The analysis of this more general case, including a bias-variance analysis is postponed to a full version of this work. Now, suppose you only measure δ(s) i , the induced degrees in the subgraph, not true degrees. To implement (6) we need not only β, but also the degree δi. Let β(s) be the resilience of the induced subgraph, β(s) = δ2 s/ δ s. A crude analysis just uses concentration twice. First, the induced degree δ(s) i is a sum of m 1 Bernoulli random variables sampled uniformly without replacement from a population of n 1 Bernoulli values in which δi of them are 1. So, δ(s) i /(m 1) concentrates at δi/(n 1). We need concentration for each of the m sampled vertices, which, using a union bound plus a Hoeffding inequality, has a failure probability 2m exp( Ω(mε2)) for relative error ε. So, the estimates ˆδi = n 1 m 1δ(s) i all concentrate with relative error at their respective δi and hence we get a resilience estimate ˆβ = ˆδ2 / ˆδ = n 1

m 1β(s) that concentrates at β. A more re-

ﬁned estimate based on E[(δ(s) i )2] is ˆβ = n 2

m 2β(s) n m

n 2 . Another popular way to sample a subgraph is using a random walk: start at a random vertex and move from one vertex to a neighbor, chosen uniformly at random (nodes could be revisited). Such a walk has a stationary distribution where a node s sampling probability is proportional to its degree (Kurant, Markopoulou, and Thiran 2011). From the discussion of degree-biased sampling, the estimator of resilience is ˆβ = δ s when the degrees in the full network are known. When only the induced subgraph is known, then, as with the induced subgraph from vertex sampling, a correction factor is needed. We approximate the sampling as independent degree-biased sampling in a random rewiring model, and get a correction factor n/m (see full version). We summarize our discussion of random vertex sampling (VS) and random walks (RW) in a table.

Sampling Measured ˆδi ˆβ VS V (s), E(s), δi δi δ2 s/ δ s Ind-VS V (s), E(s), δ(s) i n 1 m 1δ(s) i n 2 m 2β(s) n m

n 2 RW V (s), E(s), δi δi δ s Ind-RW V (s), E(s), δ(s) i n δ

m ˆβ δ(s) i n m δ s

For i V (s), δi is the degree in G, δ(s) i is the degree in G(s)

and the resilience of the subgraph is β(s) = δ2 s/ δ s. The notation x s means

i V (s) xi/m.

For Ind-RW, the network average degree δ is needed to estimate true degrees (in addition to n). There are methods to estimate δ (Dasgupta, Kumar, and Sarlos 2014; Zhang,

Applications Networks (|V |, |E|) Ecological ENet1(270,8074) (Arroyo, Armesto, and Primack 1985; Gao, Barzel, and Barab asi 2016) ENet2(97,972) (Clements and Long 1923; Gao, Barzel, and Barab asi 2016) Gene Regulation MEC(2268,5620) (Lee et al. 2002; Gao et al. 2014) TYA(662,1062) (Lee et al. 2002; Gao et al. 2014) Epidemic Dublin(410,2765) (Rossi and Ahmed 2015) Email(1133,5451) (Guimera et al. 2003; Kunegis 2013)

Table 2: List of networks in our evaluation

1 10 Ecological

5% 10 Regulatory

5% 10 Epidemic

Predicted Steady State VS

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 True Steady State

Figure 4: Predicted steady states when vertex degrees δi are available. The diagonal line is perfect prediction. Our method (red) is nearly perfect, while the naive method (blue) is a disaster. Often, the steady states just converge to 0 for the naive method.

Kolaczyk, and Spencer 2015; Ribeiro and Towsley 2010; Leskovec and Faloutsos 2006), but they use more powerful queries than the induced subgraph from a simple random walk can provide. The complication with the simple random walk is that it is a degree-biased sampling of vertices. This bias can be corrected with a Metropolis-Hastings random walk (H ubler et al. 2008), but that requires knowledge of vertex degrees. One can analyze the bias and variance of our estimators using the importance sampling framework that samples vertices according to the proposal distribution q (Theodoridis 2015; Cochran 1977; Wu 1982) (uniform or degree-biased). We postpone the details to the full paper.

5 Results We tested our approach on the three popular dynamical systems in Table 1 and two corresponding networks for each dynamical system (see Table 2). Each dynamical system contains several parameters which are set as in (Gao, Barzel, and Barab asi 2016) for ecology and gene regulation, and as in (Barzel and Barab asi 2013) for epidemics. We compare the performance of different subgraph-sampling (RW, VS, Ind-RW, Ind-VS), with sample size m = 10 (constant size) and m = 0.05n (constant fraction). We average results over several subgraphs. When a vertex is sampled in multiple subgraphs, we report

the average predicted steady-state. Given a subgraph, we solve (6) to estimate the true steadystates of vertices in the subgraph, for which we need δi (true degree) and xeff (effective external state). To get xeff, we need β to solve (14). For δi and β, we use the estimates in Section 4.1. In solving (14), there can be one stable attractor or more than one stable attractors (there can also be unstable attractors). When there are more than one stable attractor, the system can equilibrate at multiple steady-states. First, we consider sampling methods that obtain vertex degrees δi (RW, VS), where only β needs to be estimated. The results in Figure 4 show that our approach (red) is remarkable at revealing the true steady-state, even from tiny subgraphs. We get near perfect results from just 10 vertices of a multi-thousand vertex graph. This is all the more impressive given that the current state-of-the-art, i.e., the naive approach (blue), is more or less a disaster. We emphasize that the naive method is the only method currently used by researchers in the ﬁeld, and it simply fails. In the ecological application, the naive method even identiﬁes multiple steady-states, one of which sends all species to extinction. When only induced subgraphs are available (Ind-RW, Ind VS), see Figure 5, our methods are still much better than naive, but the performance drops. Estimating individual degrees from induced degrees is hard. Parameters like the resilience β are global and can be extracted accurately from

1 10 Ecological

5% 10 Regulatory

5% 10 Epidemic

Predicted Steady State Ind-VS

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 True Steady State

Figure 5: Predicted steady-states when only the induced subgraph is available and vertex degrees must be estimated. Our method (red) signiﬁcantly outperforms naive, but performance drops compared to Figure 4, because estimating vertex degrees from observed induced degrees is tough.

subgraphs. Local parameters like vertex degree get severely distorted. More details on how degree and resilience estimations are affected by induced subgraph sampling is deferred. For vertex sampling, our estimator is unbiased, but for random walks, our estimators are based on the approximation of independent degree-biased sampling. This approximation can break down.

Open Question. How should one estimate vertex degrees from induced subgraphs of random walks? How does the estimator depend on the network properties?

We also observe from Figure 5 that for induced subgraphs, the performance of our approach (and the naive method) depends strongly on the subgraph sampling methods, depending on the network structure and the nonlinear dynamics.

Open Question. What are the factors which inﬂuence the choice of subgraph sampling method? For example when is RW better than VS?

6 Discussion

We addressed a prevalent problem. Consider this scenario. A biologist has the favorite part of an ecosystem, their favorite 10-species, and carefully collects their relationships which are summarized in the adjacency matrix A(s). The biologist even knows how species interact, the dynamical system (1). The biologist carefully simulates the system to steady-state and ﬁnds that all species are going extinct. This is scary, but the result is just plain wrong. You cannot restrict a coupled nonlinear system to your favorite part of the network and expect even close to correct conclusions by just analyzing that part in isolation. One solution is to collect the full network and analyze the full system. There are two problems. First, we can t collect the full network. Second, simulating the full system to equilibrium is prohibitive in terms

of convergence time. So the only feasible solution for learning the true steady states of the observed incomplete network is to somehow account for the external impact on the local system. This was our approach. In a mean ﬁeld approximation, the external impact reduces to a single parameter xeff which depends only on the network s resilience β, a topological parameter. We showed how to estimate resilience, depending on how the subgraph is sampled. Our results on real networks with corresponding dynamics gave spectacular success we accurately recover steady-states from just 10-vertex subgraphs of thousand-vertex networks.

There are several interesting future questions. The natural one is to ﬁnd improved estimates of resilience that extend to other sampling methods (snowball sampling, edge sampling etc). A critical direction, which we address in forthcoming work, is the inverse problem. Suppose the steadystates are known. For example, the abundance of your 10 favorite monkey species is known. Can one infer the correct dynamical system f( ), g( , )? Currently, the dynamics are ﬁt to the observed steady-states for the partial network (Ghahramani and Roweis 1999; Schmidt and Lipson 2009; Ionides, Bret o, and King 2006; Bongard and Lipson 2007; Brunton, Proctor, and Kutz 2016). This is wrong and will produce the incorrect dynamics. It is no surprise, therefore, that the inferred dynamics keeps changing as more data is collected (Schmidt and Lipson 2009; Ionides, Bret o, and King 2006; Bongard and Lipson 2007; Brunton, Proctor, and Kutz 2016). It is absolutely critical to account for the external impact in the inference process.

7 Acknowledgements

This work was partially supported by the ONR Contract N00014-15-1-2640.

Ahmed, N.; Neville, J.; and Kompella, R. R. 2011. Network sampling via edge-based node selection with graph induction. Computer Science Technical Reports.

Allee, W. C.; Park, O.; Emerson, A. E.; Park, T.; Schmidt, K. P.; et al. 1949. Principles of animal ecology. Technical report, Saunders Company Philadelphia, Pennsylvania, USA.

Alon, U. 2006. An introduction to systems biology: design principles of biological circuits. Chapman and Hall/CRC.

Arroyo, M. T. K.; Armesto, J. J.; and Primack, R. B. 1985. Community studies in pollination ecology in the high temperate andes of central chile ii. effect of temperature on visitation rates and pollination possibilities. Plant Systematics and Evolution 149(3-4):187 203.

Barzel, B., and Barab asi, A.-L. 2013. Universality in network dynamics. Nature Physics 9(10):673.

Bongard, J., and Lipson, H. 2007. Automated reverse engineering of nonlinear dynamical systems. PNAS 104(24):9943 9948.

Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016. Discovering governing equations from data by sparse identiﬁcation of nonlinear dynamical systems. PNAS 113(15):3932 3937.

Clements, F. E., and Long, F. L. 1923. Experimental pollination: an outline of the ecology of ﬂowers and insects. Number 336. Carnegie Institution of Washington.

Cochran, W. G. 1977. Sampling technique.

Dasgupta, A.; Kumar, R.; and Sarlos, T. 2014. On estimating the average degree. In Proc. Conf. WWW, 795 806.

Dodds, P. S., and Watts, D. J. 2005. A generalized model of social and biological contagion. Journal of Theoretical Biology 232(4):587 604.

Edwards, S. F., and Anderson, P. W. 1975. Theory of spin glasses. Journal of Physics F: Metal Physics 5(5):965.

Gao, J.; Barzel, B.; and Barab asi, A.-L. 2016. Universal resilience patterns in complex networks. Nature 530(7590):307.

Gao, J.; Liu, Y.-Y.; D souza, R. M.; and Barab asi, A.-L. 2014. Target control of complex networks. Nature Communications 5:5415.

Ghahramani, Z., and Roweis, S. T. 1999. Learning nonlinear dynamical systems using an em algorithm. In Advances in NIPS, 431 437.

Gjoka, M.; Kurant, M.; and Markopoulou, A. 2013. 2.5 k-graphs: from sampling to generation. In INFOCOM 13, 1968 1976. IEEE.

Guimera, R.; Danon, L.; Diaz-Guilera, A.; Giralt, F.; and Arenas, A. 2003. Self-similar community structure in a network of human interactions. Phys. Rev. E 68(6):065103.

H ubler, C.; Kriegel, H.-P.; Borgwardt, K.; and Ghahramani, Z. 2008. Metropolis algorithms for representative subgraph sampling. In ICDM 08, 283 292. IEEE.

Hufnagel, L.; Brockmann, D.; and Geisel, T. 2004. Forecast and control of epidemics in a globalized world. PNAS 101(42):15124 15129.

Hui, C. 2006. Carrying capacity, population equilibrium, and environment s maximal load. Ecological Modelling 192(1-2):317 320.

Ionides, E. L.; Bret o, C.; and King, A. A. 2006. Inference for nonlinear dynamical systems. PNAS 103(49):18438 18443.

Karlebach, G., and Shamir, R. 2008. Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology 9(10):770.

Katzir, L.; Liberty, E.; and Somekh, O. 2011. Estimating sizes of social networks via biased sampling. In Proc. Conf. WWW, 597 606. ACM. Klusowski, J. M., and Wu, Y. 2018. Counting motifs with graph sampling. In Bubeck, S.; Perchet, V.; and Rigollet, P., eds., Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, 1966 2011. PMLR. Kunegis, J. 2013. KONECT: The Koblenz Network Collection. In Proc. Conf. WWW, 1343 1350. ACM. Kurant, M.; Butts, C. T.; and Markopoulou, A. 2012. Graph size estimation. ar Xiv preprint ar Xiv:1210.0460. Kurant, M.; Markopoulou, A.; and Thiran, P. 2011. Towards unbiased BFS sampling. IEEE Journal on Selected Areas in Communications 29(9):1799 1809. Lee, T. I.; Rinaldi, N. J.; Robert, F.; Odom, D. T.; Bar-Joseph, Z.; Gerber, G. K.; Hannett, N. M.; Harbison, C. T.; Thompson, C. M.; Simon, I.; et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298(5594):799 804. Leskovec, J., and Faloutsos, C. 2006. Sampling from large graphs. In Proc. Conf. ACM SIGKDD, 631 636. ACM. Lotka, A. J. 1910. Contribution to the theory of periodic reactions. The Journal of Physical Chemistry 14(3):271 274. Pastor-Satorras, R., and Vespignani, A. 2001. Epidemic spreading in scale-free networks. Phys. Rev. Lett. 86(14):3200. Ribeiro, B., and Towsley, D. 2010. Estimating and sampling graphs with multidimensional random walks. In Proc. Conf. Internet Measurement, 390 403. Rossi, R., and Ahmed, N. 2015. The network data repository with interactive graph analytics and visualization. In AAAI. Schmidt, M., and Lipson, H. 2009. Distilling free-form natural laws from experimental data. Science 324(5923):81 85. Seshadhri, C.; Pinar, A.; and Kolda, T. G. 2014. Wedge sampling for computing clustering coefﬁcients and triangle counts on large graphs. Statistical Analysis and Data Mining: The ASA Data Science Journal 7(4):294 307. Stumpf, M. P., and Wiuf, C. 2005. Sampling properties of random graphs: the degree distribution. Phys. Rev. E 72(3):036118. Theodoridis, S. 2015. Machine learning: a Bayesian and optimization perspective. Academic Press. Wu, C.-F. 1982. Estimation of variance of the ratio estimator. Biometrika 69(1):183 189. Zhang, Y.; Kolaczyk, E. D.; and Spencer, B. D. 2015. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. Annals of Applied Statistics 9:166 199.