# active_learning_based_structural_inference__d1d28187.pdf

Active Learning based Structural Inference

Aoran Wang 1 Jun Pang 1 2

In this paper, we propose a novel framework Active Learning based Structural Inference (ALa SI), to infer the existence of directed connections from observed agents states over a time period in a dynamical system. With the help of deep active learning, ALa SI is competent in learning the representation of connections with a relatively small pool of prior knowledge. Moreover, based on information theory, the proposed interand outof-scope message learning pipelines are remarkably beneficial to structural inference for large dynamical systems. We evaluate ALa SI on various large datasets including simulated systems and real-world networks, to demonstrate that ALa SI is able to outperform previous methods in precisely inferring the existence of connections in large systems under either supervised learning or unsupervised learning.

1 Introduction

Dynamical systems are commonly observed in real-world, including physical systems (Kwapie n & Dro zd z, 2012; Ha & Jeong, 2021), biological systems (Tsubaki et al., 2019; Pratapa et al., 2020), and multi-agent systems (Bras o & Leal Taix e, 2020; Li et al., 2022). A dynamical system can be described as a set of three core elements: (a) the state of the system in a time period, including the state of the individual agents, and can be viewed as time series; (b) the state-space of the system; and (c) the state-transition function (Irwin & Wang, 2017). Knowing these core elements, we can describe and predict how a dynamical system behaves. Yet the three elements are not independent of each other. The evolution of the state is affected by the state-transition function, which suggests that the future state may be predicted based on the current state and the entities which affect the agents

1Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg 2Institute for Advanced Studies, University of Luxembourg, Luxembourg. Correspondence to: Aoran Wang <aoran.wang@uni.lu>, Jun Pang <jun.pang@uni.lu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

(i.e. connectivity). Moreover, the state-transition function is often deterministic (Katok & Hasselblatt, 1995), which simplifies the derivation of the future state as a Markovian transition function.

However, in most cases, we hardly have access to the connectivity, or only have limited knowledge about the connectivity. Is it possible to infer the connectivity from the observed states of the agents over a time period? We formulate it as the problem of structural inference, and several machine learning frameworks have been proposed to address it (Kipf et al., 2018; Webb et al., 2019; Alet et al., 2019; Chen et al., 2021; L owe et al., 2022; Wang & Pang, 2022). Although these frameworks can accurately infer the connectivity, as they perform representation learning on a fully connected graph, these methods can only work for small systems (up to dozens of agents), and cannot scale well to real-world large dynamical systems, for example, with hundreds of agents. Besides, as we show in the experiment and appendix sections in this work, the integration of prior knowledge about partial connectivity of the system is quite problematic among these methods.

In this work, we propose a novel structural inference framework, namely, Active Learning based Structural Inference (ALa SI), which is designed for the structural inference of large dynamical systems based on Deep Active Learning (Deep AL) (Ren et al., 2022), and is suitable for the integration of prior knowledge. In order to perform structural inference on large dynamical systems, unlike ordinal deep active learning methods that build feature pools on batches (Kirsch et al., 2019; Zhdanov, 2019; Ash et al., 2020; Gentile et al., 2022), the pools of ALa SI are built on agents, and the framework can consequently infer the existence of directed connections with a little prior knowledge of the connections. ALa SI leverages query strategy with dynamics for agent-wise selection to update the pool with the most informative partial system, which encourages ALa SI to infer the connections efficiently and accurately with partial prior knowledge of the connectivity (named scope ). Based on information theory, ALa SI learns both inter-scope (IS) and out-of-scope (OOS) messages from the current scope to distinguish the information which represents connections from agents within the scope and from agents out of the scope, which reserves redundancy when new agents come into scope. Moreover, with oracle such as

Active Learning based Structural Inference

Partial Information Decomposition (PID) (Williams & Beer, 2010), ALa SI can infer the connectivity even without prior knowledge and be trained in an unsupervised way. We show with extensive experiments that ALa SI can infer the directed connections of dynamical systems with up to 1.5K agents with either supervised learning or unsupervised learning. The main contribution of this paper is the following:

We propose a novel structural inference algorithm, ALa SI, tailored to infer the connection of large dynamical systems based on Deep AL. It is the first attempt to structural inference with Deep AL to the best of our knowledge.

We design a novel dynamic query strategy, which queries the most informative agents to be labeled based on the dynamic error, and enables ALa SI to learn efficiently on prior knowledge of the partial dynamical system.

Based on information theory, we propose IS and OOS representation learning pipelines, which facilitate the learning of OOS connections from the current scope of the system, and reserve redundancy for new agents to be added to the current scope.

We experimentally evaluate ALa SI with seven large dynamical systems, and show that ALa SI manages to precisely and efficiently infer the connections under both supervised and unsupervised settings.

2 Related Work

Structural inference. The aim of structural inference is to accurately reconstruct the connections between the agents in a dynamical system with observational agents states. Among the wide variety of methods, Neural Relational Inference (NRI) (Kipf et al., 2018) was the first to address the problem of structural inference based on observational agents states with the help of a Variational Auto-encoder (VAE) operating on a fixed fully connected graph structure. Several works have been proposed based on further improvement on NRI. Such as extending to multi-interaction systems (Webb et al., 2019), integrating efficient messagepassing mechanisms (Chen et al., 2021), using modular meta-learning (Alet et al., 2019), and eliminating indirect connections with iterative process (Wang & Pang, 2022). From the aspect of Granger-causality, amortized causality discovery (ACD) (L owe et al., 2022) attempted to infer a latent posterior graph from temporal conditional dependence, while Wu et al. (2020) proposed the Minimum Predictive Information Regularization (MPIR) model and used a learnable noise mask on nodes to reduce the computational cost. In addition to the work mentioned above, several frameworks inferred the connectivity with different problem settings. Some approaches fitted a dynamics model and then produced a causal graph estimate of the model by using recurrent models (Tank et al., 2021; Khanna & Tan, 2020), or inferred the connections by generating edges sequen-

tially (Johnson, 2017; Li et al., 2018), or were specially designed to infer the connections of dynamic graphs (Ivanovic & Pavone, 2019; Graber & Schwing, 2020; Li et al., 2022). However, because of the fixed latent space in VAE or exponential computational efficiency, most of the methods mentioned above are incapable of structural inference on large dynamical systems and have difficulties in the efficient utilization of prior knowledge.

Deep Active learning. ALa SI follows the strategy of Deep AL (Gal et al., 2017; Pop & Fulop, 2018; Kirsch et al., 2019; Tran et al., 2019; Ren et al., 2022), attempting to combine the strong learning capability of deep learning in the context of high-dimensional data processing and the significant potential of Active Learning (AL) in effectively reducing labeling costs. To solve the problem of insufficient labeled sample data, (Tran et al., 2019) leveraged generative networks for data augmentation, and (Wang et al., 2016) expanded the labeled training set with pseudo-labels. Moreover, Hossain & Roy (2019) and Sim eoni et al. (2020) used labeled and unlabeled datasets to combine supervised and semisupervised training with AL methods. Several works have been proposed on how to improve the batch sample query strategy (Shi & Yu, 2019; Kirsch et al., 2019; Zhdanov, 2019; Ash et al., 2020). As we will show, by leveraging the advantages of Deep AL, ALa SI is competent in efficiently and accurately inferring the existence of directed connections with a small labeled pool of prior knowledge.

Partial Information Decomposition. Partial Information Decomposition (PID) explicitly quantifies the information associated with two or more information sources that is not present in any subset of those sources (Williams & Beer, 2010; Lizier et al., 2013; Pakman et al., 2021). Therefore, PID is widely utilized to uncover the underlying connections between the agents in the dynamical systems in the field of physics (Barrett, 2015; Makkeh et al., 2018) and biology (Chan et al., 2017; Cang & Nie, 2020). Moreover, Lizier et al. (2013) extended the ordinary PID to cases with two or more sources and also considered past ego state as a source. Based on (Lizier et al., 2013), we derive a novel method for learning OOS messages from the current scope. Besides that, we also extend the original symmetric formulation of PID to unsymmetric cases by integrating temporal information, to enable ALa SI to infer the existence of directed connections even without any prior knowledge.

3 Preliminaries

3.1 Notations and General Problem Definition

We view a dynamical system S as S = {V, E}, in which V represents the set of n agents in the system: V = {vi, 1 i n}, and E denotes the directed connections between the agents: (vi, vj) E V V. We focus on the cases

Active Learning based Structural Inference

where we have recordings of the agents states over a time period: V = {V t, 0 t T}, where T is the total number of time steps, and V t is the set of features of all the n agents at time step t: V t = {vt 1, vt 2, . . . , vt n}. We name the recordings as trajectories. Based on the trajectories, we aim to infer the existence of directed connections between any agent-pair in the system. The connections are represented as E = {eij {0, 1}}, where eij = 1 (or 0) denotes the existence of connection from agent i to j (or not). We sample a total number of K trajectories. With the notations above, the dynamics for agents within the system is:

vt+1 i = vt i + X

j Ui f ||vi, vj||α , (1)

where denotes a time interval, Ui is the set of agents connected with agent i, and f( ) is the state-transition function deriving to dynamics caused by the edge from agent j to i, and || , ||α denotes the α-distance. We state the problem of structural inference as searching for a combinatorial distribution to describe the existence of a directed connection between any agent pair in the dynamical system.

3.2 Problem Definition in the Context of Deep AL

Assume we have two sets of trajectories, the set of trajectories without knowing connectivity Dpool = {Vpool, E }, and the set of trajectories for training Dtrain = {Vtrain, Etrain}, where E denotes the empty set of connectivity. We consider two scenarios: in the first scenario we have access to the ground truth of connectivity E in the system, and we perform a supervised-learning-based Deep AL with ALa SI:

min s L:|s L|<K Ee PEtrain,v PVtrain[L(e, v; As0 s L)], (2)

where s0 is the initial pool of m agents chosen from Dtrain, as well as the connectivity between them, s L is the extra pool with budget K, A represents the algorithm of ALa SI, L denotes the learning objective and we denote Px as the sampling space of variable x. The second scenario is where the ground-truth connectivity is inaccessible during training, and we show that ALa SI is competent to infer the connections in an unsupervised setting with an oracle: PID (Williams & Beer, 2010; Lizier et al., 2013). Thus, instead of having Etrain available in Dtrain, we leverage PID to calculate the connectivity between the agents in the pool at every round of sampling:

min sk:|sk|<K Ee PEPID,v PVtrain[L(e, v; As0 sk)], (3)

where sk = {Vtrain, EPID} denotes the pool, with EPID denoting the connections generated by PID operating on the agents in the pool, and the number of agents in sk has a budget K. PID set up the initial set s0 as that of sk, but with a different size of agents m. We consider ALa SI with both supervised and unsupervised learning and conduct experi-

ments on both settings, to demonstrate its performance.

3.3 Background on PID

PID of two sources X1, X2 amounts to expressing the mutual information (MI) of X1, X2 with a target Y as a sum of four non-negative terms (Pakman et al., 2021):

I(Y ; (X1, X2)) = U(Y ; X1) + U(Y ; X2)

+R(Y ;X1, X2) + Sy(Y ; X1, X2), (4)

corresponding to unique (U1, U2), redundant (R) and synergistic (Sy) contributions, respectively. To calculate the PID terms, the redundant information is first calculated using the specific information Ispec, which quantifies the information provided by one variable about a specific state of another variable (Chan et al., 2017), such as from X1 about state y of variable Y :

Ispec(y; X1) = X

x X1 p(x|z) log 1 p(z) log 1 p(z|x)

Then the redundant contribution is calculated by comparing the amount of information provided by each source within set B = {X1, X2} about each state of the target Y :

R(Y ; X1, X2) = X

y Y p(y) min B Ispec(y; B). (6)

The unique information and the synergistic information can be calculated from the redundant information based on the consistency equations (Williams & Beer, 2010):

U(Y ; X1) = I(X1; Y ) R(Y ; X1, X2), (7)

Sy(Y ; X1, X2) = II(Y ; X1; X2) + R(Y ; X1, X2), (8)

where the interaction information (Mc Gill, 1954) of three variables II(a; b; c) is calculated as I(a; b|c) I(a; b). In a system of n agents, given a pair of agents X1 and Y , there are n 2 triplets involving the pair. The MI between X1 and Y is unaffected by the choice of a third agent X2, because MI is a pairwise measure. But U(Y ; X1) varies depending on X2, and the difference between I(Y ; X1) and U(Y ; X1) is equal to the redundancy between all three agents (Equation 7). So a popular method (Chan et al., 2017) is calculating the ratio score r = U(Y ; X1)/I(Y ; X1) as capturing the proportion of MI that is accounted for by unique information between X1 and Y , as opposed to redundant information between all three agents. If X1 and Y are connected, their r is higher than any other pairs, and we can follow this method to infer the connectivity by calculating the ratio scores for all agent pairs.

In this section, we present ALa SI, a scalable structural inference framework based on agent-wise Deep AL. We start by formulating such a learnable framework in Section 4.1.

Active Learning based Structural Inference

Train Query with

If unsupervised learning, leverage PID as oracle.

Figure 1. Overview of the pipeline of ALa SI.

After that, we describe the IS and OOS operations in Section 4.2, which are of great significance to make the framework scalable. Especially, we propose the hybrid loss and the query strategy with dynamics in Sections 4.3 and 4.4, respectively. Last but not least, we discuss the integration of PID into ALa SI in Section 4.5, which enables ALa SI to infer the connectivity with unsupervised learning.

4.1 Active Structural Inference with Dynamics

The basic idea behind ALa SI is to infer the existence of directed connection between two agents with the help of dynamics. According to Equation 1, we may describe it as: the correct inference of the connectivity enables the algorithm to predict the future states of the agent with smaller error. We formulate the statement as:

arg min Ui V Eθ p(θ|{V,E})R vt+1 i , P(ˆvt+1 i |vt i, Ui, θ) , (9)

where Ui represents the agents connected to agent i, R is the loss function to quantify the dynamics prediction error between actual dynamics vt+1 i and predicted dynamics ˆvt+1 i , and θ is the parameters of the model. The problem setting in Equation 9 is also widely adopted (Kipf et al., 2018; Webb et al., 2019; L owe et al., 2022; Wang & Pang, 2022). For small dynamical systems, we can directly follow this formulation and leverage generative models such as a VAE to work on a fully-connected initial graph, in order to infer the connectivity of the whole system. However, for large dynamical systems, it is impractical and unattainable to infer the connectivity in the same way, which is also a common problem observed in the literature on structural inference.

In this work, we extend Equation 9 for large dynamical systems with the help of Deep AL. Unlike previous Deep AL algorithms, which train models on batch-wise selections (Gal et al., 2017; Kirsch et al., 2019; Pop & Fulop, 2018; Tran et al., 2019), we design ALa SI to train on agent-wise selections. The pool consists of features of different agents, and the directed connections between these agents. By training ALa SI on the pool, we try to encourage the framework to capture the statistics to describe the existence of connections between any agent-pair:

arg min Ui D Eθ p(θ|D)R vt+1 i , Q(ˆvt+1 i |vt i, Ui, θ) . (10)

Different from Equation 9, we have a limited scope D on the

available agents and their features, and we can only learn the representation of connections based on current scope D. However, there possibly simultaneously exist connections between the OOS agents and the agents inside the scope, and discarding the influences of these OOS connections would lead to inaccurate inference results. As a consequence, we need to design the model Q so that it can distinguish the portion of information related to OOS connections and the portion of information coming from connections in the scope, in order to learn the representation of connection precisely and also reserve redundancy for new agents to be added into the pool. We describe the pipeline of ALa SI in Figure 1 and Algorithm 2 in the appendix.

4.2 Inter- / Out-of-Scope Operations

Previous works leveraged a fixed scope on the entire set of agents of the dynamical system, and thus struggled with the curse of scalability (Kipf et al., 2018; Webb et al., 2019; L owe et al., 2022; Wang & Pang, 2022). To address this issue, we propose a set of inter-/out-of-scope operations in order to make ALa SI scalable. Suppose we have a partial view of np agents in the dynamical system (np < n), and we call the partial view a scope. For any agent i in the scope, it is possible that it has connections within the scope and also has connections from agents out of the scope simultaneously. We demonstrate an example in Figure 2.

Figure 2. (a) A dynamical system with 11 agents. (b) A scope of 4 agents on the same system.

In Figure 2(a), there is a dynamical system consisting of 11 agents and directed connections. Then suppose we have a scope with 4 agents on the system and is shown as the cloud line in Figure 5(b). Based on the segmentation of the scope, the agents in the system are divided into two groups: IS agents (in original color), and OOS agents (in nattier blue). But the connections are segmented into three groups:(a) IS connections (in blue), (b) OOS connections (in red) and (c)

Active Learning based Structural Inference

non-observable connections (in cream). Besides agents in the scope, IS agents may also be affected by OOS agents, thus we need to take the OOS connections into consideration and separate their influence.

We denote Vt inter as the set of IS agents states and Zt oos as the summary of OOS agents states for an ego agent i at time-step t. Vt inter and Zt oos share many characteristics: (1) Since both of them represent the features within the same system, the connections between either IS agents or OOS agents and agent i have the same dynamic function as shown in Equation 9; (2) From the perspective of information theory (Kraskov et al., 2004; Belghazi et al., 2018), we can easily reach the statement that: I(vt i; Vt inter) = 0 and I(vt i; Zt oos) = 0, where vt i represents the features of agent i at time step t, and I( ; ) denotes the MI between two entities. Therefore, we reformulate Equation 10 as:

arg min Ui D Eθ p(θ|D)R vt+1 i , Q(ˆvt+1 i |vt i, Vt inter, Zt oos, θ) .

(11) Yet the calculation of Zt oos is agnostic, it is necessary to have another set of derivations:

Proposition 4.1. If we assume Zt oos only captures the information that affects vt i and is different from Vt inter, we can reach the following statements:

I(Vt inter; Zt oos) < I(vt+1 i ; Zt oos),

and I(Vt inter; Zt oos) < I(vt+1 i ; Vt inter). (12)

Proposition 4.1 infers that the MI between Vt inter and Zt oos is the smallest among the MI between any pair from Vt inter, Zt oos and vt+1 i . It also suggests that we can infer information about Zt oos from vt+1 i . We prove the proposition in Section B.1 in the appendix. Based on the MI of time series between two sources and its own present state (Lizier et al., 2013), as well as the Markovian assumption, we have:

I(vt+1 i ; vt i, Vt inter, Zt oos) = I(vt+1 i ; vt i) + I(vt+1 i ; Vt inter| vt i)

+ I(vt+1 i ; Zt oos| vt i, Vt inter). (13)

Since MI terms are non-negative by design, the last term on the right of Equation 13 suggests that given vt+1 i , we can derive the information about Zt oos conditional on vt i and Vt inter. Therefore, we implement the inter-/out-of-scope message learning pipelines with neural networks and the pipeline of which is shown in the following equations:

Zi = finter1([vt i, Vt inter]), (14)

einter = finter2([Zi Vt inter, vt i]), (15)

eoos = foos2([foos1(vt i), einter]), (16)

eout = foutput(fdynamics(einter, eoos), vt i), (17)

where einter and eoos are learned inter-/out-of-scope representations (Vt inter / Zt oos), respectively, [ , ] is the concatenation operation, finter1 is the neural network to learn the

existence of connections between agent i and the agents inside the current scope, Zi represents the connectivity inside the scope with regards to agent i, and is the operation to select agents based on connectivity. Suppose we have K agents in the scope, then Zi [0, 1]K. So for any agent i, j in the scope, we have zij [0, 1], representing the connectivity from agent i to agent j. In practice, we reparametrize zij with Gumbel-Softmax (Jang et al., 2017) to enable backpropagation (see Section C.5 in the appendix for implementation). Besides that, finter2, foos1, and foos2 are the neural networks to learn representations of IS messages, OOS embeddings, and OOS messages, respectively. Finally, in Equation 17, we learn the representations for dynamics with fdynamics, and output the future state of agent i (eout) with foutput. In addition to the operations mentioned above, we leverage loss functions (in Section 4.3) to encourage ALa SI to extract OOS messages from vt i and Vt inter.

4.3 Train with a Hybrid Loss

The loss function has three roles: (a) encouraging the model to learn OOS representations; (b) calculating dynamics error; and (c) estimating the connectivity prediction error. As mentioned in Section 4.2, the OOS message Zt oos can be derived from vt+1 i , vt i and Vt inter. Based on the triplet loss (Schultz & Joachims, 2003; Schroff et al., 2015) and Proposition 4.1, we derive the following loss function to learn the OOS message:

Loos = 1 (T 1) |D|

I(Zt oos; vt+1 i ) , (18)

where T represents the total count of time-steps, D represents the current scope, |D| denotes the number of agents in the scope. (We discuss the derivation in Section B.2.) We implement the calculation and maximization of mutual information with the help of Deep Info Max (Hjelm et al., 2019). However, we have to introduce a regularization term to encourage the learned representations of Zt oos and Vt inter to be independent of each other, and we leverage distance correlation (Sz ekely et al., 2007). As already proved (Sz ekely & Rizzo, 2009; 2012; 2014), the distance correlation between two variables is zero only when two variables are independent of each other. Therefore, we calculate and minimize the distance correlation between Zt oos and Vt inter:

Ldc = 1 (T 1) |D|

d Cov2(Zt oos, Vt inter) p

d Var(Ztoos)d Var(Vt inter) ,

(19) where d Cov and d Var are the squared sample distance covariance and the distance variance, respectively, and we describe the procedures for calculating these terms in Section C.4.2 in the appendix. Besides that, we also need the

Active Learning based Structural Inference

Trained ALa SI Model agents

Select Update

Figure 3. Query strategy with dynamics in ALa SI.

loss function for dynamics:

LD = 1 (T 1) |D|

||vt+1 i ˆvt+1 i ||2

2σ2 + const,

(20) where vt+1 and ˆvt+1 are the ground-truth dynamics and predicted dynamics, respectively, and σ is the variance. Moreover, we have the loss function for connectivity:

i,j D zij log f(ˆzij) , (21)

where f( ) denotes the softmax function, zi and ˆzi represent the ground-truth connectivity and predicted connectivity in the scope, respectively. With the proposed terms above, we can summarize the hybrid loss function R as:

R = α LD + β Lcon + γ Loos + η Ldc, (22)

where α, β, γ and η are the weights for the loss terms, trying to match the scales of the last three loss terms with the dynamic loss LD. We discuss the details of loss terms in Section C.4 in the appendix.

4.4 Query with Dynamics

Interestingly, AL is also called query learning in the statistics literature (Settles, 2009), indicating the importance of query strategies in the algorithms of AL. Query strategies are leveraged to decide which instances are most informative and aim to maximize different performance metrics (Settles, 2009; Konyushkova et al., 2017). Query strategies select queries from the pool and update the training set accordingly. In this work, we propose a novel pool-based strategy:

Algorithm 1 Query with Dynamics Q.

Input: Dtrain = a pool of labeled trajectories {Vtrain, E}, Input: Dpool = a pool of test trajectories {Vpool}, Input: D = a pool of agents we have for training, Parameters: Query Size: K, Model Weights: θ, Dynamic Loss: LD, Output: Query of K agents, Calculate dynamics loss LD on all of the agents in Dpool with only one other agent in scope, Select K agents with largest dynamics prediction error, Return: K agents and update D = D Vi, i {K} with features and connectivity from Dtrain.

Query with Dynamics, which selects queries of K agents

with the largest dynamics prediction error LD from the pool Dpool, and then we update training set D with the features and connectivity of K agents from Dtrain. If we have no access to the connectivity as in unsupervised learning, we run PID to align directed connections to the agents in pool D with additional K agents (as shown in Algorithm 2). We describe the query strategy in Algorithm 1 and Figure 3. It is notable that although we have labels on the existence of connections, we do not query agents purely on them. On one hand, the characteristic of dynamical systems (Equation 9) provides strong support that the wrong alignment of connections leads to large dynamics error LD. On the other hand, we try to reserve redundancy for unsupervised learning cases, where ALa SI has no access to ground-truth connections. In this case, we ought to use alternative algorithms as an oracle, such as PID, to estimate the existence of connections and build Dtrain. However, it may be risky that the oracle has a strong bias on the set for training D, and thus errors in this set are unavoidable. As a result, we query agents from the entire pool Dpool according to their dynamics error LD, thus wrong connections would be recognized by our query strategy.

4.5 Structural Inference with PID

As mentioned above, it is possible that we have no access to the ground-truth connectivity of the dynamical system. ALa SI manages to infer the connections with the help of an oracle: PID (Williams & Beer, 2010; Lizier et al., 2013). The PID framework decomposes the information that a source of variables provides about a destination variable (Lizier et al., 2013). In our cases to infer the existence of directed connections between a pair of agents i and j, we extend the formulation in (Pratapa et al., 2020) with temporal ordering to infer the direction of connections. We first decompose the features of agent i in the scope: Vi, into two sets: X0:T 1 i = {vt i, 0 t T 1} and X1:T i = {vt i, 1 t T}. Then we calculate the rij between two agents i and j over all other agents in the scope:

rij = U(X1:T j ; X0:T 1 i )/I(X1:T j ; X0:T 1 i ), (23)

where rij is the ratio score for connection from agent i to j, and is then ranked with the results obtained from all of the agent pairs in the scope. So the temporal information derives directed connections in this formulation. We summarize the details of PID in ALa SI in Algorithm 3 in the appendix. With the help of PID, ALa SI can infer the existence of di-

Active Learning based Structural Inference

rected connections even without any prior knowledge about the connectivity, which broadens the application scenarios of ALa SI. It is possible to use other methods as an oracle for ALa SI, such as pure mutual-information-based methods, SCODE (Matsumoto et al., 2017) or even classic VAE-based structural inference methods (Kipf et al., 2018; Webb et al., 2019; Alet et al., 2019; L owe et al., 2022), which shows a high ability of adaption of ALa SI.

5 Experiments

We test ALa SI on seven different large dynamical systems, including simulated networks and real-world gene regulatory networks (GRNs). Implementation details can be found in Section C in the appendix. Besides that, we include additional experiments on the integration of prior knowledge with unsupervised learning and ablation study in Section D.

Datasets. We first test our framework on physical simulations of spring systems, which is also mentioned in (Kipf et al., 2018). Different from that in (Kipf et al., 2018), we sample the trajectories of balls in the system with fixed connectivity, but with different initial conditions. We sample the trajectories by varying the number of balls: {50, 100, 200, 500}, and we name the corresponding datasets as: Springs50 , Springs100 , Springs200 , and Springs500 . Moreover, we collect three real-world GRNs from literature, namely single cell dataset of embryonic stem cells (ESC) (Biase et al., 2014), a cutoff of Escherichia coli microarray data (E. coli) (Jozefczuk et al., 2010), and a cutoff of Staphylococcus aureus microarray data (S. aureus) (Marbach et al., 2012). And the three GRNs have 96, 1505 and 1084 agents, respectively.

Baselines and metrics. We compare ALa SI with the stateof-the-art baseline methods: NRI (Kipf et al., 2018): a variational-auto-encoder model for relational inference. f NRI (Webb et al., 2019): an NRI-based model with a multiplex graph, allowing each layer to encode for each connection-type. MPM (Chen et al., 2021): an NRI-based method with a relation interaction mechanism and a spatio-temporal message passing mechanism. ACD (L owe et al., 2022): a variational model that leverages shared dynamics to infer causal relations across samples with different underlying causal graphs. MPIR (Wu et al., 2020): a model based on minimum predictive information regularization. PID (Williams & Beer, 2010): computes the ratio between unique mutual information between any agent-pair in the system and aligns connections according to the ranking. Despite NRI, f NRI, MPM and ACD being originally designed to operate with unsupervised learning, we follow the

instruction in their paper and only train the encoders to show their results of supervised learning. We describe the implementation details of the baseline methods in Section C.6. The evaluation results are demonstrated with the Area Under the Receiver Operating Characteristic (AUROC), showing the model s ability to discriminate between cases (positive examples) and non-cases (negative examples), and in this paper, it is used to make clear the method s ability to distinguish actual connections and non-connections.

5.1 Experimental Results of Supervised Learning

We first train ALa SI and baseline methods with supervised learning. It is worth mentioning that despite our efforts, we did not find an approach to train MPIR and PID in a supervised way without violating their inference mechanisms. For the rest of the baseline methods, we follow the instruction in their paper and only train the encoders on the partial knowledge of connections. The experimental results of ALa SI and baseline methods are shown in Figure 4. We report the results as the average AUROC values of ten runs and as a function of the proportion of labeled connections. The number of labeled connections is calculated as the square of the number of agents in the scope, where we mark both connections and non-connections as labeled. We subtract the number of labeled connections with the square of the total number of agents in the system to obtain the proportion of labeled connections. Each sub-figure corresponds to the experimental results on a specific dataset.

As shown in Figure 4, the results of baseline methods are positively affected by the proportion of labeled connections during training, and only MPM is marginally better than the other baseline methods on most of the datasets. The rest of the baselines perform almost equally. The results of ALa SI are also positively correlated with the proportion of labeled connections, but the results are much better than any of the baselines. Although ALa SI is only marginally better than any of the baselines on the datasets of Springs50 and Springs100 when the proportion of labeled connections is relatively small (smaller than 0.1), ALa SI outperforms baselines greatly when the proportion of labeled connections is greater than 0.2 on these datasets. ALa SI also infers connectivity with remarkably higher accuracy than baseline methods on the rest of the datasets.

Moreover, we also observe that ALa SI learns the connectivity of large dynamical systems more efficiently than baselines. For example, as shown in the experimental results on all of the datasets except Springs200 , with only 60% of the prior knowledge, ALa SI reaches higher inference accuracy than any baseline methods operating with 80% of the prior knowledge. And this phenomenon is more remarkable in Springs100 , Springs500 and E. coli , where ALa SI outperforms baselines with only 50% of the prior

Active Learning based Structural Inference

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

ALa SI NRI f NRI MPM ACD

Figure 4. Averaged AUROC results of ALa SI and baseline methods as a function of the proportion of labeled connections. Baseline methods are modified to be trained in a supervised way.

Table 1. Averaged AUROC results (in %) of baseline methods and ALa SI with unsupervised learning.

Method Springs50 Springs100 Springs200 Springs500 ESC E. coli S. aureus

NRI 61.7 0.04 55.2 0.05 53.7 0.05 51.1 0.06 39.2 0.06 15.6 0.06 35.1 0.07 f NRI 62.1 0.03 56.7 0.04 54.0 0.04 51.6 0.03 39.8 0.07 15.4 0.06 35.1 0.06 MPM 63.1 0.05 59.0 0.05 54.4 0.04 51.8 0.04 40.2 0.07 17.0 0.05 37.5 0.05 ACD 62.0 0.04 58.9 0.03 53.9 0.05 51.5 0.03 38.7 0.04 16.2 0.07 36.9 0.08 MPIR 49.7 0.02 44.4 0.03 42.0 0.03 41.1 0.03 30.6 0.05 15.1 0.04 33.1 0.04 PID 67.8 0.01 63.0 0.02 59.2 0.03 54.7 0.03 45.1 0.04 19.5 0.03 37.8 0.03

ALa SI 73.5 0.03 69.8 0.04 66.1 0.05 63.2 0.06 57.3 0.07 23.4 0.05 39.2 0.05

Table 2. Averaged training time (in hours) of baseline methods and ALa SI with unsupervised learning.

Method Springs50 Springs100 Springs200 Springs500 ESC E. coli S. aureus

NRI 29.2 0.02 40.6 0.03 57.1 0.05 85.1 0.03 39.4 0.04 118.6 0.06 101.7 0.06 f NRI 31.0 0.03 49.0 0.04 58.0 0.03 86.8 0.05 42.0 0.06 121.4 0.07 105.3 0.05 MPM 35.9 0.02 51.6 0.03 57.4 0.03 85.6 0.04 44.1 0.05 124.0 0.06 105.9 0.06 ACD 49.0 0.04 82.4 0.02 63.9 0.03 90.0 0.04 80.4 0.04 130.9 0.06 113.5 0.05 MPIR 12.6 0.01 20.7 0.02 42.0 0.03 51.5 0.03 19.5 0.04 65.1 0.03 47.6 0.02 PID 51.6 0.01 100.2 0.01 151.0 0.02 183.4 0.02 89.3 0.02 267.1 0.02 230.8 0.01

ALa SI 25.5 0.04 33.8 0.03 46.1 0.04 60.3 0.04 37.2 0.05 87.0 0.04 72.9 0.05

knowledge. Thanks to Deep AL and query with dynamics, ALa SI can update the labeling pool with the most informative addition of agents. Besides that, the IS and OOS operations encourage the model to learn connections within the scope and meanwhile also reserve redundancy for possible OOS connections. Consequently, ALa SI is able to learn the connectivity of large dynamical systems with less prior knowledge under supervised learning.

5.2 Experimental Results of Unsupervised Learning

We report the final average AUROC values and standard deviations of ALa SI and baseline methods under unsupervised learning from ten runs in Table 1, the average training time and standard deviations in Table 2, as well as the number of required GPUs in Table 3. We can observe from Table 1 that all of the methods unsurprisingly perform worse than themselves in supervised learning, which is also stated in (Kipf et al., 2018; Chen et al., 2021). ALa SI performs better than

Active Learning based Structural Inference

Table 3. Number of utilized GPU cards of baseline methods and ALa SI with unsupervised learning.

Method Springs50 Springs100 Springs200 Springs500 ESC E. coli S. aureus

NRI 1 2 4 6 1 8 6 f NRI 1 2 4 6 1 8 6 MPM 1 2 4 6 1 8 6 ACD 1 2 4 6 1 8 6 MPIR 1 1 1 1 1 1 1 PID 1 1 1 1 1 1 1

ALa SI 1 1 1 1 1 1 1

any of the baseline methods on all of the datasets with large margins (up to 17.1%), which certainly verifies the inference accuracy of ALa SI on the unsupervised structural inference of large dynamical systems. Moreover, the average training time of ALa SI and baseline methods is shown in Table 2. It is worth mentioning that as shown in Table 3, most of the baseline methods are trained on multiple GPU cards when the dataset has more than 100 agents, while ALa SI is trained on a single GPU card. Experimental settings with details may refer to Section C.1. The averaged training time of ALa SI is only longer than MPIR across all of the datasets, while much more accurate than MPIR.Although the AUROC values of PID are the highest among baseline methods, its operation time is much longer than the rest, and it is nevertheless less accurate than ALa SI. Compared with the rest of the baselines, thanks to the query strategy with dynamics and the OOS operation, ALa SI manages to infer the connections for large dynamical systems with higher efficiency even with unsupervised learning. These results demonstrate the computational efficiency and effectiveness of ALa SI for structural inference on large dynamical systems. More experimental results on noisy data and ablation studies can be found in Section D in the appendix.

6 Conclusion

This paper has introduced ALa SI, a scalable structural inference framework based on Deep AL. The query with dynamics encourages the framework to select the most informative agents to be labeled based on dynamics error, and thus leads to faster convergence. The OOS operation enables the framework to distinguish IS messages and OOS messages based on the current view of the partial system, which on the other hand promotes the scalability of ALa SI. The experimental results on the seven large datasets have validated the scalability and inference accuracy of ALa SI. The experiments under supervised settings suggest the possibility of leveraging ALa SI to infer the connectivity of large dynamical systems with less prior knowledge. Moreover, the experiments under unsupervised settings demonstrate the broad application scenarios of ALa SI to infer the connectivity even without prior knowledge. Future research includes struc-

tural inference based on causality and structural inference for systems with changing agents and connections.

Acknowledgment

Author Jun Pang acknowledges financial support from the Institute for Advanced Studies of the University of Luxembourg through an Audacity Grant (AUDACITY-2021).

Alet, F., Weng, E., Lozano-P erez, T., and Kaelbling, L. P. Neural relational inference with fast modular metalearning. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, pp. 11804 11815, 2019.

Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J., and Agarwal, A. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.

Barrett, A. B. Exploration of synergistic and redundant information sharing in static and dynamical gaussian systems. Physical Review E, 91(5):052802, 2015.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 531 540. PMLR, 2018.

Biase, F. H., Cao, X., and Zhong, S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Research, 24(11): 1787 1796, 2014.

Bras o, G. and Leal-Taix e, L. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6247 6257, 2020.

Cang, Z. and Nie, Q. Inferring spatial and signaling relationships between cells from single cell transcriptomic data. Nature Communications, 11(1):1 13, 2020.

Active Learning based Structural Inference

Chan, T. E., Stumpf, M. P., and Babtie, A. C. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Systems, 5(3): 251 267.e3, 2017.

Chen, S., Wang, J., and Li, G. Neural relational inference with efficient message passing mechanisms. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), pp. 7055 7063, 2021.

Cowen-Rivers, A., Lyu, W., Tutunov, R., Wang, Z., Grosnit, A., Griffiths, R.-R., Maravel, A., Hao, J., Wang, J., Peters, J., and Bou Ammar, H. Hebo: Pushing the limits of sample-efficient hyperparameter optimisation. Journal of Artificial Intelligence Research, 74, 07 2022.

Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local Bayesian optimization. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019.

Gal, Y., Islam, R., and Ghahramani, Z. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1183 1192. PMLR, 2017.

Gentile, C., Wang, Z., and Zhang, T. Achieving minimax rates in pool-based batch active learning. In Proceedings of the 39th International Conference on Machine Learning (ICML), pp. 7339 7367. PMLR, 2022.

Graber, C. and Schwing, A. G. Dynamic neural relational inference for forecasting trajectories. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4383 4392, 2020.

Ha, S. and Jeong, H. Unraveling hidden interactions in complex systems with deep learning. Scientific Reports, 11(1):1 13, 2021.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceddings of the 7th International Conference on Learning Representations (ICLR), 2019.

Hossain, H. M. S. and Roy, N. Active deep learning for activity recognition with context aware annotator selection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 1862 1870. ACM, 2019.

Irwin, M. and Wang, Z. Dynamic Systems Modeling, pp. 1 12. John Wiley & Sons, Ltd, 2017.

Ivanovic, B. and Pavone, M. The Trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2375 2384, 2019.

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.

Johnson, D. D. Learning graphical state transitions. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.

Jozefczuk, S., Klie, S., Catchpole, G., Szymanski, J., Cuadros-Inostroza, A., Steinhauser, D., Selbig, J., and Willmitzer, L. Metabolomic and transcriptomic stress response of Escherichia coli. Molecular Systems Biology, 6(1):364, 2010.

Katok, A. and Hasselblatt, B. Introduction to the Modern Theory of Dynamical Systems. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1995.

Khanna, S. and Tan, V. Y. F. Economy statistical recurrent units for inferring nonlinear granger causality. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.

Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 2688 2697. PMLR, 2018.

Kirsch, A., van Amersfoort, J., and Gal, Y. Batch BALD: Efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, pp. 7024 7035, 2019.

Konyushkova, K., Sznitman, R., and Fua, P. Learning active learning from data. In Advances in Neural Information Processing Systems (NIPS), volume 30, pp. 4225 4235, 2017.

Kraskov, A., St ogbauer, H., and Grassberger, P. Estimating mutual information. Physical Review E, 69:066138, 2004.

Kwapie n, J. and Dro zd z, S. Physical approach to complex systems. Physics Reports, 515(3-4):115 226, 2012.

Li, J., Ma, H., Zhang, Z., Li, J., and Tomizuka, M. Spatiotemporal graph dual-attention network for multi-agent prediction and tracking. IEEE Transactions on Intelligent Transportation Systems, 23(8):10556 10569, 2022.

Active Learning based Structural Inference

Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. W. Learning deep generative models of graphs. ar Xiv preprint ar Xiv:1803.03324, 2018.

Lizier, J. T., Flecker, B., and Williams, P. L. Towards a synergy-based approach to measuring information modification. In Proceedings of the 2013 IEEE Symposium on Artificial Life (ALife), pp. 43 51. IEEE, 2013.

L owe, S., Madras, D., Shilling, R. Z., and Welling, M. Amortized causal discovery: Learning to infer causal graphs from time-series data. In Proceedings of the 1st Conference on Causal Learning and Reasoning (CLea R), pp. 509 525. PMLR, 2022.

Makkeh, A., Theis, D. O., and Vicente, R. Broja-2pid: A robust estimator for bivariate partial information decomposition. Entropy, 20(4):271, 2018.

Marbach, D., Costello, J. C., K uffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G. Wisdom of crowds for robust gene network inference. Nature Methods, 9(8):796 804, 2012.

Matsumoto, H., Kiryu, H., Furusawa, C., Ko, M. S., Ko, S. B., Gouda, N., Hayashi, T., and Nikaido, I. SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics, 33(15):2314 2321, 2017.

Mc Gill, W. Multivariate information transmission. Transactions of the IRE Professional Group on Information Theory, 4(4):93 111, 1954.

Pakman, A., Nejatbakhsh, A., Gilboa, D., Makkeh, A., Mazzucato, L., Wibral, M., and Schneidman, E. Estimating the unique information of continuous variables. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, 2021.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pp. 8024 8035, 2019.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Pop, R. and Fulop, P. Deep ensemble bayesian active learning: Addressing the mode collapse issue in monte carlo dropout via ensembles. ar Xiv preprint ar Xiv:1811.03897, 2018.

Pratapa, A., Jalihal, A. P., Law, J. N., Bharadwaj, A., and Murali, T. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nature Methods, 17(2):147 154, 2020.

Ren, P., Xiao, Y., Chang, X., Huang, P., Li, Z., Gupta, B. B., Chen, X., and Wang, X. A survey of deep active learning. ACM Computing Surveys, 54(9):1 40, 2022.

Schroff, F., Kalenichenko, D., and Philbin, J. Face Net: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815 823, 2015.

Schultz, M. and Joachims, T. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems (NIPS), volume 16, pp. 41 48. MIT Press, 2003.

Settles, B. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2009.

Shi, W. and Yu, Q. Integrating bayesian and discriminative sparse kernel machines for multi-class active learning. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, pp. 2282 2291, 2019.

Sim eoni, O., Budnik, M., Avrithis, Y., and Gravier, G. Rethinking deep active learning: Using unlabeled data at model training. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), pp. 1220 1227. IEEE, 2020.

Sz ekely, G. J. and Rizzo, M. L. Brownian distance covariance. The Annals of Applied Statistics, 3(4):1236 1265, 2009.

Sz ekely, G. J. and Rizzo, M. L. On the uniqueness of distance covariance. Statistics & Probability Letters, 82 (12):2278 2282, 2012.

Sz ekely, G. J. and Rizzo, M. L. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382 2412, 2014.

Sz ekely, G. J., Rizzo, M. L., and Bakirov, N. K. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769 2794, 2007.

Tank, A., Covert, I., Foti, N., Shojaie, A., and Fox, E. B. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267 4279, 2021.

Active Learning based Structural Inference

Tran, T., Do, T., Reid, I. D., and Carneiro, G. Bayesian generative active deep learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 6295 6304. PMLR, 2019.

Tsubaki, M., Tomii, K., and Sese, J. Compound protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics, 35 (2):309 318, 2019.

Wang, A. and Pang, J. Iterative structural inference of directed graphs. In Advances in Neural Information Processing Systems (Neur IPS), volume 35, 2022.

Wang, K., Zhang, D., Li, Y., Zhang, R., and Lin, L. Costeffective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591 2600, 2016.

Webb, E., Day, B., Andres-Terre, H., and Li o, P. Factorised neural relational inference for multi-interaction systems. ar Xiv preprints ar Xiv:1905.08721, 2019.

Williams, P. L. and Beer, R. D. Nonnegative decomposition of multivariate information. ar Xiv preprint ar Xiv:1004.2515, 2010.

Wu, T., Breuel, T., Skuhersky, M., and Kautz, J. Discovering nonlinear relations with minimum predictive information regularization. ar Xiv preprint ar Xiv:2001.01885, 2020.

Zhdanov, F. Diverse mini-batch active learning. ar Xiv preprint ar Xiv:1901.05954, 2019.

Zhen, X., Meng, Z., Chakraborty, R., and Singh, V. On the versatile uses of partial distance correlation in deep learning. ar Xiv preprint ar Xiv:2207.09684, 2022.

Active Learning based Structural Inference

A What are IS and OOS Connections?

Figure 5. (a) A dynamical system with 11 agents. (b) A scope of 4 agents on the same system.

In Figure 5(a), there is a dynamical system consisting of 11 agents and directed connections. Then suppose we have a scope with 4 agents on the system and is shown as the cloud line in Figure 5(b). Based on the segmentation of the scope, the agents in the system are divided into two groups: IS agents (in original color), and OOS agents (in nattier blue). But the connections are segmented into three groups:(a) IS connections (in blue), (b) OOS connections (in red) and (c) non-observable connections (in cream). During the training of ALa SI, the query strategy with dynamics (Section 4.4) selects the most informative agents and builds a scope upon these agents. If ALa SI operates without the OOS message learning pipeline, at every update of the scope, ALa SI can only learn the representation of the connections within the scope. Yet IS agents may also be affected by OOS agents, thus we need to take the OOS connections into consideration and separate their influence. Therefore, the OOS message learning pipeline expands the learning field of ALa SI to OOS connections, even though one agent of every OOS connection is not observable in the current scope, which significantly promotes the learning efficiency of ALa SI.

B.1 Proof of Proposition 4.1

We prove Proposition 4.1 in this section. Since we assume the independence between Vt inter and Zt oos, based on the definition of mutual information between two independent variables, we can easily get to the first statement:

I(Vt inter; Zt oos) 0. (24)

Moreover, from the proposed PI-diagram of information in a target decomposed from three source variables (Lizier et al., 2013), we have the following statement:

I(vt+1 i ; vt i, Vt inter, Zt oos) > 0. (25)

We refer to Figure 3 in (Lizier et al., 2013) and search for the terms related to Xt inter and Zt oos:

I(vt+1 i ; Vt inter) = {Vt inter} + {Vt inter}{vt i, Zt oos} + {vt i}{Vt inter}{Zt oos} + {Vt inter}{Zt oos} + {vt i}{Vt inter} 0, (26)

I(vt+1 i ; Zt oos) = {Zt oos} + {Zt oos}{vt i, Vt inter} + {vt i}{Vt inter}{Zt oos} + {Vt inter}{Zt oos} + {vt i}{Zt oos} 0, (27)

where { }{ } denotes the redundant information in the two sources, { }{ }{ } denotes the redundant information in the three sources, { } represents the unique information in the single source, and { , } is the synergistic information from the sources. We summarize the results from Equation 24 to 27, and can derive to:

I(Vt inter; Zt oos) < I(vt+1 i ; Zt oos), and I(Vt inter; Zt oos) < I(vt+1 i ; Vt inter), (28)

which is Proposition 4.1.

B.2 Derivation of OOS Loss Function

We describe the derivation procedure for Equation 18 in this section. As mentioned in Section 4.2, we can derive the OOS message Zt oos from vt+1 i , vt i and Vt inter. Based on the triplet loss (Schroff et al., 2015; Schultz & Joachims, 2003) and

Active Learning based Structural Inference

Proposition 4.1, we derive the following loss function to learn OOS message:

Loos = 1 (T 1) |D|

I(Vt inter; Zt oos) I(vt+1 i ; Zt oos) + α1 + I(Vt inter; Zt oos) I(vt+1 i ; Vt inter) + α2 , (29)

where T represents the total count of time-steps, D represents the current scope, |D| denotes the number of agents in the scope, and α1 and α2 are margins to regulate the distance between two pairs of mutual information, respectively, in order to encourage larger values of I(vt+1 i ; Zt oos) and I(vt+1 i ; Vt inter) compared to I(Vt inter; Zt oos). It is notable that Zt oos and Vt inter are calculated according to every agent in the scope, respectively. We omit the subscript of Zt oos and Vt inter for agent i in Equation 29 for concise. Then we can derive:

Loos = 1 (T 1) |D|

I(Vt inter; Zt oos) I(vt+1 i ; Zt oos) + α1 + I(Vt inter; Zt oos) I(vt+1 i ; Vt inter) + α2

= 1 (T 1) |D|

H(Zt oos) H(Zt oos|Vt inter) H(Zt oos) H(Zt oos|vt+1 i ) + α1 + H(Vt inter) H(Vt inter|Zt oos)

H(Vt inter) H(Vt inter|vt+1 i ) + α2

= 1 (T 1) |D|

H(Zt oos|vt+1 i ) H(Zt oos|Vt inter) + α1 + H(Vt inter|vt+1 i ) H(Vt inter|Zt oos) + α2 .

We assume Zt oos and Vt inter are independent of each other, and we can reformulate the equation as:

Loos = 1 (T 1) |D|

H(Zt oos|vt+1 i ) H(Zt oos) + α1 + H(Vt inter|vt+1 i ) H(Vt inter) + α2

= 1 (T 1) |D|

I(Zt oos; vt+1 i ) I(Vt inter; vt+1 i ) + α1 + α2 .

Since the mutual information between two fixed variables is certain, we omit the second term in the above derivation. Besides that, since the target is minimization, the constant term has no effect on the formulation. As a result, we can obtain:

Loos = 1 (T 1) |D|

I(Zt oos; vt+1 i ) ,

which is the formulation in Equation 18. As a result, we only need to minimize I(Zoos; vt+1 i ), and we can implement it with Deep Info Max (Hjelm et al., 2019) algorithm. Deep Info Max maximizes the mutual information between input data and learned high-level representations with the help of global and local information.

C Implementation

C.1 General Settings

We implement ALa SI in Py Torch (Paszke et al., 2019) with the help of Scikit-Learn (Pedregosa et al., 2011) to calculate various metrics. We run experiments of ALa SI on a single NVIDIA Tesla V100 SXM2 graphic card, which has 32 GB graphic memory and 5120 NVIDIA CUDA Cores. We attach our pseudocode and implementation as the supplementary document to this paper. During training, we set batch size as 64 for datasets which have less than 100 agents, for those equal or more than 100 agents, we set batch size as 16. We train our ALa SI model with 500 epochs for each updated label pool on every dataset.

As for baseline methods, since the training under supervised settings only requires the encoder of the model, which demands moderate space, we managed to run the methods on a single NVIDIA Tesla V100 SXM2 graphic card, and the batch sizes are the same as ALa SI. However, when it came to unsupervised learning, the computational requirement of variational auto-encoder-based methods increased significantly. As a result, in order to run these methods on scalable datasets with more than 100 agents, we use Distributed Data Parallel of Py Torch to enable the parallel training of these models. And we ran these methods on four NVIDIA Tesla V100 SXM2 graphic cards, with a batch size of 128. For the experiments

Active Learning based Structural Inference

on datasets with less than 100 agents, we just ran the baselines on a single NVIDIA Tesla V100 SXM2 graphic card with a batch size of 64. For MPIR, since the model is super small and the computational requirement is the smallest among all of the baselines, we ran it on a single NVIDIA Tesla V100 SXM2 graphic card with a batch size of 64. For all of the experiments, we train ALa SI with a learning rate of 0.0005.

C.2 Hyper-parameters

We have the following hyper-parameters: initial sample size m, query size K, number of epochs E, number of selection rounds N, variance σ of Ldc, weights α, β, γ, ξ in hybrid loss, and proportion of rank in PID η. We utilized grid search for the rough values of these hyper-parameters, and show them in Table 4. We reported the choice of parameters based on the values that can match all of the loss terms into the same scale. And even based on these easy searches, ALa SI managed to outperform other baseline methods. We think that it is feasible to tune these parameters with the help of Bayesian Optimization packages, such as HEBO (Cowen-Rivers et al., 2022) and Tu RBO (Eriksson et al., 2019).

Table 4. Hyper parameter choices for every dataset.

DATASET m K E N σ α β γ ξ η

Springs50 5 0.10 500 12 0.0008 0.05 0.8 20 2 0.3 Springs100 5 0.05 500 15 0.0008 0.02 0.8 30 2 0.3 Springs200 10 0.04 500 20 0.0008 0.02 0.5 20 3 0.2 Springs500 20 0.02 600 30 0.0008 0.02 0.6 40 3 0.2 ESC 5 0.05 500 20 0.0008 0.02 0.5 50 2 0.2 E. coli 20 0.02 600 50 0.0008 0.01 0.4 40 3 0.3 S. aureus 20 0.02 600 50 0.0008 0.01 0.4 20 3 0.3

C.3 Details of Pipelines

In this section, we first demonstrate the general pipeline of ALa SI in Algorithm 2. Then we show the description of PID algorithm in ALa SI in Algorithm 3, which is followed by the implementation of ALa SI in Algorithm 4.

C.4 Details of Loss Function

In this section, we discuss and state the details of loss terms and the implementation details of the proposed loss terms in hybrid loss (Equation 22).

C.4.1. OOS Loss

In this section, we describe the implementation of OOS loss function (Equation 18). As shown in Section B.2, the loss function is simplified as the maximization of mutual information between Zt oos and vt+1 i for all 0 t T 1, and for all agent i in the current scope. As mentioned in Section 4.3, we leverage Deep Info Max (Hjelm et al., 2019) to maximize I(Zt oos, vt+1 i ). We follow the implementation of Deep Info Max at: https://github.com/Duane Nielsen/ Deep Infomax Pytorch, which is a pytorch version of official implementation at https://github.com/rdevon/ DIM. Interestingly, Deep Info Max requires output variables, input variables and also the negative samples of input variables. As a result, besides Zt oos and vt+1 i , we also feed Vt inter to Deep Info Max, as the negative samples.

C.4.2. Distance Correlation

In this section, we firstly describe the procedures to calculate distance correlation Ldc in Equation 19, then we describe the implementation of distance correlation in our work.

Procedures. We firstly pair the K samples of Zt oos and Vt inter as pairs: (zp, xp)p K. Then we calculate the distance matrices A, B RK K as:

Apq = ||zp zq||F , and Bpq = ||xp xq||F , p, q = 1, ..., K.

Active Learning based Structural Inference

Algorithm 2 Pipeline of ALa SI.

Input: Dtrain = a pool of labeled trajectories {Vtrain, E}, Input: Dpool = a pool of test trajectories {Vpool}, Parameters: initial sample size m, query size K, number of epochs E, number of selection rounds N, Model Weights: θ, Hybrid Loss: R, Query with Dynamics: Q, Output: Trained Active Structural Inference Model M, if Supervised learning then

Set of data points D Select m agents with features Vm and connectivity Em from Dtrain, else

Select m agents with features Vm from Dtrain, Run PID on Vm and obtain connections between m nodes: EP ID0, Set of data points D {Vm, EP ID0}, end if Train model M E epochs with loss R on D and obtain parameters θ0, Query K agents with the strategy of query with dynamics Q(θ0, {Vpool}, E), if Supervised learning then

Update D with K agents with features VK and connectivity EK from Dtrain, else

Select m agents with features Vm from Dtrain, Run PID on features VK and obtain connections between K nodes: EP IDK, Update D {VK, EP IDK}, end if while Round i < N do

Train model M E epochs with loss R on D and obtain parameters θi, Query agent features with Q(θi, {Vpool}, E) and choose K agents, if Supervised learning then

Update D with K agents with features VK and connectivity VK from Dtrain, else

Select m agents with features Vm from Dtrain, Run PID on features VK and obtain connections between K nodes: EP IDK, Update D {VK, EP IDK}, end if end while Return: trained model M and parameters θ.

Algorithm 3 PID Algorithm in ALa SI.

Input: {Vpool} = a pool of trajectories of p agents, Parameters: Rank or proportion of rank: ξ, Total number of time steps of features: T, Output: Dtrain = a pool of labeled trajectories {Vpool, E}, for agent i in total p agents do

for agent j in p 1 agents do

for agent r in p 2 agents do

Compute the unique component IUni between X1:T 1 i and X2:T j given X2:T r , Compute the mutual information I between X1:T 1 i and X2:T j given X2:T r , Compute the ratio qr between the IUni and I, end for Calculate the sum of qr over all agents r as qij, end for end for Rank all qij, and select ξ (or ξ p) agent-pairs with highest qij, Mark the connections from i to j in these pairs as exist, the rest as non-exist, Return: the connectivity between p agents.

Active Learning based Structural Inference

After that, we double center the distance matrices to get Apq, Bpq:

Apq = Apq Ap. A.q + A..,

where Ai. denoted the mean of row i, A.j denotes the mean of column j, A.. denotes the overall mean of A. So this centers both the rows and columns of A, B. All rows and columns of A and B sum to 0. In short notation: Aqm = (I M)A(I M), and Bqm = (I M)A(I M),

where M = 1

K 11T . The distance covariance of Zt oos and Vt inter is defined as the square root of:

dcov2(Zt oos, Vt inter) = 1 K2

p,q=1 Apq Bpq.

And the distance variance is defined as: dvar2(x) = dcov2(x, x). Thus we can calculate the distance correlation with:

Ldc = 1 (T 1) |D|

d Cov2(Zt oos, Vt inter) p

d Var(Ztoos)d Var(Vt inter) ,

which is the same formulation as Equation 19.

Implementation. As for the implementation of distance correlation, we originally follow the official implementation of distance correlation implementation of (Zhen et al., 2022) at https://github.com/zhenxingjian/Partial_ Distance_Correlation. We then extend the implementation to suit batch-wise calculation and GPU acceleration.

C.4.3. Discussion on the Loss Terms

In this section, we would like to discuss the importance of different terms in the hybrid loss mentioned in Section 4.3. The hybrid loss consists of four terms, Loos: to learn OOS messages, Ldc: to ensure the independence assumption of learning OOS messages, LD: loss function for dynamics, and Lcon: loss function for connectivity. Among the four terms, Loos and Ldc should appear in pairs to learn OOS messages (as stated and proved in Section B.2). LD and Lcon are the very important terms to make ALa SI work (terms for AL training), which cannot be discarded. Therefore, we conducted ablation studies to check the importance of terms for OOS message learning, and presented the results in Section D.3. In the ablation studies, we state ALa SI-no OOS as the one without Loos and Ldc by setting γ and η as zero. And Figure 9 clearly shows the importance of Loos and Ldc. Without these two terms, the algorithm can only learn about the representations of connections within the scope and cannot extrapolate onto OOS connections, which results in an almost linear dependence between AUROC and the proportion of labeled connections.

C.5 Implementation of Pipelines

We first briefly describe the pipeline of learning of ALa SI in Algorithm 4. We then describe the components of several networks mentioned in Algorithm 4. The design of finter1, finter2, foos1, foos2 and fdynamics follows modular-design practice and are based on a multi-layer-perceptron shown in Algorithm 5. We name the functional pipeline shown in Algorithm 5 as MLP, and we can represent the networks in Algorithm 4 as: finter1 = MLP(MLP( )), finter2 = MLP( ), foos1 = MLP( ), foos2 = MLP( ) and fdynamics = MLP( ). We briefly report the dimension of the layers of each networks in Table 5, where f , inter1 represents the second MLP( ) of finter1, f , inter1 represents the first MLP( ) of finter1, xdim is the number of dimensions of an agent at a time step, and |T| represents the total time steps of the trajectory.

C.6 Implementation of Baselines

NRI. We use the official implementation code by the author from https://github.com/ethanfetaya/NRI with customized data loader for our chosen datasets. We add our metric-evaluation in test function, after the calculation of accuracy in the original code.

f NRI. We use the official implementation code by the author from https://github.com/ekwebb/f NRI with customized data loader for our chosen datasets. We add our metric-evaluation in test function, after the calculation of accuracy and the selection of correct order for the representations in latent spaces in the original code.

MPM. We use the official implementation code by the author from https://github.com/hilbert9221/NRI-MPM

Active Learning based Structural Inference

Algorithm 4 Pipeline of learning in ALa SI.

Input: V = set of agent features of current scope, Input: n = number of agents in the current scope, Input: Zgt = ground truth connectivity in the current scope, Connection Learning Network: finter1, IS Message Network: finter2, OOS Embedding Network: foos1, OOS Message Network: foos2, Dynamics Learning Network: fdynamics, Output Function: foutput, Deep Info Max: f DIM Split agent features according to time steps: Vτ = V0:T 1 for training, Vψ = V1:T for loss calculation, where T represents the total time steps, Learn representation of connections: Z = finter1(Vτ, n), Summarize connectivity inside the scope: ˆZ = Gumbel Softmax(Z), Learn inter scope messages: einter = finter2(Vτ, ˆZ), Learn OOS messages: eoos = foos2(foos1(Vτ, einter)), Learn dynamics: eout = foutput(fdynamics(einter, eoos), Vτ), Calculate OOS loss with Deep Info Max: LOOS = f DIM(einter, eoos, Vτ), Calculate distance correlations: Ldc from eoos and Vτ for each agent in the scope, Calculate dynamics prediction loss: LD {eout, Vψ}, Calculate connectivity loss: Lcon { ˆZ, Zgt}, Summarize as the hybrid loss: R {LD, Lcon, LOOS, Ldc}, Update parameters with back-propagation, Return: trained model.

Algorithm 5 The Multi-layer-perceptron.

Input: features input x = elu(Linear1(input)) x = dropout(x) x = elu(Linear2(x)) out = batch norm(x) Return: out

with customized data loader for our chosen datasets. We add our metric-evaluation for AUROC in evaluate() function of class XNRIDECIns in the original code.

Table 5. Dimension of the layers and dropout rates.

Parameters f ,, inter1 f , inter1 finter2 foos1 foos2 fdynamics

Linear1 2 xdim |T| 256 2 xdim (xdim + 256) (|T| 1) 2 xdim 256 Dropout 0.0 0.0 0.0 0.5 0.0 0.0 Linear2 256 2 256 xdim (|T| 1) 256 256

ACD. We follow the official implementation code by the author as the framework for ACD (https://github.com/ loewe X/Amortized Causal Discovery). We run the code with customized data loader for our datasets. We implement the metric-calculation pipeline in the forward pass and eval() function.

MPIR. We follow the official implementation from https://github.com/tailintalent/causal as the model for MPIR. We run the model with customized data loader for the chosen datasets. After the obtain of the results, we run another script to calculate the metrics.

PID. Based on the Julia implementation of PID in https://github.com/Tchanders/Information Measures. jl, we implement PID in Python. Then we implement the mutual information calculation of PID with KDTree (see https: //github.com/paulbrodersen/entropy_estimators), in order to enable PID to operate on continuous highdimensional data. Different from other methods, we run PID on all of the dataset we have in experiments. For instance, when running experiments on Springs50 , PID infer the connections of the entire dynamical system based on a union set of the trajectories for training, validation and testing.

Active Learning based Structural Inference

C.7 Further Details about Datasets

Spring Datasets To generate these springs datasets ( Springs50 , Springs100 , Springs200 , and Springs500 ), we follow the description of the data in (Kipf et al., 2018) but with fixed connections. To be specific, at the beginning of the data generation for each springs dataset, we randomly generate a ground truth graph and then simulate 12000 trajectories on the same ground truth graph, but with different initial conditions. The rest settings are the same as that mentioned in (Kipf et al., 2018). We collect the trajectories and randomly group them into three sets for training, validation and testing with the ratio of 8: 2: 2, respectively.

GRN Datasets Different from springs datasets, GRN datasets (ESC, E. coli, and S. aureus) are sampled from publicly available data sources. We download the datasets from the links mentioned in the corresponding literature, sample the trajectories with the same amount of time steps as of springs datasets, and randomly group the trajectories of gene expressions into three sets for training, validation and testing with the ratio of 8: 2: 2, respectively.

D Further Experimental Results

In this section, we demonstrate additional experimental results as the supplement to Section 5.

D.1 Integration of Prior Knowledge with Unsupervised Learning

We conduct the integration of prior knowledge with unsupervised learning with ALa SI. At the beginning of every experiment, we randomly assign a portion of agents with true connectivity, and keep the remaining settings the same as those in Section 5.2. During a query, if the agents with true connectivity are selected and the connections of these agents assigned by PID are contrary to the true label, we set the connectivity the same as the label and maintain the connections of the rest agents. We summarize the results and plot them in Figure 6, where we plot the AUROC results of fully supervised

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

ALa SI-sup ALa SI-unsup ALa SI-p20 ALa SI-p50 ALa SI-p80

Figure 6. Averaged AUROC results of ALa SI-sup, ALa SI-unsup, ALa SI-p20, ALa SI-p50, and ALa SI-p80 as a function of the proportion of labeled connections on Springs50 , Springs100 , Springs200 , Springs500 , ESC, E. coli and S. aureus datasets.

ALa SI (ALa SI-sup), fully unsupervised ALa SI (ALa SI-unsup), and unsupervised ALa SI with 20%, 50% and 80% of prior knowledge on agents (ALa SI-p20, ALa SI-p50 and ALa SI-p80). As we can observe from the plots, ALa SI is capable of being integrated with prior knowledge, and the AUROC value is positively correlated with the proportion of integrated prior knowledge. Interestingly, ALa SI-p80 moves generally closer to the fully supervised ALa SI, which on the other hand verifies the data efficiency of ALa SI. ALa SI has the capability of accurately inferring accurate connectivity of dynamical systems with less prior knowledge. In comparison, we also tested the integration of prior knowledge with baseline methods that uses

Active Learning based Structural Inference

VAE under unsupervised settings, but surprisingly we observed performance drops in terms of AUROC. We think the reason might be the integration of prior knowledge happened in the latent space, violating the generation process of these methods. We leave the study of these performance drops to future work.

D.2 Robustness Tests of ALa SI

Although ALa SI is tested on several real-world datasets and the results are reported in Sections 5.1 and 5.2, it is interesting to carry out more experiments to further test the robustness of ALa SI. We generate a series of Springs50 datasets with different levels of Gaussian noise. The Gaussian noise is added to the features of the agents and the levels amplify the noise as follows:

vt i = vt i + ζ 0.02 , where ζ N(0, 1), (30)

where vt i represents raw feature vector of agent i at time t. And we plot the experimental results of ALa SI on these datasets in Figure 7. As shown in Figure 7, noises in the agents features have an effect on the performance of ALa SI. The effect

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

Unsupervised

Raw Δ = 1 Δ = 2 Δ = 3 Δ = 4 Δ = 5

Figure 7. Averaged AUROC results of ALa SI as a function of the proportion of labeled connections on Springs50 dataset with different levels of noise under supervised or unsupervised setting.

is minor when ALa SI is trained in supervised setting. But under unsupervised setting, especially when the proportion of labeled connections in the pool is smaller than 0.4, ALa SI faces a bigger challenge to infer the connections compared with it when under supervised setting. When the proportion of labeled connections increases, the effect of noises becomes smaller and smaller. So in summary, although noises have a negative impact on the performance, ALa SI still can infer the connections with moderate to high accuracy. Besides that, we also test the baseline methods on the dataset of Springs50 with different levels of Gaussian noise, and plot the results in Figure 8. Each subplot in Figure 8 reports the performance of ALa SI and baseline methods on the Springs50 dataset with a certain noise level, respectively. As we can learn from the figure, although the baseline methods are trained under supervised settings, compared to ALa SI, they are more sensitive to noises. The margins between the AUROC results of ALa SI and the best baseline methods become larger when the noise level increases. We think the reason may come from the baseline methods utilizing a full-sized computational graph, so during training, all of the connections within the system are learned simultaneously. Therefore, a high level of noise leads to an enormous uncertainty in the loss functions of these methods (their loss functions are summations of errors of all the connections in the system). Different from baseline methods, ALa SI learns the connections agent-wise, which eases the uncertainty in the loss function. Besides that, the query with dynamics can correctly select the most informative agent to be added to the scope, regardless of the noise level. We think a combination of these two functioning mechanisms helps ALa SI to reduce the uncertainty created by noisy data.

D.3 Ablation Study

We conduct ablation studies on the effectiveness of query with dynamics, as well as OOS operation. We modify ALa SI into (a) ALa SI-ran: where we replace the query with dynamics strategy with a random sampling strategy on agents; and (b) ALa SI-no OOS: where we remove the pipeline for OOS representation learning and the corresponding terms in the loss function. We report the results of unsupervised learning, which we believe is closer to real-world scenarios, and report the averaged AUROC results of these variants as a function of the proportion of labeled connections by PID.

As shown in Figure 9, ALa SI with query strategy with dynamics and OOS operation outperforms its variants, ALa SI-random and ALa SI-no OOS. Despite the inference accuracy of all these methods increasing when a large portion of agents are labeled, we observe that ALa SI converges much faster than the other two methods. Besides that, OOS operation is of

Active Learning based Structural Inference

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

ALa SI NRI f NRI MPM ACD

Figure 8. Averaged AUROC results of ALa SI and baseline methods as a function of the proportion of labeled connections on Springs50 dataset with different levels of noise under supervised setting.

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

0.0 0.2 0.4 0.6 Proportion of Labeled Connections

ALa SI ALa SI-ran ALa SI-no OOS

Figure 9. Averaged AUROC results of ALa SI, ALa SI-ran and ALa SI-no OOS as a function of the proportion of labeled connections under unsupervised learning.

great importance to the design of a scalable structural inference method. It is commonly observed among the subplots that ALa SI-no OOS can only learn about the representations of connections within the scope and cannot extrapolate onto OOS connections, which results in an almost linear dependence between AUROC and the proportion of labeled connections. Therefore, the query strategy with dynamics and the OOS operation of ALa SI effectively encourage faster convergence under unsupervised settings.

E Limitation of ALa SI

Besides the datasets mentioned in this work, we also test ALa SI on the physic simulation datasets mentioned in NRI (Kipf et al., 2018). Most of the physic simulation datasets have no more than 10 agents in the system, which are much smaller than the ones used in this work. Based on the experiments on these datasets, ALa SI cannot outperform baseline methods when the size of the dynamical system is small. Since ALa SI works on agent-wise selection to build the pool for training, when the total count of agents is small, ALa SI cannot benefit from the mechanism of active learning. Moreover, if there exist multiple types of connections in the dynamical system, we doubt whether ALa SI can be qualified as the structural inference method for this kind of system. We think it is possible to extend the application scenario of ALa SI to these systems with a built-in multiplex graph, and we leave this for future work.

Active Learning based Structural Inference

F Broader Impact

ALa SI allows researchers in the field of network science, biology and physics to study the underlying interacting structure of large dynamical systems, which is the first algorithm targeting the structural inference of large systems. We have shown that ALa SI has outstanding performance facing large dynamical systems even with additive Gaussian noise, which proves its broad application scenarios. While the emergence of structural inference technology for large systems may be helpful for many, it can be potentially misused either. For example, it can be likely to be used to reveal private anonymous connections which could erode privacy and anonymity.

G Ethics Statement

ALa SI is a framework for structural inference of dynamical systems. No matter how effective it is at this task, there may still be failure modes ALa SI will not catch. So far in this work, we haven t seen any issue with ethics.

H Reproducibility

We will make the implementation public on Git Hub. We will include the code of ALa SI, and the procedures for accessing the dataset we used in this work. Please refer to it as the implementation of ALa SI.