# multigranularity_causal_structure_learning__b82cb09f.pdf Multi-Granularity Causal Structure Learning Jiaxuan Liang1, 2, Jun Wang2, Guoxian Yu1, 2, *, Shuyin Xia3, 4, Guoyin Wang3, 4 1School of Software, Shandong University, Jinan, China 2SDU-NTU Joint Centre for AI Research, Shandong University, Jinan, China 3Chongqing Key Laboratory of Computational Intelligence, Chongqing Uni. of Posts and Telecom., Chongqing, China 4MOE Key Laboratory of Big Data Intelligent Computing, Chongqing Uni. of Posts and Telecom., Chongqing, China jxliang@mail.sdu.edu.cn, {kingjun, gxyu}@sdu.edu.cn, {xiasy, wanggy}@cqupt.edu.cn Unveiling, modeling, and comprehending the causal mechanisms underpinning natural phenomena stand as fundamental endeavors across myriad scientific disciplines. Meanwhile, new knowledge emerges when discovering causal relationships from data. Existing causal learning algorithms predominantly focus on the isolated effects of variables, overlook the intricate interplay of multiple variables and their collective behavioral patterns. Furthermore, the ubiquity of highdimensional data exacts a substantial temporal cost for causal algorithms. In this paper, we develop a novel method called Mg CSL (Multi-granularity Causal Structure Learning), which first leverages sparse auto-encoder to explore coarse-graining strategies and causal abstractions from micro-variables to macro-ones. Mg CSL then takes multi-granularity variables as inputs to train multilayer perceptrons and to delve the causality between variables. To enhance the efficacy on highdimensional data, Mg CSL introduces a simplified acyclicity constraint to adeptly search the directed acyclic graph among variables. Experimental results show that Mg CSL outperforms competitive baselines, and finds out explainable causal connections on f MRI datasets. Introduction Data science is moving from the data-centric paradigm forward the science-centric paradigm, and causal revolution is sweeping across various research fields. Causality learning endeavors to unearth causal relationships among variables from observational data and generate causal graph, that is, directed acyclic graph (DAG). Unlike correlationbased study, causality analysis reveals the causal mechanism of data generation. Identifying causality holds paramount significance for stable inference and rational decisions in many applications, such as recommendation systems (Wang et al. 2020), medical diagnostics (Richens, Lee, and Johri 2020), epidemiology (Vandenbroucke, Broadbent, and Pearce 2016) and many others (Von K ugelgen et al. 2022). For its significance, numerous studies have been conducted toward causal structure learning. Constraint-based algorithms (Spirtes et al. 2000; Colombo, Maathuis et al. 2014; Marella and Vicard 2022) acquire a set of causal *Corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. graphs that satisfy the conditional independence (CI) inherent in the data. Nevertheless, the faithfulness assumption can be refuted, and a substantial number of CI tests are needed. Score-based algorithms (Chickering 2002; Hauser and B uhlmann 2012; Ramsey et al. 2017) define a structure scoring function in conjunction with search strategies to explore the DAG that best fits the limited data. However, these algorithms cannot differentiate DAGs that belong to the same Markov equivalence class (MEC). By virtue of additional assumptions on data distribution and functional classes, functional causal model-based methods (Hoyer et al. 2008; Zhang and Hyv arinen 2009) can distinguish DAGs within the same MEC, but the exhaustive and heuristic search of DAGs encounters the challenge of combinatorial explosion with the increasing number of variables. NOTEARS (Zheng et al. 2018) formulates the acyclicity constraint as a differentiable equation and applies efficient numerical solvers to search DAG. Subsequent efforts focus on nonlinear extension (Lachapelle et al. 2020; Zheng et al. 2020), optimization technique (Wang et al. 2021; Yang et al. 2023), robustness (He et al. 2021) and semi-structured data (Liang et al. 2023b). However, these algorithms simply deem causal relationships stand exclusively at the level of individual variables (micro-variable), ignoring the collective interactions from multiple variables (macro-variable). For instance, the brain can be characterized at a micro granularity of neurons and their synapses, but high-order synergistic subsystems are widespread, which typically sit between canonical functional networks and may serve an integrative role (Varley et al. 2023). Actually, observational data can be regarded as knowledge in the lowest granularity level, while knowledge can be regarded as the abstraction of data at different granularity levels (Wang 2017; Wang et al. 2022). Similar viewpoints appear in the research of complex systems, which suggests that causal relationship is more pronounced at the macro-scale of a system than at its micro-scale. This phenomenon, widely known as causal emergence, manifests extensively across scientific domains such as physics (Loewer 2012), physiology (Noble 2012), sociology (Elder Vass 2012), and beyond. Due to the intricacy of macro-level causality, available algorithms often overlook or misinterpret the underlying causal structure. Furthermore, the high complexity of causal discovery algorithms causes a signifi- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Variable Abstraction Causality Orientation Variable Set Variable Set Input X Output A and S Optimization DAG Searching path product of Wencoder Macro-variable Micro-variable ||(WMLP )[i,:]||2 j ||(WMLP )[i,:]||2 Multi-granularity Causal Structure Figure 1: Framework overview of Mg CSL, which takes observational data X as inputs and constructs an SAE to explore coarse-grained strategies and causal abstractions. Within the encoder, each input variable is trained individually, and their encoded representations are summed to obtain the latent macro-variable representation Z, which is then used by the decoder to reconstruct the observational data X. The contribution matrix A from micro-variables to macro-variables is extracted from the path product of the encoder parameters Wencoder. Next, Mg CSL feeds the concatenation of microand macro-variables X Z into each MLP to explore potential causal relationships, and collects the multi-granularity weighted adjacency matrix S from the first parameter matrix of MLPs W(1) MLP. Mg CSL formulates the DAG search as an optimization problem with a simplified acyclic constraint, and induces the multi-granularity causal structure from S and A. cant efficacy drop when dealing with high-dimensional data, hindering their practical implementations. We aim to learn multi-granularity causal structure and propose an approach called Mg CSL, as depicted in Figure 1. Mg CSL firstly establishes a sparse autoencoder (SAE) (Ng et al. 2011) to automatically coarse-grain the microvariables into latent macro-ones. Next, it constructs a multiple layer perceptron (MLP) for each micro-variable, taking both the microand macro-variables as inputs to explore the underlying causal mechanisms. It further introduces a simplified acyclicity constraint to efficiently orient edges. The major contributions of our work are: (i) Diverging from existing causal learning methods that exclusively focus on causality at the micro-level of individual variables, we pioneer to learn multi-granularity causal structure across microand macro-variables, which is more complex and challenging but holds immense practical values in various scenarios. (ii) Our proposed Mg CSL utilizes SAE with sparsity term on the encoder to explore potential macro-variables and extract effective coarse-graining strategies in a sensible way. Mg CSL leverages MLPs to model the underlying causal mechanisms and introduces a simplified acyclicity penalty to orient edges, searching causal structure in an efficacy way. (iii) Experimental results confirm the advantages of Mg CSL over competitive baselines (Spirtes et al. 2000; Chickering 2002; Yu et al. 2019; Ng et al. 2019; Lachapelle et al. 2020; Zheng et al. 2020), and it identifies multi-granularity causal connections from f MRI data. Related Work Causal structure learning is an indispensable and intricate task pervading in various scientific fields. Existing causal learning algorithms can be grouped into two types: constraint-based and score-based. Constraint-based methods (Spirtes et al. 2000; Bird and Burgess 2008; Marella and Vicard 2022) perform CI tests to obtain causal skeleton, then orient edges via elaborate rules to meet the requirements of DAG. Score-based solutions (Chickering 2002; Hauser and B uhlmann 2012; Ramsey et al. 2017) leverage the score function and search strategies to find the graph yielding the highest score. NOTEARS (Zheng et al. 2018) recasts the combinatoric graph search problem as a continuous optimization problem, stimulating a proliferation of literature. DAG-GNN (Yu et al. 2019) and GAE (Ng et al. 2019) extend NOTEARS to nonlinear cases using autoencoder. Gra N-DAG (Lachapelle et al. 2020) and NOTEARSMLP (Zheng et al. 2020) employ MLPs to approximate the underlying data generation functions. DARING (He et al. 2021) imposes residual independence constraint in an adversarial way to facilitate DAG learning. RCL-OG (Yang et al. 2023) uses order graph instead of Markov chain Monte Carlo to approximate the posterior distribution of DAGs. However, these causal solutions only focus on the causality among micro-variables, can not cope with multiplex causality at different granularity levels. Our Mg CSL bridges this gap by leveraging SAE to explore multi-granularity causality, and gradient-based search with simplified acyclicity constraint to seek DAG in an efficacy way. Actually, the behavior and properties of a system are regulated by causal relations at different granularity levels. The micro-level causality cannot reveal the complexity of a system, especially for the cross-regulations among different levels. Moreover, the extracted macro-variables represent a form of knowledge that unveils higher-level characteristics and collective behavioral patterns within a system. Inspired by the human cognitive principle from coarser to finer , granular cognitive computing aims to process information at various granularity levels (Wang 2017; Wang et al. 2022). Causal learning from the perspective of granular computing has received much less attention, compared with that from individual micro-variables. Ma Ca (Yang et al. 2022) models the text with multi-granularity way and uses a matrix capsule network to cluster the emotion-cause pairs. MAGN (Qiang et al. 2023) performs backdoor adjustment based on structural causal model to achieve cross-granularity fewshot learning. EMGCE (Wu et al. 2023) represents sentences at different granulation layers to extract explicit and implicit The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) causal triplets of words or phrases. While Mg CSL focuses on exploring the multi-granularity causal graph among microand macro-variables. The Proposed Mg CSL Algorithm Problem Definition In this section, we introduce main concepts used in this paper, followed by a formal definition of our problem. Definition 1 (Structural Causal Model (SCM)). The SCM is defined on a set of variables V=(v1, v2, , vd), and consists of a causal graph G=(V, E) along with structural equations. Each edge (i, j) E represents a direct causal relation from vi to vj. vi represents a micro-variable, with its observational data denoted as xi. Multiple micro-variables can be abstracted into a macro-variable ui, with its representative data denoted as zi. The distribution P is said to be Markov with respect to G, allowing the joint probability P to be decomposed into the product of conditional probabilities as P(X)=Qd i=1 P(xi|xpa(i)), where xpa(i) acts as the set of parents of xi. We assume that the distribution P is entailed by Additive Noise Models (ANMs) (Peters et al. 2014) of the form: xi = fi(xpa(i)) + Ni (1) where fi is a nonlinear function that denotes the generative process of xi, Ni is the external noise of xi. Definition 2 (Directed acyclic graph (DAG)). A DAG consists of variables and edges, with each edge directed from one variable to another. For instance, vi vj means there is a directed edge from vi to vj and hence vi is a direct cause of vj. A path between vi and vj in a DAG is a sequence of edges. The ending variable of each edge acts as the starting variable of the next edge in the sequence. If there is no path from any variable to itself, then the graph is acyclic. In this paper, we further extend the concept of DAG to encompass multi-granularity causal structures, wherein if there exists an edge ui vj, there should be no path from vj to any micro-variable composing ui. Problem 1. Given observational data X=(x1, x2, , xd), our task is to uncover multi-granularity causal DAG. This DAG can reflect the collective behavior of variable clusters and reveal complex causal interactions. Variable Abstraction Abstract descriptions provide the foundation for system interventions and explanations for observed phenomena at a coarser granularity level than the most fundamental account of the system. While the emerging causal representation learning aim to reconstruct disentangled causal variable representations from unstructured data (e.g. images) (Sch olkopf et al. 2021), our intention is to learn from observational data and generate low-dimensional representations of macro-variables, which typically encapsulate high-order interactions and collective behaviors among multiple microones. SAE is a unsupervised learning neural network that excels in extracting compact and meaningful representations of data. By promoting sparsity, it encourages the activation of only a small subset of neurons in the hidden layer, which not only reduces the data dimensionality but also enhances interpretability. Given that, Mg CSL includes an SAE-based module to gain causal abstraction from micro-variables to macro-ones. However, conventional SAE uses a shared encoder for all inputs, entangling the contributions of micro-variables to macro-ones. To ensure the independence of micro inputs during the coarse-graining process, we construct d encoders to separately encode each input, and define the encoder and decoder as: i=1σ(σ(X[:,i](W(1) encoder)[i,:,:])(W(2) encoder)[i,:,:]) Y = σ(σ(ZW(1) decoder)W(2) decoder) (2) where Z=(z1, z2, , zq) represents the encoded data of the latent macro-variables, σ is an activation function, and biases are omitted for clarity. W(1) encoder Rd 1 m1, W(2) encoder Rd m1 q, W(1) decoder Rq m1, W(2) decoder Rm1 d are parameters of encoder and decoder, m1 is the number of neurons in the first hidden layer. Although we consider all micro-variables as inputs, in our coarse-grained strategies, not every micro-variable contributes to the generation of macro-variable representations. To determine which microvariables make contribution, we define the contribution as the path product on the encoder: A = |W(1) encoder||W(2) encoder| (3) We say that a path is inactive if at least one weight along the path is zero. Therefore, when the path product from input i=1, 2, , d to output j=1, 2, , q is non-zero, i.e. Ai,j =0, vi is one of the constituents of macro-variable uj. To clarify the composition of macro-variables from microones, we introduce an l1,1-norm regularization on the parameter matrices of the encoder, which encourages sparsity of A. This allows some zi Z to approach the zero vector, achieving dynamic variability in the number of macrovariables. In addition, we incorporate a data reconstruction loss, which helps retain causal interpretation. In other words, the compressed macro-variables preserve a significant portion of essential causal information, allowing for the faithful reconstruction of the original data. Then the loss function of variable abstraction module is defined as follows: L1 = J (X, Y) + α1 Wencoder 1,1 (4) where J ( ) denotes the cost function, here we use mean squared error, that is J (X, Y)= 1 2n X Y 2 F , n is the number of samples. The usage of the regularization term may lead to a reduction of variables contributing to abstractions. In such cases, to ensure approximate data reconstruction, SAE prefers preserving the inputs of parent variables, as they encode more information than child ones. As above, Mg CSL extracts the macro-representation vectors Z from X, which will be used for subsequent multi-granularity causality orientation. Causality Orientation How to orient causal edges between variable nodes is a critical task in causal discovery. Early approaches determine the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) orientation based on specific structures or asymmetry. The differentiable acyclicity constraint of NOTEARS (Zheng et al. 2018) enables the usage of standard numerical techniques to search DAGs. However, NOTEARS and its variants are typically operated at the micro-level. In this work, we need to identify nonlinear dependencies among variables and enforce acyclicity across multiple granularities. For this purpose, we propose a simplified acyclicity constraint to improve search efficiency. We first construct an MLP for each micro-variable, taking other variables and latent macrovariables as inputs, to model the underlying causal mechanisms: ˆxi = MLPi(X i Z; WMLPi) (5) where is the concatenation operator, WMLPi is parameters of the i-th MLP, X i represents the observational data of all variables except xi. Actually, the variables involved in the fitting process of xi are potential parent variables of vi. Therefore, we extract the weighted adjacency matrix (WAM) C on the first layer of WMLPi, taking into account the macro-variables: Ci,j = (W(1) MLPj)[i,:] + A[i,:](W(1) MLPj)[d+1:,:] 2 (6) where W(1) MLPj R(d+q) m2 represents the first parameter matrix of j-th MLP, m2 is the number of neurons in the first hidden layer. Consequently, we can collectively enforce acyclicity on the multi-granularity variables. Similarly, we can obtain multi-granularity WAM S R(d+q) d as: Si,j = (W(1) MLPj)[i,:] 2 (7) When a variable participates in joint interactions on other variables, we reduce its individual weight on other variables to exclude redundant causal relations by a redundancy penalty as: k=1(|A||W(1) MLPj|[d+1:,:])[:,k] Xm2 k=1(|W(1) MLPj|[:d,:])[:,k] 1 (8) where denotes the Hadamard product. Then we can define the loss function of causality orientation module as follows: min L2 = J (X, ˆX) + Lred+ j=1 W(1) MLPj 1,1 + 1 j=1 WMLPj 2 F ) s.t. H(C) = 0 (9) H(C) = 0 is our meticulously simplified directed acyclicity constraint, which will be discussed in more detail later. As sparse parameters lead to more specific direction, we apply an l1,1-norm regularization on each W(1) MLPj to help orientation. We further employs squared Frobenius norm on each WMLPj to facilitate model convergence and enhance generalization. The objective function of Mg CSL consists of the two afore-defined loss functions: min L = L1 + L2 s.t. H(C) = 0 (10) By iteratively optimizing the objective function, we can search for the optimal multi-granularity DAG. Due to the continuous nature of C and S obtained after training, it cannot be directly interpreted as a causal graph. Therefore, postprocessing procedure on C and S is required. We first cut all weights below a certain threshold ϵ, then we iterative eliminate the smallest weight until no directed cycles exist in C and S. Finally, we set all remaining non-zero elements to 1. Optimization To ensure the acyclicity of C, an explicit way is to enforce the following condition: Pd k=1 tr(Dk)=0 where D=C C and tr( ) denotes the trace of D. NOTEARS (Zheng et al. 2018) suggests that Pd k=1 tr(Dk)=0 tr(Pd k=1 Dk k! )=0 tr(Pd k=0 Dk k! ) d=0, and proposes Theorem 1. Theorem 1. A WAM C Rd d is a DAG if and only if h(C) = tr(e D) d = 0 (11) This provides a continuous optimization approach to search for the graph structure. Indeed, the computational complexity of matrix exponential makes it challenging to obtain results within an acceptable time on high-dimensional data (Liang et al. 2023a). Here, we introduce an meticulously designed acyclicity constraint to orient edges in an efficacy way. Proposition 1. A WAM C Rd d is a DAG if and only if i [1, d], λi = 0 (12) where λi represents the eigenvalues of D. The proof of Proposition 1 is deferred into the Supplementary file (Liang et al. 2023c). Previous effort (Lee et al. 2019) attempts to enforce acyclicity based on the spectral radius, which is unstable and tends to a zero matrix, due to its neglect of global information of eigenvalues and an easily attainable local optimum of the zero matrix. Optimizing the eigenvalues jointly expands the search space but introduces computational burdens. Since a square matrix is necessarily unitarily similar to an upper triangular matrix, we perform Schur decomposition on D: D = URU (13) where U is a unitary matrix, R is a upper triangular matrix that has the eigenvalues on the diagonal, which are also the eigenvalues of D, since D is similar to R. Proposition 2. A WAM C Rd d is a DAG if and only if H(C) = diag(R) 2 2 = 0 (14) where diag( ) denotes the diagonal vector of a matrix. Optimizing Eq. (14) avoids the intricate computation of matrix exponential and its gradient in Eq. (11), which facilitate an efficient search of causal graphs in the latent graph space. It can also minimize the weights of non-existing edges with the help of reconstruction loss on X. Therefore, the error caused by transforming D to R can be effectively removed through our post-processing procedure. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) To this end, Eq. (10) is reformulated into a continuous optimization problem, albeit non-convex due to the non-convex feasible set. Nevertheless, we can employ an augmented Lagrangian approach to replace the original constrained problem Eq. (10) with a sequence of unconstrained subproblems, and then seek stationary points as the solution: min L = J (X, Y) + α1 Wencoder 1,1+ J (X, ˆX) + Lred + α2( Xd j=1 W(1) MLPj 1,1+ j=1 WMLPj 2 F ) + µ 2 |H(C)|2 + γH(C) where µ is the penalty coefficient and γ is the Lagrange multiplier. The parameter update process is as follows: θ(κ+1) arg min{J (X, Y) + α1 Wencoder 1,1+ J (X, ˆX) + Lred + α2( Xd j=1 W(1)(κ) MLPj 1,1+ j=1 W(κ) MLPj 2 F ) + µ 2 |H(C)|2 + γH(C)} (16) ( ηµ(κ), H(C(κ+1)) > ρH(C(κ)) µ(κ), otherwise (17) γ(κ+1) γ(κ) + µH(C) (18) where θ {Wencoder, Wdecoder, WMLP}, and θ(κ) is the iterative optimized θ in the κ-th iteration. A variety of optimization methods can be applied to Eq. (16). In this work, we apply the L-BFGS-B algorithm (Byrd et al. 1995). The comprehensive procedure of Mg CSL is delineated in Algorithm 1 in the Supplementary file. The time complexity of Mg CSL comes from two sources, variable abstraction and causality orientation. For the former, the computational complexity of SAE for one iteration step is O(nm1q + ndm1q + nd2m1). For the latter, the computational complexity of MLP for one iteration step is O(d2m2+nd2m2+d3). Although the Schur decomposition of Mg CSL and the trace of matrix exponential in NOTEARS share the same time complexity of O(d3), eigenvalues are more sensitive to the presence of directed cycles in C than the trace of matrix exponential, enabling C to be optimized faster towards a cycle-free direction. This is because under the constraint of the squared Frobenius norm, most entries of C are smaller than 1, resulting in the trace of matrix exponential approaching zero. Therefore, our approach is approximately one order of magnitude faster, as shown in the experiment section. Experimental Evaluation Experimental Setup We evaluated the proposed Mg CSL on random DAG produced by Erd os-R enyi (ER) or Scale-Free (SF) scheme. We varied the number of variables (d {20, 50, 100}) with edge density (degree=2). For each graph, we generate 10 datasets of n=1000 samples following: (i) Additive models with Gaussian processes, and (ii) ANM with Gaussian processes. In addition, on the SF graph with d=20, we randomly select {2, 4} variables as macro-variables and employ MLP to decompose them into 8 micro-variables, forming multigranularity graphs. We consider another well-known dataset Sachs (Sachs et al. 2005), which measures the level of different expressions of proteins and phospholipids in human cells. The ground truth causal graph of this dataset consists of 11 variables and 20 edges. In this work, we test on observational data with 7466 samples. We compare Mg CSL with representative causal structure learning methods, including PC (Spirtes et al. 2000), GES (Chickering 2002), DAG-GNN (Yu et al. 2019), GAE (Ng et al. 2019), Gra N-DAG (Lachapelle et al. 2020) and NOTEARS-MLP (NO-MLP) (Zheng et al. 2020). The first two methods focus on combinatorial optimization, and the last four target at general nonlinear dependencies. We refer readers to the Supplementary file (Liang et al. 2023c) for the detailed experimental setup. The code of Mg CSL is shared at http://www.sdu-idea.cn/codes.php?name=Mg CSL. We evaluate the estimated DAG using five metrics: Precision: proportion of correctly estimated edges to the total estimated edges; Recall: proportion of correctly estimated edges to the total edges in true graph; F1 score: the harmonic average of precision and recall; Structural hamming distance (SHD): the number of missing, falsely estimated or reversed edges; Runtime: the running time required to obtain the results, measured in seconds. When a method fails to return any result within a reasonable response time (100 hours), we mark the entry as - . On the case of multigranularity graphs, we convert the estimated graph and the ground truth graph into the micro-level, i.e., an edge from a macro-variable to a target implies multiple edges from micro-variables encompassed within that macro-one to the same target. For the reported results, ( ) means the higher (lower) the value, the better the performance is. The best result is highlighted in bold font. Result Analysis Results on multi-granularity graph We first compare the performance of our Mg CSL with baselines on multi- (a) d=20 with 2 macro-variables (b) d=20 with 4 macro-variables Figure 2: Results on multi-granularity graph. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Erd os-R enyi graph Scale-Free graph d Precision(%) Recall(%) SHD Runtime Precision(%) Recall(%) SHD Runtime PC 60.22 8.38 54.50 7.43 25.90 4.15 0.54 41.20 9.26 42.70 9.60 33.70 5.23 0.69 GES 58.14 5.80 59.00 6.15 26.00 3.06 12.28 43.72 5.88 55.68 6.39 33.60 3.37 14.75 DAG-GNN 89.57 8.44 42.50 6.01 24.70 3.09 2512.59 85.02 11.23 38.92 5.29 23.90 2.64 2880.47 GAE 83.24 12.07 30.50 17.59 29.30 5.42 1469.75 83.93 22.12 37.03 19.49 24.60 7.97 1999.70 Gra N-DAG 97.13 5.20 51.25 6.80 20.10 2.85 311.23 95.12 3.69 59.19 12.32 16.10 4.46 355.68 NO-MLP 89.86 4.59 69.75 7.12 14.80 3.05 45.35 80.62 5.62 73.24 5.32 15.30 2.87 47.10 Mg CSL 98.61 2.30 65.75 8.42 14.10 3.21 9.57 96.19 7.36 70.54 7.48 12.00 3.59 7.58 PC 31.06 2.79 40.60 3.16 235.30 13.67 12.74 27.39 2.23 38.12 2.61 263.00 13.43 27.15 GES 42.10 3.77 60.78 5.01 213.44 18.53 79857.07 - - - - DAG-GNN 88.61 4.75 34.95 6.12 135.50 10.05 1637.52 88.80 4.10 31.22 5.38 140.10 8.17 3654.05 GAE 50.15 42.26 13.80 10.42 211.20 55.11 5527.23 64.56 35.86 15.94 8.78 203.00 62.33 1991.81 Gra N-DAG 11.70 3.62 33.00 9.84 653.50 208.96 2089.45 6.31 1.10 23.60 5.11 828.40 122.29 1575.67 NO-MLP 80.77 4.88 64.50 5.13 95.80 12.69 1923.61 69.28 4.38 62.44 3.40 122.60 12.27 1688.14 Mg CSL 94.61 3.33 49.65 7.18 106.10 11.53 430.81 89.20 4.97 55.28 7.36 98.20 8.09 394.56 Table 1: Results on nonlinear models with additive Gaussian processes. granularity synthetic datasets. As shown in Figure 2, the presence of macro-variables has an impact on the precision of the baselines. Even on small graphs, they exhibit a precision as low as 0.5 or even lower. PC and GES achieve the highest SHD, because they struggle to capture complex nonlinear relationships through either CI tests or score functions. GES even fails to produce any results when the number of macro-variables is 4. DAG-GNN, GAE and Gra NDAG possess the capability to learn nonlinear dependencies, but they fail to achieve satisfactory results. In particular, Gra N-DAG cannot obtain valid estimated graphs across all datasets. This limitation might be attributed to the introduction of noise through the micro-level representations of macro-variables, which affects the DAG search process. NO-MLP achieves promising results in terms of SHD. However, its low precision and high execution time still make it challenging to directly apply in multi-granularity graphs. Mg CSL is capable of extracting valuable information from data mixed with multi-granularity variables for DAG learning, thereby achieving the optimal performance in terms of precision and SHD with a modest expenditure of time. We apply the Wilcoxon signed-rank test (Demˇsar 2006) to check the statistical difference between Mg CSL and other compared methods across these metrics and datasets, all the pvalue are smaller than 0.02. This demonstrates its ability to discover multi-granularity causal structures. Results on typical causal graph We also conduct experiments on typical causal discovery synthetic datasets. As shown in Table 1 and S3 in the Supplementary file, our Mg CSL outperforms its opponents across most metrics within a short period of time. Due to page limit, we have moved the results of 50 variables and F1 metric to the Supplementary file. This does not affect the conclusions. Mg CSL vs. combinatorial optimization methods: PC can swiftly produce comparable results through CI tests with a few variables. However, as the number of variables increases, the limited sample size renders CI tests unreliable, leading to a sharp rise of SHD. The performance of GES is relatively superior to that of PC because it is less affected by data issues (e.g., noisy or small samples). However, its adoption of the greedy search strategy makes it challenging to deliver results within a reasonable runtime, particularly when confronted with the exponential growth of the search space as the number of variables increases. In comparison, our Mg CSL manifests more remarkable performance across these metrics on datasets of different sizes. Mg CSL vs. gradient-based methods: DAG-GNN achieves consistent results in most cases by leveraging the power of variational inference. However, it encounters a substantial time cost on low-dimensional data, which hinders its applicability to small graphs. GAE presents poor performance, with its precision fluctuating significantly. While Gra N-DAG exhibits excellent precision on small graphs, its performance falters as the number of variables increases. Its estimated graphs contain numerous erroneous edges, indicating its limitation in learning causal structures on large graphs. NO-MLP shows commendable performance, but it demands a lot of time cost due to the high complexity involved in matrix exponential calculations. Mg CSL surpasses the above rivals across most metrics, maintaining a high precision even on high-dimensional graphs. Moreover, the simplified acyclicity penalty empowers Mg CSL to provide effective results within shorter time, ensuring its feasibility for real-world applications. The impact of different graphs: We further compared the performance of the algorithms on ER and SF graphs. On ER graphs, the degrees of individual variables are relatively balanced, whereas SF graphs contain a few hub variables with degrees significantly higher than the average, making it more likely for multiple variables to jointly influence a single target. As the experimental results show, most algorithms such as PC, GES, and NO-MLP perform worse on SF graphs than that on ER graphs in terms of SHD, especially as the number of variables increases. This could be attributed to the causal discovery algorithm s demand for sparsity in the estimated graph, resulting in poor performance when dealing with graphs with highly uneven degrees. Mg CSL has the capability of abstracting multiple micro-variables into a macro-one and discovering causal edges among multigranularity variables. This empowers it to achieve competi- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Precision(%) Recall(%) F1(%) SHD PC 37.04 50.00 42.55 24 GES 23.08 45.00 30.51 29 DAG-GNN 26.92 35.00 30.43 29 GAE 28.57 10.00 14.81 20 Gra N-DAG 50.00 10.00 16.67 18 NO-MLP 16.67 10.00 12.50 22 Mg CSL 100.00 30.00 46.15 14 Table 2: Results on Sachs s protein signaling dataset. tive results on SF graphs. Results on real-world data To further verify the effectiveness of Mg CSL, we conduct experiments on Sachs s (Sachs et al. 2005) protein signaling dataset. The results in Table 2 show that Mg CSL still exhibits to a competitive performance compared with the rivals. PC, GES and DAGGNN identify quit a lot of correct edges but at the cost of precision, resulting in the highest SHD. GAE and NO-MLP miss too many edges, leading to a low recall. The sparse inference graph of Gra N-DAG contains fewer errors. This can be attributed to its preprocessing procedure that selects candidate neighbors for each variable, reducing the possibility of making mistakes. Mg CSL outperforms the baselines in terms of precision, F1 and SHD while identifying 6 correct causal edges. This signifies its tangible potential for realworld applications. Ablation Study We conduct ablation study to investigate the effectiveness of the proposed acyclicity constraint. We implement NOMLP and NO-MLP combined with our proposed acyclicity constraint named NO-MLP+ on the dataset of nonlinear models with additive Gaussian processes. As shown in Fig. 3, NO-MLP+ yields sustained high precision even as the number of variables increases, while NO-MLP discovers excessive erroneous edges. Due to the stronger influence of directed cycles on eigenvalues, NO-MLP+ prunes a significant number of edges, including correct ones, leading to a decrease in F1 score. Nevertheless, in terms of SHD, NO-MLP+ manifests comparable performance to NO-MLP. Most importantly, NO-MLP+ demonstrates a significant decrease in time consumption, about an order of magnitude. This verifies that the proposed acyclicity penalty is beneficial in efficiently and accurately discovering causal DAG. In addition, we carry out parameter sensitivity study w.r.t. α1 and α2. A small α1 may introduce considerable noise, while an excessively large α1 could weaken the representational capacity of macro-variables. Besides, a larger α2 can improve the performance in terms of precision and running time, but Mg CSL tends to achieve an overly sparse estimated graph with a few edges when α2 is too large. Based on the aforementioned analysis, we set α1=0.1 and α2=0.01. Case Study To further appraise the effectiveness of Mg CSL in realworld applications, we conducted a case study on the functional magnetic resonance imaging (f MRI) Hippocampus 50 100 200 number of variables Precision(%) NO-MLP NO-MLP+ 50 100 200 number of variables 50 100 200 number of variables 50 100 200 number of variables Figure 3: Precision, F1, SHD and runtime of NO-MLP and NO-MLP combined with our proposed acyclicity constraint named NO-MLP+ on d={50, 100, 200}. dataset, which is consisted of the resting-state signals from six separate brain regions (Poldrack et al. 2015). It was collected from the same person for 84 days and the sample size for each day is 518. We apply our proposed Mg CSL on 10 successive days. We consider the top 5 edges discovered by Mg CSL as the estimated causal edges: {PHC, Sub} ERC, ERC DG, CA1 Sub, and CA1 DG. We compared the results with anatomical directed connections (Bird and Burgess 2008; Zhang et al. 2017), since theoretically, when two regions have an anatomical connection, there is a high likelihood of causal relationship between them. The ground truths provided by anatomical structures contain cycles, while we assume that there are no directed cycles in the causal graph. The results indicate that the first four edges discovered by Mg CSL match the fact that most of the hippocampus s neocortical inputs come from the PHC via the ERC, with most of its neocortical output directed towards the Sub, which also projects back to the ERC. The collaborative interaction between PHC and Sub on ERC unveils potential cross-area coordination within the brain. The last edge is reversed, possibly attributed to the presence of directed cycles, leading Mg CSL to misjudge the direction. In summary, the results demonstrate the effectiveness of Mg CSL in identifying causal connections between different brain regions on real f MRI datasets. In this paper, we study how to acquire multi-granularity causal structure on observational data, which is a practical and significant, but currently under-researched problem. We proposed an effective approach Mg CSL that leverages SAE to extract latent macro-variables, and utilizes MLPs with a simplified acyclicity penalty to speed up the discovery of multi-granularity causal structure. The efficacy of Mg CSL is verified by experiments on synthetic and real-world datasets. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Acknowledgements This work is supported by NSFC (No. 62072380, 62031003 and 62272276), National Key Research and Development Program of China (No. 2022YFC3502101), Taishan Scholars Project Special Funding, Xiaomi Young Talents Program, CAAI-Huawei Mind Spore Open Fund. References Bird, C. M.; and Burgess, N. 2008. The hippocampus and memory: insights from spatial processing. Nat. Rev. Neurosci., 9(3): 182 194. Byrd, R. H.; Lu, P.; Nocedal, J.; and Zhu, C. 1995. A limited memory algorithm for bound constrained optimization. SIAM J Sci. Comput., 16(5): 1190 1208. Chickering, D. M. 2002. Optimal structure identification with greedy search. JMLR, 3(11): 507 554. Colombo, D.; Maathuis, M. H.; et al. 2014. Orderindependent constraint-based causal structure learning. JMLR, 15(116): 3921 3962. Demˇsar, J. 2006. Statistical comparisons of classifiers over multiple data sets. JMLR, 7: 1 30. Elder-Vass, D. 2012. Top-down causation and social structures. Interface focus, 2(1): 82 90. Hauser, A.; and B uhlmann, P. 2012. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. JMLR, 13(79): 2409 2464. He, Y.; Cui, P.; Shen, Z.; Xu, R.; Liu, F.; and Jiang, Y. 2021. Daring: Differentiable causal discovery with residual independence. In KDD, 596 605. Hoyer, P.; Janzing, D.; Mooij, J.; Peters, J.; and Sch olkopf, B. 2008. Nonlinear causal discovery with additive noise models. In Neu IPS, 689 696. Lachapelle, S.; Brouillard, P.; Deleu, T.; and Lacoste-Julien, S. 2020. Gradient-based neural DAG learning. In ICLR. Lee, H.-C.; Danieletto, M.; Miotto, R.; Cherng, S. T.; and Dudley, J. T. 2019. Scaling structural learning with NOBEARS to infer causal transcriptome networks. In PSB, 391 402. World Scientific. Liang, J.; Wang, J.; Yu, G.; Domeniconi, C.; Zhang, X.; and Guo, M. 2023a. Gradient-based local causal structure learning. IEEE TCYB, 99(1): 1 10. Liang, J.; Wang, J.; Yu, G.; Guo, W.; Domeniconi, C.; and Guo, M. 2023b. Directed acyclic graph learning on attributed heterogeneous network. TKDE, 99(1): 1 12. Liang, J.; Wang, J.; Yu, G.; Xia, S.; and Wang, G. 2023c. Multi-granularity Causal Structure Learning. ar Xiv:2312.05549. Loewer, B. 2012. The emergence of time s arrows and special science laws from physics. Interface focus, 2(1): 13 19. Marella, D.; and Vicard, P. 2022. Bayesian network structural learning from complex survey data: a resampling based approach. SMA, 31(4): 981 1013. Ng, A.; et al. 2011. Sparse autoencoder. CS294A Lecture notes, 72(2011): 1 19. Ng, I.; Zhu, S.; Chen, Z.; and Fang, Z. 2019. A graph autoencoder approach to causal structure learning. In Neur IPS Workshop. Noble, D. 2012. A theory of biological relativity: no privileged level of causation. Interface focus, 2(1): 55 64. Peters, J.; Mooij, J. M.; Janzing, D.; and Sch olkopf, B. 2014. Causal discovery with continuous additive noise models. JMLR, 15: 2009 2053. Poldrack, R. A.; Laumann, T. O.; Koyejo, O.; Gregory, B.; Hover, A.; Chen, M.-Y.; Gorgolewski, K. J.; Luci, J.; Joo, S. J.; Boyd, R. L.; et al. 2015. Long-term neural and physiological phenotyping of a single human. Nat. Commun., 6(1): 8885. Qiang, W.; Li, J.; Su, B.; Fu, J.; Xiong, H.; and Wen, J.-R. 2023. Meta attention-generation network for crossgranularity few-shot learning. IJCV, 131(5): 1211 1233. Ramsey, J.; Glymour, M.; Sanchez-Romero, R.; and Glymour, C. 2017. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. JDSA, 3(2): 121 129. Richens, J. G.; Lee, C. M.; and Johri, S. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun., 11(1): 3923. Sachs, K.; Perez, O.; Pe er, D.; Lauffenburger, D. A.; and Nolan, G. P. 2005. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523 529. Sch olkopf, B.; Locatello, F.; Bauer, S.; Ke, N. R.; Kalchbrenner, N.; Goyal, A.; and Bengio, Y. 2021. Toward causal representation learning. Proceedings of the IEEE, 109(5): 612 634. Spirtes, P.; Glymour, C. N.; Scheines, R.; and Heckerman, D. 2000. Causation, prediction, and search. Cambridge, MA: MIT Press, 2nd ed. edition. Vandenbroucke, J. P.; Broadbent, A.; and Pearce, N. 2016. Causality and causal inference in epidemiology: the need for a pluralistic approach. Int. J. Epidemiol., 45(6): 1776 1786. Varley, T. F.; Pope, M.; Faskowitz, J.; and Sporns, O. 2023. Multivariate information theory uncovers synergistic subsystems of the human cerebral cortex. Commun. Biol., 6(1): 451. Von K ugelgen, J.; Karimi, A.-H.; Bhatt, U.; Valera, I.; Weller, A.; and Sch olkopf, B. 2022. On the fairness of causal algorithmic recourse. In AAAI, 9584 9594. Wang, G. 2017. DGCC: data-driven granular cognitive computing. Granul. Comput., 2(4): 343 355. Wang, G.; Fu, S.; Yang, J.; and Guo, Y. 2022. A review of research on multi-granularity cognition based intelligent computing. Chin. J. Comput., 45(6): 1161 1175. Wang, X.; Du, Y.; Zhu, S.; Ke, L.; Chen, Z.; Hao, J.; and Wang, J. 2021. Ordering-based causal discovery with reinforcement learning. In IJCAI, 3566 3573. Wang, Y.; Liang, D.; Charlin, L.; and Blei, D. M. 2020. Causal inference for recommender systems. In Rec Sys, 426 431. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Wu, M.; Zhang, Q.; Wu, C.; and Wang, G. 2023. End-to-end multi-granulation causality extraction model. DCN. Yang, C.; Zhang, Z.; Ding, J.; Zheng, W.; Jing, Z.; and Li, Y. 2022. A multi-granularity network for emotion-cause pair extraction via matrix capsule. In CIKM, 4625 4629. Yang, D.; Yu, G.; Wang, J.; Wu, Z.; and Guo, M. 2023. Reinforcement causal structure learning on order graph. In AAAI, 10737 10744. Yu, Y.; Chen, J.; Gao, T.; and Yu, M. 2019. DAG-GNN: DAG structure learning with graph neural networks. In ICML, 7154 7163. Zhang, K.; Huang, B.; Zhang, J.; Glymour, C.; and Sch olkopf, B. 2017. Causal discovery from nonstationary/heterogeneous data: skeleton estimation and orientation determination. In IJCAI, 1347 1353. Zhang, K.; and Hyv arinen, A. 2009. On the identifiability of the post-nonlinear causal model. In UAI, 647 655. Zheng, X.; Aragam, B.; Ravikumar, P. K.; and Xing, E. P. 2018. Dags with no tears: continuous optimization for structure learning. In Neu IPS, 9492 9503. Zheng, X.; Dan, C.; Aragam, B.; Ravikumar, P.; and Xing, E. 2020. Learning sparse nonparametric dags. In AISTATS, 3414 3425. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)