# federated_graph_learning_with_graphless_clients__b8743fbd.pdf Published in Transactions on Machine Learning Research (10/2024) Federated Graph Learning with Graphless Clients Xingbo Fu xf3av@virginia.edu Department of Electrical and Computer Engineering University of Virginia Song Wang sw3wv@virginia.edu Department of Electrical and Computer Engineering University of Virginia Yushun Dong yd6eb@virginia.edu Department of Electrical and Computer Engineering University of Virginia Binchi Zhang epb6gw@virginia.edu Department of Electrical and Computer Engineering University of Virginia Chen Chen zrh6du@virginia.edu Department of Computer Science University of Virginia Jundong Li jundong@virginia.edu Department of Electrical and Computer Engineering University of Virginia Reviewed on Open Review: https: // openreview. net/ forum? id= m VAp0e Dfy R Federated Graph Learning (FGL) is tasked with training machine learning models, such as Graph Neural Networks (GNNs), for multiple clients, each with its own graph data. Existing methods usually assume that each client has both node features and graph structure of its graph data. In real-world scenarios, however, there exist federated systems where only a part of the clients have such data while other clients (i.e. graphless clients) may only have node features. This naturally leads to a novel problem in FGL: how to jointly train a model over distributed graph data with graphless clients? In this paper, we propose a novel framework Fed GLS to tackle the problem in FGL with graphless clients. In Fed GLS, we devise a local graph learner on each graphless client which learns the local graph structure with the structure knowledge transferred from other clients. To enable structure knowledge transfer, we design a GNN model and a feature encoder on each client. During local training, the feature encoder retains the local graph structure knowledge together with the GNN model via knowledge distillation, and the structure knowledge is transferred among clients in global update. Our extensive experiments demonstrate the superiority of the proposed Fed GLS over five baselines. 1 Introduction Recent years have witnessed a growing development of graph-based applications in a wide range of highimpact domains. As a powerful deep learning tool for graph-based applications, Graph Neural Networks (GNNs) exploit abundant information inherent in graphs (Wu et al., 2020) and show superior performance in different domains, such as node classification (Fu et al., 2023; He et al., 2022) and link prediction (Tan Published in Transactions on Machine Learning Research (10/2024) Hospital A Hospital B Hospital C Hospital D Figure 1: An example of a healthcare system including four hospitals. In this example, Hospital A and Hospital D have their local datasets of patients (node features) and co-staying information (links) among them. In the meantime, Hospital B and Hospital C only have their local datasets of patients (node features). The four hospitals aim to jointly train a model for predicting whether a patient is at high risk of contracting a contagious disease, orchestrated by a third-party company over their local datasets while the company cannot directly access their private datasets. et al., 2023a; Zhang & Chen, 2018). Traditionally, GNNs are trained over graph data stored on a single machine. In the real world, however, multiple data owners (i.e., clients) may have their own graph data and hope to jointly train GNNs over their graph data. The challenge in this scenario is that the graph data is often not allowed to be collected from different places to one single server for training due to the emphasis on data security and user privacy (Voigt & Von dem Bussche, 2017; Wang et al., 2024b). Considering a group of hospitals, for instance, each of them has its local dataset of patients. These hospitals hope to collaboratively train GNNs for patient classification tasks (e.g., predicting whether a patient is at high risk of contracting a contagious disease) while keeping their private patient data locally due to strict privacy policies and commercial competition. To tackle the above challenge, Federated Learning (FL) (Mc Mahan et al., 2017) enables collaborative training among clients over their private graph data orchestrated by a central server. Typically, FL can be categorized into horizontal and vertical FL based on how data is distributed among clients (Yang et al., 2019). This study focuses on horizontal FL where distributed datasets share the same feature space. During each training round, the selected clients receive the global model from the central server and perform local updates over their local graph data. The central server aggregates the updated local models from the clients and computes the new global model for the next training round. Numerous studies have been proposed to improve FL performance over graph data by reconstructing cross-client information (Zhang et al., 2021b;a), aligning overlapping instances (Peng et al., 2021; Zhou et al., 2022), and mitigating data heterogeneity (Xie et al., 2021; Tan et al., 2023b; Fu et al., 2024). The above methods rely on a fundamental assumption that each client has graph structure information of its local graph data. In the real world, however, this assumption may not be feasible for all the clients. Instead, there may be a part of clients only having local node features, whereas other clients have both node features and edge information. To illustrate this scenario in practice, we provide a real-world example as follows. More practical examples can be found in Appendix A. Motivating example. Considering the aforementioned example of a healthcare system as shown in Figure 1, we may construct local graphs in each hospital by taking patient demographics as node features and co-staying in a ward as edges. In the real world, however, some hospitals may not record the co-staying information and cannot construct patient graphs. As a result, these hospitals are unable to directly train GNNs to predict whether a patient is at high risk of contracting a contagious disease in a federated manner. If we instead train a machine learning model only based on patient features, the classification performance will Published in Transactions on Machine Learning Research (10/2024) be unsatisfactory because we disregard the important co-staying information between patients, which can significantly determine the risk of contracting a contagious disease for a patient. The above scenario brings us a novel problem in the federated setting: how to jointly train a GNN model for the classification task from isolated graphs distributed in multiple clients while some clients only have node features? In this paper, we name such clients as graphless clients. Since directly training GNNs is obviously infeasible in this setting, collaboratively training non-GNN-based models such as multi-layer perceptrons (MLPs) and Support Vector Machines (SVMs) is a plausible solution to the above problem. However, a number of experiments in prior works have demonstrated that the non-GNN-based models are typically less accurate than GNNs for the classification task (Franceschi et al., 2019; Zhang et al., 2022). Another intuitive method is to let a graphless client construct graph structure based on the similarity of the features (e.g., using k NN (Gidaris & Komodakis, 2019)) and jointly train GNNs together with other clients. A disadvantage of this method is that the generated graphs are only dependent on node features and are not suitable for node classification (Franceschi et al., 2019; Liu et al., 2022). To overcome the disadvantage, it is natural to let the graphless clients produce graph structures with the structure knowledge of other clients. However, structure knowledge transfer and utilization in a federated manner are still challenging and unexplored. In this study, we propose a novel framework Fed GLS to handle FGL with graphless clients. Fed GLS aims to solve two key challenges of utilizing structure knowledge in this scenario: 1) how to transfer structure knowledge among clients; and 2) how to utilize the transferred knowledge on graphless clients? In Fed GLS, we first design two modules - a GNN model and a feature encoder on each client. In particular, we deploy a third module - a local graph learner on each graphless client. The GNN model learns node embeddings over the local graph and the feature encoder approximates the output of the GNN model via knowledge distillation (Hinton et al., 2015). Therefore, the GNN model and the feature encoder together retain structure knowledge on each client. The central server collects the local parameters of the two modules from the clients and gets the global parameters. In this way, Fed GLS transfers structure knowledge among clients. The local graph learner utilizes the structure knowledge by maximizing the consistency between the output of the global GNN model and feature encoder on each graphless client with a contrastive loss. We conduct extensive experiments over five datasets, and the results show that Fed GLS outperforms other baselines. Our contributions can be summarized as follows. Problem Formulation. We propose a novel research problem of FGL with graphless clients and provide the formal definition of the proposed problem. Algorithm Design. We propose Fed GLS to tackle the proposed problem. We devise a scheme for transferring structure knowledge among clients in Fed GLS by letting a feature encoder imitate the node embeddings from a GNN model. The scheme enables a graph learner on each graphless client to learn its local graph structure with the structure knowledge transferred from other clients. Experimental Evaluations. We conduct extensive experiments on real-world datasets, and the results validate the superiority of our proposed Fed GLS against five baselines. Our implementation of Fed GLS is available in the supplementary materials. 2 Related Work 2.1 Federated Learning FL (Mc Mahan et al., 2017) enables participants (i.e., clients) to jointly train a model under the coordination of a central server without sharing their private data. One key problem that FL concerns is statistical heterogeneity: the data across clients are likely to be non-IID (Li et al., 2020a; Wu et al., 2024a). When each client updates its local model based on its local dataset, its local objective may be far from the global objective. Thus, the averaged global model is away from the global optima (Wang et al., 2024a; Wu et al., 2024b; Wang et al., 2023) and influences the convergence of FL. To overcome the performance degradation of Fed Avg when data at each client are statistically heterogeneous (non-IID), a number of recent studies have been proposed from different aspects. Typically, these studies can be categorized into single global Published in Transactions on Machine Learning Research (10/2024) model-based methods and personalized model-based methods. Single global model-based methods aim to train a global model for all clients. For instance, Fed Prox (Li et al., 2020b) adds a proximal term to the local training loss to keep updated parameters close to the global model. SCAFFOLD (Karimireddy et al., 2020) customizes the gradient updates of personalized models to mitigate client drifts between local models and the global model. Moon (Li et al., 2021) uses a contrastive loss to increase the distance between the current and previous local models. Personalized model-based methods instead enable each client to train a personalized model to mitigate the impact of data heterogeneity. For example, p Fed HN (Shamsian et al., 2021) trains a central hypernetwork to output a unique personal model for each client. Fed Proto (Tan et al., 2022) and Fed Proc (Mu et al., 2023) utilize the prototypes to regularize the local model training. 2.2 Federated Graph Learning Following many well-designed FL methods for Euclidean data (e.g., images), a number of recent studies have begun tackling challenges in FL on graph data (Fu et al., 2022) to achieve better performance on downstream tasks. One specific problem in FL on graph data is missing neighboring information when each client only owns a part of the original graph. The recent proposed methods recover missing neighboring information by transmitting intermediate result (Zhang et al., 2021a) and generating missing neighbors (Zhang et al., 2021b). Another interesting issue in FL on graphs is aligning overlapping instances across clients. This issue happens in FL with heterogeneous graphs (Peng et al., 2021) and vertical FL on graphs (Zhou et al., 2022). In the meantime, some recent studies focus on the unique challenges caused by data heterogeneity in FGL. For example, GCFL (Xie et al., 2021) enables clients with similar graph structure properties to share model parameters. Fed Star (Tan et al., 2023b) designs a structure encoder to share structure knowledge among clients for graph classification. Fed Lit (Xie et al., 2023) mitigates the impact of link-type heterogeneity underlying homogeneous graphs in FGL via an EM-based clustering algorithm. Different from the aforementioned problems where the clients own structured data (i.e., graphs), our work aims to deal with the setting where a part of the clients do not have structure information. 3 Problem Statement In this section, we first present basic concepts in GNNs and FL. Then we propose a novel problem setting of FGL with graphless clients. 3.1 Preliminaries Before formally presenting the formulation of the novel problem, we first introduce the concepts of GNNs and FGL. Notations. We use bold uppercase letters (e.g., A) to represent matrices. For any matrix, e.g., Z, we use the corresponding bold lowercase letters zi to denote its i-th row vector. We use letters in calligraphy font (e.g., V) to denote sets. |V| denotes the cardinality of set V. Graph Neural Networks. We use G = (V, E, X) to denote an attributed graph, where V = {v1, v2, , vn} is the set of n = |V| nodes, E is the edge set, and X Rn d is the node feature matrix. d is the number of node features. The edges describe the relations between nodes and can also be represented by an adjacency matrix A Rn n. A GNN model f parameterized by θ = {θz, θc} learns the node embeddings Z Rn d based on X and A Z = f(X, A; θz). (1) Here d is the embedding dimension, and θz represents parameters of the encoder part in f to obtain the node embedding Z. For the node classification task, we use Z to obtain the prediction ˆY = f(Z; θc) Rn p where p is the number of classes, and θc represents parameters of the predictor in f. Given the label set YL = {y1, y2, , y L} where yi denotes the label of node vi VL = {v1, v2, , v L}, the objective is to minimize the difference between ˆY and YL min θ L(θ) = 1 |VL| vi VL ℓ(ˆyi, yi), (2) Published in Transactions on Machine Learning Research (10/2024) where ℓ( , ) is the cross-entropy loss. Federated Graph Learning. In a federated system with K clients C = {c(1), c(2), , c(K)}, each client c(k) C owns a private graph G(k) = (V(k), E(k), X(k)) and n(k) = |V(k)|. The goal of the clients is to collaboratively train a GNN model f parameterized by θ orchestrated by a central server while keeping the private datasets locally. Specifically, the objective is to solve arg min θ F(θ) := N L(k)(θ), (3) where L(k)(θ) = 1 |V(k) L | P vi V(k) L ℓ(ˆyi, yi), and N = PK k=1 n(k) is the total number of nodes around all the clients. Fed Avg (Mc Mahan et al., 2017) is one of generic federated optimization methods, which can be directly applied to FL on graph data. Typically, during each training round, a client c(k) updates its local GNN parameters θ(k) over its private graph G(k) via SGD for a number of epochs with initialization of the parameters set to the global GNN parameters θ. At the end of the round, the server collects {θ(k)}K k=1 from clients and computes the new global GNN parameters by N θ(k). (4) The new global GNN parameters θ are used for local training in the next round. 3.2 Problem Definition The clients jointly training a GNN model require structure information (e.g., A(k)) of a private graph G(k) known by each client c(k). However, this requirement may not be feasible for all the clients and some of the clients only have its local node feature matrix X(k). As a result, we cannot obtain a GNN model trained directly over the distributed graph data on all the clients. Based on the aforementioned challenges, we propose a novel problem setting in FGL. We provide a formal definition of the problem as follows. Problem 1. Federated graph learning with graphless clients: Given a set of K clients C = {c(k)}K k=1 and 1 < M < K, each client c(k) C1 = {c(k)}M k=1 owns the node features X(k) in its private graph G(k) with the complete structure information (e.g., A(k)) while each graphless client c(k) C2 = {c(k)}K k=M+1 only owns its local node features X(k). The goal of the whole clients C is to collaboratively train a model for the node classification task without sharing their graph data. 4 Methodology In this section, we present the proposed framework Fed GLS. The goal of Fed GLS is to let graphless clients learn local graph structures with the structure knowledge transferred from other clients. To achieve this goal, Fed GLS solves the following two challenges: 1) how to transfer structure knowledge among clients; and 2) how to utilize the transferred structure knowledge on graphless clients. To handle the first challenge, we design a GNN model and a feature encoder on each client. The feature encoder aims to retain structure knowledge together with the GNN model via knowledge distillation. To utilize the transferred structure knowledge, we design a graph learner on each graphless client. This module generates local graph structure and learns the structure knowledge in the GNN model and the feature encoder via a contrastive loss. 4.1 Framework Overview Figure 2 illustrates an overview of Fed GLS. It consists of two training stages: local training on the clients and global update on the central server. Local Training. On each client, a GNN model generates node embeddings with respect to node features and local graph structure. The goal of the feature encoder is to retain structure knowledge together with the Published in Transactions on Machine Learning Research (10/2024) Generated graph 𝓖(𝒌) = 𝐗(𝒌), 𝐒(𝒌) Graph learner Node features 𝐗(𝒌) GNN model 𝒇𝜽(𝒌) Feature Encoder Node embeddings Node embeddings Predictions Original graph 𝓖(𝒌) = 𝐗(𝒌), 𝐀(𝒌) Feature Encoder Node embeddings Predictions Central Server 𝜽 Client 𝒄(𝒌)𝝐 𝓒𝟐 Client 𝒄(𝒌)𝝐 𝓒𝟏 Local Training Global Update Node features 𝐗(𝒌) Node embeddings GNN model 𝒇𝜽(𝒌) 𝜽(𝒌), 𝝓(𝒌) 𝜽(𝒌), 𝝓(𝒌) 𝜽, 𝝓 𝜽, 𝝓 Figure 2: An overview of the proposed Fed GLS. GNN model. It approximates the output (i.e., node embeddings) of the GNN model using the knowledge learned by the GNN model. On each graphless client, a graph learner generates local graph structure and learns the structure knowledge via a contrastive loss. Finally, the well-trained graph learner produces local graph structure and the GNN model learns more expressive node embeddings. Global Update. After local training, the central server gathers the local parameters of the GNN model and the feature encoder from the clients. Then it computes the new global parameters following Fed Avg and broadcasts them to the clients for local training in the next round. 4.2 Local Training As introduced above, we deploy a GNN model and a feature encoder on each client. For each graphless client, an extra graph learner is deployed to generate the graph structure. In this subsection, we will introduce the local update of the three modules during local training on the clients. For each graphless client c(k) C2, a graph learner g produces an adjacency matrix S(k) Rn(k) n(k) with respect to the node feature matrix X(k). Formally, we formulate the graph learner as S(k) = g(X(k); ω(k)) = Ω(Gen(X(k); ω(k))), (5) where ω(k) denotes the parameters in g. Gen( ) denotes a graph generator which produces a matrix S (k) Rn(k) n(k) based on the node features. Typically, we instantiate the graph generator as an MLP encoder or an attentive encoder followed by a cosine similarity function. Ω( ) is a non-parametric adjacency processor which conducts post-processing on S (k) and outputs S(k). Generally, it includes four main operations: sparsification, activation, symmetrization, and normalization (Liu et al., 2022). The adjacency processor Ω( ) ensures that S(k) is a normalized symmetric sparse adjacency matrix with non-negative values. For detailed descriptions about designing graph generators and adjacency processors in the graph learner, readers can refer to Appendix B. Then the GNN model f on each graphless client c(k) C2 produces node embeddings with the generated adjacency matrix S(k) and the node feature matrix X(k) as the input. For each client c(k) C1, the client Published in Transactions on Machine Learning Research (10/2024) instead uses the original adjacency matrix A(k). Specifically, the GNN model produces the node embeddings Z(k) by Z(k) = f(X(k), S(k); θ(k)) (6) for each client c(k) C2; otherwise, S(k) will be replaced by the original adjacency matrix A(k). θ(k) is the parameters in f. In this paper, we instantiate the GNN model f as a GCN. In the meantime, the feature encoder h similarly generates the node representations H(k) but it is only based on the node feature matrix X(k) H(k) = h(X(k); ϕ(k)), (7) where ϕ(k) denotes the parameters in h. In this paper, we instantiate the feature encoder h as an MLP. Optimizing ω(k). For a graphless client c(k) C2, its well-trained graph learner g with its parameters ω(k) is supposed to learn structure knowledge transferred from other clients. Typically, it produces the adjacency matrix S(k) so that the node embeddings Z(k) produced by the GNN model based on S(k) and X(k) are consistent with H(k) produced by the feature encoder. To achieve this, we optimize the local graph learner by maximizing the agreement with a contrastive loss (e.g., NT-Xent (Chen et al., 2020a)). Specifically, we consider the embedding pair z(k) i and h(k) i of node vi V(k) as a positive pair. In contrast, the embedding z(k) i and the embedding of any other node vj V(k) (either in Z(k) or H(k)) compose a negative pair. Our goal is to decrease the distance between positive pairs and increase the distance between negative pairs. Concretely, we formalize it as the contrastive loss as follows: ℓ(k) i (Z(k), H(k)) = log esim(z(k) i ,h(k) i )/τ j=1 1[i =j][esim(z(k) i ,h(k) j )/τ + esim(z(k) i ,z(k) j )/τ] where sim( , ) is the cosine similarity function, and τ is a temperature parameter. The contrastive loss encourages z(k) i not to be too close to other node embeddings and therefore alleviate the over-smoothing issue in GNNs. In the end, the local graph learner is optimized by minimizing the total contrastive loss: min ω(k) L(k) CL = 1 n(k) i=1 ℓi(Z(k), H(k)). (9) Optimizing θ(k). The GNN model f aims to learn expressive node embeddings and therefore predicts the labels of unlabeled nodes. To optimize the parameters θ(k) in f on each client c(k) C, we minimize the classification loss over all labeled nodes in V(k) L by min θ(k) L(k) CE = 1 CE(ˆyi, yi), (10) where CE( , ) denotes the cross-entropy loss. Optimizing ϕ(k). The goal of the feature encoder is to produce H(k) without structure information which is consistent with Z(k) and therefore enables the training of the graph learner. The key challenge for training graph learners is to enforce closeness between H(k) and Z(k). To tackle this issue, we resort to knowledge distillation (Hinton et al., 2015). The intuition of knowledge distillation is to let a student model learn with the knowledge (e.g., predictions) from a teacher model. Typically, the student model is able to produce comparable outputs with the teacher model. In Fed GLS, we choose to distill knowledge from the GNN model to the feature encoder. Then the feature encoder (e.g., an MLP) achieves comparable performance with a GNN via learning the knowledge transferred from the GNN only based on the features (Zhang et al., 2022). In Fed GLS, the parameters ϕ(k) of a feature encoder on each client c(k) C are updated by approximating the knowledge (i.e., the node embeddings Z(k)) from its GNN model. Specifically, the feature encoder on Published in Transactions on Machine Learning Research (10/2024) client c(k) C is optimized by minimizing the discrepancy between h(k) i and z(k) i for vi V(k) via knowledge distillation (Hinton et al., 2015) as follows: min ϕ(k) L(k) KD = X vi V(k) KL(f(z(k) i ; θ(k) c )||f(h(k) i ; θ(k) c )), (11) where KL( || ) is to compute the Kullback-Leibler divergence (KL-divergence). 4.3 Global Update During global update, the central server gathers the local θ(k) and ϕ(k) from the clients. Then it computes the new global θ and ϕ with Fed Avg. More specifically, the new global θ and ϕ for the next round are calculated by N ϕ(k) . (12) 4.4 Overall Algorithm The overall federated training algorithm of Fed GLS is shown in Algorithm 1. During each round, the central server sends global θ and ϕ to the selected clients. For each client c(k) C2, the graph learner g first produces the adjacency matrix S(k). The GNN model f takes node features X(k) and the generated adjacency matrix S(k) (for clients in C1, they use their original adjacency matrices A(k) instead) to get node representations Z(k). Then the feature encoder h computes corresponding node embeddings H(k) only with node features X(k) as input. The parameters ω(k) in g are updated by minimizing the discrepancy between Z(k) and H(k) using Eq. (9). Afterward, each client c(k) C updates the parameters θ(k) of f via supervised learning using Eq. (10) and updates the parameters ϕ(k) of h with the knowledge distilled from the GNN model as supervision information using Eq. (11). Finally, the central server collects the updated θ(k) and ϕ(k) from the clients to get the new global θ and ϕ using Eq. (12) for local training in the next round. We provide complexity analysis Appendix C. 5 Experiments We conduct extensive experiments over five real-world datasets to verify the superiority of the proposed Fed GLS. In particular, we aim to answer the following questions. RQ1: How does Fed GLS perform compared with other state-of-the-art baselines? RQ2: How well can Fed GLS be stable under different local epochs and various graphless client ratios? 5.1 Experiment Setup 5.1.1 Datasets We synthesize the distributed graph data based on five common real-world datasets, i.e., Cora (Sen et al., 2008), Cite Seer (Sen et al., 2008), Pub Med (Sen et al., 2008), Flickr (Zeng et al., 2020), and ogbn-arxiv (Hu et al., 2020). Following the data partition strategy in previous studies (Huang et al., 2023; Zhang et al., 2021b), we synthesize the distributed graph data by splitting each dataset into multiple communities via the Louvain algorithm (Blondel et al., 2008); each community is regarded as an entire graph in a client. We summarize the statistics and basic information about the datasets in Appendix D. In our experiments, we randomly select half clients as graphless clients. Following the setting in (Zhang et al., 2021b), we randomly select nodes on each client and let 60% for training, 20% for validation, and the remaining for testing. We report the average accuracy for node classification over the clients for five random repetitions. Published in Transactions on Machine Learning Research (10/2024) Algorithm 1 The detailed algorithm of Fed GLS. Input: global parameters θ, ϕ; learning rate α, β, γ, local epoch E Output: θ 1: repeat 2: Server selects a subset of clients Cs from C, then broadcasts θ and ϕ to Cs 3: for client c(k) Cs do 4: θ(k) θ, ϕ(k) ϕ 5: if c(k) C2 then 6: // Update graph learner 7: Compute L(k) CL(ω(k)) = 1 n(k) i=1 ℓi(Z(k), H(k)) 8: ω(k) ω(k) γ ω(k)L(k) CL(ω(k)) 9: end if 10: for i = 1, 2 , E do 11: // Update GNN classifier 12: Compute L(k) CE(θ(k)) = 1 |V(k) L | P CE(ˆyi, yi) 13: θ(k) θ(k) α θ(k)L(k) CE(θ(k)) 14: // Update feature encoder 15: Compute L(k) KD(ϕ(k)) = P vi V(k) KL(f(z(k) i ; θ(k) c )||f(h(k) i ; θ(k) c )) 16: ϕ(k) ϕ(k) β ϕ(k)L(k) KD(ϕ(k)) 17: end for 18: Client c(k) sends θ(k), ϕ(k) back to server 19: end for 20: Server updates θ and ϕ via θ P N θ(k), ϕ P 21: until training stop 5.1.2 Baselines Since Fed GLS is proposed to deal with a novel problem setting in FGL with graphless clients, most of the existing frameworks cannot be directly adopted without any preprocessing. Considering this, we first design the following two baselines. Fed-MLP: the clients in C jointly train an MLP model; Fed-GNNMLP: the clients in C1 collaboratively train a GNN model while the clients in C2 collaboratively train an MLP model; In the meantime, we include the following baseline which can be adopted to handle the heterogeneous model architecture setting. Fed Proto (Tan et al., 2022): the local models on the clients in C1 are GNNs and those on the clients in C2 are MLPs, then Fed Proto performs collaborative training by aggregating prototypes. A prototype is the mean value of the embeddings of instances in a class. Furthermore, we also consider the following two baselines with preprocessing. Local-GNNk: it first performs the k NN operation on the graphless clients in C2, then each client in C individually trains a GNN model over its local data without any communication among the clients; Fed-GNNk: it first performs the k NN operation on the graphless clients in C2, then jointly trains a GNN model over the clients in C. Published in Transactions on Machine Learning Research (10/2024) Table 1: Node classification performance (Accuracy Std) over different datasets. Bold and underlined values indicate best and second-best mean performances, respectively. Datasets Fed-MLP Fed-GNNMLP Fed Proto Local-GNNk Fed-GNNk Fed GLS Fed-GNN Cora 0.6141 0.7448 0.8089 0.8002 0.7891 0.8180 0.8238 ( 0.0599) ( 0.0104) ( 0.0078) ( 0.0267) ( 0.0269) ( 0.0281) ( 0.0109) Cite Seer 0.7253 0.7521 0.7876 0.7834 0.7884 0.8058 0.7951 ( 0.0169) ( 0.0279) ( 0.0184) ( 0.0262) ( 0.0131) ( 0.0171) ( 0.0273) Pub Med 0.8398 0.8413 0.8341 0.8328 0.8426 0.8491 0.8629 ( 0.0044) ( 0.0088) ( 0.0097) ( 0.0098) ( 0.0128) ( 0.0070) ( 0.0017) Flickr 0.4649 0.4868 0.4720 0.4844 0.4796 0.4937 0.5014 ( 0.0031) ( 0.0029) ( 0.0044) ( 0.0058) ( 0.0032) ( 0.0048) ( 0.0024) ogbn-arxiv 0.4495 0.4652 0.4685 0.4573 0.4761 0.4872 0.5386 ( 0.0065) ( 0.0049) ( 0.0037) ( 0.0051) ( 0.0044) ( 0.0052) ( 0.0040) In the meantime, we include Fed-GNN which uses the real graph structures on the graphless clients in C2. Theoretically, Fed-GNN should perform better than Fed GLS and the aforementioned baselines since Fed-GNN utilizes more information (i.e., graph structures on the clients in C2). 5.1.3 Parameter Settings We implement a two-layer GCN and a two-layer MLP as the GNN model and the feature encoder in Fed GLS, respectively. We apply the same model architectures to the models used in the baselines. In Fed GLS, we choose a two-layer attentive encoder as the graph generator in the graph learner. The hidden size of each layer in the two models is 16. For all the aforementioned models, we use the Adam (Kingma & Ba, 2015) optimizer. The learning rates α and β are set to 0.01 in the GNN model and the feature encoder and γ is set to 0.001 in the graph learner. The temperature τ in the contrastive loss is set to 0.2. The number of local epoch E is set to 5. The number of rounds is set to 100 for Cora and Cite Seer, 200 for Pub Med, 300 for Flickr, and 2,000 for ogbn-arxiv. All the clients are sampled during each round. 5.2 Effectiveness Evaluation In this subsection, we evaluate the performance of Fed GLS over the five datasets against the baselines to answer RQ1. Specifically, we compare Fed GLS with the five baselines on the node classification accuracy and convergence speed. 5.2.1 Classification Performance We first evaluate the node classification accuracy of different approaches over the five datasets. We conduct all the experiments five times and report the average accuracy with standard deviation in Table 1. As for the baselines without preprocessing, Fed-GNNMLP outperforms Fed-MLP over all the datasets. It is because Fed-GNNMLP utilizes structure information on the clients in C1. However, there is still a huge performance gap between Fed-GNNMLP and Fed GLS since training MLPs based solely on node features on graphless clients by Fed-GNNMLP cannot achieve comparable performance with Fed GLS using structure information. Although Fed Proto is a personalized approach which enables each client to train a local model with global prototypes as a constraint, it does not always perform better than Fed-MLP and Fed-GNNMLP (e.g., on Pub Med and Flickr). Since graphless clients in Fed Proto use MLPs as the backbone model and other clients use GNNs, they may learn distinct prototypes with their different backbone models. As for the baselines with preprocessing, Fed-GNNk fails to perform better than Local-GNNk on Cora and Flickr. In the meantime, we can observe significant performance degradation of Fed-GNNk compared with Fed-GNN. It indicates that constructing graph structures via k NN is not suitable for generating graphs in the real world. In the end, we observe that Fed GLS consistently achieves the highest classification accuracy. Compared with Fed-GNNk, Published in Transactions on Machine Learning Research (10/2024) 0 20 40 60 80 100 Rounds Training loss (a) Training loss of Cora Fed-GNNk Fed GLS 0 50 100 150 200 250 300 Training loss (c) Training loss of Flickr Fed-GNNk Fed GLS 0 20 40 60 80 100 Rounds Test accuracy (b) Test accuracy of Cora Fed-GNNk Fed GLS 0 50 100 150 200 250 300 Test accuracy (d) Test accuracy of Flickr Fed-GNNk Fed GLS Figure 3: Result for convergence speeds of Fed GLS and Fed-GNNk: (a) training loss curve and (b) test accuracy curve on Cora; (c) training loss curve and (d) test accuracy curve on Flickr. Fed GLS is able to deliver better prediction performance, much closer to Fed-GNN (even a little bit higher on Cite Seer). It is because the graph learner produces well-learned graph structures on graphless clients which benefit training of the GNN model. 5.2.2 Convergence Speed We then compare convergence speeds of Fed GLS with the best baseline Fed-GNNk. We present the curves of training loss and test accuracy during the training process on Cora and Flickr in Figure 3. From the results, we observe that the training loss of Fed GLS decreases significantly faster than that of Fed-GNNk and the test accuracy of Fed GLS reaches relatively high values with fewer rounds than Fed-GNNk on both datasets. According to the observation, we can conclude that Fed GLS converges faster than Fed-GNNk. This is because the GNN model in Fed GLS is trained over adaptive graphs whose adjacency matrices are generated by graph learners instead of simply produced by k NN in Fed-GNNk. The graph learners on graphless clients learn more suitable graph structures for the node classification task by learning structure knowledge transferred from other clients. Table 2: Node classification performance (Accuracy) under different local epochs on Citeseer and Pub Med dataset. Bold and underlined values indicate best and second-best mean performances, respectively. Datasets Epochs Fed-MLP Fed-GNNk Fed GLS 3 0.7203 0.7876 0.8017 Cite Seer 5 0.7253 0.7884 0.8058 10 0.7311 0.7859 0.8015 3 0.8406 0.8467 0.8477 Pub Med 5 0.8398 0.8426 0.8491 10 0.8354 0.8455 0.8493 5.3 Sensitivity Study In this subsection, we conduct sensitivity studies to answer RQ2. Specifically, we evaluate the performance of Fed GLS under different local epochs and graphless client ratios. 5.3.1 Local Epochs The main experiments are conducted with the local epoch E = 5. In this part, we set the local epoch as 3 and 10 and report the average accuracy of Fed-MLP, Fed-GNNk, and Fed GLS. Table 2 shows the results Published in Transactions on Machine Learning Research (10/2024) Table 3: Node classification performance (Accuracy) under different graphless client ratios on Pub Med and Flickr dataset. Bold and underlined values indicate best and second-best mean performances, respectively. Datasets |C1| : |C2| Fed-MLP Fed-GNNk Fed GLS 12:4 0.8398 0.8465 0.8527 Pub Med 8:8 0.8398 0.8426 0.8491 4:12 0.8398 0.8351 0.8422 12:8 0.4649 0.4831 0.4982 Flickr 10:10 0.4649 0.4796 0.4937 8:12 0.4649 0.4687 0.4790 on Cite Seer and Pub Med. From the results, we observe that Fed GLS performs stably under different local epochs and has the best utility under all the settings on Cite Seer and Pub Med. 5.3.2 Graphless Client Ratios We also consider varying ratios of graphless clients in C. In this part, we conduct experiments on Pub Med and Flickr with different graphless clients. Table 3 shows the results of Fed-MLP, Fed-GNNk, and Fed GLS on Pub Med and Flickr. Note that the performance of Fed-MLP keeps consistent since it does not require structure information for training. From the table, we can observe that Fed GLS consistently achieves better utility compared with Fed-GNNk and Fed-MLP. In the meantime, Fed GLS and Fed-GNNk show performance degradation when there are more graphless clients because of less structure information in the system. For instance, Fed-GNNk performs worse than Fed-MLP on Pub Med when |C1| : |C2| = 4 : 12. 6 Conclusion In this paper, we study a novel problem of FGL with graphless clients. To tackle this problem, we propose a principled framework Fed GLS, which deploys a local graph learner on each graphless client to learn graph structures with the structure knowledge transferred from other clients. To enable structure knowledge transfer, we design a GNN model and a feature encoder in Fed GLS. They retain structure knowledge together via knowledge distillation and the structure knowledge is transferred among clients during global update. Extensive experiments are conducted on five real-world datasets to show the effectiveness of the proposed algorithm Fed GLS. Although this study proposes a novel research problem in FGL, there are some limitations in the proposed Fed GLS. For example, one potential limitation of Fed GLS is that it may not recover the underlying graph structures on graphless clients since it produces fixed k neighbors for each node on graphless clients. Therefore, the generated graph structures may not match the unknown real-world structure information. We will work on designing frameworks to obtain graph structures with consistent real-world structure information. In addition, graphless clients with heterogeneous graphs may introduce more challenges and are worthy of exploration in the future. Acknowledgments This work is supported in part by the National Science Foundation under grants (IIS-2006844, IIS-2144209, IIS-2223769, IIS-2331315, CNS-2154962, BCS-2228534, and CMMI-2411248) and the Commonwealth Cyber Initiative Awards under grants (VV-1Q24-011, VV-1Q25-004). Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008. Published in Transactions on Machine Learning Research (10/2024) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020a. Yu Chen, Lingfei Wu, and Mohammed Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. In Advances in neural information processing systems, 2020b. Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. In ICML, 2019. Xingbo Fu, Binchi Zhang, Yushun Dong, Chen Chen, and Jundong Li. Federated graph machine learning: A survey of concepts, techniques, and applications. ACM SIGKDD Explorations Newsletter, 2022. Xingbo Fu, Chen Chen, Yushun Dong, Anil Vullikanti, Eili Klein, Gregory Madden, and Jundong Li. Spatialtemporal networks for antibiogram pattern prediction. In 2023 IEEE 11th International Conference on Healthcare Informatics, 2023. Xingbo Fu, Zihan Chen, Binchi Zhang, Chen Chen, and Jundong Li. Federated graph learning with structure proxy alignment. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024. Spyros Gidaris and Nikos Komodakis. Generating classification weights with gnn denoising autoencoders for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. Yilin He, Chaojie Wang, Hao Zhang, Bo Chen, and Mingyuan Zhou. A variational edge partition model for supervised graph representation learning. Advances in Neural Information Processing Systems, 2022. Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. 2020. Wenke Huang, Guancheng Wan, Mang Ye, and Bo Du. Federated graph semantic and structural learning. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence, 2023. Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2020. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 2020a. Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine learning and systems, 2020b. Yixin Liu, Yu Zheng, Daokun Zhang, Hongxu Chen, Hao Peng, and Shirui Pan. Towards unsupervised deep graph structure learning. In Proceedings of the ACM Web Conference, 2022. Published in Transactions on Machine Learning Research (10/2024) Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 2017. Xutong Mu, Yulong Shen, Ke Cheng, Xueli Geng, Jiaxuan Fu, Tao Zhang, and Zhiwei Zhang. Fedproc: Prototypical contrastive federated learning on non-iid data. Future Generation Computer Systems, 2023. Hao Peng, Haoran Li, Yangqiu Song, Vincent Zheng, and Jianxin Li. Differentially private federated knowledge graphs embedding. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 2008. Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, 2021. Qiaoyu Tan, Xin Zhang, Ninghao Liu, Daochen Zha, Li Li, Rui Chen, Soo-Hyun Choi, and Xia Hu. Bring your own view: Graph neural networks for link prediction with personalized subgraph selection. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023a. Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. Fedproto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI conference on artificial intelligence, 2022. Yue Tan, Yixin Liu, Guodong Long, Jing Jiang, Qinghua Lu, and Chengqi Zhang. Federated learning on non-iid graphs via structural knowledge sharing. In Proceedings of the AAAI conference on artificial intelligence, 2023b. Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017. Jiaqi Wang, Yuzhong Chen, Yuhang Wu, Mahashweta Das, Hao Yang, and Fenglong Ma. Rethinking personalized federated learning with clustering-based dynamic graph propagation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2024a. Song Wang, Xingbo Fu, Kaize Ding, Chen Chen, Huiyuan Chen, and Jundong Li. Federated few-shot learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023. Song Wang, Yushun Dong, Binchi Zhang, Zihan Chen, Xingbo Fu, Yinhan He, Cong Shen, Chuxu Zhang, Nitesh V Chawla, and Jundong Li. Safety in graph machine learning: Threats and safeguards. ar Xiv preprint ar Xiv:2405.11034, 2024b. Yebo Wu, Li Li, Chunlin Tian, Tao Chang, Chi Lin, Cong Wang, and Cheng-Zhong Xu. Heterogeneity-aware memory efficient federated learning via progressive layer freezing. In 2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQo S), 2024a. Yebo Wu, Li Li, Chunlin Tian, and Chengzhong Xu. Breaking the memory wall for heterogeneous federated learning with progressive training. ar Xiv preprint ar Xiv:2404.13349, 2024b. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020. Han Xie, Jing Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid graphs. In Advances in neural information processing systems, 2021. Han Xie, Li Xiong, and Carl Yang. Federated node classification over graphs with latent link-type heterogeneity. In Proceedings of the ACM Web Conference, 2023. Published in Transactions on Machine Learning Research (10/2024) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 2019. Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph sampling based inductive learning method. In International Conference on Learning Representations, 2020. Binchi Zhang, Minnan Luo, Shangbin Feng, Ziqi Liu, Jun Zhou, and Qinghua Zheng. Ppsgcn: A privacypreserving subgraph sampling based distributed gcn training method. ar Xiv preprint ar Xiv:2110.12906, 2021a. Ke Zhang, Carl Yang, Xiaoxiao Li, Lichao Sun, and Siu Ming Yiu. Subgraph federated learning with missing neighbor generation. In Advances in neural information processing systems, 2021b. Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 2018. Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. Graph-less neural networks: Teaching old mlps new tricks via distillation. In International Conference on Learning Representations, 2022. Jun Zhou, Chaochao Chen, Longfei Zheng, Huiwen Wu, Jia Wu, Xiaolin Zheng, Bingzhe Wu, Ziqi Liu, and Li Wang. Vertically federated graph neural network for privacy-preserving node classification. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, 2022. Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. Deep graph structure learning for robust representations: A survey. ar Xiv preprint ar Xiv:2103.03036, 2021. A Practical Scenarios with Graphless Clients In this section, we provide more real-world scenarios where the proposed problem will arise. Graphless clients do not record edge information. In practice, some clients in an FGL system may downplay the importance of edge information and therefore choose not to record it. Considering a healthcare system with multiple hospitals, we may construct local graphs in each hospital by taking patient demographics as node features and co-staying in a ward as edges. In the early stage of a pandemic, however, some hospitals (i.e., graphless clients) may not take account of co-staying information and fail to record it. Since the co-staying information is crucial for predicting a patient s risk of contracting a contagious disease, these hospitals cannot make accurate predictions based solely on patient demographics. Instead, our proposed Fed GLS can help them to generate proper graph structures with structure knowledge transferred from other hospitals, enhancing the contagious risk prediction on these hospitals. Graphless clients have not yet generated edge information. In the real world, some clients in an FGL system may initially have only node features, with edge information generated later. This scenario is common in dynamic graphs. Considering a financial system with multiple banks, each bank has its local dataset of customers in a city such as their demographics. For a bank operated for years in a city, it also stores transaction records among its customers and the customers naturally form a customer graph where edges represent the transaction records. Nevertheless, a new bank (i.e., a graphless client) in another city may only have few or even no transaction records and fail to form a customer graph. Considering the abundant information in transaction records, the new bank will benefit from training a model for financial lending with proper graph structures. Graphless clients have contaminated edge information. The edge information on some clients in an FGL system may be contaminated by malicious attackers. Considering multiple e-commerce companies that aim to jointly train a model for product rating prediction, nodes are products and edges connect products that are frequently bought together. For some companies (i.e., graphless Published in Transactions on Machine Learning Research (10/2024) clients), their edge information may be destroyed or manipulated by hackers and cannot be used for model training. Without the informative edge information, it is difficult to predict the rating given to a product by its node features (e.g., word embeddings). B Module Designing of Graph Learner Graph learners in Fed GLS aim to learn graph structures (i.e., adjacency matrices) on the clients in C2 to help GCN training for node classification. Typically, a graph learner on each graphless client consists of a graph generator Gen( ) and a non-parametric adjacency processor Ω( ). For simplicity, we omit the client index of the notations in this section. B.1 Graph Generator Most existing graph generators produce a matrix S either by direct approaches (treating the elements in S as independent parameters) (Franceschi et al., 2019; Jin et al., 2020) or neural approaches (computing S through an encoder) (Chen et al., 2020b; Liu et al., 2022). Since direct approaches are difficult to train (Zhu et al., 2021), we consider neural approaches in this paper. Neural approaches take node features as input and produce matrix S. Specifically, we formulate a neural network-based graph generator Gen( ) on each graphless client in C2 as S = Gen(X; ω) = Φ(Enc(X; ω)), (13) where ω denotes parameters in the encoder Enc( ) and Φ( ) is a non-parametric metric function (e.g., cosine similarity, Euclidean distance, and inner product). Here we mainly consider two specific instances of neural network-based graph generators, i.e., MLP Encoders and Attentive Encoders. An L-th layer MLP Encoder employs an MLP to produce node representations by R(l) = Enc(l)(R(l 1)) = σ(R(l 1)W(l)) (14) for l = 1, 2, , L. R(l) denote the node representations after the l-th layer of Enc( ) and R(0) is the node feature matrix X. W(l) is the weight matrix of l-th layer in ω. σ( ) is an activation function. In an Attentive Encoder, each layer computes the Hadamard product of the input vector and parameters. Specifically, an L layer Attentive Encoder on each client in C2 can be written as R(l) = Enc(l)(R(l 1)) = σ([r(l 1) 1 w(l), , r(l 1) n w(l)] ) (15) for l = 1, 2, , L. r(l 1) i is the transpose of the i-th row vector in R(l 1). is the Hadamard operation and w(l) is the weight vector of the l-th layer in ω. B.2 Adjacency Processor The generated matrix S measures the similarities between the node features. With S, the nodes V form a fully connected graph which is not only computationally expensive but also might introduce noise. Furthermore, S may have both positive and negative values while an adjacency matrix should typically be non-negative. Therefore, we deploy an adjacency processor Ω( ) to refine the generated matrix S before taking it as the input of GNNs. The goal of the adjacency processor Ω( ) is to obtain a sparse symmetric normalized adjacency matrix S with non-negative elements. Typically, the adjacency processor includes three main operations: sparsification, symmetrization, and normalization. B.2.1 Sparsification To obtain a sparse adjacency matrix, we consider a k NN-based sparsification applied on S. Specifically, the sparsification operation Sp( Si; k) produces a sparse adjacency matrix S(sp) by masking off (i.e., set to zero) those elements in si which are not in the set of top k values in si S(sp) ij = Sp( si; k) = ( Sij, Sij Top K( si, k) 0, otherwise , (16) Published in Transactions on Machine Learning Research (10/2024) Table 4: The statistics and basic information about the five datasets adopted for our experiments. Dataset Clients Node Features Average Nodes Average Edges Classes Cora 8 1,433 192 695 7 Cite Seer 8 3,703 149 535 6 Pub Med 16 500 1,079 4,367 3 Flickr 20 500 4,441 28,663 7 ogbn-arxiv 16 128 8,948 50,397 40 where Top K( si, k) selects the highest k values in si. B.2.2 Symmetrization In practice, undirected graphs typically own symmetric adjacency matrix with non-negative values. Considering this, we conduct a symmetrization operation by S(sym) = Sym(S(sp)) = σ(S(sp)) + σ(S(sp)) where σ is a nonlinear activation function. B.2.3 Normalization Once we get S(sym), we normalize it by computing its degree matrix D(sym) and multiplying it from left and right to (D(sym)) 1 2 S = Norm(S(sym)) = (D(sym)) 1 2 S(sym)(D(sym)) 1 C Complexity Analysis In the federated setting, clients may not be equipped with powerful machines for model training. Therefore, the training cost becomes a major concern during collaborative training. Since Fed GLS is agnostic on graph learners and graph learners are updated only once per round, we mainly focus on analyzing the computational complexity of training the GNN model and the feautre encoder (e.g., an MLP) in Fed GLS. As we analyze the computational complexity during local training within a client, we omit the client index of the notations for simplicity in this subsection. We take the training of a 2-layer GCN and a 2-layer MLP as the GNN model and the feature encoder as an example. Their parameters are trained over a graph G = (V, E, X) with n = |V| nodes on a client. Here we assume that both the GCN and the MLP have the same hidden size m. Typically, the GCN produces Z Rn m by Z = ˆAσ( ˆAXWθ 1)Wθ 2, (19) where θ = {Wθ 1, Wθ 2}, and σ( ) is a nonlinear activation function. The feature encoder produces H Rn m by H = σ(XWϕ 1)Wϕ 2, (20) where ϕ = {Wϕ 1, Wϕ 2}. The time complexity of the 2-layer GCN is O(2nm2 + 2n2m) for both forward and backward pass. Hence, the overall time complexity of the GCN is O(nm2 + n2m). Similarly, we can get the time complexity of the MLP as O(nm2). Typically, n is significantly larger than m (e.g., 1,079 vs 16 in Pub Med). Then the time complexity of the MLP will be consequently much smaller than the GCN. As a result, the feature encoder in Fed GLS does not introduce significant extra computational costs compared with other baselines such as Fed-GNNk. To reduce the training cost of the feature encoder, we may choose to use smaller MLPs or model compression. It can not only reduce the computational complexity of the feature encoder, but also uses fewer communication resources during aggregation. We leave this exploration for future work. Published in Transactions on Machine Learning Research (10/2024) In this study, we synthesize the distributed graph data on five real-world datasets, i.e., Cora (Sen et al., 2008), Cite Seer (Sen et al., 2008), Pub Med (Sen et al., 2008), Flickr (Zeng et al., 2020), and ogbn-arxiv (Hu et al., 2020) by splitting each of them into multiple communities. A community is regarded as an entire graph on a client. Table 4 summarizes the statistics and basic information about the five datasets.