# spacetime_graph_neural_networks__f9002a3d.pdf Published as a conference paper at ICLR 2022 SPACE-TIME GRAPH NEURAL NETWORKS Samar Hadou, Charilaos I. Kanatsoulis & Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania {selaraby, kanac, aribeiro}@seas.upenn.edu We introduce space-time graph neural network (ST-GNN), a novel GNN architecture, tailored to jointly process the underlying space-time topology of time-varying network data. The cornerstone of our proposed architecture is the composition of time and graph convolutional filters followed by pointwise nonlinear activation functions. We introduce a generic definition of convolution operators that mimic the diffusion process of signals over its underlying support. On top of this definition, we propose space-time graph convolutions that are built upon a composition of time and graph shift operators. We prove that ST-GNNs with multivariate integral Lipschitz filters are stable to small perturbations in the underlying graphs as well as small perturbations in the time domain caused by time warping. Our analysis shows that small variations in the network topology and time evolution of a system does not significantly affect the performance of ST-GNNs. Numerical experiments with decentralized control systems showcase the effectiveness and stability of the proposed ST-GNNs. 1 INTRODUCTION Graph Neural Networks (GNNs) are powerful convolutional architectures designed for network data. GNNs inherit all the favorable properties convolutional neural networks (CNNs) admit, while they also exploit the graph structure. An important feature of GNNs, germane to their success, is that the number of learnable parameters is independent of the size of the underlying networks. GNNs have manifested remarkable performance in a plethora of applications, e.g., recommendation systems (Ying et al., 2018; Wu et al., 2020), drug discovery and biology (Gainza et al., 2020; Strokach et al., 2020; Wu et al., 2021; Jiang et al., 2021), resource allocation in autonomous systems (Lima et al., 2020; Cranmer et al., 2021), to name a few. Recently, there has been an increased interest in time-varying network data, as they appear in various systems and carry valuable dynamical information. This interest is mostly prominent in applications as decentralized controllers (Tolstaya et al., 2020; Gama et al., 2020b; Yang and Matni, 2021; Gama and Sojoudi, 2021), traffic-flow forecasting (Yu et al., 2018; Li et al., 2018; Fang et al., 2021), and skeleton-based action detection (Yan et al., 2018; Cheng et al., 2020; Pan et al., 2021). The state-ofthe-art (SOTA) usually deploys an additional architecture side by side with a GNN so the latter learns patterns from the graph domain while the former works on the time sequences. One choice could be a CNN as in (Li et al., 2020; Isufi and Mazzola, 2021; Wang et al., 2021) or a recurrent neural network (RNN) as in (Seo et al., 2018; Nicolicioiu et al., 2019; Ruiz et al., 2020). However, these joint architectures are performed in a centralized manner in the sense that the up-to-date data of all nodes are given at any given time. While this is well suited for, but not limited to, the case of social networks and recommendation systems, many physical-network applications are decentralized in nature and suffer from time delays in delivering messages. In this paper, we close the gap by developing a causal space-time convolutional architecture that jointly processes the graph-time underlying structure. That is, the convolutional layers preserve time delays in message passing. Our work is motivated by the following question. Is it possible to transfer learning between signals and datasets defined over different space-time underlying structures? This is a well motivated question since in practice we execute these architectures on graphs that are different from the graphs used in training and signals are sampled at different sampling rates between training and execution. The answer to the above question was provided for the case of static graph signals in Published as a conference paper at ICLR 2022 (Gama et al., 2020a), where the stability of traditional GNNs to graph perturbations was proved. In this work we give an affirmative answer to the above question in the case when time-varying graph signals are considered and space-time convolutional architectures are employed. The contribution of this paper is twofold. First we introduce a novel convolutional architecture for time-varying graph signals, and second we prove its stability. Specifically, we provide a general definition of convolutions for any arbitrary shift operator and define a space-time shift operator (STSO) as the linear composition of the graph shift operator (GSO) and time-shift operator (TSO). We then introduce space-time graph neural networks (ST-GNNs), a cascade of layers that consist of space-time graph filters followed by point-wise nonlinear activation functions. The proposed ST-GNN allows processing continuous-time graph signals, which is pivotal in the stability analysis. Furthermore, we study the effect of relative perturbations on ST-GNNs and prove that small variations in the graph and/or irregularities in the sampling process of time-varying graph signals do not essentially affect the performance of the proposed ST-GNN architecture. Our theoretical findings are also supported by thorough experimental analysis based on decentralized control applications. The rest of this paper is structured as follows. The related work is summarized in Section 2. Sections 3 and 4 present our contributions listed above. Numerical experiments and conclusions are presented in Sections 5 and 6, respectively. The proofs and extended experiments are provided in the appendices. Notation: Bold small and large symbols, i.e. x and X, denote vectors and matrices, respectively. Calligraphic letters mainly represent time-varying graph signals unless otherwise is stated. 2 RELATED WORK GNNs for Time-varying Graph Signals. One of the early architectures was (Yan et al., 2018), which introduced a convolutional filter that aggregates information only from 1-hop neighbors. Relying on product graphs, (Isufi and Mazzola, 2021) introduced a convolutional architecture, where each node has access to the present and past data of the other nodes in the graph. A similar convolutional layer was studied in (Pan et al., 2021; Loukas and Foucard, 2016), and while restricted to a domain that preserves time delays, (Pan et al., 2021) shows that it is not necessarily less expressive. Meanwhile, (Wang et al., 2021) performs GNNs over graph signals at each time instance separately before a temporal convolution is performed at the GNN outputs to capture the evolution of graph embeddings. Graph RNNs are another architecture developed to deal with time-varying graph signals. For example, (Pareja et al., 2020) uses an RNN to evolve the GNN parameters over time. Moreover, (Hajiramezanali et al., 2019) combined the GRNN with a variational autoencoder (VGAE) to improve the former s expressive power. However, all these architectures did not take into account the physical restrictions in the form of time delays, associated with decentralized applications. Similar to our work, other architectures considered the diffusion equation to form message-passing layers, e.g., (Xhonneux et al., 2020; Poli et al., 2021; Fang et al., 2021). These architectures parameterize the dynamics of graph signals with GNNs. In simple words, they learn a parameterization that helps find the current state variables from the previous ones. The architecture in (Chamberlain et al., 2021) is another example of these architectures, but learns the graph weights instead of a limited number of filter coefficients (i.e., it resembles the parameterization of graph attention networks in (Veliˇckovi c et al., 2018)). Stability. Deformation stability of CNNs was studied in (Bruna and Mallat, 2013; Bietti and Mairal, 2019). The notion of stability was then introduced to graph scattering transforms in (Gama et al., 2019; Zou and Lerman, 2020). In a following work, Gama et al. (2020a) presented a study of GNN stability to graph absolute and relative perturbations. Graphon neural networks was also analyzed in terms of its stability in (Ruiz et al., 2021). Moreover, (Pan et al., 2021) proves the stability to absolute perturbations of space-time graph scattering transforms. 3 SPACE-TIME GRAPH NEURAL NETWORKS In this section, we present the proposed ST-GNN architecture for time-varying graph signals. First, we provide a general definition of convolutions and then we develop the space-time graph filters, which are the cornerstone of the ST-GNN architecture. Published as a conference paper at ICLR 2022 Our analysis starts with the homogeneous diffusion equation, which is defined with respect to the Laplacian differential operator L X(t) t = LX(t). (1) Equation (1) describes processes of signals that evolve across the diffusion dimension, t. The diffused signal, X(t), is modeled as an abstract vector in a vector space V. The vector space is associated with an inner product ., . V, and is closed under addition and scalar multiplication, e.g., n-dimensional vector spaces and function spaces. The solution of (1) is X(t) = e t LX(0) = e t LX0 = e t LX (the subscript is omitted in what follows), and describes the signal after time t from the initial state. 3.1 CONVOLUTION With the diffusion equation in our hands, we can now define the convolution operation as the linear combination of the diffusion sequence with a filter h(t). Definition 1 (Convolution Operator). For a linear shift-invariant filter with coefficients h(t), t 0, the convolution between the filter and an input signal X V, with respect to a linear differential operator L : V V, is defined as 0 h(t)e t LXdt, (2) where D denotes the convolution operator applied on signals with underlying structure D. Definition 1 is a valid convolution operator. With abuse of notation, we refer to L as the shift operator but it should be understood that the shift operation is executed using e L. Definition 1 establishes a generalized form of convolutions for a wide variety of shift operators and signals. Next we discuss two convolution types of practical interest that follow directly from (2), i.e., time convolutions and graph convolutions. Time Convolutions: They involve continuous-time signals, denoted by x(τ), that are elements of square-integrable function spaces, i.e, x(τ) L2(R). The underlying structure of x(τ) is the real line, since the time variable τ R. In order to generate the traditional definition of time convolution, the time shift operator (TSO) is chosen as the differential operator, Lτ = / τ. One can show that e t / τx(τ) = x(τ t), using the Taylor series, e t / τx(τ) = τ n x(τ) = x(u), with u = τ t. In other words, the TSO performs a translation operation. Then (2) reduces to x(τ) R h(τ) = Z 0 h(t)e t / τx(τ)dt = Z 0 h(t)x(τ t)dt, (3) which is indeed the traditional definition of convolutions between time signals. It is also worth noting that ejωτ, ω [0, ) are eigenfunctions of the TSO Lτ = / τ with associated eigenvalues jω. This observation is pivotal in the stability analysis of section 4. Graph Convolutions: They involve graph signals, x L2(RN), that are N-dimensional squaresummable vectors containing information associated with N different nodes. The underlying structure of x is represented by a graph G = (V, E, W), where V is the set of nodes with cardinality |V| = N, E V V is the set of edges, and W : E R is a map assigning weights to the edges (Shuman et al., 2012). The graph shift operator (GSO), S RN N, is a matrix representation of the graph sparsity, e.g., graph adjacency or Laplacian (Ortega et al., 2018). Applying S to a signal x results in a signal Sx, where [Sx]i represents aggregated data from all nodes j that satisfy that (i, j) E or (j, i) E. We focus on undirected graphs, and therefore, S is a real symmetric and diagonalizable matrix with a set of N eigenvectors {vi}N i=1, each associated with a real eigenvalue λi. The GSO is a linear operator S : L2(RN) L2(RN) that deals with discrete signals. Therefore, it is more common to discretize the diffusion process over graphs at times k Ts, k Z+, where Ts is the sampling period. The convolution is then defined as k=0 hke k Sx, (4) Published as a conference paper at ICLR 2022 where {hk}k is the filter coefficients and K is the number of filter taps. The finite number of taps is a direct consequence of Cayley-Hamilton theorem, and h is a finite-impulse response (FIR) filter. Equation (4) generalizes graph convolutions in the literature, where convolutions are defined as polynomials in the shift operator, i.e., x G h = PK 1 k=0 hk Skx, (Sandryhaila and Moura, 2013). This definition executes the shift operation using the matrix S directly compared to e S in (4). 3.2 SPACE-TIME CONVOLUTION AND GRAPH FILTERS The generalized definition of convolution in 1, as well as the definitions of time and graph convolutions in (3), (4) respectively, set the ground for the space-time convolution. Space-time convolutions involve signals that are time-varying and are also supported on a graph. They are represented by X L2(RN) L2(R), where is the tensor product [see (Kadison and Ringrose, 1983; Grillet, 2007)] between the vector spaces of graph signals and continuous-time signals. In the following, calligraphic symbols (except L) refer to time-varying graph signals unless otherwise is stated. The space-time shift operator (STSO) is the linear operator that jointly shifts the signals over the space-time underlying support. Therefore, we choose the STSO to be the linear composition of the GSO and TSO, i.e., L = S Lτ. The convolution operator in Definition 1 can be specified with respect to the STSO for time-varying graph signals as h G R X = Z 0 h(t)e t S Lτ Xdt =: H(S, Lτ)X, (5) where H(S, Lτ) = R 0 h(t)e t S Lτ dt represents the space-time graph filter. The frequency response of the filter is then derived as [see Appendix A] h(λ, jω) = Z 0 h(t)e t(λ+jω)dt. (6) Remark 1. The space-time graph filters can be implemented distributively. For each node, the operator e S Lτ t aggregates data from the t-hop neighbors after they are shifted in time. This process can be done locally after each node acquires information from their neighbors. The time shift t represents the time required for the data to arrive from the t-hop neighbors. 3.3 SPACE-TIME GRAPH NEURAL NETWORKS The natural extension of our previous analysis is to define the convolutional layer of ST-GNNs based on the space-time graph filters in (5). However, learning infinite-impulse response (IIR) filters is impractical and thus we employ FIR space-time graph filters, defined as Hd(S, Lτ) = PK 1 k=0 hke k Ts S Lτ . The ST-GNN architecture employs a cascade of L layers, each of which consists of a bank of space-time graph filters followed by a pointwise nonlinear activation functions, denoted by σ. The input to layer ℓis the output of the previous layer, Xℓ 1, and the output of the ℓth layer is written as k=0 hfg kℓe k Ts S Lτ X g ℓ 1 for each feature X f ℓ, f = 1, . . . , Fℓ. The number of features at the output of each layer is denoted by Fℓ, ℓ= 1, . . . , L. A concise representation of the ST-GNN can be written as Φ(X; H, S Lτ), where H is the set of all learnable parameters {hfg kℓ}f,g k,ℓ. An interesting remark is that the time-varying graph signals processed in (7) are continuous in time, but learning a finite number of parameters. Remark 2. The ST-GNNs are causal architectures, that is, the architecture does not allow the nodes to process data that are still unavailable to them. This is achieved by delaying the data coming from the k-hop neighbors by k time steps [c.f. (7)]. Therefore, each node has access to its own up-to-date data and outdated data from their neighbors. 4 STABILITY TO PERTURBATIONS In this section, we perform a stability analysis of our proposed ST-GNN architecture. In particular, we characterize the difference between the filters H(S, Lτ) and H(ˆS, ˆLτ), where ˆS and ˆLτ are perturbed versions of the graph and time shift operators, respectively. Published as a conference paper at ICLR 2022 4.1 PERTURBATION MODELS Before we study the effect of graph and time perturbations on the stability of our proposed ST-GNN, we first need to introduce a perturbation model in space and time. For graph perturbations, we follow the relative perturbation model proposed in (Gama et al., 2020a): PT 0 ˆSP0 = S + SE + ES, (8) where ˆS is the perturbed GSO, E is the error matrix and P0 is a permutation matrix. Taking a closer look in equation (8) we observe that the entries of S and ˆS become more dissimilar as the norm E increases. The dissimilarity is measured by the difference [S]ij [PT 0 ˆSP0]ij = [SE]ij + [ES]ij, where each summand is a weighted sum of the edges that are connected to nodes i and j, respectively. For nodes i and j with high node degrees, the sum includes a high number of non-zero entries from S, which leads to perturbations of high volume. Consequently, the considered graph perturbations are relative to node degrees. It is also worth noticing that equation (8) is invariant to node permutations of the GSO. This is due to the use of P0 which reconciles for node relabelling of perturbed graphs (see Section 4.2). While there has been plenty of work regarding graph perturbation models, time perturbation in the context of graph neural networks is not well studied. In this paper we fill this gap and propose a practical model for time perturbations. In particular, we study irregularities that appear in the process of sampling continuous-time signals, x(τ). It is often the case that sampling is imperfect and not exactly equispaced, i.e., the discrete signals are captured at times k Ts z(k Ts), where k Z, Ts R is the sampling period and z : R R is a differentiable function. To model this phenomenon, we use time-warping operations. In particular, we consider a perturbed timeline ˆτ = τ + z(τ), and we observe the signal x(ˆτ) = x(τ + z(τ)) instead of x(τ). The diffusion process in (1) for a continuous-time signal, over the perturbed timeline, can then be cast as tx(t, ˆτ) = ˆτ x(t, ˆτ) tx(t, τ + z(τ)) = (1 + z (τ)) τ x(t, τ + z(τ)). (9) Now, let ξ(τ) = z (τ), Lτ = τ , and ˆLτ = ˆτ . Then the perturbed TSO takes the form ˆLτ = (1 + ξ(τ)) Lτ. (10) The effect of time perturbation is controlled by ξ(τ). As the norm ξ(τ) 2 grows, the dissimilarity between the original and perturbed time shift operators increases. This shows that the considered time perturbation model is relative to the rate of change of z(τ); faster changes in z(τ) produce more significant perturbations (i.e., longer time shifts), whereas smoother choices of z(τ) yield lower values of ξ(τ) 2 and lower perturbation levels. The time convolution in (3) with a perturbed TSO becomes x(τ + z(τ)) R h(τ) = Z 0 h(t)e t(1+ξ(τ))Lτ x(τ + z(τ))dt 0 h(t)e t(1+ξ(τ))Lτ ez(τ)Lτ x(τ)dt, (11) since x(τ + z(τ)) = ez(τ)Lτ x(τ), which models a signal translation ez(τ)Lτ caused by time perturbations. Moreover, the perturbed TSO e t(1+ξ(τ))Lτ models a time-varying shift of t(1 + ξ(τ)) when ξ(τ) is not a constant (which is the case in our analysis). 4.2 JOINT OPERATOR-DISTANCE MODULO In order to proceed with the stability analysis, we need to define the operators used to measure the effect of perturbations. We start with the measure of graph perturbations as we utilize the operator distance modulo permutation, . P, introduced in (Gama et al., 2020a). For any two graph operators A and ˆA : L2(RN) L2(RN), the distance between them can be calculated as A ˆA P = min P P max x: x 2=1 Ax PT ˆAPx 2, (12) where P is the set of all permutation matrices. This distance measures how close an operator ˆA to being a permuted version of A. The distance in (12) is minimum (i.e., 0) when the two graphs are permuted versions of each other. Next, we define a new operator distance modulo translation for time operations: Published as a conference paper at ICLR 2022 Definition 2 (Operator Distance Modulo Translation). Given the linear time operators B and ˆB : L2(R) L2(R), define the operator distance modulo translation as B ˆB T = min s R max x: x 2=1 Bx(τ) (e s Lτ ˆB)x(τ) 2 . (13) The expression in (13) measures how far or close two time operators are, to being translated versions of each other. In the case where ˆLτ = (1 + ξ(τ)) Lτ as in (10), the distance between B = e t Lτ and ˆB = e t ˆ Lτ reduces to e t Lτ e t ˆ Lτ T = min s R max x: x 2=1 x(τ t) x(τ s t tξ(τ s)) 2 . (14) Since we are dealing with time-varying graph signals and joint time and graph perturbations, we combine the two previously defined distances and introduce the operator distance modulo permutationtranslation. Definition 3 (Operator Distance Modulo Permutation-Translation). Given the graph and time operators A, ˆ A : L2(RN) L2(RN), B, ˆB : L2(R) L2(R), define the operator distance modulo permutation-translation as A B ˆA ˆB P,T = min P P min s R max X: X F =1 (A B)X (PT ˆAP) (e s Lτ ˆB)X F . (15) 4.3 STABILITY PROPERTIES OF SPACE-TIME GRAPH FILTERS The last step before we present our stability results is to discuss integral Lipschitz filters, which are defined as: Definition 4 (Multivariate Integral Lipschitz Filters). Let the frequency response of a multivariate filter be h(w) : Rn R. The filter is said to be integral Lipschitz if there exists a constant C > 0 such that for all w1 and w2, h(w2) h(w1) 2C w2 w1 2 w2 + w1 2 . (16) The condition in (16) requires the frequency response of the filter to be Lipschitz over line segments defined by arbitrary w1 and w2 Rn, with a Lipschitz constant that is inversely proportional to their midpoint. For differentiable filters the above condition reduces to ζ h(w1) C 1 w1 2 for every ζ that is a component of w1. The proposed space-time graph filters admit a frequency response which is bivariate with variables λ and jω [cf. (6)]. Following Definition 4, a space-time graph filter h(λ, jω) is integral Lipschitz, if |λ + jω| ζ h(λ, jω) C, ζ {λ, jω}. (17) Note that the vector w1 in this case is given as [λ, ω]T with w1 2 = |λ+jω|. The conditions in (17) indicate that the frequency response of an integral Lipschitz space-time graph filter can vary rapidly at low frequencies close to 0. Therefore, the filter can discriminate between close low-frequency spectral components, but is more flat at high frequencies, prohibiting discriminability between the spectral features in these bands. Note that when we mention low frequencies in the context of space time graph filters (or ST-GNNs later), we refer to low values of λ and ω, whereas high frequencies correspond to high values of λ or ω. Proposition 1 (Stability to Graph Perturbations). Consider a space-time graph filter H along with graph shift operators S and ˆS, and a time shift operator Lτ. If it holds that (A1) the GSOs are related by PT 0 ˆSP0 = S + SE + ES with P0 being a permutation matrix, (A2) the error matrix E has a norm E ϵs, and an eigenvector misalignment δ relative to S, and (A3) the filter H(S, Lτ) is an integral Lipschitz filter with a Lipschitz constant C > 0, then the difference between the space-time graph filters H(S, Lτ) and H(ˆS, Lτ) is characterized by H(S, Lτ) H(ˆS, Lτ) P 2Cϵs 1 + δ N + O(ϵ2 s), (18) where δ = ( U V + 1)2 1, and N is the number of nodes in S. Published as a conference paper at ICLR 2022 Proposition 1 states that space-time graph filters are stable to graph perturbations of order O(ϵs) with a stability constant 2C(1 + δ N). The constant is uniform for all graphs of the same size and depends on the Lipschitz constant of the filter. The bound is proportional to the eigenvector misalignment between the GSO and the error matrix. The misalignment is bounded and does not grow with the graph size as δ ( U + V + 1)2 1 = 8 with U and V being unitary matrices. Now, consider a type of perturbations known as graph dilation, where the edge weights stretch by a value ϵs, i.e., the GSO becomes ˆS = (1 + ϵs)S. In this case, the eigenvalues of the GSO also stretch by ϵs, but the eigenvectors are no longer misaligned, i.e., δ = 0. The bound for graph dilation then reduces to 2Cϵ and does not depend on the structure of the underlying graph. Therefore, it is inferred that the bound in (18) is split into (i) a term that reflects the difference in eigenvalues between S and ˆS, and (ii) a term that arises from the eigenvector misalignment. Note that the stability constant is not affected by any property of the TSO, and consequently has the same form of the one derived by Gama et al. (2020a) for traditional graph filters. Proposition 2 (Stability to Time Perturbations). Consider a space-time graph filter H along with time shift operators Lτ and ˆLτ, and a graph shift operator S. If it holds that (A1) the TSOs are related by ˆLτ = (1 + ξ(τ)) Lτ, (A2) the error function ξ(τ) is infinitely differentiable with a norm ξ(τ) 2 κϵτ, and the norm of the mth-order derivative ξ(m)(τ) 2 is of order O(ϵm+1 τ ) where κ is an absolute constant, and (A3) the filter H(S, Lτ) is an integral Lipschitz filter with a Lipschitz constant C > 0, then the difference between the space-time graph filters H(S, Lτ) and H(S, ˆLτ) satisfies H(S, Lτ) H(S, ˆLτ) T Cκϵτ + O(ϵ2 τ). (19) Similarly to Proposition 1, the filter difference is bounded by a constant proportional to the size of perturbation, i.e., ξ(τ) 2, and the bound is also affected by the Lipschitz constant of the filter. The perturbed TSO shares the same eigenfunctions with the original TSO since ˆLτejωτ = jω(1 + ξ(τ))ejωτ. Therefore, the difference between the filters only arises from the difference in eigenvalues. It is worth mentioning that the assumption (A2) of Proposition 2 is rather reasonable. One example of a time-warping function z(τ) that satisfies this assumption is z(τ) = ϵτ cos(ϵττ)e ϵτ τ. The error function is then ξ(τ) = z (τ) = ϵ3/2 τ (sin ϵττ + cos ϵττ)e ϵτ τ with a norm of p 3/4ϵτ. The first derivative of the error function is ξ (τ) = 2ϵ5/2 τ cos(ϵττ)e ϵτ τ and its norm is of order O(ϵ2 τ), which increases exponentially with the order of the derivatives. In Theorem 1, we consider joint time and graph perturbations, and characterize the difference between the filters under the STSO, S Lτ, and its perturbed version ˆS ˆLτ. Theorem 1 (Space-Time Graph Filter Stability). Under the assumptions of Propositions 1 and 2, the distance between the space-time graph filters H(S, Lτ) and H(ˆS, ˆLτ) satisfies H(S, Lτ) H(ˆS, ˆLτ) P,T 2Cϵs 1 + δ N + Cκϵτ + O(ϵ2), (20) where ϵ = max{ϵs, ϵτ}. In order to keep this stability bound small, space-time graph filters should be designed with small Lipschitz constant C. This comes at the cost of filter discriminability at high frequencies [cf. (17)], whereas at low frequencies, the integral Lipschitz filters are allowed to vary rapidly. Thus Theorem 1 shows that space-time graph filters are stable and discriminative at low frequencies but cannot be both stable and discriminative at higher frequencies. 4.4 STABILITY PROPERTIES OF SPACE-TIME GRAPH NEURAL NETWORKS Theorem 2 establishes the stability of the proposed ST-GNN architecture. Theorem 2 (ST-GNNs Stability). Consider an L-layer ST-GNN Φ( ; H, S Lτ) with a single feature per each layer. Consider also the GSOs S and ˆS, and the TSOs Lτ and ˆLτ. If (A1) the filters at each layer are integral Lipschitz with a Lipschitz constant C > 0 and have unit operator norms, i.e., Hℓ(S, Lτ) = 1, ℓ= 1, . . . , L, (A2) the nonlinearirties σ are Lipschitz-continuous with a Lipschitz constant of 1, i.e., σ(x2) σ(x1) 2 x2 x1 2, Published as a conference paper at ICLR 2022 (A3) the GSOs satisfy (A1) and (A2) of Proposition 1, and the TSOs satisfy (A1) and (A2) of Proposition 2, then Φ( ; H, S Lτ) Φ( ; H, ˆS ˆLτ) P,T 2CLϵs 1 + δ N + CLκϵτ + O(ϵ2), (21) where ϵ = max{ϵs, ϵτ}. Theorem 2 shows that the stability bound of ST-GNNs depends on the number of layers L, in addition to the factors affecting the stability of space-time graph filters. Compared to space-time graph filters, ST-GNNs can be both stable and discriminative. This is due to the nonlinearities applied at the end of each layer. The effect of the pointwise nonlinearity is that it demodulates (i.e., spills) the energy of the higher-frequency components into lower frequencies [see (Gama et al., 2020a)]. Once the frequency content gets spread over the whole spectrum by the consecutive application of the nonlinearities, the integral Lipschiz filters in the deeper layers become discriminative. The stability/discriminability property of ST-GNNs is a pivotal reason why they outperform space-time graph filters. 5 NUMERICAL EXPERIMENTS We examine two decentralized-controller applications: flocking, and motion planning. Specifically, we are given a network of N agents, where agent i has position pi,n R2, velocity vi,n R2 and acceleration ui,n R2 at time steps n < T. The agents collaborate to learn controller actions and complete a specific task. For each task, the goal is to learn a controller U that imitates an optimal centralized controller U . Therefore, we parameterize U with ST-GNNs and aim to find H that satisfies H = arg min H H 1 M m=0 ℓ Φ Xm; H, Lm U m , (22) where H is the set of all possible parameterizations, ℓ(.) is the mean-squared loss function, Lm = (S Lτ)m is the STSO of the m-th example, and M is the size of the training dataset. The input signal Xm Rq N T contains q state variables of N agents, e.g. the position and velocity, over T time steps. Detailed description of the ST-GNN implementation can be found in Appendix G. A. Network Consensus and Flocking. A swarm of N agents collaborates to learn a reference velocity rn R2, n, according to which the agents move to avoid collisions. Ideally the reference velocity and the agent velocity vi,n, should be the same, which is not the case in practice since each agent observes a noisy version of the reference velocity ri,n. Therefore the state variables (inputs) are the estimated velocities {vi,n}i,n, the observed velocities { ri,n}i,n and the relative masses {qi,n|qi,n = P j Ni(pi,n pj,n)}i,n, where Ni is the set of direct neighbors of the i-th agent. We train an ST-GNN to predict agent accelerations {ui,n}i,n according to which the agents will move to a new position with certain velocity. Further details can be found in Appendix G. Experiment 1. First we organize 100 agents in a mesh grid, but the agents are not allowed to move. The reason is that we want to test the ST-GNN on a setting where the graph is not changing over time. Note that although the agents do not move they still estimate an acceleration and therefore a velocity according to which they should move. In simple words the goal is to estimate the optimal acceleration of each agent in the case they would be required to move. Consequently, the training and testing is executed over the same graph but with different state variables and outputs. Fig. 1 (Left) illustrates the mean of agent velocities, defined as vn = PN i=1 vi,n, along with the reference velocity rn. We observe that the agents succeed to reach consensus and follow the reference velocity. We also execute the previously trained ST-GNN on a setting with perturbed graph and time shift operators according to (8) and (10), respectively. Specifically, the error matrix E is diagonal with E ϵ, and z(τ) = ϵ cos(ϵτ)e ϵτ. Fig. 1 (Middle) shows that the relative RMSE of the outputs {ui,n}i,n, defined as Φ(X; H, S Lτ) Φ(X; H, ˆS ˆLτ) F / Φ(X; H, S Lτ) F . We observe that the relative RMSE is linearly proportional to the perturbation size ϵ, following the results of Theorem 2. Experiment 2. We now allow 50 agents to move and form a communication network that changes with their movement, i.e., the graph of the agents varies over time. Consequently, the ST-GNN is trained on a dynamic graph [see Appendix G]. Fig. 1 (Right) shows the difference between the agent and reference velocities, vi,n and rn, averaged over N agents. The small standard deviation in the velocity difference (represented in red) indicates that the agents reached a consensus on their velocities. The figure also shows that the ST-GNN outperforms the decentralized policy, described in Appendix G. This experiment demonstrates the effectiveness of the proposed ST-GNN architecture Published as a conference paper at ICLR 2022 0 20 40 60 80 100 Time steps Agents mean velocity Reference velocity 10 4 10 3 10 2 10 1 100 ǫ Relative RMSE Figure 1: (Left) The mean of the velocity estimates by the agents compared to the reference velocity in a test example. (Middle) The relative RMSE in the ST-GNN outputs under perturbations to the underlying space-time topology. (Right) The mean difference between agent and reference velocities. 0 1 2 3 4 M Relative error 10 3 10 2 10 1 Ts Relative error Figure 2: (Left) Three-second agent trajectories of a test example. (Middle) Relative error caused by using different network densities. (Right) Relative error caused by using different sampling time Ts. on dynamic graphs, even though our theoretical analysis is derived for fixed graphs. Further results are presented in Appendix G. B. Unlabeled Motion Planning. The goal of this task is to assign N unlabeled agents to N target locations. The agents collaborate to find their free-collision trajectories to the assigned targets. We train an ST-GNN to learn the control accelerations {ui,n}i,n according to (22). The state variables for agent i are position {pi,n}n, velocity {vi,n}n, the position and velocity of agent s M-nearest neighbors, and the position of the M-nearest targets at each time step n. The underlying graph is constructed as (i, j) En if and only if j Ni,n or i Nj,n, where Ni,n is the set of the M-nearest neighbors of agent i at time step n. The rest of the training parameters are shown in Appendix G. We executed the learned ST-GNN on a test dataset that is designed according to the parameter M and sampling time Ts. A snapshot of the predicted trajectories are shown in Fig. 2 (Left). We observe that the agents approach their target goals without collisions. The average distance between the agent s final position and the desired target is, ˆdpg = 0.524 with variance equal to 0.367. We now test the sensitivity of ST-GNNs to the right choice of M and Ts in the testing phase. Fig. 2 (Middle) shows the relative error, which is calculated as ( ˆdpg,pert ˆdpg,org)/ ˆdpg,org, when we change M. The error increases with M because changing the neighborhood size induces a change in the underlying graph. According to our theoretical analysis, the error increases with the size of graph perturbations, which matches the results in Fig. 2 (Middle). The same remark can be observed in Fig. 2 (Right) when we change the sampling time, which induces a change in the underlying time structure. 6 CONCLUSIONS In this paper we developed a novel ST-GNN architecture, tailored for time-varying signals that are also supported on a graph. We showed that the proposed architecture is stable under both graph and time perturbations under certain conditions. Our conditions are practical and provide guidelines on how to design space-time graph filters to achieve desirable stability. Our theoretical analysis is supported by strong experimental results. Simulations on the tasks of decentralized flocking and unlabeled motion planning corroborated our theoretical results and demonstrated the effectiveness of the proposed ST-GNN architecture. Published as a conference paper at ICLR 2022 A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep convolutional representations. Journal of Machine Learning Reseasrch, 20(1):876 924, Jan. 2019. J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872 1886, Aug. 2013. B. P. Chamberlain, J. Rowbottom, M. Gorinova, S. Webb, E. Rossi, and M. M. Bronstein. GRAND: Graph neural diffusion. June 2021. URL http://arxiv.org/abs/2106.10934. K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 180 189, 2020. M. Cranmer, P. Melchior, and B. Nord. Unsupervised resource allocation with graph neural networks. Proceedings of Machine Learning Research, 1:1 13, June 2021. URL http://arxiv.org/ abs/2106.09761. Z. Fang, Q. Long, G. Song, and K. Xie. Spatial-temporal graph ode networks for traffic flow forecasting. page 10, June 2021. URL http://arxiv.org/abs/2106.12931. P. Gainza, F. Sverrisson, F. Monti, E. Rodolà, D. Boscaini, M. M. Bronstein, and B. E. Correia. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2):184 192, Feb. 2020. F. Gama and S. Sojoudi. Distributed linear-quadratic control with graph neural networks. 2021. URL https://arxiv.org/abs/2103.08417v2. F. Gama, A. Ribeiro, and J. Bruna. Stability of graph scattering transforms. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019. F. Gama, J. Bruna, and A. Ribeiro. Stability properties of graph neural networks. IEEE Transactions on Signal Processing, 68:5680 5695, 2020a. F. Gama, Q. Li, E. Tolstaya, A. Prorok, and A. Ribeiro. Synthesizing decentralized controllers with graph neural networks and imitation learning. Dec. 2020b. URL http://arxiv.org/abs/ 2012.14906. P. A. Grillet. Abstract Algebra. Springer-Verlag, New York, 2 edition, 2007. E. Hajiramezanali, A. Hasanzadeh, N. Duffield, K. R. Narayanan, M. Zhou, and X. Qian. Variational graph recurrent neural networks. In Neural Information Processing Systems (Neur IPS), 2019. E. Isufi and G. Mazzola. Graph-time convolutional neural networks. In IEEE Data Science and Learning Workshop, June 2021. D. Jiang, Z. Wu, C. Y. Hsieh, G. Chen, B. Liao, Z. Wang, C. Shen, D. Cao, J. Wu, and T. Hou. Could graph neural networks learn better molecular representation for drug discovery? a comparison study of descriptor-based and graph-based models. Journal of Cheminformatics, 13(1):12, dec 2021. R. V. Kadison and J. R. Ringrose. Fundamentals of the Theory of Operator Algebras, volume I. Academic Press, 1 edition, 1983. Q. Li, F. Gama, A. Ribeiro, and A. Prorok. Graph neural networks for decentralized multi-robot path planning. In IEEE International Conference on Intelligent Robots and Systems, pages 11785 11792, 2020. Y. Li, R. Yu, C. Shahabi, and Y. Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations (ICLR 18), 2018. V. Lima, M. Eisen, K. Gatsis, and A. Ribeiro. Resource allocation in large-scale wireless control systems with graph neural networks. In 21th IFAC World Congress, volume 53, pages 2634 2641. Elsevier B.V., Jan. 2020. Published as a conference paper at ICLR 2022 A. Loukas and D. Foucard. Frequency analysis of time-varying graph signals. In 2016 IEEE Global Conference on Signal and Information Processing, pages 346 350, Apr. 2016. A. Nicolicioiu, I. Duta, and M. Leordeanu. Recurrent space-time graph neural networks. Advances in Neural Information Processing Systems, 32, apr 2019. A. Ortega, P. Frossard, J. Kovacevic, J. M. Moura, and P. Vandergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808 828, May 2018. C. Pan, S. Chen, and A. Ortega. Spatio-temporal graph scattering transform. In International Conference on Learning Representations, 2021. A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. B. Schardl, and C. E. Leiserson. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pages 5363 5370, 2020. M. Poli, S. Massaroli, C. M. Rabideau, J. Park, A. Yamashita, H. Asama, and J. Park. Continuousdepth neural models for dynamic graph prediction. June 2021. URL http://arxiv.org/ abs/2106.11581. L. Ruiz, F. Gama, and A. Ribeiro. Gated graph recurrent neural networks. IEEE Transactions on Signal Processing, 68:6303 6318, 2020. L. Ruiz, Z. Wang, and A. Ribeiro. Graphon and graph neural network stability. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5255 5259. IEEE, June 2021. A. Sandryhaila and J. M. Moura. Discrete signal processing on graphs. IEEE Transactions on Signal Processing, 61(7):1644 1656, 2013. Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling with graph convolutional recurrent networks. In Advances in Neural Information Processing Systems, pages 362 373, 2018. D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83 98, Oct. 2012. A. Strokach, D. Becerra, C. Corbi-Verge, A. Perez-Riba, and P. M. Kim. Fast and flexible protein design using deep graph neural networks. Cell Systems, 11(4):402 411.e4, Oct. 2020. H. G. Tanner, A. Jadbabaie, and G. J. Pappas. Stable flocking of mobile agents part II: dynamic topology. In Proceedings of the IEEE Conference on Decision and Control, volume 2, pages 2016 2021, 2003. E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro. Learning decentralized controllers for robot swarms with graph neural networks. In Proceedings of the Conference on Robot Learning, volume 100, pages 671 682, Oct 2020. M. Turpin, N. Michael, and V. Kumar. Capt: Concurrent assignment and planning of trajectories for multiple robots. The International Journal of Robotics Research, 33(1):98 112, 2014. P. Veliˇckovi c, A. Casanova, P. Liò, G. Cucurull, A. Romero, and Y. Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. International Conference on Learning Representations, ICLR, 2018. Y. Wang, P. Li, C. Bai, and J. Leskovec. Tedic: Neural modeling of behavioral patterns in dynamic social interaction networks. In Proceedings of the Web Conference 2021, WWW 21, page 693 705, New York, NY, USA, 2021. S. Wu, F. Sun, W. Zhang, and B. Cui. Graph neural networks in recommender systems: A survey. Nov. 2020. Y. Wu, M. Gao, M. Zeng, F. Chen, M. Li, and J. Zhang. Bridge DPI: A novel graph neural network for predicting drug-protein interactions. Jan. 2021. Published as a conference paper at ICLR 2022 L.-P. P. A. Xhonneux, M. Qu, and J. Tang. Continuous graph neural networks. In 37th International Conference on Machine Learning, pages 10363 10372, Dec. 2020. S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pages 7444 7452. AAAI press, Jan. 2018. F. Yang and N. Matni. Communication topology co-design in graph recurrent neural network based distributed control. 2021. URL https://arxiv.org/abs/2104.13868v1. R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 10:974 983, June 2018. B. Yu, H. Yin, and Z. Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018. D. Zou and G. Lerman. Graph convolutional neural networks via scattering. Applied and Computational Harmonic Analysis, 49(3):1046 1074, 2020. A (PROOF OF) FREQUENCY RESPONSE OF SPACE-TIME GRAPH FILTERS We start from the abstract representation of signal X V, on which we apply a shift operator L : V V. Then we can write the signal X as i X, U(i) V U(i)di, (23) where {U(i)}i is the set of the orthonormal eigenfunctions of L that spans the space V. The form in (23) is a reminiscence of the inverse Fourier transform and the scalar values X, U(i) V are the Fourier coefficients. In classical signal processing, the Fourier coefficients represent the spectral components of the signal X. Now, we consider the example of time-varying graph signals, X L2(RN) L2(R), which are constructed as the tensor product of a continuous-time signal and a graph signal. Shifting the signal X is executed using the STSO, S Lτ, which is the linear composition of the GSO and TSO. The eigenfunctions of the STSO can then be defined as the tensor product vi ejωτ, where vi is an eigenvector of the GSO and ejωτ is an eigenfunction of the TSO. Recall that the eigenfunction is itself a vector in the space L2(RN) L2(R), and thus applying the STSO on vi ejωτ means to jointly shift the the graph vector and the continuous-time function. In other words, we have e S Lτ (vi ejωτ) = e Svi e Lτ ejωτ = e (λi+jω)(vi ejωτ). (24) Thus we conclude that every eigenfunction vi ejωτ of the STSO is associated with an eigenvalue λi + jω. Defining the eigenfunctions of the STSO allows us to re-write (23) for time-varying graph signals as 0 X(λi, jω)(vi ejωτ)dω, (25) where X(λi, jω) is the spectral component of the signal X at the frequency pair (λi, ω), and the constant 1 2π is to normalize the eigenfunctions. It is inferred that the frequency domain of timevarying graph signals is bivariate. Considering the space-time graph filter H(S, Lτ), the filter output can be expressed as Y = H(S, Lτ)X = Z 0 h(t)e t S Lτ Xdt 0 X(λi, jω) Z 0 h(t)e t S Lτ (vi ejωτ)dtdω 0 X(λi, jω) Z 0 h(t)e t(λi+jω)dt(vi ejωτ)dω, Published as a conference paper at ICLR 2022 where (a) results from expressing X by its spectral representation [cf. (25)] and (b) from recalling (24). As in (25), the filter output can be also written as Y = 1 2π PN i=1 R 0 Y(λi, jω)(vi ejωτ)dω, and thus we have Y(λi, jω) = X(λi, jω) R 0 h(t)e t(λi+jω)dt, following from (26). From the convolution theorem, the convolution operator implies a multiplication in the spectral domain. Therefore, the frequency response of the space-time graph filter is h(λ, jω) = Z 0 h(t)e t(λ+jω)dt, (27) which is the Laplace transform of the impulse response h(t). It is worth noting that the filter response depends on the filter coefficients, which is irrespective of the graph. The spectral coefficients of the filter applied to one specific graph are, however, obtained by instantiating the frequency response h(λ, jω) on its eigenvalues {λi}N i=1. B PROOF OF PROPOSITION 1 First, we define the eigenvector misalignment between S and E in the following lemma (Gama et al., 2020a). Lemma 1. Let S = VΛVH and E = UMUH with E ϵs. For any eigenvector vi of S, it holds that Evi = mivi + Eivi, (28) with Ei ϵδ, where δ = ( U V + 1)2 1. Now we proceed with the proof of Proposition 1. Proof of Proposition 1. Proposition 1 bounds the distance between the space-time graph filters before and after adding graph perturbations. From (12), this distance can be evaluated as H(S, Lτ) H(ˆS, Lτ) P = min P P max X: X F =1 H(S, Lτ)X PT H(ˆS, Lτ)PX F = min P P max X: X F =1 H(S, Lτ)X H(PT ˆSP, Lτ)X F , (29) where the last step is due to H being permutation equivariant, i.e., PT H(S, Lτ)P = Z 0 h(t)PT e t S Lτ Pdt (a) = Z 0 h(t)e t PT SP Lτ = H(PT SP, Lτ). (30) Equality (a) results from the fact that matrix P commutes with the exponential operator and als with the TSO. From the minimum operator in (29), there exists a matrix P0 P that satisfies H(S, Lτ) H(ˆS, Lτ) P max X: X F =1 H(S, Lτ)X H(PT 0 ˆSP0, Lτ)X F = H(S, Lτ) H(PT 0 ˆSP0, Lτ) (b) = H(S, Lτ) H(S + ES + SE, Lτ) , where . is the operator norm, and (b) follows from the assumption (A1). Define the the filter difference as s := H(S + ES + SE, Lτ) H(S, Lτ). The next step is to find the norm of s. s can be expressed [cf. (5)] as 0 h(t)e t(S+ES+SE) Lτ dt Z 0 h(t)e t S Lτ dt. (32) From Taylor series, e t S Lτ = P n=0 1 n!( t)n(S Lτ)n and the difference is then written as ((S + ES + SE) Lτ)n (S Lτ)n dt. (33) Published as a conference paper at ICLR 2022 By induction, we expand the term ((S + ES + SE) Lτ)n to the first order on E, so we have ((S + ES + SE) Lτ)n (a) = (S Lτ + (ES + SE) Lτ)n = (S Lτ)n + r=0 (S Lτ)r (ES + SE) Lτ (S Lτ)n r 1 + O2(E), (34) where O2(E) combines the other higher order terms, and (a) follows from being a linear composition. Hence, (33) reduces to r=0 (S Lτ)r (ES + SE) Lτ (S Lτ)n r 1dt + O(E), (35) with O(E) = R 0 h(t) P n=0 ( t)n n! O2(E)dt. Note that the frequency response of the filter is an analytic function. Thus the norm of O(E) is of order O( E 2) since 0 < lim E 0 O(E) E 2 < . (36) Denote the first term in the right hand side of (35) by (S). After splitting the summands (ES + SE) Lτ in (35) into ES Lτ + SE Lτ, we get r=0 (S Lτ)r E(S Lτ)n rdt r=0 (S Lτ)r+1E(S Lτ)n r 1dt. The first line is straightforward and the second follows from commuting the matrix E with the TSO, Lτ. The required norm becomes s (S) + O(E) from the triangle inequality, with O(E) being of order O(ϵ2 s) by the assumption (A2) and (36). Thus the aim of the proof reduces to bound the norm (S) by 2Cϵs(1 + δ The norm (S) is defined as (S) = max X: X F =1 (S)X F . We express (S)X as 0 X(λi, jω) Z r=0 (S Lτ)r E(S Lτ)n r(vi ejωτ)dtdω 0 X(λi, jω) Z r=0 (S Lτ)r+1E(S Lτ)n r 1(vi ejωτ)dtdω, which follows from expressing (S) as in (37) and X as in (25). Then 0 X(λi, jω) Z r=0 (λi + jω)n r(S Lτ)r E(vi ejωτ)dtdω 0 X(λi, jω) Z r=0 (λi + jω)n r 1(S Lτ)r+1E(vi ejωτ)dtdω, since (S Lτ)q(vi ejωτ) = (λi + jω)q(vi ejωτ). The assumption (A2) indicates that matrix E has eigenvectors that are not aligned with vi i. From Lemma 1, it follows that E(vi ejωτ) = Evi ejωτ = mivi ejωτ + Eivi ejωτ. The terms (S Lτ)p E(vi ejωτ) of (39) can then be simplified as (S Lτ)p E(vi ejωτ) = mi(S Lτ)p(vi ejωτ) + (S Lτ)p Ei(vi ejωτ) = mi(λi + jω)p(vi ejωτ) + (S Lτ)p Ei(vi ejωτ). (40) Published as a conference paper at ICLR 2022 Substituting the first term of (40) in (39) results in 0 X(λi, jω) Z r=0 2mi(λi + jω)n(vi ejωτ)dtdω. Doing the same for the second term of (40) leads to 0 X(λi, jω) Z r=0 (λi + jω)n r(S Lτ)r Ei(vi ejωτ)dtdω 0 X(λi, jω) Z r=0 (λi + jω)n r 1(S Lτ)r+1Ei(vi ejωτ)dtdω, where (S)X = 1(S)X + 2(S)X. The next step is to find the norm of 1(S) and 2(S). For the term 1(S)X in (41), we notice that the inner summation no longer depends on r and can be written as 2nmi(λi + jω)n(vi ejωτ). Therefore, we get n! (λi + jω)ndt = 2mi(λi + jω) λ h(λi, jω), (43) where λ h(λi, jω) = R 0 h(t) P n=0 n n!( t)n(λi + jω)n 1dt is the partial derivative of the frequency response of the filter computed at λ = λi. Thus (41) is written as 0 2mi X(λi, jω) (λi + jω) λ h(λi, jω) (vi ejωτ)dω. (44) From Pythagoras theorem we know that the squared norm of a sum of orthogonal terms is the sum of the squared norm of the individual summands. Then 1(S)X 2 F = 0 4m2 i | X(λi, jω)|2 |λi + jω|2 λ h(λi, jω) 2 dω 4ϵ2 s C2 X 2 F , (45) since |mi| E ϵs, the filter is integral Lipschitz [cf. (17)], and the eigenfunction basis vectors are orthonormal, i.e., vk 1 2πejω τ 2 F = 1. The norm X 2 F is defined as PN i=1 R 0 | X(λi, jω)|2dω. We now rewritte 2(S)X in (42) as 0 X(λi, jω)Ki Ei(vi ejωτ)dω, (46) (λi + jω)n r(S Lτ)r + (λi + jω)n r 1(S Lτ)r+1 dt. (47) We next find the norm of Ki in order to find 2(S) . Note that Ki : L2(RN) L2(R) L2(RN) L2(R) is a linear operator and so is the matrix S : L2(RN) L2(RN). A matrix S can be expressed using the outer product of its eigenvectors as S = PN i=1 λiviv H i . Since the tensor product generalizes the outer product of n-dimensional vectors, we can draw a parallel and have (S Lτ)q = 1 (2π)2 0 (λk + jω )q(vk ejω τ) (vk ejω τ)Hdω . (48) Published as a conference paper at ICLR 2022 Calculating (47) requires evaluating the term (λi + jω)n r(S Lτ)r + (λi + jω)n r 1(S Lτ)r+1 first, which can be simplified as (λi + jω)n r(S Lτ)r + (λi + jω)n r 1(S Lτ)r+1 = (λi + jω)n (S Lτ)r (λi + jω)r + (S Lτ)r+1 (λi + jω)r+1 When we substitute (48) in (49), we get (λi + jω)n r(S Lτ)r + (λi + jω)n r 1(S Lτ)r+1 = (λi + jω)n (λk + jω )r (λi + jω)r + (λk + jω )r+1 (λi + jω)r+1 (vk ejω τ) (vk ejω τ)Hdω . The inner summation in (47) reduces to the sum of two geometric series, each of which has the form r = 1 (λi + jω)n 1 (λi + jω)n (λk + jω )n λi + jω (λk + jω ) , (51) and the reader can confirm with a simple algebraic manipulation that the right hand side of (51) follows from the geometric sum Pn 1 r=0 ar = (1 an)/(1 a). It is also straightforward to show that (λi+jω)n n 1 X r + λk + jω r+1 = λi + λk + j(ω + ω ) λi λk + j(ω ω ) (λi+jω)n (λk+jω )n . (52) with some algebraic manipulations. We eventually can write Ki in (47) as λi + λk + j(ω + ω ) λi λk + j(ω ω ) (λi + jω)n (λk + jω )n (vk ejω τ) (vk ejω τ)Hdω dt. Since we have h(λi, jω) = R 0 h(t)e t(λi+jω)dt = R 0 h(t) P n=0 1 n!( t)n(λi + jω)ndt from Taylor series, Ki reduces to Ki = 1 (2π)2 (λi + jω) + (λk + jω ) (λi + jω) (λk jω ) h(λi, jω) h(λk, jω ) (vk ejω τ) (vk ejω τ)Hdω . Note that (54) has the form (w2 + w1)( h(w2) h(w1))/(w2 w1) of (16) with w being replaced by its complex form, λ + jω. This remark will help bound the norm of Ki. The norm Ki is defined as max X: X F =1 Ki X F for all i, and we have Ki X (a) = Ki 0 X(λm, j ω)(vm ej ωτ)d ω 0 X(λk, jω )(λi + jω) + (λk + jω ) (λi + jω) (λk jω ) h(λi, jω) h(λk, jω ) (vk ejω τ)dω . The RHS in (a) involves the inner products 1 (2π)2 (vk ejω τ)H(vm ejωτ) = 1 (2π)2 vk ejω τ, vm ejωτ , m, k. These inner products are between the orthogonal eigenvectors and are nonzero if and only if m = k and ω = ω . Therefore, only the terms that have m = k and ω = ω appeared in (b), and the inner products are 1. Published as a conference paper at ICLR 2022 From Pythagoras theorem, the squared norm of Ki X is then 0 | X(λk, jω )|2 (λi + jω) + (λk + jω ) (λi + jω) (λk jω ) h(λi, jω) h(λk, jω ) 4C2 X 2 F , i, (56) followed from the assumption (A3) that the filter is Lipschitz filter [cf. Definition 4]. Therefore, we get Ki 2C for a unit-norm signal X. Now, it is left to take the norm of (46), which yields 0 X(λi, jω)Ki Ei(vi ejωτ)dω 0 | X(λi, jω)| Ki Ei dω 2Cϵδ 0 | X(λi, jω)|dω, where we have Ei ϵδ, i from Lemma 1. To bound the summation in (57), we can write 0 | X(λi, jω)| 1 dω (a) 0 | X(λi, jω)|2dω 0 | X(λi, jω)|2dω where (a) follows from Cauchy-Schwartz inequality, and (b) from Parseval s theorem. Note that ˆδ(τ) is the Dirac delta function. Eventually, we obtain 2(S)X F 2Cϵsδ N X F . (59) Recall that (S)X = 1(S)X + 2(S)X. From the triangle inequality, we then obtain (S)X F 1(S)X F + 2(S)X F 2Cϵs(1 + δ N) X F . (60) following from and using (45) and (59). Hence, (S) = max X: X F =1 | (S)X F 2Cϵs(1 + δ N), which completes the proof. C PROOF OF PROPOSITION 2 The aim of the proof is to bound the distance between the space-time graph filters before and after adding time perturbation. This distance is calculated using the operator distance modulo translation. From Definition 3, it holds that there exists a real value s such that H(S, Lτ) H(S, ˆLτ) T = min s R max x: x F =1 H(S, Lτ)X e s Lτ H(S, ˆLτ)X F (a) max X: X F =1 H(S, Lτ)X H(S, ˆLτ)X F (b) = H(S, Lτ) H(S, (1 + ξ(τ))Lτ) , where in (a) we let the minimum norm value be lower than or equal to the norm when s = 0, and in (b) we use assumption (A1). The difference between the filters is evaluated as H(S, ˆLτ) H(S, Lτ) = Z 0 h(t)e t S (Lτ +ξ(τ)Lτ )dt Z 0 h(t)e t S Lτ dt. (62) Let τ denote H(S, ˆLτ) H(S, Lτ), which can be written as (S (Lτ + ξ(τ)Lτ))n (S Lτ)n dt, (63) Published as a conference paper at ICLR 2022 following from expanding the exponentials to e t S Lτ = P n=0 1 n!( t)n(S Lτ)n using Taylor series. We also expand (S Lτ + S ξ(τ)Lτ)n to the first order of ξ(τ): (S Lτ + S ξ(τ)Lτ)n = (S Lτ)n + r=0 (S Lτ)rξ(τ)(S Lτ)n r + O2(ξ(τ)), (64) where O2(ξ(τ)) is a polynomial of the higher powers of ξ(τ). By substituting (64) in (63), we can re-write the latter as r=0 (S Lτ)rξ(τ)(S Lτ)n rdt + O(ξ(τ)), (65) where O(ξ(τ)) = R 0 h(t) P n=0 ( t)n n! O2(ξ(τ))dt. The quantity O(ξ(τ)) has all the higher power terms and satisfies 0 < lim ξ(τ) 0 O(ξ(τ)) 2 ξ(τ) 2 2 < . (66) Therefore, the norm O(ξ(τ)) 2 is of order O(ϵ2 τ) following from assumption (A2). Remember that if the norm is of order ϵτ, it means that ξ(τ) 2 κϵτ with κ being an absolute constant. Our goal reduces to bound the norm of the first term of the right hand side of (65). Since the TSO and GSO can commute, it holds that (S Lτ)rξ(τ)(S Lτ)n r = (Sr Lr τ)ξ(τ)(Sn r Ln r τ ) = Sn Lr τξ(τ)Ln r τ . (67) However, Lτ and ξ(τ) do not commute due to their dependence on τ. Recalling that Lτ is a differential operator and applying the chain rule, we obtain Lr τ ξ(τ)Ln r τ (a) = ξ(m)(τ)Ln m τ = ξ(τ)Ln τ + G2(ξ (τ)), (68) where (a) is valid by induction, ξ(m)(τ) = mξ(τ)/ τ m and the term G2(ξ (τ)) contains the higherorder derivatives starting from the first derivative. Substituting (67) and (68) in (65), we get r=0 Sn ξ(τ)Ln τ dt + G(ξ (τ)) + O(ξ(τ)) n! nξ(τ)(S Lτ)ndt + G(ξ (τ)) + O(ξ(τ)). The summands of the inner summation in (69) no longer depend on r leading to the form in (a). We also have G(ξ (τ)) = R 0 h(t) P n=0 ( t)n n! Pn 1 r=0 G2(ξ (τ))dt. It can now be shown that the norm G(ξ (τ)) 2 is of order O(ϵ2 τ) since 0 < lim ξ (τ) 2 0 G(ξ (τ)) 2 ξ (τ) 2 < , (70) and ξ (τ) 2 is of order ϵ2 τ according to assumption (A2). Denote the first term in the right hand side of (69) by (Lτ). Since both O(ξ(τ)) and G(ξ (τ)) are of order ϵ2 τ, the rest of the proof aims to show that the norm of (Lτ) is bounded by Cκϵτ. We have n! nξ(τ)(S Lτ)ndt 0 X(λi, jω)(vi ejωτ)dω 0 X(λi, jω) Z n! nξ(τ)(S Lτ)n(vi ejωτ)dtdω. Published as a conference paper at ICLR 2022 Recalling that (vi ejωτ) is an eigenfunction of (S Lτ), we can re-write (71) as 0 X(λi, jω)ξ(τ) Z n n!( t)n(λi + jω)ndt(vi ejωτ)dω 0 X(λi, jω)ξ(τ)(λi + jω) jω h(λi, jω)(vi ejωτ)dω, where jω h(λi, jω) = R 0 h(t) P n=0 n n!( t)n(λi + jω)n 1dt. The the squared norm of (72) can then be evaluated as X(λi, jω) 2 |ξ(τ)|2|λi + jω|2 jω h(λi, jω) = |λi + jω|2 jω h(λi, jω) 0 |ξ(τ)|2dτ Z X(λi, jω) 2 dω = |λi + jω|2 jω h(λi, jω) 2 ξ(τ) 2 2 X 2 F (b) C2κ2ϵ2 τ X 2 F . In (a), we use the inequality | R 0 xdx|2 R 0 |x|2dx. In (b), we use assumption (A3), which states that the filter is integral Lipschitz with a constant C, and assumption (A2), which has the norm ξ(τ) 2 bounded by κϵτ. Finally, the required norm can be calculated as (Lτ) = max X: X F =1 (Lτ)X F Cκϵτ, which completes the proof. D PROOF OF THEOREM 1 The aim of the proof is to bound the distance between the space-time graph filters before and after adding joint perturbations. From (15), this distance can be evaluated as H(S, Lτ) H(ˆS, ˆLτ) P,T = min P P min s R max X: X F =1 H(S, Lτ)X H PT ˆSP, e s Lτ ˆLτ X F (a) max X: X F =1 H(S, Lτ)X H(S + SE + ES, ˆLτ)X F = H(S, Lτ) H(S + SE + ES, ˆLτ) , where (a) is true for any specified values of P and s, and we choose P0 and 0 as in (31) and (61), respectively. Then, we add and subtract H(S + SE + ES, Lτ) from the filter difference to get H(S, Lτ) H(S + SE + ES, ˆLτ) = H(S, Lτ) H(S + SE + ES, Lτ) + H(S + SE + ES, Lτ) H(S + SE + ES, ˆLτ). (75) Using the triangular inequality, the norm of the filter difference is bounded by H(S, Lτ) H(S + SE + ES, ˆLτ) H(S, Lτ) H(S + SE + ES, Lτ) + H(S + SE + ES, Lτ) H(S + SE + ES, ˆLτ) = s + τ . (76) Note that in Proposition 2, τ is defined for the GSO S but the proposition is valid for any GSO. The bounds of s and τ are obtained in Propositions 1 and 2. Therefore, we get H(S, Lτ) H(S + SE + ES, ˆLτ) 2Cϵs(1 + δ N) + Cκϵτ + O(ϵ2), (77) where O(ϵ2) is the highest among O(ϵ2 s) and O(ϵ2 τ). This completes the proof. Published as a conference paper at ICLR 2022 D.1 FINITE-IMPULSE RESPONSE FILTERS While Theorem 1 handles space-time graph filters with continuous-time impulse response h(t), its results can be extended to finite-impulse response filters following the same proofs in Sections B, C, and D. In this context, FIR filters can be thought of as a special case of the filter h(t), where we have a finite number of filter coefficients. We summarize this remark in the following lemma. Lemma 2. Define a space-time graph filter with a finite impulse response as Hd(S, Lτ) = k=0 hke k Ts S Lτ . (78) Under the GSOs defined in Proposition 1 and TSOs in Proposition 2, the distance between the FIR space-time graph filters Hd(S, Lτ) and Hd(ˆS, ˆLτ) satisfies Hd(S, Lτ) Hd(ˆS, ˆLτ) P,T 2Cϵs 1 + δ N + Cκϵτ + O(ϵ2). (79) E PROOF OF THEOREM 2 The goal of the proof is to show the difference between the GNN output before and after adding perturbations. The GNN output is the L layer s output, i.e., XL. From (7), we can write the difference between the two GNNs as Φ(X0, H, S Lτ) Φ(X0, H, ˆS ˆLτ) F = XL ˆ XL F = σ HL(S, Lτ)XL 1 σ HL(ˆS, ˆLτ) ˆ XL 1 F , (80) where HL(S, Lτ) is the filter in (78) at Layer L, Xℓis the output of layer ℓ, and ˆ Xℓis the corresponding output after adding the perturbations. From assumption (A2), we get σ(X2) σ(X1) 2 X2 X1 2. Accordingly, the norm of the output difference at any layer ℓ L is bounded by Xℓ ˆ Xℓ F Hℓ(S, Lτ)Xℓ 1 Hℓ(ˆS, ˆLτ) ˆ Xℓ 1 F . (81) Add and subtract Hℓ(ˆS, ˆLτ)Xℓ 1 inside the norm in the right hand side. The norm of (81) can then be wriien as Xℓ ˆ Xℓ F Hℓ(S, Lτ)Xℓ 1 Hℓ(ˆS, ˆLτ)Xℓ 1 + Hℓ(ˆS, ˆLτ)Xℓ 1 Hℓ(ˆS, ˆLτ) ˆ Xℓ 1 F Hℓ(S, Lτ) Hℓ(ˆS, ˆLτ) Xℓ 1 F + Hℓ(ˆS, ˆLτ) Xℓ 1 ˆ Xℓ 1 F , (82) using the triangle inequality in (a) and Cauchy-Schwarz inequality in (b). From assumption (A1), the filters have unit operator norm, i.e., Hℓ(S, Lτ) = 1, ℓ= 1, . . . , L, and hence, Xℓ F Xℓ 1 F X0 F . From the stability of graph filters in Theorem 1 (with Lemma 2), the difference becomes XL ˆ XL F 2Cϵs 1 + δ N + Cκϵτ + O(ϵ2) X0 F + ˆ XL 1 XL 1 F . (83) Substituting (82) in (83) recursively, we obtain XL ˆ XL F 2CLϵ 1 + δ N X0 F + CLκϵτ X0 F + O(ϵ2). (84) Note that the base case of the recursion is calculated as X1 ˆ X1 F H1(S, Lτ)X0 H1(ˆS, ˆLτ)X0 F H1(S, Lτ) H1(ˆS, ˆLτ) X0 F N + Cκϵτ + O(ϵ2) X0 F Eventually, as in (74), it follows that the joint operator distance modulo of the output of ST-GNNs is Φ(.; H, S Lτ) Φ(.; H, ˆS ˆLτ) P,T 2CLϵs 1 + δ N + CLκϵτ + O(ϵ2), (86) completing the proof. Published as a conference paper at ICLR 2022 F STABILITY OF MULTIPLE-INPUT MULTIPLE-OUTPUT ST-GNNS Theorem 2 only considers the case of single-feature layers. In this section, we extend the stability analysis to the case of multiple features. We derive a bound for the difference in the output of ST-GNNs in Corollary 1, where each layer as well as the input signals have multiple features. We let the features at the hidden layers to be the same for simplicity. Corollary 1 (ST-GNNs Stability). Consider the assumptions of Theorem 2 but let Fℓ= F be the number of features per each layer for 1 ℓ L 1. Let F0 and FL be the number of the features of the input and output signals, respectively. Then, Φ(.; H, S Lτ) Φ(.; H, ˆS ˆLτ) P,T l=1 F l ! 2Cϵs(1 + δ N) + Cκϵτ + O(ϵ2). (87) Proof. We aim to find the difference between the output of the Lth layer before and after adding joint perturbations. For a layer ℓand feature f, we can define its output as g=1 Hfg ℓ(S, Lτ)X g ℓ 1 where Hfg ℓ is a single-input signle-output filter as the one used in Theorem 2. From assumption (A2) of Theorem 2, we have the identity σ(x2) σ(x1) 2 x2 x1 2. We can then bound the output difference for all 1 < ℓ< L: X f ℓ ˆ X f ℓ F Hfg ℓ(S, Lτ)X g ℓ 1 Hfg ℓ(ˆS, ˆLτ) ˆ X g ℓ 1 F Hfg ℓ(S, Lτ) Hfg ℓ(ˆS, ˆLτ) X g ℓ 1 F + Hfg ℓ(ˆS, ˆLτ) ˆ X g ℓ 1 X g ℓ 1 F (c) F 2Cϵ 1 + δ N + Cκϵτ + O(ϵ2) X 1 ℓ 1 F + F ˆ X 1 ℓ 1 X 1 ℓ 1 F , where (a) follows from the triangle inequality, (b) from (82), and (c) from Theorem 1. The summation in (b) has equivalent F summands since we assume that all the filters satisfies assumption (A1) in Theorem 2. With applying ˆ X 1 ℓ 1 X 1 ℓ 1 F recursively, we can re-write (89) as X f ℓ ˆ X f ℓ F l=1 F l ! 2Cϵ 1 + δ N + Cκϵτ + O(ϵ2) . (90) Note that the base case of the recursion is calculated as X f 1 ˆ X f 1 F Hfg 1 (S, Lτ)X g 0 Hfg 1 (ˆS, ˆLτ)X g 0 F F0 2Cϵs 1 + δ N + Cκϵτ + O(ϵ2) X 1 0 F . Finally, we can express the difference between the output at layer L. It has multiple features, and therefore, the norm of the difference is expressed as XL ˆ XL 2 X f L ˆ X f L 2 l=1 F l !2 2Cϵs 1 + δ N + Cκϵτ + O(ϵ2) 2 , where (a) is calculated from (90) for ℓ= L. Taking the square root yields the inequality in (87), which completes the proof. Published as a conference paper at ICLR 2022 G EXTENDED NUMERICAL RESULTS We consider the problem of decentralized controllers, where we are given a team of N agents, each of which has a position pi,n R2, a velocity vi,n R2 and an acceleration ui,n R2, that are captured at times n Ts, n Z+. The goal is to learn controller actions that allow the agents to move together and complete a specific task. Two tasks are considered in this paper: flocking and unlabeled motion planing. Optimal centralized controllers for the two tasks are derived in the literature. However, centralized controllers require access to the information at all the agents, and therefore, the computation complexity scales fast with the number of agents. With the help of ST-GNNs, we can find decentralized controllers that imitate the centralized solutions according to (22). In this section, we aim to provide a detailed description of the experiments in Section 5 along with further experiments. Communication networks. The underlying graphs represent the communication networks between the agents. Two criteria can be used to assemble the graph. First, each agent is connected to its M-nearest neighbors. The graph at a time step n is then represented by a binary graph Gn such that (i, j) En if and only if j Ni,n or i Nj,n, where Ni,n is the set of the M-nearest neighbors of the agent i. The second is to connect each agent to the neighbors within a communication range R. The graph in this case is also represented by a binary graph such that (i, j) En if and only if pi,n pj,n < R. In both cases, when the agents move, the graph Gn changes with n. ST-GNN Implementation. As indicated above, the communication networks (i.e., graphs) change with the movement of the agents (i.e., with time). However, ST-GNNs in (7) are designed for fixed graphs. Moreover, ST-GNNs are designed for continuous-time signals, whereas the signals in our experiments are discrete. To turn around these challenges, we implement the FIR space-time graph filter recursively, where at every time step we use the corresponding underlying graph. The output of the filter at a time step n is expressed as yn = h0xn + where xn and Sn are the graph signal and the GSO at time step n. The graph diffusion is implemented in (93) with the GSO directly instead of the operator exponential e Sn. The latter can be expressed as l! Sl n = I Sn + O(Sn), (94) where O(Sn) contains the higher-order terms. Equation (94) shows that the GSO is a first-order approximation of e Sn, and the sign difference is absorbed in the learning parameters {hk}K k=0. G.1 APPLICATION I: FLOCKING AND NETWORK CONSENSUS In this application, the agents collaborate to avoid collisions and learn to move according to a reference velocity rn R2. The reference velocity is generated randomly as rn+1 = rn + Ts rn, n Z+, n < T. The initial value r0 is sampled from a Gaussian distribution and so is rn, with zero mean and expected norms E[ r0 ] and E[ r0 ], respectively. At each time step n, each agent i observes a biased reference velocity ri,n such that ri,n = rn + ri with ri being white Gaussian with independent and identically-distributed (i.i.d.) components. The goal is to learn acceleration actions {ui,n}i,n that allow the agents to form a swarm moving with the same velocity. Mobility model. Given acceleration actions {ui,n}i,n, each agent moves according to the equations of motion: vi,n+1 = vi,n + Tsui,n, pi,n+1 = pi,n + Tsvi,n + T 2 s/2 ui,n. (95) The initial velocity is assumed to be vi,0 = r0 + v, for all i, where v is white Gaussian with i.i.d. components. In our experiments, we aim to learn accelerations {ui,n}i,n that make vi,n for all i be as close as possible to rn and prevent the position differences pij,n = pi,n pj,n from being zero for all i = j at any time step n. We solve this problem under the constraints ui,n 2 µ, i, n. Published as a conference paper at ICLR 2022 Optimal centralized controllers. The problem described above has an optimal solution, which, for all i = 1, . . . , N and n T, is given as j=1 pi,n C(pi,n pj,n), (96) if u i,n 2 µ, and otherwise u i,n = µ. The collision avoidance potential C(.) is defined as C(pi, pj) = 1/ pij 2 2 log( pij 2 2), pij 2 γ, 1/γ2 log(γ2), otherwise, (97) where pij = pi pj, and γ = 1 (Tanner et al., 2003; Gama et al., 2020b). Note that (96) requires access to data from all the agents, and therefore, this solution is known to be the optimal centralized solution. However, in a decentralized setting, the agents exchange information only within their K-neighborhood only. Therefore, our goal is to find acceleration controls that imitate the centralized solution in (96). Decentralized Controllers. We compare our results to a decentralized controller that only has access to the information shared within the K-neighborhood of the agents. Unlike the centralized policy where all the information is available to the central unit, the agents only have access to the data that they receive from their neighbors. More importantly, the data get delayed through the communication network, and the agents have to make their predictions based on outdated data. Therefore, the estimated accelerations in (96) are approximated by udec i,n = 1 1 K|N k i,n| pi,n C pi,n pj,(n k) , (98) where |.| is the set cardinality, and N k i,n = {j N k 1 j,(n 1) | j Ni,n} is the set of the k-hop neighbors. G.1.1 EXPERIMENT 1: NETWORK CONSENSUS In this section, we present a detailed description of the training process. Remember that the agents in this experiment are not moving and the graphs are fixed mesh grids. The state variables are the estimated and observed velocities, vi,n and ri,n, i, n. Training. The dataset is generated according to the mobility model in (95) and (96). The dataset consists of 500 time-varying graph signals {Xm}500 m=1 that are calculated under optimal centralized policies {U m}500 m=1. We split the data into 460 examples for training, 20 for validation and 20 for testing. We train a 2-layer ST-GNN on the training data and optimize the mean squared loss using ADAM algorithm with learning rate 0.01 and decaying factors β1 = 0.9 and β2 = 0.999.1 We then keep the model with the lowest cost (across the validation data) among 30 epochs while the cost is averaged over the T steps. For a single time step, the cost is calculated as c(un) = 1 2N i=1 Tsui,n 2 2, n < T. (99) All the training and simulation parameters are shown in Table 1. Execution. In addition to the 20 test examples generated with the training set, we generate another 20 examples under joint graph and time perturbations for each value of ϵ. The perturbation size of both graph and time topologies is chosen the same. The results shown in Fig. 1 (Middle) suggest that the closer the space-time structures, the closer the outputs of the trained ST-GNN under the two cases are. This shows that an ST-GNN, which is trained with signals defined on one particular graph and sampled at a certain sampling rate, can be generalized to signals with a different underlying topology as long as the two topologies are close. 1We used the GNN library at https://github.com/alelab-upenn/ graph-neural-networks Published as a conference paper at ICLR 2022 Table 1: Simulation parameters in Experiments #1 and #2. parameter value E[ r0 ] = E[ rn ] 1m/s E[ ri ] = E[ v ] 1m/s Initial agent density, ρ0 0.5 agents/m2 Communication range, R 2 m Maximum acceleration value, µ 3 m/s2 Time steps, T 100 Sampling time, Ts 0.1 s ST-GNN feature/layer, F0:2 4, 16, 2 (#1) and 6, 64, 2 (#2) Filter taps/layer, K1:2 4, 1 Activation function, σ tanh Figure 3: Flocking experiment. (a) (b) and (c) the estimated velocities of some agents in three different examples from the test dataset, and (d) the positions of the agents in (c) at the start of the simulations and after 9s (the red arrows represent the agent velocity). G.1.2 EXPERIMENT 2: FLOCKING In this experiment, we have 50 agents that are spread uniformly with initial density ρ0. The agents are allowed to move with their velocities {vi,n}i,n. At each time step n, the graph Gn represents their communication network based on a communication range R. We train a 2-layer ST-GNN that is implemented in (93), and the training procedure is exactly as in the previous experiment. The dataset consists of 800 training, 100 validation, and 100 test examples. We execute the trained ST-GNNs on signals defined over graphs constructed with the same initial agent density ρ0 and sampled at the same sampling rate 1/Ts. Fig. 3 depicts three paradigms of the learned velocities compared to the reference velocity. We notice that the agents follow the trend in the reference velocity. The mean and variance of the difference between the agent and reference velocities were shown before in Fig. 1 (Right). Fig. 3d illustrates the positions of the swarm after 9s and shows that the swarm moved together with same velocity in the same direction. Fig. 1 (Right) also shows that the proposed architecture outperforms the decentralized policy in (98). The difference between the estimated and reference velocities is lower under ST-GNNs than the difference under the decentralized policy. Published as a conference paper at ICLR 2022 Figure 4: The estimated velocities of some agents in two different examples from the test dataset using ST-GNNs (left) and a decentralized controller (right). Examples of the test dataset are shown in Fig. 4. It is clear that ST-GNNs help the agents to mitigate the delays in the received information and learn accelerations that follow the reference velocity. Graph and time perturbations appear in real applications as a change in the underlying graph-time topology. In the following, we provide extended experiments to show the effect of perturbations. In particular, we repeat the same experiment either with graphs constructed with different agent densities or under different values of sampling time Ts. Experiment 2.I. In this experiment, we execute the trained ST-GNN on graphs constructed under different initial agent densities (i.e., a source of graph perturbations). We execute the trained STGNNs on 50 signals generated with a different initial agent density. We repeat the experiments under densities 2, 1 32, 1 128, 1 512 agents/m2, respectively, and plot the relative costs after 10s in Figure 5 (Left). The cost represents the mean difference between the agent velocity and the mean of the observed velocities, which is calculated as cost = 1 2N The relative cost is then calculated as a relative deviation from the cost under a density of 2 agents/m2. Figure 5 (Left) shows that for the small changes in the density ρ (e.g., from the original density 2 to 0.5 agents/m2), the relative cost remains small. However, the higher values of ρ result in higher relative costs, i.e., degradation in the performance. Note that using smaller agent densities (i.e., higher ρ) at the same communication range R results in sparser graphs. Therefore, their distances to the graphs generated at the original density (2 agents/s) increases leading to the aforementioned performance degradation. Experiment 2.II. In this experiment, we execute the trained ST-GNN on signals sampled at different sampling rates (i.e., a source of time perturbations). For several values of Ts, we replace the sampling time Ts with Ts + Ts and calculate the relative costs as in Experiment 2. Figure 5 (Right) shows that the smaller the value of Ts, the smaller the relative cost is. This matches our theoretical results. Published as a conference paper at ICLR 2022 1.5 1.6 1.7 1.8 1.9 2.0 Δρ Relative cost Relative cost Figure 5: Relative cost after 10s calculated over test datsets that have encountered either graph perturbations resulted from changing the agent density ρ0 (Left), or time perturbations resulted from changing the sampling time Ts (Right). Table 2: Simulation parameters in Section G.2 parameter value No. of agents, N 12 Neighborhood size, M 5 Initial minimum distance, d 1.5 m Initial velocity, vi,0 4 m/s Maximum acceleration value, µ 5 m/s2 Time steps, T 30 Sampling time, Ts 0.1 s ST-GNN feature/layer, F0:2 34, 64, 2 Filter taps/layer, K1:2 3, 1 Activation function, σ tanh G.2 APPLICATION II: UNLABELED MOTION PLANNING In this problem, we aim to assign N unlabeled agents to N target goals, {gj R2}N j=1 through planning their free-collision trajectories. The term unlabeled implies that the assignment is not pre-determined and it is executed online. At the start of the experiment, the agents are spread with minimum inter-agent distance d and so do the goal targets. A centralized solution to this problem was introduced in (Turpin et al., 2014), which we refer to as the CAPT solution. This solution gives the agent trajectories, {Qi R2 T }N i=1, while the agent velocities and accelerations can be calculated as vi,n = (pi,n+1 pi,n)/Ts, u i,n = (pi,n+1 pi,n)/T 2 s . (101) Note that pi,n is the n-th column of matrix Qi. We use the optimal accelerations in (101) to learn an ST-GNN parameterization that predicts the agent accelerations {ui,n}i,n according to (22). The input Xm consists of 6M + 4 state variables for each agent, which are the agent position {pi,n}n and velocity {vi,n}n, the position of the nearest M neighbors {Pi,n RM 2 | [Pi,n]j = pj,n j Ni,n}n along with their velocities, and the position of the nearest M target goals. The CAPT accelerations and the corresponding state variables constitutes together one pair in the dataset. The dataset consists of 55000 training examples and 125 validation examples. We train an ST-GNN on the training data and optimize the mean squared loss using ADAM algorithm with learning rate 0.0005 and decaying factors β1 = 0.9 and β2 = 0.999. We then keep the model with the lowest mean distance between the agent final position and its desired target among 60 epochs. The other training parameters are shown in Table 2. The trained ST-GNN is then executed on a test dataset that consists of 1000 examples. The ST-GNN output for each example is the estimated accelerations, and the planned trajectories are then calculated using (95). One example was shown in Fig. 2 (Left), which depicts free-collision trajectories. Published as a conference paper at ICLR 2022 Similar to the flocking experiment, we test the trained ST-GNNs with different graph-time topologies, generated with either different neighborhood sizes or different sampling times. Figures 2 (Middle) and (Right) uphold the conviction that the ST-GNN output difference increases with the distance between the underlying topologies.