# scalable_spatiotemporal_graph_neural_networks__fa9dddc1.pdf Scalable Spatiotemporal Graph Neural Networks Andrea Cini *1, Ivan Marisca *1, Filippo Maria Bianchi 2,3, Cesare Alippi 1,4 1 The Swiss AI Lab IDSIA, Universit a della Svizzera italiana 2 Ui T the Arctic University of Norway 3 NORCE Norwegian Research Centre 4 Politecnico di Milano {andrea.cini, ivan.marisca}@usi.ch, filippo.m.bianchi@uit.no, cesare.alippi@usi.ch Neural forecasting of spatiotemporal time series drives both research and industrial innovation in several relevant application domains. Graph neural networks (GNNs) are often the core component of the forecasting architecture. However, in most spatiotemporal GNNs, the computational complexity scales up to a quadratic factor with the length of the sequence times the number of links in the graph, hence hindering the application of these models to large graphs and long temporal sequences. While methods to improve scalability have been proposed in the context of static graphs, few research efforts have been devoted to the spatiotemporal case. To fill this gap, we propose a scalable architecture that exploits an efficient encoding of both temporal and spatial dynamics. In particular, we use a randomized recurrent neural network to embed the history of the input time series into high-dimensional state representations encompassing multi-scale temporal dynamics. Such representations are then propagated along the spatial dimension using different powers of the graph adjacency matrix to generate node embeddings characterized by a rich pool of spatiotemporal features. The resulting node embeddings can be efficiently pre-computed in an unsupervised manner, before being fed to a feed-forward decoder that learns to map the multi-scale spatiotemporal representations to predictions. The training procedure can then be parallelized node-wise by sampling the node embeddings without breaking any dependency, thus enabling scalability to large networks. Empirical results on relevant datasets show that our approach achieves results competitive with the state of the art, while dramatically reducing the computational burden. 1 Introduction As graph neural networks (GNNs; Scarselli et al. 2008; Bacciu et al. 2020) are gaining more traction in many application fields, the need for architectures scalable to large graphs such as those associated with large sensor networks is becoming a pressing issue. While research to improve the scalability of models for static graph signals has been very prolific (Hamilton, Ying, and Leskovec 2017; Chiang et al. 2019; Zeng et al. 2019; Frasca et al. 2020), little attention has been paid to the additional challenges encountered when dealing with discrete-time dynamical graphs, i.e., spatiotemporal time series. Several of the existing scalable training *These authors contributed equally. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. . . . . . . Temporal embedding: Spatial embedding: i.i.d. samples Figure 1: Overview of the forecasting framework. Lightgrey boxes denote training-free components. At first, an Echo-State Network (ESN) with shared parameters across nodes encodes multi-scale temporal dynamics. Then, K graph shift operators are used to propagate spatial information. The resulting K + 1 representations are concatenated and fed to an MLP to predict the next H node observations. techniques rely on subsampling graphs to reduce the computational requirements of the training procedure, e.g., (Hamilton, Ying, and Leskovec 2017; Zeng et al. 2019). However, sampling the node-level observations as if they were i.i.d. can break relational (spatial) dependencies in static graphs and it is even more problematic in the dynamic case, as dependencies occur also across the temporal dimension. Indeed, complex temporal and spatial dynamics that emerge from the interactions across the whole graph over a long time horizon, can be easily disrupted by perturbing such spatiotemporal structure with subsampling. As an alternative, precomputing aggregated features over the graph allows for factoring out spatial propagation from the training phase in certain architetures (Frasca et al. 2020). However, similarly to the subsampling approach, extending this method to the spatiotemporal case is not trivial as the preprocessing step must account also for the temporal dependencies besides the graph topology. In this paper, we propose a novel scalable encoder- The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) decoder architecture for processing spatiotemporal data; Fig. 1 shows high-level schematics of the proposed approach. The spatiotemporal encoding scheme is trainingfree: first, it exploits a deep randomized recurrent neural network (Jaeger 2001; Gallicchio, Micheli, and Pedrelli 2017) to encode the history of each sequence in a high-dimensional vector embedding; then, it uses powers of the graph adjacency matrix to build informative node representations of the spatiotemporal dynamics at different scales. According to the downstream task at hand, the decoder maps the node representations into the desired output, e.g., the future values of the time series associated with each node. To improve efficiency, we exploit the structure of the extracted embedding to design the decoder to act as a collection of filters localized at different spatiotemporal scales. Since the spatiotemporal encoder requires neither training nor supervision, the representation of each node and time step can be computed in a preprocessing stage, without the constraints that come from online training on GPUs with limited memory. The decoder is the only component of the architecture with trainable parameters. However, since spatiotemporal relationships are already embedded in the representations, the embeddings can be processed independently from their spatiotemporal context with two consequent advantages. First, training can be done node-wise, allowing for sampling node representations in mini-batches of a size proportional to the hardware capacity. Second, the decoder can be implemented similarly to a standard multilayer perceptron (MLP) readout, which is fast and easy to train. Let T and E be the number of steps and the number of edges in the input graph, respectively. The cost of training a standard spatiotemporal GNN on a mini-batch of data has a computational and memory cost that scales as O(TE), or O(T 2E) in attention-based architectures (Wu et al. 2022). Conversely, in our approach mini-batches can be sampled disregarding the length of the sequence and size of the graph, thus making scalability in training constant, i.e., O(1), w.r.t. the spatiotemporal dimension of the problem. Our contributions can be summarized as follows. We propose a general scalable deep learning framework for spatiotemporal time series, which exploits a novel encoding method based on randomized recurrent components and scalable GNNs architectures. We apply the proposed model to forecast multivariate time series, whose channels are subject to spatial relationships described by a graph. We carry out a rigorous and extensive empirical evaluation of the proposed architecture and variations thereof. Notably, we introduce two benchmarks for scalable spatiotemporal forecasting architectures. Empirical results show that our approach performs on par with the state of the art while being easy to implement, computationally efficient, and extremely scalable. Given these considerations, we refer to our architecture as Scalable Graph Predictor (SGP). 2 Preliminaries and Problem Definition We consider discrete-time spatiotemporal graphs. In particular, given N interlinked sensors, we indicate with xi t Rdx the dx-dimensional multivariate observation associated with the i-th sensor at time-step t, with Xt RN dx the node attribute matrix encompassing measurements graph-wise, and with Xt:t+T the sequence of T measurements collected in the time interval [t, t + T) at each sensor. Similarly, we indicate with Ut RN du the matrix containing exogenous variables (e.g., weather information related to a monitored area) associated with each sensor at the t-th time-step. Then, we indicate additional, optional, static node attributes as V RN dv. The relational information is encoded in a, potentially dynamic, weighted adjacency matrix At RN N. We indicate with the tuple Gt = Xt, Ut, V , At the graph signal at the t-th time-step. Note that the number of sensors in a network is here considered fixed only to ease the presentation; we only request nodes to be distinguishable across time steps. The objective of spatiotemporal forecasting is to predict the next H observations given a window of W past measurements. In particular, we consider the family of forecasting models Fθ( ) s.t. c Xt:t+H = Fθ (Gt W :t) , (1) where θ indicates the learnable parameters of the model and c Xt:t+H the H-step ahead point forecast. Echo-State Networks Echo state networks (Jaeger 2001; Lukoˇseviˇcius and Jaeger 2009) are a class of randomized architectures that consist of recurrent neural networks with random connections that encode the history of input signals into a high-dimensional state representation to be used as input to a (trainable) readout layer. The main idea is to feed an input signal into a high-dimensional, randomized, and non-linear reservoir, whose internal state can be used as an embedding of the input dynamics. An echo state network is governed by the following state update equation: ht = σ (Wxxt + Whht 1 + b) , (2) where xt indicates a generic input to the system, Wx Rdh dx and Wh Rdh dh are the random matrices defining the connectivity pattern in the reservoir, b Rdh is a randomly initialized bias, ht indicates the reservoir state, and σ is a nonlinear activation function (usually tanh). If the random matrices are defined properly, the reservoir will extract a rich pool of dynamics characterizing the system underlying the input time series xt and, thus, the reservoir states become informative embeddings of xt T :t (Lukoˇseviˇcius and Jaeger 2009). Thanks to the nonlinearity of the reservoir, the embeddings are commonly processed with a linear readout that is optimized with a least squares procedure to perform classification, clustering, or time series forecasting (Bianchi et al. 2020). 3 Scalable Spatiotemporal GNNs This section presents our approach to building scalable GNN architectures for time series forecasting. Our method is CONCATENATE Echo State Network Figure 2: Overview of the SGP encoder. Input time series are fed into a randomized network with recurrent connections and embedded into a hierarchical vector representation. A graph shift operator is used to propagate and aggregate spatial information of different order which is then concatenated to obtain a final embedding. based on a hybrid encoder-decoder architecture. The encoder first constructs representations of the time series observed at each node by using a reservoir that accounts for dynamics at different time scales. Representations are further processed to account for spatial dynamics described by the graph structure. In particular, as shown on the right-hand side of Fig. 2, we use incremental powers of the graph adjacency matrix to propagate and aggregate information along the spatial dimension. Each power of the propagation matrix accounts for different scales of spatial dynamics. The final embedding is then built by concatenating representations obtained w.r.t. each propagation step, thus resulting in a rich encoding of both spatial and temporal features. The encoder does not need any training and, once computed, the embeddings can be uniformly sampled over time and space when training a nonlinear readout to perform Hstep-ahead predictions. The straightforward choice for the decoder (i.e., readout) is to map the encodings to the outputs (i.e., predictions) by using a linear transformation or a standard MLP. However, to further enhance scalability, our decoder exploits the structure of the embedding to reduce the number of parameters and learn filters that are localized in time and space. As we will discuss in Sec. 3.2, this is done by learning separate weight matrices for each spatiotemporal scale. The following subsections describe in detail each component of the architecture. 3.1 Spatiotemporal Encoder We consider as temporal encoders deep echo state networks (Deep ESN; Gallicchio, Micheli, and Pedrelli 2017) with leaky integrator neurons (Jaeger et al. 2007). In particular, we consider networks where the signal associated with each node is encoded by a stack of L randomized recurrent layers s.t. hi,(0) t = xi t ui t , ˆhi,(l) t = tanh W (l) u hi,(l 1) t + W (l) h h i,(l) t 1 + b(l) , hi,(l) t = (1 γl)hi,(l) t 1 + γlˆhi,(l) t , l = 1, . . . , L where γl (0, 1] is a discount factor associated with lth layer, W (l) u Rdhl dhl 1 , Wh Rdhl dhl , b Rdhl are random weight matrices, hi,(l) t indicates the hidden state of the system w.r.t. the i-th node at the l-th layer, and indicates node-wise concatenation. As Eq. 3 shows, Deep ESNs are a hierarchical stack of reservoir layers that, e.g., by changing the discount factor at each layer, extract a rich pool of multi-scale temporal dynamics (Gallicchio, Micheli, and Pedrelli 2017)1. Given a Deep ESN encoder, the input is represented by the concatenation of the states from each layer, i.e., we obtain node-level temporal encodings h i t for each node i and time-step t as h i t = hi,(0) t hi,(1) t . . . hi,(L) t . (4) We indicate as Ht the encoding for the whole graph at time t. The extraction of the node-level temporal embeddings is depicted on the left-end side of Fig. 2, where, to simplify the drawing, we depict an ESN with a single layer. The next step is to propagate information along the spatial dimension. As discussed at the beginning of the section, we use powers of a graph shift operator e A to propagate and aggregate node representations at different scales. By using a notation similar to Eq. 4, we obtain spatiotemporal encodings as S(0) t = Ht = H(0) t H(1) t . . . H(L) t , S(k) t = e AS(k 1) t = e Ak H(0) t e Ak H(1) t . . . e Ak H(L) t , St = S(0) t S(1) t . . . S(K) t , (5) where e A indicates a generic graph shift operator matching the sparsity pattern of the graph adjacency matrix. In practice, by indicating with D the graph degree matrix, we use e A = D 1A in the case of a directed graph and the symmetrically normalized adjacency e A = D 1/2AD 1/2 in the undirected case. Furthermore, for directed graphs we optionally increase the number of representations to 2K + 1 1We refer to (Gallicchio, Micheli, and Pedrelli 2018) for more details on the properties and stability of Deep ESNs. to account for bidirectional dynamics, i.e., we repeat the encoding process w.r.t. the transpose adjacency matrix similarly to (Li et al. 2018). Intuitively, each propagation step e AS(k 1) t propagates and aggregates properly weighted features between nodes connected by paths of length k in the graph. As shown in Eq. 5, features corresponding to each order k can be computed recursively with K sparse matrixmatrix multiplications (Fig. 2). Alternatively, each matrix e Ak can be precomputed and the computation of the different blocks of matrix St can be distributed in a parallel fashion as suggested in Fig. 1. Even in the case of extremely large graphs, features St can be computed offline by exploiting distributed computing as they do not need to be loaded on the GPU memory. 3.2 Multi-Scale Decoder The role of the decoder is that of selecting and weighing from the pool of the (possibly redundant) features extracted by the spatiotemporal encoder and mapping them to the desired output. Representations St can be fed into an MLP that performs node-wise predictions. Since the representations are large vectors, a na ıve implementation of the MLP results in many parameters that hinder scalability. Therefore, we replace the first MLP layer with a more efficient implementation that exploits the structure of the embeddings. As we described in Sec. 3.1, St is the concatenation of the representations corresponding to different spatial propagation steps which, in turn, are obtained from the concatenation of multi-scale temporal features. To exploit this structure, we design the first layer of the decoder with a sparse connectivity pattern to learn representations Zt s.t. Z(k) t = σ e Ak H(0) t Θ(0) k . . . e Ak H(L) t Θ(L) k (6) Θ(0) k 0 ... 0 Θ(L) k Zt = Z(0) t Z(1) t . . . Z(K) t , (8) where Θ(l) k Rdhl dz are the learnable parameters and σ is an activation function. In practice, representations Zt can be efficiently computed by exploiting grouped 1-d convolutions (e.g., see Krizhevsky, Sutskever, and Hinton 2012) to parallelize computation on GPUs. In particular, if we indicate the 1-d grouped convolution operator with g groups and kernel size r as r,g , and the collection of the decoder parameters Θ(l) k as Θ we can compute Zt as Zt = σ Θ 1,g St , (9) with g = L(K + 1) in the case of undirected graphs and g = L(2K + 1) for the directed case. Besides reducing the number of parameters by a factor of L(K +1), this architecture localizes filters Θ(L) k w.r.t. the dynamics of spatial order k and temporal scale l. In fact, as highlighted in Eq. 6 8, representation Zt can be seen as a concatenation of the results of L(K + 1) graph convolutions of different order. Finally, the obtained representations are fed into an MLP that predicts the H-step-ahead observations as bxi t:t+H = MLP zi t, vi , (10) where the static node-level attributes vi can also be augmented by concatenating a set of learnable parameters (i.e., a learnable positional encoding). Training and sampling The main improvement introduced by the proposed approach in terms of scalability concerns the training procedure. Representations St embed both the temporal and spatial relationships among observations over the sensor network. Consequently, each sample si t can be processed independently since no further spatiotemporal information needs to be collected. This allows for training the decoder with SGD by uniformly and independently sampling mini-batches of data points si t. This is the key property that makes the training procedure extremely scalable and drastically reduces the lower bound on the computational complexity required for the training w.r.t. standard spatiotemporal GNN architectures. 4 Related Works Spatiotemporal GNNs are essentially based on the idea of integrating message-passing modules in architectures to process sequential data. Notably, Seo et al. (2018) and Li et al. (2018) use message-passing to implement gates of recurrent neural networks. Yu, Yin, and Zhu (2018) and Wu et al. (2019, 2020) proposed architectures alternating temporal and spatial convolutions. Wu et al. (2022) and Marisca, Cini, and Alippi (2022), instead, exploit the attention mechanism to propagate information along both time and space. Modern architectures often combine some type of relational inductive bias, with full Transformerlike attention (Vaswani et al. 2017) along the spatial dimension (Zheng et al. 2020; Oreshkin et al. 2021; Satorras, Rangapuram, and Januschowski 2022), which, however, makes the computation scale quadratically with the number of nodes. SGP falls within the category of time-then-graph models, i.e., models where the temporal information is encoded before being propagated along the spatial dimension. Gao and Ribeiro (2022) showed that such models can be more expressive than architectures that alternate temporal and spatial processing steps. Research on scalable models for discrete-time dynamic graphs has been relatively limited. Practitioners have mostly relied on methods developed in the context of static graphs which include node-centric, Graph SAGE-like, approaches (Hamilton, Ying, and Leskovec 2017) or subgraph sampling methods, such as Cluster GCN (Chiang et al. 2019) or Graph SAINT (Zeng et al. 2019). Wu et al. (2020); Gandhi et al. (2021); Wu et al. (2021) are examples of such approaches. Among scalable GNNs for static graphs, SIGN (Frasca et al. 2020) is the approach most related to our method. Like in our approach, SIGN performs spatial propagation as a preprocessing step by using different shift operators to aggregate across different graph neighborhoods, which are then fed to an MLP. However, SIGN is limited to static graphs and propagates raw node-level attributes. METR-LA PEMS-BAY 15 m 30 m 60 m Average 15 m 30 m 60 m Average MAE MAE MAE MAE MSE MAPE MAE MAE MAE MAE MSE MAPE LSTM 2.99 .00 3.58 .00 4.43 .01 3.58 .00 53.01 .13 1.19 .05 1.39 .00 1.83 .01 2.35 .01 1.79 .00 17.72 .08 4.16 .05 FC-LSTM 3.33 .01 3.43 .01 3.67 .01 3.46 .01 44.85 .12 1.15 .09 2.22 .01 2.25 .01 2.34 .02 2.26 .01 22.31 .27 5.33 .04 Dyn GESN 3.27 .00 3.99 .00 5.00 .00 3.98 .00 5.30 .07 11.11 .01 1.57 .00 2.13 .01 2.81 .02 2.09 .01 18.43 .07 4.74 .01 DCRNN 2.82 .00 3.23 .01 3.74 .01 3.20 .00 41.57 .22 8.88 .05 1.36 .00 1.71 .00 2.08 .01 1.66 .00 14.40 .15 3.76 .01 GWNet 2.72 .01 3.10 .02 3.54 .03 3.06 .02 38.22 .32 8.40 .03 1.31 .00 1.64 .01 1.94 .01 1.58 .00 13.12 .14 3.58 .02 FC-GGN 2.72 .01 3.05 .01 3.44 .01 3.01 .00 37.48 .32 8.27 .00 1.32 .00 1.63 .01 1.89 .01 1.56 .01 12.96 .11 3.51 .03 UG-GGN 2.72 .00 3.10 .00 3.54 .01 3.06 .00 38.82 .08 8.40 .04 1.33 .00 1.67 .01 1.99 .01 1.61 .01 13.76 .14 3.59 .03 SGP 2.69 .00 3.05 .00 3.45 .00 3.00 .00 36.70 .10 8.27 .02 1.30 .00 1.60 .00 1.88 .00 1.54 .00 12.43 .03 3.44 .01 Ablations No-Graph 2.84 .00 3.26 .00 3.74 .00 3.22 .00 44.55 .05 9.20 .01 1.34 .00 1.68 .00 2.02 .00 1.62 .00 14.14 .06 3.67 .01 FC-Dec. 2.76 .01 3.13 .01 3.52 .02 3.08 .01 37.92 .35 8.63 .11 1.35 .01 1.67 .01 1.96 .01 1.61 .01 13.04 .23 3.61 .04 GC-Dec. 2.77 .00 3.17 .00 3.63 .00 3.12 .00 40.67 .06 8.74 .01 1.32 .00 1.65 .00 1.97 .00 1.59 .00 13.47 .08 3.60 .01 Table 1: Results on benchmark traffic datasets (averaged over 3 independent runs). We report MAE, MSE, and MAPE averaged over a one-hour (12 steps) forecasting horizon. We also show MAE for H {15, 30, 60} minutes time horizons. Bold numbers are within a standard deviation from the best average. Finally, similar to our work, Dyn GESN (Micheli and Tortorella 2022) processes dynamical graphs with a recurrent randomized architecture. However, the architecture in Dyn GESN is completely randomized, while ours is hybrid as it combines randomized components in the encoder with trainable parameters in the decoder. 5 Empirical Evaluation We empirically evaluate our approach in 2 different scenarios. In the first one, we compare the performance of our forecasting architecture against state-of-the-art methods on popular, medium-scale, traffic forecasting benchmarks. In the second one, we evaluate the scalability of the proposed method on large-scale spatiotemporal time series datasets by considering two novel benchmarks for load forecasting and PV production prediction. Further details on datasets, baselines, and experimental settings are detailed in the supplemental material2. We provide an efficient open-source implementation of SGP together with the code to reproduce all the experiments3. Datasets In the first experiment we consider the METRLA and PEMS-BAY datasets (Li et al. 2018), which are popular medium-sized benchmarks used in the spatiotemporal forecasting literature. In particular, METR-LA consists of traffic speed measurements taken every 5 minutes by 207 detectors in the Los Angeles County Highway, while PEMSBAY includes analogous observations recorded by 325 sensors in the San Francisco Bay Area. We use the same preprocessing steps of previous works to extract a graph and obtain train, validation and test data splits (Wu et al. 2019). For the second experiment, we introduce two larger-scale datasets derived from energy analytics data. The first dataset contains 2See https://arxiv.org/abs/2209.06520. 3https://github.com/Graph-Machine-Learning-Group/sgp data coming from the Irish Commission for Energy Regulation Smart Metering Project (CER-E; Commission for Energy Regulation 2016), which has been previously used for benchmarking spatiotemporal imputation methods (Cini, Marisca, and Alippi 2022); however, differently from previous works, we consider the full sensor network consisting of 6435 smart meters measuring energy consumption every 30 minutes at both residential and commercial/industrial premises. The second large-scale dataset is obtained from the synthetic PV-US4 dataset (Hummon et al. 2012), consisting of simulated energy production by 5016 PV farms scattered over the United States given historic weather data for the year 2006, aggregated in half an hour intervals. Since the model does not have access to weather information, PV production at neighboring farms is instrumental for obtaining good predictions. Notably, CER-E and PV-US datasets are at least an order of magnitude larger than the datasets typically used for benchmarking spatiotemporal time series forecasting models. Note that for both PV-US and CER-En the (weighted) adjacency is obtained by applying a thresholded Gaussian kernel to the similarity matrix obtained by considering the geographic distance among the sensors and the correntropy (Liu, Pokharel, and Principe 2007) among the time series, respectively. We provide further details on the datasets in the supplemental material. Baselines We consider the following baselines: 1. LSTM: a single standard gated recurrent neural network (Hochreiter and Schmidhuber 1997) trained by sampling window of observations from each node-level time series by disregarding the spatial information; 2. FC-LSTM: an LSTM processing input sequences as if they were a single high-dimensional multivariate time series; 4https://www.nrel.gov/grid/solar-power-data.html PV-US CER-En Prediction error (MAE) Resource utilization Prediction error (MAE) Resource utilization 30 m 7 h 30 m 11 h Batch/s Memory Batch 30 m 7 h 30 m 11 h Batch/s Memory Batch DCRNN 1.39 .09 3.34 .22 3.54 .48 2.04 .01 9.63 GB 2 0.22 .00 0.28 .00 0.29 .00 1.43 .02 11.10 GB 2 GWNet 1.45 .13 5.09 .63 5.26 1.34 2.01 .02 11.64 GB 2 0.23 .00 0.36 .01 0.36 .01 2.41 .03 8.39 GB 1 UG-GGN 1.33 .08 2.94 .05 3.12 .14 8.41 .09 11.46 GB 5 0.22 .00 0.28 .00 0.28 .00 8.21 .08 11.70 GB 4 SGP 1.09 .01 3.14 .21 3.16 .19 116.58 8.74 2.21 GB 4096 0.21 .00 0.30 .00 0.31 .01 117.32 8.36 2.21 GB 4096 DCRNN 1.59 .17 4.10 .27 4.93 .60 1.37 .00 11.59 GB 1 0.23 .00 0.29 .00 0.29 .00 1.13 .01 11.10 GB 1 GWNet 1.65 .23 6.93 .58 7.93 .17 .77 .00 11.35 GB 2 0.25 .01 0.38 .03 0.37 .01 1.26 .01 8.58 GB 1 UG-GGN 1.61 .06 3.25 .04 3.04 .05 8.83 .10 11.14 GB 1 0.22 .00 0.28 .00 0.29 .00 8.77 .10 11.14 GB 1 SGP 1.09 .00 3.06 .11 3.13 .13 118.64 8.35 2.21 GB 4096 0.21 .00 0.30 .00 0.31 .01 115.85 1.60 2.21 GB 4096 Table 2: Results on large-scale datasets (averaged over at least 3 independent runs). We report MAE over H-step-ahead predictions, H = {30m, 7h30m, 11h}, together with timings and memory consumption. indicates that subsampling was needed to comply with the memory constraints. Bold numbers are within a standard deviation from the best average. Dataset # steps # nodes # edges sparsity METR-LA 34272 207 1515 3.54% PEMS-BAY 52116 325 2369 2.24% PV-US (100nn) 8868 5016 417,199 1.66% CER-En (100nn) 8868 6435 639,369 1.54% PV-US 8868 5016 3,710,008 14.75% CER-En 8868 6435 3,186,369 7.69% Table 3: Additional information on the considered datasets. 3. DCRNN: a recurrent graph network presented in (Li et al. 2018) differently from the original model we use a recurrent encoder followed by a linear readout (more details in the appendix); 4. Graph Wave Net (GWNet): a residual network that alternates temporal and graph convolutions over the graph that is given as input and an adjacency matrix that is learned by the model (Wu et al. 2019); 5. Gated GN (GGN): a state-of-the-art time-thangraph (Gao and Ribeiro 2022) model introduced in (Satorras, Rangapuram, and Januschowski 2022) for which we consider two different configurations. The first one (FC) uses attention over the full node set to perform spatial propagation, while the second one (UG) constrains the attention to edges of the underlying graph. 6. Dyn GESN: the echo state network for dynamical graphs proposed in (Micheli and Tortorella 2022). For all the baselines, we use, whenever possible, the configuration found in the original papers or in their open-source implementation; in all the other cases we tune hyperparameters on the holdout validation set. Experimental setup For the traffic datasets, we replicate the setup used in previous works. In particular, each model is trained to predict the 12-step-ahead observations. In SGP, the input time series are first encoded by the spatiotemporal encoder, and then the decoder is trained by sampling mini- batches along the temporal dimension, i.e., by sampling B sequences Gt W :t of observations. For the large-scale datasets, we focus on assessing the scalability of the different architectures rather than maximizing forecasting accuracy. In particular, for both datasets, we consider the first 6 months of data (4 for months for training, 1 month for validation, and 1 month for testing). The models are trained to predict the next {00:30, 07:30, 11:00} hours. We repeat the experiment in two different settings to test the scalability of the different architectures w.r.t. the number of edges. In the first setting, we extract the graph by sparsifying the graph adjacency matrix imposing a maximum of 100 neighbors for each node, while, in the second case, we do not constrain the density of the adjacency matrix. Tab. 3 reports some details for the considered benchmarks. To assess the performance in terms of scalability, we fix a maximum GPU memory budget of 12 GB and select the batch size accordingly; if a batch size of 1 does not fit in 12 GB, we uniformly subsample edges of the graph to reduce the memory consumption. Differently from the other baselines, in SGP we first preprocess the data to obtain spatiotemporal embeddings and then train the decoder by uniformly sampling the node representations. We train each model for 1 hour, then restore the weights corresponding to the minimum training error and evaluate the forecasts on the test set. The choice of not running validation at each epoch was dictated by the fact that for some of the baselines running a validation epoch would take a large portion of the 1 hour budget. The time required to encode the datasets with SGP s encoder ranges from tens of seconds to 4 minutes on an AMD EPYC 7513 processor with 32 parallel processes. To ensure reproducibility, the time constraint is not imposed as a hard time out; conversely, we measure the time required for the update step of each model on an NVIDIA RTX A5000 GPU and fix the maximum number of updates accordingly. For SGP, the time required to compute node embeddings was considered as part of the training time and the number of updates was appropriately reduced to make the comparison fair. For all the baselines, we keep the same architecture 0 10 20 30 40 50 60 [minutes] SGP encoding done SGP UG-Gated-GN DCRNN Graph Wave Net Figure 3: Training curves on PV-US. The plot shows the average the standard deviation of 3 independent runs. The plotted curves are smoothed with a running average of 8 steps. used in the traffic experiment. For SGP we use the same hyperparameters for the decoder, but we reduce the dimension of the embedding (the value of K) so that a preprocessed dataset can fit in a maximum of 80 GB of storage. To account for the different temporal scales, we increase the window size for all baselines and increase the number of layers in the ESN (while keeping the final size of Ht similar). Additional details are provided in the supplementary material. 5.1 Results Results for the traffic benchmarks are reported in Tab. 1; while the outcomes of the scalability experiments are shown in Tab. 2. We consider mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error (MAPE) as evaluation metrics. Traffic experiment The purpose of the first experiment is to demonstrate that the proposed method achieves performance comparable to that of the state of the art. In this regard, results in Tab. 1 show that in all the considered scenarios SGP is always among the best-performing forecasting architectures. The full-attention baseline is the strongest competitor but, however, has time and memory complexities that scale quadratically with the number of nodes. Regarding the other baselines, DCRNN underperforms compared to the other spatiotemporal GNN architectures. Dyn GESN, the fully randomized architecture, despite being very fast to train, obtains reasonable performance in short-range predictions but falls short over longer forecasting horizons in the considered scenarios. In light of these results, it is worth commenting on the efficiency of SGP compared to the baselines. Approaches like DCRNN and Graph Wavenet, perform graph convolutions whose time and space of complexity is O(LTE), being E the number of edges, L the number of layers (8 in Graph Wavenet), and T the time steps. Such complexity is completely amortized by the preprocessing step in our architecture. Similarly, Gated GN, while being architecturally much simpler, propagates spatial information by relying on the attention mechanism that is known to scale poorly with the dimensionality of the problem. The next ex- perimental setting shines a light on these shortcomings. The bottom of Tab. 1 reports results for the ablation of key elements of the proposed architecture: No-Graph indicates that the embeddings are built without the spatial propagation step; FC-Dec. consider the case where the structure of the embedding is ignored in the readout and the sparse weight matrix in Eq. 7 is replaced by a fully-connected one; GCDec. indicates that the spatial propagation is limited to the neighbors of order K = 1 and, thus, the decoder behaves similarly to a single-layer graph convolutional network. Results clearly show the optimality of the proposed architectural design. Large-scale experiment Tab. 2 reports the results of the scalability experiment where we considered only the spatiotemporal GNNs trained by gradient descent. We excluded the full-attention baseline (FC-Gated GN) as its O(N 2) complexity prevented scaling to the larger datasets; however, we considered the UG version where attention is restrained to each node s neighborhood. There are several comments that need to be made here. First of all, batch size has a different meaning for our model and the other baselines. In our case, each sample corresponds to a single spatiotemporal (preprocessed) observation; for the other methods, a sample corresponds to a window of observations Gt W :t where edges of the graph are eventually subsampled if the memory constraints could not be met otherwise. In both cases, the loss is computed w.r.t. all the observations in the batch. The results clearly show that SGP can be trained efficiently also in resource-constrained settings, with contained GPU memory usage. In particular, the update frequency (batch/s) is up to 2 order of magnitude higher. Notably, resource utilization at training time remains constant (by construction) in the two considered scenarios, while almost all the baselines require edge subsampling in order to meet the resource constraints. Fig. 3 shows learning curves for the PV-US dataset, further highlighting the vastly superior efficiency, scalability, and learning stability of SGP. Finally, results concerning forecasting accuracy show that performance is competitive with the state of the art in all the considered scenarios. 6 Remarks and Conclusion We proposed SGP, a scalable architecture for graph-based spatiotemporal time series forecasting. Our approach competes with the state of the art in popular medium-sized benchmark datasets, while greatly improving scalability in large sensor networks. While in SGP sampling largely reduces GPU memory usage compared to the other methods, the entire processed sequence can take up a large portion of system memory, depending on the size of the reservoir. Nevertheless, the preprocessing can be distributed, the preprocessed data stored on disk and loaded in batches during training, as customary for large datasets. We believe that SGP constitutes an important stepping stone for future research on scalable spatiotemporal forecasting and has the potential of being widely adopted by practitioners. Future work can explore a tighter integration of the spatial and temporal encoding components and assess performance on even larger benchmarks. Acknowledgements This work was supported by the Swiss National Science Foundation project FNS 204061: Higher Order Relations and Dynamics in Graph Neural Networks. The authors wish to thank the Institute of Computational Science at USI for granting access to computational resources and Nvidia Corporation for the donation of two GPUs. References Bacciu, D.; Errica, F.; Micheli, A.; and Podda, M. 2020. A gentle introduction to deep learning for graphs. Neural Networks, 129: 203 221. Bianchi, F. M.; Scardapane, S.; Løkse, S.; and Jenssen, R. 2020. Reservoir Computing Approaches for Representation and Classification of Multivariate Time Series. IEEE Transactions on Neural Networks and Learning Systems, 32(5): 2169 2179. Chiang, W.-L.; Liu, X.; Si, S.; Li, Y.; Bengio, S.; and Hsieh, C.-J. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 257 266. Cini, A.; Marisca, I.; and Alippi, C. 2022. Filling the G ap s: Multivariate Time Series Imputation by Graph Neural Networks. In International Conference on Learning Representations. Commission for Energy Regulation. 2016. CER Smart Metering Project - Electricity Customer Behaviour Trial, 20092010 [dataset]. Irish Social Science Data Archive. SN: 001200. Frasca, F.; Rossi, E.; Eynard, D.; Chamberlain, B.; Bronstein, M.; and Monti, F. 2020. SIGN: Scalable Inception Graph Neural Networks. In ICML 2020 Workshop on Graph Representation Learning and Beyond. Gallicchio, C.; Micheli, A.; and Pedrelli, L. 2017. Deep reservoir computing: A critical experimental analysis. Neurocomputing, 268: 87 99. Gallicchio, C.; Micheli, A.; and Pedrelli, L. 2018. Design of deep echo state networks. Neural Networks, 108: 33 47. Gandhi, A.; Kaveri, S.; Chaoji, V.; et al. 2021. Spatio Temporal Multi-graph Networks for Demand Forecasting in Online Marketplaces. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 187 203. Springer. Gao, J.; and Ribeiro, B. 2022. On the Equivalence Between Temporal and Static Equivariant Graph Representations. In International Conference on Machine Learning, 7052 7076. PMLR. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30. Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735 1780. Hummon, M.; Ibanez, E.; Brinkman, G.; and Lew, D. 2012. Sub-hour solar data for power system modeling from static spatial variability analysis. Technical report, National Renewable Energy Lab.(NREL), Golden, CO (United States). Jaeger, H. 2001. The echo state approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34): 13. Jaeger, H.; Lukoˇseviˇcius, M.; Popovici, D.; and Siewert, U. 2007. Optimization and Applications of Echo State Networks with Leaky-Integrator Neurons. Neural Networks, 20(3): 335 352. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2018. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In International Conference on Learning Representations. Liu, W.; Pokharel, P. P.; and Principe, J. C. 2007. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Transactions on signal processing, 55(11): 5286 5298. Lukoˇseviˇcius, M.; and Jaeger, H. 2009. Reservoir Computing Approaches to Recurrent Neural Network Training. Computer Science Review, 3(3): 127 149. Marisca, I.; Cini, A.; and Alippi, C. 2022. Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations. ar Xiv preprint ar Xiv:2205.13479. Micheli, A.; and Tortorella, D. 2022. Discrete-time dynamic graph echo state networks. Neurocomputing, 496: 85 95. Oreshkin, B. N.; Amini, A.; Coyle, L.; and Coates, M. 2021. FC-GAGA: Fully connected gated graph architecture for spatio-temporal traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 9233 9241. Satorras, V. G.; Rangapuram, S. S.; and Januschowski, T. 2022. Multivariate Time Series Forecasting with Latent Graph Inference. ar Xiv preprint ar Xiv:2203.03423. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2008. The graph neural network model. IEEE Transactions on Neural Networks and Learning Systems, 20(1): 61 80. Seo, Y.; Defferrard, M.; Vandergheynst, P.; and Bresson, X. 2018. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, 362 373. Springer. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Wu, Y.; Zhuang, D.; Labbe, A.; and Sun, L. 2021. Inductive graph neural networks for spatiotemporal kriging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 4478 4485. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; and Zhang, C. 2020. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 753 763. New York, NY, USA: Association for Computing Machinery. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; and Zhang, C. 2019. Graph Wavenet for Deep Spatial-Temporal Graph Modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 1907 1913. Wu, Z.; Zheng, D.; Pan, S.; Gan, Q.; Long, G.; and Karypis, G. 2022. Traverse Net: Unifying Space and Time in Message Passing for Traffic Forecasting. IEEE Transactions on Neural Networks and Learning Systems. Yu, B.; Yin, H.; and Zhu, Z. 2018. Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 3634 3640. Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; and Prasanna, V. 2019. Graph SAINT: Graph Sampling Based Inductive Learning Method. In International Conference on Learning Representations. Zheng, C.; Fan, X.; Wang, C.; and Qi, J. 2020. Gman: A graph multi-attention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 1234 1241.