# eventaware_multimodal_mobility_nowcasting__95362360.pdf

Event-Aware Multimodal Mobility Nowcasting

Zhaonan Wang1,3 , Renhe Jiang1,2 , Hao Xue3, Flora D. Salim3, Xuan Song1, Ryosuke Shibasaki1

1 Center for Spatial Information Science, University of Tokyo; 2 Information Technology Center, University of Tokyo 3 School of Computing Technologies, RMIT University {znwang, jiangrh, songxuan, shiba}@csis.u-tokyo.ac.jp, {zhaonan.wang, hao.xue, ﬂora.salim}@rmit.edu.au

As a decisive part in the success of Mobility-as-a-Service (Maa S), spatio-temporal predictive modeling for crowd movements is a challenging task particularly considering scenarios where societal events drive mobility behavior deviated from the normality. While tremendous progress has been made to model high-level spatio-temporal regularities with deep learning, most, if not all of the existing methods are neither aware of the dynamic interactions among multiple transport modes nor adaptive to unprecedented volatility brought by potential societal events. In this paper, we are therefore motivated to improve the canonical spatio-temporal network (ST-Net) from two perspectives: (1) design a heterogeneous mobility information network (HMIN) to explicitly represent intermodality in multimodal mobility; (2) propose a memory-augmented dynamic ﬁlter generator (MDFG) to generate sequence-speciﬁc parameters in an on-the-ﬂy fashion for various scenarios. The enhanced event-aware spatiotemporal network, namely EAST-Net, is evaluated on several real-world datasets with a wide variety and coverage of societal events. Both quantitative and qualitative experimental results verify the superiority of our approach compared with the state-of-the-art baselines. Code and data are published on https://github.com/underdoc-wang/EAST-Net.

Introduction

Mobility-as-a-Service (Maa S), as an emerging paradigm of transport service, seamlessly integrates multimodal mobility services (e.g. public transport, ride-hailing, bike-sharing), which streamlines trip planning, ticketing (for users), operating optimization, emergency response (for providers), and trafﬁc management (for city managers). For a smooth operation of Maa S, spatio-temporal predictive modeling for multimodal transport of crowds is indispensable. However, the existing methods either implicitly handle the interaction between supply and demand of different modes or assume it to be time-invariant (Ye et al. 2019). This task is even more challenging in scenarios where societal events (e.g. holiday, severe weather, epidemic) take place and deviate collective

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. Work was done while the ﬁrst author was virtually visiting the CRUISE research group at RMIT University. Corresponding author.

Figure 1: Time Series Histograms of Citywide Taxi and Share Bike Demands in Washington DC from 24 Dec. 2015 to 31 Jan. 2016, during which Christmas, New Year, and a Historic Blizzard Jonas Took Place

mobility signiﬁcantly from the normality (e.g. daily, weekday routines). Moreover, as illustrated in Figure 1, the impacts of different events differ, e.g. taxi demand rockets on New Year s eve but vanishes at Christmas and during the blizzard, and the volatility brought to each transport mode varies, e.g. recovery of share bike demand takes longer than the one of taxi after the blizzard. While tremendous progress has been made in spatiotemporal modeling thanks to deep learning (Shi et al. 2015; Zhang, Zheng, and Qi 2017; Li et al. 2018; Wu et al. 2019; Zheng et al. 2020), most, if not all, of them advance by exploiting high-level spatio-temporal regularities. The volatility brought by societal events, on the other hand, is by far downplayed and usually handled by simple rectiﬁcations, such as incorporating temporal covariates (e.g. time-of-day, day-of-week, whether-holiday) as auxiliary input (Yao et al. 2018; Zonoozi et al. 2018), adding a memory bank to reuse similar patterns in history (Yao et al. 2019; Tang et al. 2020). These manipulations to a certain degree bring time and holiday awareness, but they mainly help with the periodic and precedent parts and would still fail under more extreme scenarios like unprecedented events (e.g. historic blizzard, COVID-19 pandemic). There is another line of research (Fan et al. 2015; Jiang et al. 2018, 2019) attempting to capture anomalous mobility tendency under events in an online fashion based on low-order Markov assumption and ﬁne-grained time slot setting. These practices are arguably circumventing the inherent difﬁculty of the task instead of truly tackling it. In this paper, we tackle the identiﬁed twofold unaware-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

ness of the existing spatio-temporal networks, namely intermodality-unaware and event-unaware, correspondingly via: (1) explicitly representing the dynamic interactions among multiple mobility modes; (2) intrinsically enhancing event-awareness and adaptivity of predictive models for various scenarios, including unprecedented events. Specifically, we design a heterogeneous information network to build the intermodal interactions into the widely adopted spatio-temporal modeling strategy; then leverage techniques of memory and dynamic ﬁlter networks that encourage the model to learn to distinguish and generalize to diverse scenarios. Based on the above two motivations, we propose an event-aware spatio-temporal network (EAST-Net). Our contributions are summarized as follows:

We design a new heterogeneous mobility information network (HMIN) to explicitly represent intermodal interactions (or intermodality) for spatio-temporal multimodal mobility modeling. We propose a novel memory-augmented dynamic ﬁlter generator (MDFG) to produce sequence-speciﬁc parameters on-the-ﬂy, intrinsically improving event-awareness and adaptivity of spatio-temporal networks. We conduct a series of experiments on four real-world event-mobility datasets, and the results validate the superiority of EAST-Net quantitatively and qualitatively.

Related Work Here we brieﬂy review two lines of research: the ﬁrst line is on modeling of event-related human mobility, while the other involves techniques for spatio-temporal forecasting and dynamic ﬁlter generation. (1) The former can be broken down into two branches, namely event-oriented and eventdriven modeling. On the one hand, human mobility data (e.g. GPS (Konishi et al. 2016), origin-destination records (Zhang, Zheng, and Yu 2018; Zhang et al. 2019), trip survey (Wang et al. 2021)) are commonly used as the underlying clues to infer both local and citywide anomalous events. On the other hand, mobility behavior is affected by these societal events conversely (Song et al. 2014; Fan et al. 2015; Jiang et al. 2019; Xie et al. 2020) and thereby deviates from normal patterns. (2) By effectively capturing complex non-linear dependency in the space and time, deep learning-based predictive models as a group outperforms classical statistical and matrix/tensor-based methods on both individual (e.g. call detail records (Feng et al. 2018), GPS trajectory (Fan et al. 2019), Point-of-Interest visits (Xue et al. 2021)) and collective mobility (e.g. crowd volumes (Zhang, Zheng, and Qi 2017), demand of multimodal transport modes (Ye et al. 2019), origin-destination trips (Wang et al. 2019; Jiang et al. 2021)) modeling. Besides, mainly studied for tasks like video prediction (Jia et al. 2016) and image classiﬁcation (Yang et al. 2019; Zhou et al. 2021), dynamic ﬁlter generation broadly shares a similar idea as model-based meta-learning (e.g. memory-augmented neural networks (Santoro et al. 2016), meta networks (Munkhdalai and Yu 2017)) or hypernetworks (Ha, Dai, and Le 2016), by conditioning parameters of a target module on another network. Moreover, there have been some recent studies fur-

thering the idea onto continual learning (von Oswald et al. 2020); meta knowledge-based parameterization (Pan et al. 2019) and spatial distinct ﬁlter generation (Cirstea et al. 2021) for spatio-temporal forecasting.

Preliminaries In this section, we ﬁrstly formulate multimodal mobility nowcasting problem, then brieﬂy revisit a standard solution, namely spatio-temporal network (ST-Net), for this task.

Problem Deﬁnition Given a speciﬁed spatio-temporal granularity, the time and space can be discretized into a set of equal-length time slots and regions (not necessarily equal-area), respectively, denoted by T = {τt|t (1, , T)} and R = {ηn|n (1, , N)}. Considering there are in total M modes of mobility, we can build a multimodal mobility tensor M RT N C, where C = 2 M if modeling the supply and demand of multiple transport modes; C = M if modeling the visit volume of multiple travel purposes. Accordingly, multimodal mobility nowcasting problem can be formulated as follows: Given α-step consecutive observations in M, denoted by (Xt α+1, , Xt), where X RN C, return the immediate expectations for the next β-step, i.e. (ˆXt+1, , ˆXt+β). Note that auxiliary temporal covariates can be available from time slot τt α+1 to τt+β, denoted by Tcov R(α+β) v, where v is total number of the covariates. Formally written as:

ˆXt+1, , ˆXt+β = argmax Xt+1, ,Xt+β log P (Xt+1, , Xt+β|Xt α+1, , Xt; Tcov)

(a) ST-Net+Tcov

(b) ST-Net+Mem.

(c) EAST-Net

Figure 2: Comparison of Abstract Structure between the Existing ST-Net with Rectiﬁcations (a) (b) and Proposed EAST-Net (c)

Spatio-Temporal Network To solve the above problem, recent studies commonly exploit high-level spatio-temporal dependency in observations (Zhang, Zheng, and Qi 2017; Li et al. 2018; Ye et al. 2019; Wu et al. 2020). Particularly, convolutional and recurrent

neural networks (e.g. CNN, GCN, TCN, RNN) are two typical submodules utilized to handle the underlying dependencies over the space R and time T , respectively. This class of models arguably share a similar spatio-temporal view, which prioritizes the ﬁrst and second dimensions in (Xt α+1, , Xt) Rα N C. We term this modeling strategy Spatio-Temporal Graph (STG), as demonstrated in Figure 3, in which the third dimension of the observations is treated as features evolving on STG. We further term models built on top of STG Spatio-Temporal Network (ST-Net). Without loss of generality, we combine GCN and RNN to denote a ST-Net, which handles spatial dependency by a graph and temporal dependency in a recurrent form:

H = σ(X G Θ) = σ(

k=0 Pk XWk) (2)

ut = sigmoid([X(l) t , H(l) t 1] G Θu + bu)

rt = sigmoid([X(l) t , H(l) t 1] G Θr + br)

Ct = tanh([X(l) t , (rt H(l) t 1)] G ΘC + b C)

H(l) t = ut H(l) t 1 + (1 ut) Ct

Equation (2) deﬁnes the basic graph convolution operation G, which takes input X RN p and returns H RN q

given a graph topology matrix P RN N ( P is its normalized form), approximation order K, and trainable parameters Θ R(K+1) p q. Equation (3) deﬁnes an extended version of GRU (a form of RNN), namely GCRU, with matrix multiplications replaced by graph convolutions (Equation (2)). It is noteworthy that DCGRU (Li et al. 2018) can be seen as a special form of GCRU by restricting P to be random walk normalized transition matrix and performing bidimensional graph diffusion. Then, stacking multiple layers (denoted by l) of GCRU forms encoder and decoder of a ST-Net, abbreviated as ST-Enc/ST-Dec in Figure 2. Besides, as illustrated in Figure 2a, temporal covariates can be used as auxiliary input (Yao et al. 2018; Zonoozi et al. 2018) to equip ST-Net with time and holiday awareness. In this case, X(0) t = [Xt, T t], where [, ] denotes a concatenation operation and T t is the linear projected representation of Tcov at τt. Another rectiﬁcation for ST-Net (demonstrated in Figure 2b) attaches an external memory bank (Yao et al. 2019; Tang et al. 2020) to the decoder such that some typical spatio-temporal patterns can be stored for reuse. This memory is implemented by a parameter matrix M Rm D, where m and d denote the total number of memory records and dimension of each one. Before making the ﬁnal output, decoder makes a query to M for ﬁnd similar representations, which is implemented by attention mechanism (Bahdanau, Cho, and Bengio 2014; Vaswani et al. 2017). Formally,

Qt = H (l) t WQ + b Q

φj = e Qt M[j] Pm j=1 e Qt M[j]

j=1 φj M[j])WV + b V

where Qt RD denotes the query vector projected from ﬂattened H(l) t ; φj is the attention score corresponding to j-th memory record. The obtained vector V can be reshaped back and concatenated with H(l) t for output, H(out) t = [H(l) t , V].

Methodology

In this section, we elaborate the motivations and techniques for improving ST-Net, and present Event-Aware Spatio Temporal Network (EAST-Net) as a more adaptive framework for multimodal mobility nowcasting.

Heterogeneous Mobility Information Network

As presented in Figure 3, STG, the fundamental of ST-Net, prioritizes spatio-temporal modeling while restricting all features (i.e. mobility modes) to evolve together on the ﬁxed STG. We argue that this spatio-temporal view restricts the modeling of dynamic interactions among different modes of mobility, which is in fact the operating mechanism of Maa S. As demonstrated in Figure 1, a societal event may impact different transport modes variously, which conﬁrms the necessity for intermodality modeling. Thus, we are motivated to design a new underlying structure, i.e. Heterogeneous Mobility Information Network (HMIN), to jointly represent intermodal interaction and spatio-temporal dependency.

Figure 3: Comparison between Spatio-Temporal Graph (STG) and Heterogeneous Mobility Information Network (HMIN) for Multimodal Mobility Modeling

As illustrated in Figure 3, HMIN is deﬁned as G = (Vsp Vmo, Esp Emo Esp-mo Et-sp Et-mo), where Vsp = {η1, , ηN} and Vmo = {µ1, , µM} denote node set of regions and mobility modes, respectively; Esp, Emo, Esp-mo, Et-sp, Et-mo denote ﬁve edge sets for the relations in region-to-region, mode-to-mode, region-to-mode, timeto-region, time-to-mode. By this deﬁnition, the intermodal relationship and its dynamicity can be represented by Emo and Et-mo; and the task of multimodal mobility nowcasting is reformulated as a link prediction task for edge set Esp-mo (|Esp-mo| = N C) from τt+1 to τt+β. Here we propose a simple yet generic framework to encode-decode HMIN by applying handy GCRU in a sim-

Figure 4: Framework of EAST-Net: (1) Input Multimodal Mobility Tensor and Temporal Covariates; (2) Encode with Heterogeneous Mobility Information Network in Two Pyramidal GCRU Branches; (3) Query Memory-augmented Dynamic Filter Generator to Produce Sequence-speciﬁc Graph Convolution Kernels; (4) Decode with GCRU and Generate Links

ilar fashion to ST-Net. Denoting the framework of STEnc/ST-Dec (GCRU in multi-layer) as Ht+1, , Ht+β = GCRUEnc-Dec(Xt α+1, , Xt), processing of HMIN can be decomposed into two views (i.e. spatial and intermodal) followed by a stepwise fusion layer, formally denoted by:

H(sp) t+1 , , H(sp) t+β = GCRU(sp) Enc-Dec(Xt α+1, , Xt)

H(mo) t+1 , , H(mo) t+β = GCRU(mo) Enc-Dec(X t α+1, , X t )

ˆXt+ε = σ(H(sp) t+ε Wout H(mo) t+ε ) (5) where ε (1, , β) denotes the step index within horizon β; H(sp) t+ε RN q and H(mo) t+ε RC q denote spatial and modal embeddings on Vsp and Vmo at time slot τt+ε, respectively; Wout Rq q denotes a parameter matrix to fuse the node embeddings for link generation. For simplicity, we denote this framework by HMINet (Equation (5)) and consider it to be a general case of ST-Net, which only takes the spatial view and let Wout Rq C, H(mo) t+ε = IC. Essentially, edge sets Esp and Emo, representing spatial and intermodal dependencies, are handled by graph convolution in each domain; unidirectional temporal edges Et-sp and Et-mo are encoded by the recurrent structure; HMINet learns the mapping from α-step to β-step in edge set Esp-mo.

Memory-augmented Dynamic Filter Generator Although enhancing ST-Net in an intermodality-aware way, HMINet introduces extra parameters by approximately same amount that ST-Net has. To control the model size and, more importantly, empower it to be aware of and adaptive to various scenarios, we propose a novel Memory-augmented Dynamic Filter Generator (MDFG). MDFG is motivated by a line of research on dynamic ﬁlter networks (DFN) (Jia et al. 2016; Yang et al. 2019; Zhou et al. 2021), which have been mainly studied on convolutional kernels for image and video-related tasks. The core idea behind DFN is instead of sharing a same trainable ﬁlter

for all samples in a dataset, dynamically generating ﬁlters conditioned on an input sample, which by nature increases the ﬂexibility and adaptivity of model. In light of DFN, we argue that the indistinguishability between normal and event scenarios roots in the way that a same set of parameters (e.g. Θ = [Θu, Θr, ΘC] in Equation (3)) is shared for all observational sequences (Xt α+1, , Xt) by vanilla ST-Net. In other words, parameters in ST-Net are sequence-agnostic. We thereby utilize the idea of DFN and further put parameters conditioned on a plugin memory bank Mmob Rm D to encourage discovery of high-level mobility prototypes, which are representations incorporating spatial, temporal, and multimodal knowledge. To be speciﬁc, Mmob takes concatenated node embeddings [H(sp) t , H(mo) t ] R(N+C) q as a query and returns a reconstructed prototype vector Vmob RD, which further passes through a DFN to produce momentary ﬁlters [Θ(sp) t , Θ(mo) t ] for GCRU(sp) Enc-Dec and GCRU(mo) Enc-Dec. This interaction between HMINet and MDFG occurs in an on-the-ﬂy manner, which generates sequencespeciﬁc parameters (denoted by Θt). Formally,

Qt = [H (sp) t , H (mo) t ]WQ + b Q

φj = e Qt Mmob[j] Pm j=1 e Qt Mmob[j]

j=1 φj Mmob[j]

Θt = [Θ(sp) t , Θ(mo) t ] = Z (ϕ( Z ( Vmob)))

where R denotes a dynamic ﬁlter generation (DFG) layer, which can be implemented in various ways. Without loss of generality, we utilize a linear projection in this case. ϕ denotes a ﬁlter normalization (FN) layer (Zhou et al. 2021), used to normalize the generated parameters and avoid gradient vanishing and exploding.

Event-Aware Spatio-Temporal Network Based on HMINet and MDFG, we further make three reﬁnements to the framework of Event-Aware Spatio-Temporal Network (EAST-Net), as illustrated in Figure 4, which can be trained in an end-to-end fashion by minimizing a speciﬁed loss function using the standard backpropagation. Temporal covariates are fused stepwise for basic time and holiday awareness for both spatial and intermodal views, following the common practice (Yao et al. 2019).

X(0) t+ε = [Xt+ε, T t+ε] , if ε (1 α, , 0) [ˆXt+ε, T t+ε] , if ε (1, , β) (7)

Then, X(0) t+ε is fed into HMINet (Equation (5)) as input. Pyramidal structure (Zonoozi et al. 2018) is leveraged in GCRU encoders to help accelerate the training of HMINet and discover multi-level temporal pattern for mobility prototype extraction. In a case by a factor of 2:

H(l+1) t = [H(l) 2t , H(l) 2t+1] (8)

Adaptive edge sets Esp, Emo are learnt in HMINet without making any prior assumptions on either intermodal or spatial relationship (Li et al. 2018; Wu et al. 2019). Essentially, a pair of parameterized node embeddings are initialized for both GCRU(sp) Enc-Dec and GCRU(mo) Enc-Dec to derive corresponding topology for graph convolutions: ( E(sp) = softmax(relu(E(sp)F (sp))) E(mo) = softmax(relu(E(mo)F (mo))) (9)

where embeddings E(sp), F(sp) RN µsp and E(mo), F(mo) RC µmo are trained to learn the underlying region-to-region and mode-to-mode dependencies within node sets Vsp and Vmo; the derived topology is normalized to [0, 1] by softmax to simulate signal diffusion in each domain (replacing P in Equation (2)).

Experiments Datasets To evaluate the proposed EAST-Net, we collect four realworld datasets with different spatio-temporal scales and coverage (presented in Table 1), and represent multimodal mobility with transport modes on three city-level datasets (for

New York City, Washington DC, Chicago), and with travel purpose on the other country-level dataset (for the United States). Similarly to the previous studies (Zhang, Zheng, and Qi 2017; Ye et al. 2019; Jiang et al. 2019), trip records (e.g. taxi, share bike) or POI visits are processed as in/outﬂow (supply/demand) or visit volume to be further aggregated onto a given spatio-temporal granularity. Particularly, each dataset is designed to cover a set of holidays and a historic event with big social impact, i.e. the winter storm Jonas or COVID-19 pandemic. Following the common practice (Zonoozi et al. 2018; Yao et al. 2018), we encode temporal covariates of each time slot (i.e. time-of-day, day-of-week, month-of-year, whether-holiday) in an one-hot manner as auxiliary sequence input.

Settings We chronologically split each dataset for training, validation, testing with a ratio of 7 : 1 : 2, such that the lengths of test sets are roughly last 20 days for JONAS-{NYC, DC}, 110 days for COVID-CHI, and 40 days for COVID-US. Lengths of observational and nowcasting sequences are set to α = 8 and β = 8, respectively; number of GCRU layers L = 2 with approximation order K = 3 and hidden dimension q = 32; embedding dimensions for Tcov v = 2, µ(sp) = 20 and µ(mo) = 3; mobility prototype memory m = 8 and D = 16. For model training, batch size = 32; learning rate = 5 10 4; maximum epoch = 100 with an early stopper with a patience of 10; MAE is chosen to be optimized using Adam. We implement EAST-Net with Py Torch and carry out experiments on a GPU server with NVIDIA Ge Force GTX 1080 Ti graphic cards. For evaluation, we adopt three commonly used metrics, namely Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE).

Evaluations In this section, to understand the performance of our approach, we develop a group of research questions and design a series of experiments correspondingly: (1) How does EAST-Net perform compared with the existing methods? (2) How does EAST-Net perform compared with its model variants? (3) How does EAST-Net behave in different scenarios of societal events?

Dataset JONAS-NYC JONAS-DC COVID-CHI COVID-US Time Span 2015/10/24 2016/1/31 2015/10/24 2016/1/31 2019/7/1 2020/12/31 2019/11/14 2020/5/31 Temporal 100 days by 30-minute 100 days by 1-hour 550 days by 2-hour 200 days by 1-hour Spatial 16 8 grid in 0.5 0.5km 9 12 grid in 0.5 0.5km 14 8 grid in 1.5 1.2km 50 states + DC Mobility {Demand, Supply} of Transport Mode in {Taxi, Share Bike, Scooter*} Travel Purpose

Event Holidays + Blizzard Jonas (2016/1/22 24) Holidays + COVID Pandemic

* Scooter trip data is only available in COVID-CHI set. Travel purpose is measured by POI visitations of 10 categories: {grocery store, retailer, transportation, ofﬁce, school, healthcare, entertainment, hotel, restaurant, service}, according to the NAICS industry codes (https://www.naics.com/search-naics-codes-by-industry/). COVID-19 pandemic outbroke in late March 2020; COVID-US set depicts the early stage (ﬁrst wave in April), and COVID-CHI set depicts the progression (ﬁrst to third waves till end of 2020) of the pandemic.

Table 1: Summary of Four Experimental Datasets

Model JONAS-NYC JONAS-DC COVID-CHI COVID-US RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE HA 48.953 39.221 75.33% 6.316 3.112 38.86% 37.156 9.938 190.48% 2822.12 1218.61 159.72% NF 29.928 28.370 59.25% 7.754 3.594 68.95% 12.909 5.662 79.19% 2385.28 1258.12 185.08% Transformer 35.050 23.428 47.08% 6.544 2.963 65.73% 12.671 5.106 80.50% 1767.71 862.82 180.13% Co ST-Net 33.721 22.485 41.03% 6.274 2.971 52.05% 15.259 6.881 83.74% - - - DCRNN 28.722 18.718 38.99% 5.469 3.066 50.35% 10.566 6.483 51.23% 1194.38 722.34 155.92% GW-Net 28.584 19.367 36.96% 5.091 2.334 51.03% 8.365 3.723 45.41% 1022.82 490.97 77.62% MTGNN 28.874 19.118 36.39% 5.161 2.691 47.95% 8.822 4.350 51.58% 1083.00 535.61 75.86% Stem GNN 30.711 21.489 40.81% 5.316 3.074 50.44% 8.400 4.496 50.27% 1279.04 709.16 146.73% EAST-Net 23.632 15.790 33.33% 4.103 2.004 35.03% 9.381 3.380 61.50% 799.51 371.78 51.84% - % -17.3% -15.6% -8.4% -19.4% -14.1% -9.9% - -9.2% - -21.8% -24.3% -31.7%

Table 2: Performance of EAST-Net and Baselines in RMSE, MAE, MAPE at JONAS-{NYC, DC}, COVID-{CHI, US}

Quantitative Evaluation 1 To quantitatively evaluate the overall prediction accuracy of EAST-Net on the multimodal mobility nowcasting problem, we implement eight baselines on mobility/trafﬁc-related spatio-temporal prediction for comparison, including:

Historical Average (HA): Average values of same time slot in the training set for prediction. Naive Forecast (NF): Naively repeat the latest one observation for the next β time slots. This practice is proven to be rather effective under events (Jiang et al. 2019). Transformer (Vaswani et al. 2017): The enhanced version (Li et al. 2019) with convolutional self-attention is implemented to capture local temporal pattern for time series forecasting. Co ST-Net (Ye et al. 2019): A two-stage co-predictive model for multimodal transport demands. It models each mode individually with convolutional auto-encoder and uses a heterogeneous LSTM for collaborative modeling. DCRNN (Li et al. 2018): A special form of GCRU requiring a pre-deﬁned transition matrix as auxiliary input to perform bidimensional graph diffusion. Graph Wave Net (GW-Net) (Wu et al. 2019): A benchmark trafﬁc forecasting model, in which parameterized graph input is ﬁrstly proposed. It utilizes a Wave Net-like structure for temporal modeling. MTGNN (Wu et al. 2020): A state-of-the-art model for multivariate time series modeling. It features an efﬁcient unidirectional graph constructor and multi-kernel TCN. Stem GNN (Cao et al. 2020): Another state-of-the-art method by modeling spatial and temporal dependencies jointly in the spectral domain for multivariate time series (MTS) forecasting.

We present the performance comparison of EAST-Net and baselines in Table 2. It is noticeable that the error range on four datasets varies in magnitude: among three citylevel sets, DC and CHI have relatively smaller transport volume than NYC; COVID-US is apparently the most tricky set which is state-level, of ten modes for travel purpose, and being tested at the very early stage (ﬁrst wave) of the pandemic. Besides, acceptable results obtained by HA on

JONAS-DC, NF on JONAS-NYC and COVID-CHI indicate a rather strong short-term temporal dependency in JONASNYC and COVID-CHI, and a daily periodicity in JONASDC. By treating the problem simply as time series, Transformer does not acquire satisfactory accuracy. Taking spatial locality into consideration, Co ST-Net performs better than Transformer on JONAS-{NYC, DC}, but the pre-trained convolutional structure not only fails it on COVID-CHI but limits it from handling graph-based data like COVID-US. Then, among four graph-based models, GW-Net prevails in terms of most metrics on all datasets. Lastly, speaking of EAST-Net, we can observe a consistent and dramatic improvement throughout JONAS-{NYC, DC} and COVIDUS, which undoubtedly conﬁrms the efﬁcacy of EAST-Net. The exception on COVID-CHI, we think, can be explained by: (1) A coarse time slot (2-hour) setting smoothes sudden changes, making the task easier for other models; (2) Along with the progression (ﬁrst to third waves) of COVID pandemic, other models gradually learn the pandemic pattern as a new normality.

Quantitative Evaluation 2 To understand how EAST-Net improves from the canonical ST-Net, we implement ST-Net (in Equation (3)) and its two rectiﬁed forms (in Figure 2a and 2b), as well as HMINet (in Equation (5)) for comparison. As presented in Table 3, within the ST-Net family, a regular memory bank improves ST-Net in most cases, but not as signiﬁcantly as temporal covariates do. However, adding Tcov deteriorates the performance on COVID-US, which is actually reasonable because the reinforced awareness of periodicity backﬁres especially at the early stage of a historic epidemic when human mobility began to deviate (because of quarantine measures). In comparison, adopting HMINet drops all metrics compared with regular ST-Net especially on COVID-US, which validates our motivation for explicit intermodality modeling. Besides, adopting HMIN on COVID-CHI seems not as helpful as on other datasets. This issue, we think, may be caused by including the scooter data, which is in fact a pilot program in Chicago and thus has some months without any data. Lastly, comparing HMINet and EAST-Net side by side, we can observe a consistent performance improvement, which veriﬁes the effectiveness of MDFG in various scenarios.

Variant JONAS-NYC JONAS-DC COVID-CHI COVID-US RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE ST-Net 31.382 20.215 43.99% 5.437 2.366 55.96% 12.166 5.061 80.00% 1123.91 519.07 62.17% ST-Net+Tcov 25.353 16.964 33.97% 4.453 2.042 43.05% 8.674 2.823 59.00% 1434.33 720.08 82.42% ST-Net+Mem. 30.725 20.158 40.41% 5.079 2.599 44.00% 9.921 3.018 57.00% 1058.52 528.29 63.36% HMINet 28.713 18.205 37.96% 4.567 2.072 48.26% 11.437 4.475 78.53% 906.85 399.35 43.47% EAST-Net 23.632 15.790 33.33% 4.103 2.004 35.03% 9.381 3.380 61.50% 799.51 371.78 51.84%

Table 3: Performance of EAST-Net and Its Variants in RMSE, MAE, MAPE at JONAS-{NYC, DC}, COVID-{CHI, US}

(a) Ground Truth

(b) Prediction by GW-Net

(c) Prediction by EAST-Net

Figure 5: Time Series Histograms of Ground Truth and (2-hour ahead) Predictions of Citywide Taxi Demand in New York City from 22 Jan. 2016 to 27 Jan. 2016 (6 days)

Figure 6: On-the-ﬂy Attention Score (φj) in MDFG at JONAS-NYC Test Set (from 12 Jan. to 31 Jan. 2016)

Qualitative Evaluation To understand how EAST-Net behaves in diverse scenarios of societal events, we conduct two case studies on JONAS-NYC and COVID-US. In Figure 5, a clear no-mobility period is expected under the impact of the historic blizzard Jonas . GW-Net, a stateof-the-art model according to Table 2, simply makes native forecasting (repeating the latest observation) during this anomalous period. In contrast, EAST-Net can quickly adapt to a declining-to-zero tendency (although causing underestimations afterwards). In addition, as illustrated in Figure 6, the composition of mobility prototypes in memory records for generating momentary ﬁlters is clearly differentiated between (1) normal workdays and weekend with a holiday; (2) a long weekend and the Jonas period. These observations demonstrate the event-awareness and adaptivity of EASTNet under a short-term event causing sudden volatility. Figure 7 presents stream graphs (Byron and Wattenberg 2008) for state averaged POI visits in 10 categories during the ﬁrst wave of COVID pandemic. Stream graph is a variation of stacked area graph by positioning layers to minimize weighted wiggle (sum of the squared slopes). In our case,

(a) Ground Truth

(b) Prediction by GW-Net

(c) Prediction by EAST-Net

Figure 7: Stream Graphs of Ground Truth and (4-hour ahead) Predictions of State-averaged POI Visits in 10 Categories from 22 Apr. 2020 to 29 May. 2020 (38 days)

Figure 8: On-the-ﬂy Attention Score (φj) in MDFG at COVID-US Set (from 15 Nov. 2019 to 25 May. 2020)

an overall negative tendency is expected according to the ground truth. While an opposite positive tendency is produced by GW-Net, EAST-Net can capture the overall shape correctly. In detail, EAST-Net also better catches a smaller volume of POI visits than GW-Net on the Memorial Day (25 May 2020). Besides, based on Figure 8, EAST-Net becomes aware of a new mobility pattern as early as March, the very beginning of the epidemic in US. These observations reconﬁrm the event-awareness and adaptivity of EAST-Net, particularly under a long-term event imposing lasting impact.

Conclusion In this paper, we tackle the multimodal mobility nowcasting problem in response to various event scenarios. By designing a heterogeneous mobility information network for explicitly representing intermodality and a memoryaugmented dynamic ﬁlter generator for producing sequencespeciﬁc parameters on-the-ﬂy, we propose an event-aware spatio-temporal network. Experiments on four real-world datasets verify the event-awareness and adaptivity of EASTNet, which is even applicable to unprecedented events.

Acknowledgements

This work was partially supported by JSPS KAKENHI (JP20K19859), JST Strategic International Collaborative Research Program (SICORP) (JPMJSC2002, JPMJSC2104), and Australian Research Council (ARC) Discovery Project (DP190101485). We are also appreciative of the open POI data (i.e. Weekly Patterns) Safe Graph has made.

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. ar Xiv preprint ar Xiv:1409.0473. Byron, L.; and Wattenberg, M. 2008. Stacked Graphs Geometry & Aesthetics. IEEE Transactions on Visualization and Computer Graphics, 14(6): 1245 1252. Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; and Zhang, Q. 2020. Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting. In Advances in Neural Information Processing Systems, 17766 17778. Cirstea, R.-G.; Kieu, T.; Guo, C.; Yang, B.; and Pan, S. J. 2021. Enhance Net: Plugin Neural Networks for Enhancing Correlated Time Series Forecasting. In IEEE 37th International Conference on Data Engineering (ICDE), 1739 1750. IEEE. Fan, Z.; Song, X.; Jiang, R.; Chen, Q.; and Shibasaki, R. 2019. Decentralized Attention-based Personalized Human Mobility Prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4): 1 26. Fan, Z.; Song, X.; Shibasaki, R.; and Adachi, R. 2015. City Momentum: An Online Approach for Crowd Behavior Prediction at a Citywide Level. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 559 569. Feng, J.; Li, Y.; Zhang, C.; Sun, F.; Meng, F.; Guo, A.; and Jin, D. 2018. Deep Move: Predicting Human Mobility with Attentional Recurrent Networks. In Proceedings of the 2018 World Wide Web Conference, 1459 1468. Ha, D.; Dai, A.; and Le, Q. V. 2016. Hypernetworks. ar Xiv preprint ar Xiv:1609.09106. Jia, X.; De Brabandere, B.; Tuytelaars, T.; and Gool, L. V. 2016. Dynamic Filter Networks. Advances in Neural Information Processing Systems, 667 675. Jiang, R.; Song, X.; Fan, Z.; Xia, T.; Chen, Q.; Miyazawa, S.; and Shibasaki, R. 2018. Deep Urban Momentum: An Online Deep-Learning System for Short-Term Urban Mobility Prediction. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 784 791. Jiang, R.; Song, X.; Huang, D.; Song, X.; Xia, T.; Cai, Z.; Wang, Z.; Kim, K.-S.; and Shibasaki, R. 2019. Deep Urban Event: A System for Predicting Citywide Crowd Dynamics at Big Events. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2114 2122.

Jiang, R.; Wang, Z.; Cai, Z.; Yang, C.; Fan, Z.; Xia, T.; Matsubara, G.; Mizuseki, H.; Song, X.; and Shibasaki, R. 2021. Countrywide Origin-Destination Matrix Prediction and Its Application for COVID-19. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 319 334. Springer. Konishi, T.; Maruyama, M.; Tsubouchi, K.; and Shimosaka, M. 2016. City Prophet: City-Scale Irregularity Prediction Using Transit App Logs. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 752 757. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.-X.; and Yan, X. 2019. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. Advances in Neural Information Processing Systems, 32: 5243 5253. Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2018. Diffusion Convolutional Recurrent Neural Network: Data-Driven Trafﬁc Forecasting. In International Conference on Learning Representations (ICLR 18). Munkhdalai, T.; and Yu, H. 2017. Meta Networks. In International Conference on Machine Learning, 2554 2563. PMLR. Pan, Z.; Liang, Y.; Wang, W.; Yu, Y.; Zheng, Y.; and Zhang, J. 2019. Urban Trafﬁc Prediction from Spatio-Temporal Data Using Deep Meta Learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1720 1730. Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-Learning with Memory Augmented Neural Networks. In International Conference on Machine Learning, 1842 1850. PMLR. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; and Woo, W.-c. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802 810. Song, X.; Zhang, Q.; Sekimoto, Y.; and Shibasaki, R. 2014. Prediction of Human Emergency Behavior and Their Mobility Following Large-Scale Disaster. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 5 14. Tang, X.; Yao, H.; Sun, Y.; Aggarwal, C.; Mitra, P.; and Wang, S. 2020. Joint Modeling of Local and Global Temporal Dynamics for Multivariate Time Series Forecasting with Missing Values. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 5956 5963. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems, 5998 6008. von Oswald, J.; Henning, C.; Sacramento, J.; and Grewe, B. F. 2020. Continual Learning with Hypernetworks. In International Conference on Learning Representations (ICLR 20). Wang, Y.; Yin, H.; Chen, H.; Wo, T.; Xu, J.; and Zheng, K. 2019. Origin-Destination Matrix Prediction via Graph Con-

volution: A New Perspective of Passenger Demand Modeling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1227 1235. Wang, Z.; Xia, T.; Jiang, R.; Liu, X.; Kim, K.-S.; Song, X.; and Shibasaki, R. 2021. Forecasting Ambulance Demand with Proﬁled Human Mobility via Heterogeneous Multi-Graph Neural Networks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 1751 1762. IEEE. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; and Zhang, C. 2020. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 753 763. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; and Zhang, C. 2019. Graph Wave Net for Deep Spatial-Temporal Graph Modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, 1907 1913. Xie, Q.; Guo, T.; Chen, Y.; Xiao, Y.; Wang, X.; and Zhao, B. Y. 2020. Deep Graph Convolutional Networks for Incident-Driven Trafﬁc Speed Prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 1665 1674. Xue, H.; Salim, F.; Ren, Y.; and Oliver, N. M. 2021. Mob TCast: Leveraging Auxiliary Trajectory Forecasting for Human Mobility Prediction. In Thirty-Fifth Conference on Neural Information Processing Systems. Yang, B.; Bender, G.; Le, Q. V.; and Ngiam, J. 2019. Condconv: Conditionally Parameterized Convolutions for Efﬁcient Inference. Advances in Neural Information Processing Systems, 1307 1318. Yao, H.; Liu, Y.; Wei, Y.; Tang, X.; and Li, Z. 2019. Learning from Multiple Cities: A Meta-Learning Approach for Spatial-Temporal Prediction. In The World Wide Web Conference, 2181 2191. Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; and Li, Z. 2018. Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2588 2595. Ye, J.; Sun, L.; Du, B.; Fu, Y.; Tong, X.; and Xiong, H. 2019. Co-Prediction of Multiple Transportation Demands based on Deep Spatio-Temporal Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 305 313. Zhang, H.; Zheng, Y.; and Yu, Y. 2018. Detecting urban anomalies using multiple spatio-temporal data sources. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(1): 1 18. Zhang, J.; Zheng, Y.; and Qi, D. 2017. Deep Spatio Temporal Residual Networks for Citywide Crowd Flows Prediction. In Thirty-ﬁrst AAAI Conference on Artiﬁcial Intelligence, 1655 1661. Zhang, M.; Li, T.; Shi, H.; Li, Y.; Hui, P.; et al. 2019. A Decomposition Approach for Urban Anomaly Detection

across Spatiotemporal Data. In Proceedings of the Twenty Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, 6043 6049. Zheng, C.; Fan, X.; Wang, C.; and Qi, J. 2020. GMAN: A Graph Multi-Attention Network for Trafﬁc Prediction. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 1234 1241. Zhou, J.; Jampani, V.; Pi, Z.; Liu, Q.; and Yang, M.-H. 2021. Decoupled Dynamic Filter Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6647 6656. Zonoozi, A.; Kim, J.-j.; Li, X.-L.; and Cong, G. 2018. Periodic-CRN: A Convolutional Recurrent Model for Crowd Density Prediction with Recurring Periodic Patterns. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-18, 3732 3738.