# enhancing_urban_flow_maps_via_neural_odes__4a8b35c4.pdf Enhancing Urban Flow Maps via Neural ODEs Fan Zhou1 , Liang Li1 , Ting Zhong1 , Goce Trajcevski2 , Kunpeng Zhang3 and Jiahao Wang1 1School of Information and Software Engineering, University of Electronic Science and Technology of China 2Iowa State University, Ames IA 3University of Maryland, College Park MD {fan.zhou, zhongting, wangjh}@uestc.edu.cn, liliang2333@gmail.com, gocet25@iastate.edu, kpzhang@umd.edu Flow super-resolution (FSR) enables inferring finegrained urban flows with coarse-grained observations and plays an important role in traffic monitoring and prediction. The existing FSR solutions rely on deep CNN models (e.g., Res Net) for learning spatial correlation, incurring excessive memory cost and numerous parameter updates. We propose to tackle the urban flows inference using dynamic systems paradigm and present a new method FODE FSR with Ordinary Differential Equations (ODEs). FODE extends neural ODEs by introducing an affine coupling layer to overcome the problem of numerically unstable gradient computation, which allows more accurate and efficient spatial correlation estimation, without extra memory cost. In addition, FODE provides a flexible balance between flow inference accuracy and computational efficiency. A FODE-based augmented normalization mechanism is further introduced to constrain the flow distribution with the influence of external factors. Experimental evaluations on two realworld datasets demonstrate that FODE significantly outperforms several baseline approaches. 1 Introduction Urban flow super-resolution (FSR) aims at inferring finegrained crowd flows in a city based on coarse-grained observations. As a variant of image SR in the traffic domain [Cai et al., 2019; Wang et al., 2019] it has practical significance in urban planning and traffic monitoring. Despite its close relationship to image SR, FSR has certain constraints the most important one being the structural constraint i.e., the sum of the flow volumes in surrounding regions in the inferred fine-grained map strictly equals that of their corresponding superregion in the original map. However, in practice, the distribution of the flows in a given region is affected by many external factors, e.g., weather, time-of-day, etc. A recent work [Liang et al., 2019] addresses FSR problem based on the residual networks [He et al., 2016] and uses a simple normalization scheme to constrain the structural distribution of flows. However, the proposed architecture heavily relies on empirically stacking deep neural networks and, consequently, lacks principles to guide the design of effective and interpretable FSR networks. Furthermore, it pays no attention to the computational overheads as the network goes deeper, which requires significantly more memory cost and network parameters making it hard to be optimized and prone to be overfitting. Recently, there has been a growing interest in bridging the gap between neural networks and dynamic systems [E, 2017; Lu et al., 2018]. In particular, a recent study [Chen et al., 2018] has shown that Res Net [He et al., 2016] can be interpreted as discretized Neural Ordinary Differential Equations (NODE), which provides a new perspective of improving stability and trainability of neural networks. For example, an additional state (called adjoint) is used in [Chen et al., 2018] to solve the ODEs. In this way, it does not need to store the intermediate states of the forward pass, resulting in O(1) memory cost in each ODE block. A few studies have been proposed to improve the performance of NODE [Liu et al., 2019b] and/or to apply NODE in different domains such as graph neural networks [Poli et al., 2019], time series learning [Rubanova et al., 2019; De Brouwer et al., 2019] and generative models [Heinonen and L ahdesm aki, 2019; Grathwohl et al., 2019]. However, the dynamics of either the hidden state or the adjoint are usually unstable, due to the numerical instability of solving the backward ODEs. ANODE [Gholami et al., 2019] and its variant [Zhang et al., 2019] improve the robustness and generalization of the adjoint method by adding a few of the intermediate states from the forward pass however, these cannot fundamentally solve the incorrect gradient problem. In this work, we introduce a new SR method FODE (FSR with ODEs) for fine-grained urban flow inference. FODE is a more general neural ODE architecture that addresses the numerical instability problem of previous methods, while not incurring extra memory cost. The main idea of FODE is to incorporate an affine coupling layer in each ODE block to avoid the inaccurate gradient issue. The input to a FODE block can then be accurately reconstructed from its outputs without the need of storing intermediate states. We show theoretically and empirically that the proposed FODE model can be used to learn spatial correlations among urban regions and the influence of external features. The main contributions of this work can be summarized as: We present a new neural ODE framework that can accu- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) rately compute the gradients in each block. Our novel FSR method requires significantly less memory for fine-grained flow inference compared with Res Net-based deep neural network. The augmented normalization scheme disentangles the influence of external factors on different regions. Experiments conducted on two real-world datasets demonstrate that our FODE not only outperforms strong baselines on FSR task, but also provides a flexible balance between computational cost and inference accuracy. 2 Architecture and Methodology We now present the problem settings and discuss in detail the main aspects of the proposed FODE approach. We assume a grid-based segmentation, similar to [Zhang et al., 2017], which divides a city map M into H W spatial grid-cells based on their geographical locations, i.e., M = {rij}H M. Each cell rij corresponds to a spatial region the ith row and the jth column of M. Let X RH W + be the flow map at a given time, where each entry xij R+ denotes the volume of the flow in the region rij. Definition 1 (Urban Flow Super-Resolution (FSR)). Given a coarse-grained flow map Xc, an upscaling factor N as well as the external factors E (e.g., weather, events, etc.), the FSR task is to learn a model F mapping Xc into a fine-grained flow map b Xf RNH NW + : b Xf = F (Xc|E, N; θ) , (1) where θ represent all learnable parameters. Figure 1 illustrates an example of converting a coarsegrained flow map to a fine-grained flow map in the city of Beijing. The coarse-grained flow map (20km 20km) consists of a total of 32 32 cells, each of which denotes a superregion. In the fine-grained flow map, there are 128 128 subregions in total. At the top of Figure 1, a superregion is composed of N 2 subregions (N = 4 here). Note that the sum of the flow volume in the N 2 subregions (top right) is equal to the flow volume of the corresponding superregion (top left). 18 15 12 16 Superregion Scaling factor N = 4 Subregions Figure 1: Coarse-grained and fine-grained flow maps in Beijing. 2.1 Architecture Figure 2 illustrates the overall architecture of FODE, which consists of three main components: FODE block which extracts spatial correlations among flow regions with the proposed new ODEs. Feature fusion network (FFN) which combines the FODE blocks and fully connected networks for fusing the influence of external factors. Fine-grained flow inference (FFI) which leverages an augmented N 2-normalization scheme (AN 2) to estimate the distribution of both flows and external factors for generating the fine-grained flow maps. 2.2 FODE Block Res Net [He et al., 2016] and its variants have been widely employed for image super-resolution [Ledig et al., 2017; Zhang et al., 2018]. Recently, [Liang et al., 2019] exploited residual block for regional correlation learning and fine-grained urban flow inference, which share the same idea of image SR. Suppose we obtain the high-dimensional hidden state z0 from the coarse-grained flow Xc by some convolutional layers. The residual block then transforms the hidden states zn according to: zn+1 = zn + f (zn; θ) . (2) While achieving promising results on SR tasks, residual networks still confront the problem of intensive computation and require to tune a huge amount of parameters. There is a growing interest in bridging the gap between discrete neural networks and continuous dynamic systems. For example, the iterative updates in the above residual block can be viewed as a discretization of a continuous ODE operator [Lu et al., 2018; Chen et al., 2018], if we take time t T as a continuous variable: dt = f(z(t), t; θ), where z(tn) = zn, (3) z(T ) = z(0) + Z T 0 f(z(t), t; θ)dt. (4) Solving the above ODEs requires computing the gradients through backpropagation, which would be memory prohibitive if the time is reduced into infinite steps. In principle, it requires O(N) cost to store the all N intermediate activations. NODE [Chen et al., 2018] proposes to address this problem with the adjoint method. Considering a loss function L, an additional state referred to as the adjoint a (t) = L/ z (t) can be used to compute the gradient w.r.t. parameters θ: dt = a(t) f(z(t), t; θ) z(t) dt, (5) θ = a(t) z(t) T a(t) f(z(t), t; θ) where we only need to store the final state z(T ). Hence this strategy successfully reduces the memory cost. However, it also introduces other problems the inaccurate gradient and Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Coarse-grained Flows Fine-grained Flows Nearest-neighbor Upsampling External factors Figure 2: Overview of the proposed FSR architecture. numerical instability. Taking Eq. (2) for example, we compare the difference between the two calculated gradients: (1) Calculate the gradient w.r.t. zn by Eq. (6): a (tn) = a (tn+1) + Z tn tn+1 a(t) f(z(t), t; θ) z(t) dt. (7) (2) Calculate the exact gradient w.r.t. zn by chain rule: an = an+1 + an+1 f(zn, θ) As can be seen, the second terms of Eq. (7) and Eq. (8) are different, which, therefore, makes the state a (tn) = an. In other words, the adjoint method leads to inaccurate and unstable gradient L/ θ compared to direct backpropagation that computes correct and stable gradient, but suffers a prohibitive memory cost. ANODE and its variant [Gholami et al., 2019; Zhang et al., 2019] mitigate this problem by splitting a block into time batches and stores in memory a few intermediate states from the forward pass, which, in a sense, is a compromise between time and storage. In this work, we bridge this gap by introducing an affine coupling layer in the computations of ODEs, to avoid the inaccurate gradient issue while only requiring O(1) memory in each block. Affine coupling layer referred to a family of neural network whose forward function is a bijective mapping and has been widely used in invertible generative models [Dinh et al., 2015; Dinh et al., 2017; Kingma and Dhariwal, 2018], where the input to a bijective block can be accurately reconstructed from its outputs. As illustrated in Figure 3, we divide the input zt (let zt = z (t)) into two parts of same size za t , zb t RC/2 H W where C is the number of channels. In the forward pass functions in each FODE block with time step size t, we have: F : ha t = za t hb t = I (za t ) zb t + J (za t ) G : zb t+ t = hb t za t+ t = S hb t ha t + K hb t , (9) Figure 3: Computational graph for a FODE block. where ha t and hb t are the intermediate states, za t+ t and zb t+ t denote the outputs, I, J , S and K are differentiable neural networks. The reverse computations are therefore: hb t = zb t+ t ha t = za t+ t K hb t /S hb t za t = ha t zb t = (ha t J (za t )) /I (za t ) (10) s.t. 0 < si S hb t , 0 < ii I (za t ) . To ensure the reversibility of Eq. (10), all elements in S hb t and I (za t ) need to be greater than zero, which can be met by making a simple transformation for each element e as exp(loge), e I or e S. Then we have following property which ensures that the affine transformations in the forward pass in Eq. (9) are reversible: Proposition 1. The forward pass functions Eq. (9) in the FODE block is reversible as long as each element of I(za t ) and S(hb t) is non-zero. Proof. Let JF and JG be the Jacobians of the transformations in Eq. (9), which can be computed as: JF za t , zb t = " I O hb t za t diag(I (za t )) JG hb t, ha t = " I O za t+ t hb t diag(S(hb t)) where I and O are identity and zero matrix, respectively, and diag( ) denotes the diagonal matrix whose diagonal elements correspond to the elements in I (za t ) or S(hb t). Since JF and JG are lower triangular matrices, their determinants are Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 1 Gradient calculation in FODE. Input: Initial value: za 0, zb 0; parameters θ; integration time t T ; time step size: t Output: Gradient d L dθ ; 1: Forward Pass: 2: for t := 0 to T do 3: za t+ t, zb t+ t ODESolve Eq.(9), [za t , zb t], θ ; 4: Delete za t , zb t and all intermediate activations; 5: t t + t; 6: end for 7: return za T and zb T . 8: Backward: 9: for t := T to 0 do 10: Restore ˆza t t and ˆzb t t with za t and zb t by Eq.(10); 11: Let za t t = ˆza t t, zb t t = ˆzb t t; 12: Compute ˆza t and ˆzb t by Eq.(9) ; 13: Compute gradients t by Eq. (13) and Eq. (14); 14: Update model gradients d L dθ + t; 15: Delete za t , zb t; 16: t t t; 17: end for 18: return d L computed as |JF | = diag(I (za t )) and |JG| = diag(S(hb t)). While each element e in S(hb t) and I (za t ) is greater than 0 after transformation exp(loge), Jacobians JF and JG are therefore invertible. Due to the reversibility of FODE, in the forward pass with ODE solver, we only save the output (za t+ t,zb t+ t) without the needs to store other variables and intermediate activations. In the backward stage, we first restore the input (ˆza t , ˆzb t) from the output (za t+ t, zb t+ t) by Eq. (10). Then, we perform one-time step forward pass to obtain the output (ˆza t+ t,ˆzb t+ t), and then calculate corresponding gradients [ˆza t+ t,ˆzb t+ t] [ˆza n,ˆzb n] and [ˆza t+ t,ˆzb t+ t] [θI,θJ ,θS,θK]. Subsequently, the gradients w.r.t. zt and θ of one-time step are computed as: L za t , zb t = L za t+ t, zb t+ t ˆza t+ t,ˆzb t+ t ˆza t ,ˆzb t , (13) L [θI, θJ , θS, θK] = L za t+ t, zb t+ t ˆza t+ t,ˆzb t+ t [θI, θJ , θS, θK]. (14) The process of calculating accurate gradients in FODE is summarized in Algorithm 1, where ODESolve can be any numerical solutions, e.g., Euler, Runge-Kutta [Butcher and Wanner, 1996] and Dopris Solver [Ascher et al., 1997]. After FODE block, layer normalization [Ba et al., 2016] is employed to normalize the feature maps in the channel. Then we leverage Sub Pixel blocks [Liang et al., 2019] to upscale the hidden state from coarse-grained to fine-grained with upscaling factor N, which obtains the fine-grained hidden state output Hf after the convolutional layer. External factors FODE FODE Sub Pixel Block Figure 4: Feature fusion network. 2.3 Feature Fusion Network It has been demonstrated that external features (e.g., wind speed, temperature, weather and holidays) affect the traffic distribution of the flows [Liang et al., 2019]. Here we also take these factors into consideration for improving performance. In addition to a simple fully connected network (FCN) for feature fusion, we use the proposed FODE blocks to improve the ability of estimating the feature influence. As shown in Figure 4, we obtain the coarse-grained feature map Ec RH W + and fine-grained feature map Ef RNH NW + by appending two FODE blocks: Ec = FODE (E) E, (15) Ef = FODE (SP (Ec)) SP (Ec) , (16) where SP indicates a Sub Pixel block used for upsampling. 2.4 Augmented Flow Normalization One of the main differences between FSR and image SR tasks is that there is a structural constraint in FSR, i.e., the amount of flows in the subregions should equal the flows in the respective superregion: i j xf i j s.t. i where xc ij (resp. xf i j ) denotes the flow volume in a superregion (resp. subregion) of the coarse-grained (resp. finegrained) grid map. This is a straightforward method that normalizes the flow in the N 2 subregions to meet the constraint, a.k.a. N 2-Normalization [Liang et al., 2019]. However, it ignores the influence of external factors on the subregions, and to address this problem, we propose an augmented N 2 normalization method which takes the distribution of external factors into account when constraining the flow, as illustrated in Figure 5. More specifically, we replace the xf i j in a subregion with probability value αi ,j , and we modify Eq. (17) as follows: i ,j αi ,j xc i,j, ec i,j = X i ,j βi ,j ec i,j. α, β R+, s.t. X αi ,j = 1, X βi ,j = 1, i where ec i,j indicates the degree of influence of complex factors on the region ri,j. αi ,j and βi ,j denote the proportion of flow and factor influence assigned to the subregions from corresponding superregion, respectively. Now, we learn the joint distribution of factors influence Df e and flow Df h to obtain the final flow distribution Df π. Here we present a Distribution Gating Mechanism (DGM) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 5: Illustration of AN 2-Normalization. to explicitly capture the dynamic spatial dependence between the two distributions: Df π = N 2 Df h Df h σ Df e , (19) where σ denotes distribution gate, Df e and Df h are generated by N 2-Normalization with input Ef and Hf. AN 2Normalization not only inherits all the advantages of N 2Normalization (e.g., parallelizable computation and no extra parameters), but also bridges gaps between external factors and urban flows when normalizing the values of subregions. Fine-grained flow inference. At this point, we are ready to generate the fine-grained flow b Xf from the learned joint distribution as: b Xf = Xc up Df π, (20) where Xc up RNH NW + is produced by nearest-neighbor upsampling with scaling factor N. Optimization. Finally, we minimize the mean squared error between the ground truth fine-grained flow map Xf and the model inferred output b Xf: L(θ) = Xf b Xf 2 2 = Xf F (Xc, |E, N; θ) 2 where θ represents all learnable parameters in our model. 3 Experiments We now present the details of our experimental evaluations. 3.1 Experimental Settings Datasets. We evaluate all the methods using two real-world urban flow datasets: (1) Taxi BJ [Liang et al., 2019] a taxi GPS data including taxi flows from July 1, 2014 to October 31, 2014; and (2) Bike NYC collected from an open website1 which contains data from January 1, 2019 to June 30, 2019. Each dataset contains two sub-datasets: coarse-grained and fine-grained flows. A detailed description of the datasets is shown in Table 1. Note that the scaling factors are different for the two datasets, i.e., N = 4 and N = 2 for Taxi BJ and Bike NYC, respectively. Baselines. We compare FODE with the following 10 baselines: Mean Partition (Mean) evenly distributes the flow volume in each subregion. 1https://www.citibikenyc.com/system-data Dataset Taxi BJ Bike NYC Time range 7/1/2013-10/31/2013 1/1/2019-3/31/2019 Time interval 30 minutes 1 hour Coarse-grained size 32 32 40 16 Fine-grained size 128 128 80 32 Upscaling factor (N) 4 2 Latitude range 39.82 N - 39.99 N 40.65 N-40.81 N Longitude range 116.26 E-116.49 E 74.00 E-74.07 E External Factors (meteorology, time (e.g., hourofday, dayofweek)) Temperature / C [-24.6,41.0] \ Wind speed / mph [0,48.6] \ Weather conditions 16 types (e.g., Rainy,Sunny) \ Holidays 18 10 Table 1: Statistics of datasets. Historical Average (HA) models historical average data to predict the flow in the subregion. SRCNN [Dong et al., 2015] is a classic model for image SR based on CNNs. ESPCN [Shi et al., 2016] introduces a sub-pixel convolutional layer for image SR. VDSR [Kim et al., 2016] employs residual networks to solve the slow convergence and limited representation problems in SR. SRRes Net [Ledig et al., 2017] is a Res Net-based variant of VDSR model, which allows stacking more network layers. OISR [He et al., 2019] is an ODE-inspired SR model using Runge-Kutta (RK3) method [Butcher and Wanner, 1996] as ODE solver. NODE [Chen et al., 2018] is a neural ODE method that uses adjoint method for discretization and optimal control of ODEs. ANODE [Gholami et al., 2019] is an improved version of NODE by introducing checking points for alleviating the incorrect gradient issue. Urban FM [Liang et al., 2019] infers fine-grained urban flow with external factors by stacking Res Net-based neural networks. Metrics. We evaluate different methods with three widely used metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). Implementation details. Adam [Kingma and Ba, 2014] is adopted to train FODE with batch size 16 and learning rate e 4. We leverage Dopri5 numerical method, which can adaptively choose the step size, as ODESovle in FODE. FODE consists of 128 channels and 1 ODE block. We also present a simplified version S-FODE which contains 64 channels while other components are the same as FODE. During training, we halve the learning rate and perform a test on the validation set every 20 epochs. For all image SR models numerical methods (NODE and ANODE), we use N 2-normalization to constrain the inferred flow distribution. We note that the details of other network settings are described in the source-implementation2. 2https://github.com/Anewnoob/FODE Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Datasets Taxi BJ Bike NYC Method RMSE MAE MAPE RMSE MAE MAPE Mean 20.918 12.019 4.469 4.554 1.379 0.678 HA 4.741 2.214 0.332 2.414 0.676 0.216 SRCNN 4.297 2.491 0.714 2.385 0.821 0.433 ESPCN 4.206 2.497 0.732 2.356 0.825 0.441 VDSR 4.159 2.213 0.467 2.344 0.734 0.285 SRRes Net 4.164 2.457 0.713 2.355 0.741 0.430 OISR 4.126 2.134 0.421 2.318 0.683 0.246 NODE 4.058 2.125 0.408 2.309 0.671 0.243 ANODE 3.967 2.043 0.351 2.246 0.635 0.217 Urban FM 3.951 2.011 0.327 2.234 0.627 0.209 S-FODE 3.941 2.001 0.322 1.951 0.517 0.129 FODE 3.860 1.963 0.313 1.916 0.512 0.115 Table 2: Performance comparisons on Taxi BJ and Bike NYC. 3.2 Results Overall performance. Table 2 reports the FSR results of all the methods, demonstrating that FODE and its variant outperform all baselines on all metrics across both datasets. Taking Taxi BJ for example, FODE yields 14.9%, 19.1%, and 41.6% improvement on average in terms of RMSE, MAE, and MAPE, respectively. In a relatively smaller area of NYC, FODE achieves larger improvement for inferring the bicycle flow. The performance gain of FODE over baselines demonstrates the effectiveness of continuous-time ODEs, which provides an alternative view of improving fine-grained flow inference. Comparison analysis. Image SR methods are usually not comparable even with N 2 normalization, due to the structural constraints and the influence of external factors in FSR application. This implies that FSR requires specific model design that seamlessly taking constraints and factors into account. As a specifically tailored FSR model, Urban FM achieves best performance among baselines. However, as a Res Netbased model, it models the urban flow in a discrete manner by stacking deep neural networks, which could be problematic since urban flow inherently can be viewed as a continuous dynamic system. OISR, NODE and ANODE are numerical methods tailored for FSR. OISR uses RK-block as the network structure, which suffers from a huge number of parameters and numerical errors that significantly affect the performance. Additionally, it requires a significant amount of memory for storing intermediate quantities during backpropagation (cf. Table 3). NODE, in contrast, uses adjoint method for solving ODEs, which only needs O( e L) memory as FODE (note that e L = 1 in FODE), while ANODE requires O( e L)+O(N) memory. Nevertheless, the dynamics of either the hidden state or the adjoint might be unstable, which incurs inaccurate gradient computation, as we analyzed in previous section. The performance gain of FODE over these numerical methods indicates that our method estimates the gradients more accurately, due to the introduced affine coupling layer while incurring no memory cost. Memory efficiency. In addition to FSR performance, FODE has a significant memory and parameter efficiency, compared to Res Net-based models. Table 3 outlines the memory cost and parameters required for different methods (we omit other image SR methods due to the similar archi- Method C #Params (M) memory time SRRes Net 128 5.5 O(LH) O(LH) OISR 128 15.7 O(LH) O(LH) NODE 128 2.1 O( e L) O(N) ANODE 128 2.4 O( e L)+O(N) O(N) Urban FM 128 6.2 O(LH) O(LH) S-FODE 64 0.7 O( e L) O(N) FODE 128 2.1 O( e L) O(N) Table 3: Comparisons of parameters and memory cost. C: the number of channels; L( e L): the number of Res Net (ODE blocks); H: the number of layers in each Res Net; N: the number of function evaluations. tectures as SRRes Net). In particular, FODE requires only 1/3 parameters of Urban FM and reduces the memory cost to O(1). It is worthwhile to note that a simplified version S-FODE contains only 0.7M parameters while still outperforming all the baseline methods on FSR task. Ability of factor fusion. External factors play important roles and should be carefully considered in FSR models. Figure 6(a) and 6(b) compare the influence of factors learned by Urban FM and FODE, respectively. Urban FM uses a FCN to fuse external factors, which is too weak to correlate complex factors with flow distributions. For example, the impact of factors concentrates in a smaller area for Urban FM i.e., two main roads are more affected by external factors than other regions. In contrast, FODE estimates the influence of factors by evenly distributing the external influences and is therefore more robust. This can be further verified by the results shown in Figure 6(c) and Figure 6(d), where we observe that FODE consistently converges while Urban FM, surprisingly, achieves best performance using temperature only, rather than all the factors. Error analysis. Figure 7(a) shows the inference errors of Urban FM and FODE on a data sample, where a brighter pixel indicates a larger error. To better visualize the quality of inference, we select four subregions (A, B, C, and D), from which we clearly see that the flow inference by FODE performs better than Urban FM in crowded areas. Similarly, Figure 7(b) depicts the overall inference error using two different normalization schemes. We observe that in most areas, especially in E, F, G subregions, the distribution generated by AN 2-normalization is closer to the ground-truth, which demonstrates that the proposed AN 2 method is a more effective way of constraining the flow distributions. This improvement can be attributed to jointly modeling the distribution of external factors and urban flows in AN 2, compared to simply constraining the flows in subregions in N 2-normalization. Accuracy vs. efficiency. Another merit of FODE is that it allows to balance the trade off between the flow inference accuracy and the computational overhead, by varying the number of function evaluations N. As shown in Figure 8, the more function evaluated in the forward pass, the lower the MSE loss. Accordingly, the time required for training the model increases linearly with the number of function evaluations. This result is important since downstream applications Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) (a) Urban FM. (b) FODE. 0 100 200 300 400 500 #Validations Validation RMSE Urban FM-without factors Urban FM-wind speed Urban FM-temperature Urban FM-holiday Urban FM-with factors (c) Urban FM fusion capability. 0 200 400 600 800 1000 #Validations Validation RMSE Urban FM-without factors Urban FM-with factors FODE-without factors FODE-with factors (d) FODE fusion capability. Figure 6: Analysis of external factor fusion. Urban FM FODE (a) Comparison of inference errors. -Normalization -Normalization AN2 N 2 (b) N 2 vs. AN 2. Figure 7: Inference error visualization. high efficiency high error low efficiency low error Figure 8: Tradeoff between training time and inference accuracy. may make flexible solutions by conciliating inference accuracy with computational cost. 4 Conclusion We proposed a novel method FODE for inferring fine-grained urban flow. FODE learns urban flow distribution through a new ODEs parameterized by affine coupling neural networks alleviating the numerical instability gradient computation issue, which allows for both memory and model parameter savings. In addition, it is capable of explicitly providing more flexible prediction performance by adaptively balancing prediction accuracy and computation overheads. Furthermore, we believe that FODE is a more general ODE-based architecture and can be better exploited for time series prediction or other fine-grained inference tasks such as single image SR [Cai et al., 2019; Wang et al., 2019] and air quality inference [Liu et al., 2019a]. Acknowledgments This work was supported by National Natural Science Foundation of China (Grant No.61602097) and NSF grant CNS 1646107. References [Ascher et al., 1997] Uri M Ascher, Steven J Ruuth, and Raymond J Spiteri. Implicit-explicit runge-kutta methods for time-dependent partial differential equations. Applied Numerical Mathematics, 25(2-3):151 167, 1997. [Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv: 1607.06450, 2016. [Butcher and Wanner, 1996] John Charles Butcher and Gerhard Wanner. Runge-kutta methods: some historical notes. Applied Numerical Mathematics, 22(1-3):113 151, 1996. [Cai et al., 2019] Jianrui Cai, Shuhang Gu, Radu Timofte, Lei Zhang, and et al. Ntire 2019 challenge on real image super-resolution - methods and results. In CVPR Challenge, pages 2211 2223, 2019. [Chen et al., 2018] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Neu IPS, pages 6572 6583, 2018. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) [De Brouwer et al., 2019] Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. Gru-ode-bayes - continuous modeling of sporadically-observed time series. In Neur IPS, pages 7377 7388, 2019. [Dinh et al., 2015] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components estimation. In ICLR, 2015. [Dinh et al., 2017] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In ICLR, 2017. [Dong et al., 2015] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. TPAMI, 38(2):295 307, 2015. [E, 2017] Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017. [Gholami et al., 2019] Amir Gholami, Kurt Keutzer, and George Biros. Anode: Unconditionally accurate memoryefficient gradients for neural odes. In IJCAI, pages 730 736, 2019. [Grathwohl et al., 2019] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: free-form continuous dynamics for scalable reversible generative models. In ICLR, 2019. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [He et al., 2019] Xiangyu He, Zitao Mo, Peisong Wang, Yang Liu, Mingyuan Yang, and Jian Cheng. Ode-inspired network design for single image super-resolution. In CVPR, pages 1732 1741, 2019. [Heinonen and L ahdesm aki, 2019] Markus Heinonen and Harri L ahdesm aki. ODE2VAE: Deep Generative Second Order ODEs with Bayesian Neural Networks. In Neur IPS, pages 13412 13421, 2019. [Kim et al., 2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, pages 1646 1654, 2016. [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv: 1412.6980, 2014. [Kingma and Dhariwal, 2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In NIPS, pages 10236 10245, 2018. [Ledig et al., 2017] Christian Ledig, Lucas Theis, Ferenc Husz ar, Caballero, and et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pages 105 114, 2017. [Liang et al., 2019] Yuxuan Liang, Kun Ouyang, Lin Jing, Sijie Ruan, Ye Liu, Junbo Zhang, David S Rosenblum, and Yu Zheng. Urbanfm: Inferring fine-grained urban flows. In KDD, pages 3132 3142, 2019. [Liu et al., 2019a] Ning Liu, Rui Ma, Yue Wang, and Lin Zhang. Inferring fine-grained air pollution map via a spatiotemporal super-resolution scheme. In Ubiq Comp, pages 498 504, 2019. [Liu et al., 2019b] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural SDE: Stabilizing neural ode networks with stochastic noise. ar Xiv: 1906.02355v1, 2019. [Lu et al., 2018] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In ICML, pages 3282 3291, 2018. [Poli et al., 2019] Michael Poli, Stefano Massaroli, Junyoung Park, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. Graph neural ordinary differential equations. ar Xiv: 1911.07532v1, 2019. [Rubanova et al., 2019] Yulia Rubanova, Ricky T.Q. Chen, and David Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In Neur IPS, pages 5321 5331, 2019. [Shi et al., 2016] Wenzhe Shi, Jose Caballero, Ferenc Husz ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pages 1874 1883, 2016. [Wang et al., 2019] Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for image super-resolution: A survey. ar Xiv:1902.06068, 2019. [Zhang et al., 2017] Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, pages 1655 1661, 2017. [Zhang et al., 2018] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, pages 2472 2481, 2018. [Zhang et al., 2019] Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, and Michael Mahoney. Anodev2: A coupled neural ode evolution framework. In Neur IPS, 2019. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)