# urban_region_embedding_via_multiview_contrastive_prediction__6f57ed94.pdf

Urban Region Embedding via Multi-View Contrastive Prediction

Zechen Li1, Weiming Huang2, Kai Zhao3, Min Yang1 , Yongshun Gong1, Meng Chen1,4 *

1 School of Software, Shandong University 2 School of Computer Science and Engineering, Nanyang Technological University 3 Robinson College of Business, Georgia State University 4 Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources lizechenn@gmail.com, weiming.huang@ntu.edu.sg, kzhao4@gsu.edu, myang3@sdu.edu.cn, yongshun2512@hotmail.com, mchen@sdu.edu.cn

Recently, learning urban region representations utilizing multi-modal data (information views) has become increasingly popular, for a deep understanding of the distributions of various socioeconomic features in cities. However, previous methods usually blend multi-view information in a posterior stage, falling short in learning coherent and consistent representations across different views. In this paper, we form a new pipeline to learn consistent representations across varying views and propose the multi-view Contrastive Prediction model for urban Region embedding (Re CP), which leverages the multiple information views from point-of-interest (POI) and human mobility data. Specifically, Re CP comprises two major modules, namely an intra-view learning module utilizing contrastive learning and feature reconstruction to capture the unique information from each single view, and an interview learning module that perceives the consistency between the two views using a contrastive prediction learning scheme. We conduct thorough experiments on two downstream tasks to assess the proposed model, i.e., land use clustering and region popularity prediction. The experimental results demonstrate that our model outperforms state-of-the-art baseline methods significantly in urban region representation learning.

Introduction

A deep understanding of the spatial distribution of various socioeconomic factors in cities such as land use or population distribution, is important for urban planning and management. In recent years, an increasingly popular trend in the community of urban computing has been to partition a city into numerous regions and utilize various urban sensory data to learn the latent representations of the regions, which can subsequently be used in varying urban sensing tasks, e.g., land usage clustering. house price prediction, and population density inference (Liu et al. 2021; Li et al. 2022; Liu et al. 2023; Huang et al. 2023; Xu et al. 2023b; Li et al. 2023). This trend can also be attributed to the prosperity of mobile sensing technologies, which has led to the rapid accumulation of urban sensing data, such as human trajectories or points-of-interest (POIs) (Zheng et al. 2020, 2021; Chen, Yu, and Liu 2018; Zhang, Zhao, and Chen 2022; Xu et al.

*Corresponding authors. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

graph embedding

Region Attributes

Human Mobility

Single-view Representations

graph embedding

Attention Layer

Embedding Fusion

Multi-view Fusion

region representation

Consistency Learning consistent

information

region representation

View-specific Representations

learning intra-view learning

Region Attributes

Human Mobility

𝐼(𝐙𝑎, 𝐙𝑚) 𝐻(𝐙𝑎|𝐙𝑚) 𝐻(𝐙𝑚|𝐙𝑎)

max 𝐼(𝐙𝑎, 𝐙𝑚) min 𝐻(𝐙𝑎|𝐙𝑚) min 𝐻(𝐙𝑚|𝐙𝑎)

Figure 1: Illustration of (a) multi-view fusion paradigm and our proposed (b) consistency learning paradigm for region embedding. In the right figure, the solid and dotted rectangles denote the region representations Za and Zm from the attribute and mobility views, respectively. The mutual information I(Za, Zm) (chartreuse area) quantifies the amount of information shared by Za and Zm; the conditional entropy H(Za|Zm) (grey area) quantifies the amount of information of Za conditioned on Zm. To learn consistent region representations across different views, it is encouraged to maximize I(Za, Zm) and minimize H(Za|Zm) and H(Zm|Za).

2023a; Zhang et al. 2023). Such various urban data provide more opportunities for tackling the problem of region representation learning. Many previous studies have attempted to learn region representations by utilizing human mobility data. For instance, Wang et al. (Wang and Li 2017) construct flow graphs and spatial graphs using taxi flow data and propose a graph embedding method to learn region representations. Yao et al. (Yao et al. 2018) extract human mobility patterns from taxi trajectories, and model the co-occurrence of origin-destination regions to learn region representations. The above methods merely rely on single-view data, which offers a limited perspective of regions and fails to provide a comprehensive representation. Further, recent studies (Zhang et al. 2021; Luo, Chung, and Chen 2022; Zhang, Long, and Cong 2022; Zhou et al. 2023) propose learning region representations through integrating data in mul-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

tiple modalities, thus forming multiple information views. In this context, the technical focus of recent region embedding studies has shifted towards the fusion between multiple information views, where they usually follow the same pipeline: separate single-view representation followed by multiple-view fusion. Such a pipeline is demonstrated in Figure 1(a), where, it (1) separately models each information view (usually with a graph structure) and learns multiple single-view representations for each region, and (2) leverages certain fusion techniques (e.g., based on attention mechanisms) to blend multiple representations and yield the final multi-view region representation. The previous multi-view region embedding methods have been effective in certain analyses, but they come with a notable limitation: neglecting the information consistency across different views when generating the final region representation. Intuitively, the information carried by multiple views of a region is highly correlated, and thus their representations should be consistent. For example, an entertainment region could contain multiple bars and restaurants (region attribute view based on POIs), as well as a large number of nighttime mobility flows (human mobility view). Both views can reflect the intrinsic characteristics of this region (i.e., entertainment function). If we manage to leverage such correlation, it could be served as the constraint during the process of learning representations for each view, and enable the knowledge of transferring from one view to the other. Ultimately, the multi-view representations would become highly consistent and naturally fused. Following the ideas above, we present a new pipeline - consistency learning paradigm - for multi-view region embedding from an information theory perspective (Tsai et al. 2021; Lin et al. 2021), where the multi-view representations are naturally fused through exchanging information between views along with learning view-specific region representations, rather than treating fusion as a posterior process. This new pipeline is shown in Figure 1(b). Given two view-specific region representations Za and Zm (where they are from the region attribute view and the human mobility view, respectively), we maximize the mutual information I(Za, Zm) to increase the amount of the shared information (consistency) in the region representations of the two views. We also minimize the conditional entropy H(Za|Zm) and H(Zm|Za) to diminish the inconsistent information across the two views and improve the consistency further. Based on the consistency learning paradigm, we propose a multi-view Contrastive Prediction model for urban Region embedding (Re CP), which can effectively enhance the consistency of region representations across different views. Re CP consists of two major components: intra-view learning and inter-view learning. In the intra-view learning component, to learn view-specific region representations, we compare each region with other dissimilar ones to embed the region into a latent space via contrastive learning; in the meantime, we also utilize autoencoders to capture viewspecific region features for different views, which helps avoid model falling into a trivial solution. In the inter-view learning component, to learn the cross-view consistency of region representations, we design inter-view contrastive

learning by maximizing I(Za, Zm) and dual prediction between views by minimizing H(Za|Zm) and H(Zm|Za). To summarize, our contributions are as follows: We form a new pipeline following a consistency learning paradigm, to study the urban region embedding problem by exploring the consistency across different views, using both human mobility and POI data. Different from existing multi-view region embedding methods which adopt the attention mechanisms to fuse representations of different views, we propose to learn consistent multiview representations of regions by increasing the amount of shared information across multiple views from the information entropy perspective. We design the inter-view contrastive learning and dual prediction processes to diminish the inconsistent information across views and learn an informative and consistent region representation between different views, achieved by maximizing the mutual information among different views and minimizing the conditional entropy among them. We conduct extensive experiments to evaluate our model with real-world datasets. The results demonstrate that the proposed Re CP outperforms existing methods on two downstream tasks by a margin. Data and source code are available at https://github.com/lizc-sdu/Re CP.

Problem Formulation Definition 1 (Urban Region) A city can be partitioned into n disjoint urban regions, denoted as R = {r1, r2, ..., rn}. Definition 2 (Region Attributes) In this study, region attributes are defined as inherent geographic features of regions. Specifically, we consider Point of Interest (POI) categories as region attributes following Zhang, Long, and Cong (2022); Fu et al. (2019). These region attributes are represented as a set A = {A1, A2, , An}, where Ai RF and F represents the total number of POI categories. Each dimension in Ai corresponds to the number of POIs with a specific category in the region ri. Definition 3 (Human Mobility) For a region ri, we define its outflow feature Sj,t i as the number of trips made by all individuals originating from region ri and destined for region rj during a specific time interval t. Consequently, we generate a collection of outflow features based on the mobility data encompassing all regions within the set R. This collection is represented as S = {S1, S2, , Sn}, where Si RM. Here, M is calculated as the product of the number of regions, n, and the number of time intervals, Nt, within a day, for instance, 24. Similarly, by considering ri as the destination region and the other regions rj as the source regions, we can obtain an inflow feature vector, denoted as Di, and finally obtain a collection D = {D1, D2, , Dn} of inflow features for all regions. Problem 1 (Region Representation Learning) Given the attribute features A, outflow features S, and inflow features D of n regions, our objective is to acquire a collection of low-dimensional embeddings E = {E1, E2, , En}, to serve as the latent representation for each region.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Intra-view Reconstruction

Decoder Intra-view Reconstruction

Human Mobility

Region Attributes

Intra-view Contrastive

Inter-view Contrastive Learning

Dual Prediction

𝐙𝑎 𝐙𝑚 predict

𝐙𝑚 𝐙𝑎 predict

max 𝐼(𝐙𝑎, 𝐙𝑚)

max 𝐼(𝐙𝑎, 𝐙𝑚)

Intra-view Learning Inter-view Learning

Intra-view Contrastive

Figure 2: The framework of Re CP.

Methodology The framework of Re CP is illustrated in Figure 2, which includes two major components: 1) intra-view learning: for both region attribute and human mobility view, it captures the representative features of each region by intra-view contrastive learning to learn view-specific representations. Additionally, feature reconstruction is designed within each view to recover the original feature of the region, which helps avoid a trivial solution; 2) inter-view learning: within the same region, it integrates representations from different views through two learning objectives: inter-view contrastive learning is used to enhance the consistency across different views, and dual prediction is introduced to further diminish the inconsistent information between views.

Intra-view Learning Initially, we learn view-specific region representations based on the region attribute features A and the mobility features S and D, respectively. Within each view, we learn the latent representation for each region by employing intra-view contrastive learning, i.e., we compare each region with others to highlight distinctive features within each region. Additionally, we design a within-view reconstruction loss to avoid the trivial solution.

Intra-view Contrastive Learning To learn region representations within each view, we design an intra-view contrastive learning module, which compares each region with others. For a given region ri, we have three types of region features, including the attribute feature Ai, outflow feature Si, and inflow feature Di. For simplicity, let Xv i denote the raw feature for the v-th view. For a target region ri, we define its positive set as Pv i = {Xv 1, Xv 2, , Xv K}, where

Xv 1, Xv 2, , Xv K are positive samples obtained through the data augmentation function following (Zhang, Long, and Cong 2022), and K is the number of positive samples. The negative set N v i is defined as N v i = {Xv t |t = i}, which contains features of regions except ri. We then map the raw features of regions into a latent representation, Zv i = E(v)(Xv i ), (1)

where E(v) denotes the encoder for the v-th view. In practice, we simply implement it as a fully connected neural network. As a result, we obtain three types of region representations, Za i , Zs i and Zd i . Further, we compute the region representation Zm i of the human mobility view as the average of Zs i and Zd i , i.e., Zm i = Zs i + Zd i /2. To maximize the similarity of positive pairs while minimizing the similarity of negative pairs, the contrastive learning loss for the v-th view is defined as follows,

k=1 exp(Zv i Zv k τ )+

k=1 exp(Zv i Zv k τ ) +

t=1 exp(Zv i Zv t τ ))],

where τ is the temperature parameter and R is the set of regions. Further, the intra-view contrastive learning loss across all views is formulated as

Lintra cl = µLa cl + Lm cl. (3)

where µ is the parameter controlling the balance between the attribute view and the mobility view.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Intra-view Reconstruction Given the feature Xv i for the v-th view of the region ri, we further optimize the latent region representations via an autoencoder and define the reconstruction loss Lv rec as

Xv i D(v)(E(v)(Xv i )) 2

where E(v) is the same as that in Equation (1) and D(v) is the decoder for the v-th view to reconstruct the region features. Specifically, we employ a fully connected network to implement D(v), which shares the same number of layers and hidden sizes as E(v). Note that the autoencoder structure is helpful to avoid the trivial solution. The total reconstruction loss across all views is

Lintra rec = µLa rec + Lm rec, (5)

where µ is the same weight parameter as that in Equation (3). So far, we obtain two types of view-specific region representations Za i and Zm i from the region attribute and human mobility views.

Inter-view Learning Different views of a region provide valuable information for describing the region, often offering complementary insights. To learn consistent and informative representations across different views, we employ inter-view contrastive learning to improve collaboration and information exchange between the views, achieved by maximizing the mutual information among different views. Additionally, dual prediction between two views is leveraged to reduce the impact of inconsistent information between the views by minimizing the conditional entropy across them.

Inter-view Contrastive Learning In the latent embedding space, we conduct contrastive learning to learn consistent representations shared across different views, as recent contrastive learning studies (He et al. 2020; Lin et al. 2021) have shown that consistency could be learned by maximizing the mutual information of different views. Formally, given the two representations Za i and Zm i of region ri, we maximize the mutual information between Za i and Zm i from different views:

Linter cl = X

ri R [I(Za i , Zm i ) + α(H(Za i ) + H(Zm i ))], (6)

where I( ) represents mutual information, H( ) denotes information entropy, and the parameter α controls the balance between mutual information and information entropy. Note that the maximization of H (Za i ) and H (Zm i ) also helps prevent trivial solutions where all regions are represented by the same representation. Based on the definition of mutual information, I( ) is defined as

I (Za i , Zm i ) = P (Za i , Zm i ) log P (Za i , Zm i ) P (Za i ) P (Zm i ), (7)

where P (Za i , Zm i ) represents the joint probability distribution of Za i and Zm i . To represent the joint probability distribution, we employ a softmax function to transform the region representations Za i Rd and Zm i Rd (where d is the

dimension of region representations) with

Ba i = softmax (Za i ) , Bm i = softmax (Zm i ) , (8)

where Ba i Rd and Bm i Rd can be interpreted as the probability distributions. Considering the entire set R containing n regions, we define the matrix M Rd d as the joint probability distribution of Za and Zm,

i=1 Ba i (Bm i )T. (9)

We denote the element located at the r-th row and the r -th column of the matrix as Mrr , and the sum of the elements in matrix M along the r-th row (the r -th column) as Mr (Mr ). Mrr represents the joint probability, while Mr and Mr represent the marginal probability, respectively. Then we could compute the mutual information I (Za, Zm) as follows,

I (Za, Zm) =

r =1 Mrr log Mrr Mr Mr . (10)

Information entropy H(Zv i ) is defined as follows,

H(Zv i ) = P(Zv i )log P(Zv i ), (11)

where v {a, m}. Following the above definition of M, H(Zv i ) could be computed as

r=1 Mr log Mr,

r =1 Mr log Mr .

Combining Equations (6), (10), and (12), the inter-view contrastive learning loss is formulated as

Linter cl =

r =1 Mrr ln Mrr Mα+1 r Mα+1 r . (13)

where α is the weight parameter defined in the Equation (6).

Inter-view Dual Prediction To further diminish the inconsistency across different views, we predict the viewspecific region representation by minimizing the conditioned entropy. Formally, given the region representations Za and Zm, we minimize the conditional entropy H(Zp|Zq), where p = a, q = m or p = m, q = a. On one hand, Zq contains nearly all the information required to represent the p-th view if Zq can perfectly predict Zp for any (Zp, Zq) PZp,Zq. On the other hand, Zq diminishes the inconsistent information within the q-th view if Zp can perfectly predict Zq under the constraint where I(Zp, Zq) is maximized. Mathematically, H(Zp|Zq) is defined as

H (Zp |Zq ) = EP Zp,Zq [log P (Zp|Zq)] . (14)

To minimize H (Zp |Zq ), a common approach is to assume a variational distribution Q (Zp|Zq) for Zp and Zq. Specially, we present to maximize EP Zp,Zq [log Q (Zp|Zq)],

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

which serves as a lower bound of EP Zp,Zq [log P (Zp|Zq)]. Q ( | ) can be any distribution such as Gaussian or Laplacian. In this work, we simply adopt the Gaussian distribution N (Zp|F (q) (Zq) , σI), where F (q) ( ) represents a parameterized function mapping Zq to Zp, and σI denotes the variance matrix. By ignoring the constants derived from the Gaussian distribution, maximizing EP Zp,Zq [log Q (Zp|Zq)] is equivalent to minimizing

EP Zp,Zq Zp F (q) (Zq) 2

Then the dual prediction loss can be formulated as

Linter dp = X

Zm i F (a) (Za i ) 2

2 + Za i F (m) (Zm i ) 2

Here, F (a) and F (m) are respectively implemented as fully-connected networks, with each layer followed by a batch normalization layer and a Re LU layer. Note that the above loss may lead to model collapse without the intra-view reconstruction loss (Equation (4)), i.e., Za i and Zm i from different views become equivalent to the same constant. Finally, the inter-view learning loss is defined as

Linter = Linter dp + Linter cl . (16)

Model Training The final objective function is defined as

L = Linter + λ1Lintra cl + λ2Lintra rec , (17)

where λ1 and λ2 are parameters controlling the weights of different losses. After learning the latent representations Za and Zm, we simply concatenate them as the final multi-view region representation, i.e., Ei = Za i || Zm i .

Experiments Experimental Settings Datasets. We collect a diverse set of real-world data from NYC Open Data1 and use the Manhattan borough as the study area. We partition Manhattan into 270 regions based on the city boundaries designed by the US Census Bureau2. As for the human mobility data, we employ complete taxi trip records from February 2014 as our training data. We utilize the NYC check-in and POI data provided by Yang et al. (2014) for our model training and the popularity prediction task. The detailed description of datasets is shown in Table 1. Based on these data, we construct the region features including A, S, and D for model training. Model Parameters. In our experiments, the dimension of region representations is set to 96. In the intra-view reconstruction module, we set the number of layers at 3 and the hidden size at 128 for the encoder E(v) and decoder D(v); in the intra-view contrastive learning module, following the settings in Zhang, Long, and Cong (2022), we set the number of positive samples for region attribute and human mobility data at 3 and 4, and the parameter µ controlling the

1https://opendata.cityofnewyork.us 2https://www.census.gov/data.html

Dataset Description Regions 270 regions divided by streets in Manhattan Taxi trips 10M taxi trips during February, 2014 POI data 10K POIs with 244 categories Check-in data 100K check-in records

Table 1: Data description (K=103, M=106).

balance between different views at 0.0001. In the inter-view dual prediction module, we set the number of layers at 3 and the hidden size at 96 for F (a) and F (m); in the inter-view contrastive learning module, we set the parameter α at 9. We set the hyper-parameters λ1 and λ2 in the final objective loss at 1. Note that the optimal model parameters are selected using grid search with a small but adaptive step size. To optimize our model, we adopt Adam and initialize the learning rate at 0.01 with a linear decay. Baselines. We compare the performance of Re CP with several state-of-the-art region embedding methods.

HDGE. (Wang and Li 2017) constructs flow graphs and spatial graphs using taxi data and learns region representations with graph embedding methods. ZE-Mob. (Yao et al. 2018) models co-occurrence patterns between regions from mobility data to learn region representations. MV-PN. (Fu et al. 2019) models both inter-region and intra-region information to construct multi-view POIPOI networks within each region. CGAL. (Zhang et al. 2019) extends MV-PN and incorporates the spatial structure and spatial autocorrelation among regions to learn region representations. MVURE. (Zhang et al. 2021) learns region representations by cross-view information sharing and multi-view fusion with human mobility and region attributes. MGFN. (Wu et al. 2022) designs multi-level crossattention mechanisms to extract region representations from multiple mobility patterns. Re MVC. (Zhang, Long, and Cong 2022) learns region representations through both intra-view and inter-view contrastive learning modules. HREP. (Zhou et al. 2023) constructs heterogeneous graphs and uses relation-aware graph embedding to learn region representations.

Land Usage Clustering We use the district division by the community boards (Berg 2007) as ground truth and divide the Manhattan borough into 29 districts, following the settings in Zhang, Long, and Cong (2022). We cluster regions into groups by k-means clustering (k = 29), using region representations as inputs. The regions with the same land usage type are expected to be assigned to the same cluster. The experimental results are evaluated using three metrics: Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and F-measure following (Yao et al. 2018; Zhang et al. 2021). We assess all the methods using the same dataset and conduct 10 runs to

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Method Land Usage Clustering Region Popularity Prediction NMI ARI F-measure MAE RMSE R2

HDGE 0.469 0.01 0.095 0.01 0.117 0.01 334.43 10.17 474.94 9.49 0.079 0.04 ZE-Mob 0.437 0.02 0.071 0.01 0.097 0.01 282.42 13.71 418.02 12.69 0.286 0.04 MV-PN 0.407 0.01 0.036 0.01 0.070 0.01 291.17 16.54 435.23 16.52 0.226 0.06 CGAL 0.414 0.08 0.059 0.06 0.091 0.06 351.10 51.20 486.96 52.58 0.021 0.20 MVURE 0.735 0.01 0.400 0.02 0.415 0.02 236.25 7.86 347.01 11.70 0.508 0.03 MGFN 0.748 0.01 0.424 0.03 0.437 0.03 240.37 11.99 354.24 17.14 0.487 0.05 Re MVC 0.761* 0.02 0.455* 0.04 0.462* 0.04 283.02 18.03 406.25 18.00 0.325 0.06 HREP 0.757 0.01 0.448 0.03 0.457 0.03 217.52* 10.98 318.41* 14.54 0.585* 0.04 Re CP 0.780 0.01 0.483 0.01 0.499 0.02 195.16 18.70 291.19 20.04 0.652 0.05 Improvements 2.50% 6.15% 8.01% 10.28% 8.55% 11.45%

Table 2: Performance comparison on two downstream tasks, where the performance improvements of Re CP are compared with the best of these baseline methods, marked by the asterisk.

report the mean value with the standard deviation in Table 2. From the results, we observe that:

HDGE and ZE-Mob exhibit relatively inferior performance as they merely model co-occurrence patterns using human mobility data. MGFN demonstrates better performance than HDGE and ZE-Mob, as it designs a deep model based on cross-attention mechanisms to capture complex mobility patterns from spatial-temporal human mobility data. The methods that model multi-view information generally achieve satisfactory results, validating the importance of effectively integrating multi-view information for region embedding. Specifically, MV-PN and CGAL exhibit poor performance as they simply combine region representations from two views, lacking the deep interaction between views; MVURE and HREP design attention-based mechanisms to fuse the multi-view information, consequently yielding superior performance; Re MVC adopts contrastive learning to model intra-view and inter-view information and also obtains good results. The proposed Re CP outperforms all these baselines, as it explores the consistency across different views in region embedding. Compared with Re MVC, Re CP achieves average improvements of 2.50%, 6.15%, and 8.01% in terms of NMI, ARI, and F-measure, respectively. Moreover, the results of the superiority paired t-test indicate that the improvement of Re CP over the baselines is statistically significant, with a p-value less than 0.01.

Region Popularity Prediction

Another commonly-compared downstream task to evaluate the region representations is popularity prediction, where we aggregate the check-in counts within each region as the ground truth of popularity following Yang et al. (2014); Zhang, Long, and Cong (2022). We take region representations as input and train the Lasso regression model. The evaluation results including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Coefficient of Determination (R2) are obtained by 5-fold cross-validation, as reported in Table 2. From the results, we observe that the multi-view fusion methods including MVURE and HREP

achieve decent performance, which further validates the necessity of integrating multi-view information in region embedding. Re CP performs the best among all methods, e.g., compared to HREP, Re CP achieves average improvements of 10.28%, 8.55%, and 11.45% in terms of MAE, RMSE, and R2. These results indicate that it is an effective way to learn better region representations by utilizing the new pipeline following the consistency learning paradigm.

Ablation Study and Parameter Analysis Ablation study We design four variants to explore how each module of Re CP affects the model performance. Re CP w/o CL removes the intra-view contrastive learning loss, Re CP w/o Rec removes the intra-view reconstruction loss and only uses the encoder to extract features, Re CP w/o IV removes the inter-view learning module and directly concatenates region representations from the two views without the constraint of consistency learning, and Re CP w/o DP removes the inter-view dual prediction loss. From the results in Figure 3, we observe that: 1) Re CP w/o CL achieves the lowest performance in both tasks, indicating that the intra-view contrastive learning loss is a crucial component in our model for learning viewspecific feature representations of regions. 2) Re CP w/o Rec achieves worse performance than Re CP, supporting the aforementioned claim that the intra-view reconstruction loss could help prevent the model from converging to a trivial solution. 3) Re CP demonstrates an improvement of 29.84% (in terms of ARI) and 4.00% (in terms of R2) when compared to Re CP w/o IV. This finding suggests that the proposed interview learning module effectively leverages the multi-view information and highlights the importance of consistency learning across different views. 4) Re CP w/o DP outperforms Re CP w/o IV but performs worse than Re CP, indicating that both the inter-view contrastive learning loss (which maximizes the mutual information between views) and the inter-view dual prediction loss (which minimizes the conditional entropy across them) are important for learning multi-view region representations.

Parameter sensitivity The parameters λ1 and λ2 govern the weighting of various losses. We vary their values within

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

a Land Usage Clustering

b Region Popularity Prediction

Figure 3: Performance comparison of different modules.

100 10 1 0.1 0.01

0.2 0.4 0.6 0.8

0.01 0.1 1 10 100

0.0 0.2 0.4 0.6

0.01 0.1 1 10 100

100 10 1 0.1 0.01

0.0 0.2 0.4 0.6

0.01 0.1 1 10 100

100 10 1 0.1 0.01

a Land Usage Clustering

240 160 80 0

0.01 0.1 1 10 100

100 10 1 0.1 0.01

0 120 240 360

0.01 0.1 1 10 100

100 10 1 0.1 0.01

0.00 0.25 0.50 0.75

0.01 0.1 1 10 100

100 10 1 0.1 0.01

b Region Popularity Prediction

Figure 4: Parameter analysis on both downstream tasks.

the range of {0.01, 0.1, 1, 10, 100} to assess the impact of λ1 and λ2 on the model performance. As depicted in Figure 4, Re CP achieves satisfactory performance when we set both λ1 and λ2 at 1.

Related Work Traditional methods for region embedding typically utilize human mobility data to analyze the transition patterns between urban regions. These methods are often based on the word2vec framework and learn the latent representations of regions (Wang and Li 2017; Yao et al. 2018). In a similar vein, Wu et al. (2022) incorporate mobility graphs with spatio-temporal similarity as mobility patterns and propose multi-level cross-attention mechanisms to extract comprehensive region representations from these patterns. Additionally, some studies focus on leveraging the inherent attributes of regions to learn latent representations. For instance, Zhang et al. (2019) construct multiple spatial graphs to represent the geographic structure of regions. By transforming the region embedding problem into a graph em-

bedding problem, they primarily capture the spatial structure within regions and the spatial autocorrelation between regions. Another approach, proposed by Wang, Li, and Rajagopal (2020), involves mining street views and textual information of POIs within regions to learn representations. Moreover, there have been studies that learn region representations by incorporating both attribute features within regions and mobility patterns between regions. For instance, Fu et al. (2019) propose an autoencoder framework that effectively captures inter-region correlations and intra-region structural information during the process of region embedding. Zhang et al. (2021) model multi-view region correlations by leveraging human mobility data and inherent region attributes, and employ a graph attention mechanism to acquire region representations from each view of the established correlations. Zhou et al. (2023) learn relation-specific region representations from various types of relations in a heterogeneous graph constructed using human mobility, POI data, and geographic neighbors of regions. They devise an attention-based fusion technique to integrate shared information among different types of correlations. Additionally, Zhang, Long, and Cong (2022) introduce a multiview region embedding model based on contrastive learning, which incorporates an intra-view contrastive learning module to discern distinct representations and an inter-view contrastive learning module to facilitate the transfer of knowledge across multiple views.

In this paper, we form a new pipeline based on the consistency learning paradigm for multi-view region embedding. Under the hood, we propose a multi-view Contrastive Prediction model for urban Region embedding (Re CP) by exploring the consistency across two views, leveraging both POI and human mobility data. The Re CP model consists of two modules: an intra-view learning module that utilizes contrastive learning and feature reconstruction to learn region representations specific to each view, and an inter-view learning module utilizing a contrastive prediction learning scheme that enhances the consistency between two views. To evaluate the effectiveness of our proposed model, we conduct comprehensive experiments on two downstream tasks: land use clustering and region popularity prediction. The experimental results demonstrate that the proposed Re CP model outperforms state-of-the-art embedding methods, proving that retaining consistency across views is pivotal for effective region embedding.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61906107 and 62202270, the Young Scholars Program of Shandong University, the Taishan Scholar Project of Shandong Province (tsqn202306066), and the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources. W.H. was supported by the Knut and Alice Wallenberg Foundation.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Berg, B. F. 2007. New York City Politics: Governing Gotham. Rutgers University Press. Chen, M.; Yu, X.; and Liu, Y. 2018. PCNN: Deep convolutional networks for short-term traffic congestion prediction. IEEE Transactions on Intelligent Transportation Systems, 19(11): 3550 3559. Fu, Y.; Wang, P.; Du, J.; Wu, L.; and Li, X. 2019. Efficient region embedding with multi-view spatial networks: A perspective of locality-constrained spatial autocorrelations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 906 913. He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729 9738. Huang, W.; Zhang, D.; Mai, G.; Guo, X.; and Cui, L. 2023. Learning urban region representations with POIs and hierarchical graph infomax. ISPRS Journal of Photogrammetry and Remote Sensing, 196: 134 145. Li, T.; Xin, S.; Xi, Y.; Tarkoma, S.; Hui, P.; and Li, Y. 2022. Predicting multi-level socioeconomic indicators from structural urban imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 3282 3291. Li, Y.; Huang, W.; Cong, G.; Wang, H.; and Wang, Z. 2023. Urban Region Representation Learning with Open Street Map Building Footprints. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1363 1373. Lin, Y.; Gou, Y.; Liu, Z.; Li, B.; Lv, J.; and Peng, X. 2021. Completer: Incomplete multi-view clustering via contrastive prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11174 11183. Liu, C.; Yang, Y.; Yao, Z.; Xu, Y.; Chen, W.; Yue, L.; and Wu, H. 2021. Discovering urban functions of high-definition zoning with continuous human traces. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 1048 1057. Liu, Y.; Zhang, X.; Ding, J.; Xi, Y.; and Li, Y. 2023. Knowledge-infused contrastive learning for urban imagerybased socioeconomic prediction. In Proceedings of the ACM Web Conference 2023, 4150 4160. Luo, Y.; Chung, F.-l.; and Chen, K. 2022. Urban region profiling via multi-graph representation learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 4294 4298. Tsai, Y.-H.; Wu, Y.; Salakhutdinov, R.; and Morency, L.- P. 2021. Self-supervised learning from a multi-view perspective. In Proceedings of the International Conference on Learning Representations, 2021. Wang, H.; and Li, Z. 2017. Region representation learning via mobility flow. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 237 246.

Wang, Z.; Li, H.; and Rajagopal, R. 2020. Urban2vec: Incorporating street view imagery and pois for multi-modal urban neighborhood embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 1013 1020. Wu, S.; Yan, X.; Fan, X.; Pan, S.; Zhu, S.; Zheng, C.; Cheng, M.; and Wang, C. 2022. Multi-graph fusion networks for urban region embedding. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2312 2318. Xu, R.; Chen, M.; Gong, Y.; Liu, Y.; Yu, X.; and Nie, L. 2023a. TME: Tree-guided multi-task embedding learning towards semantic venue annotation. ACM Transactions on Information Systems, 41(4). Xu, R.; Huang, W.; Zhao, J.; Chen, M.; and Nie, L. 2023b. A spatial and adversarial representation learning approach for land use classification with POIs. ACM Transactions on Intelligent Systems and Technology, 14(6): 1 25. Yang, D.; Zhang, D.; Zheng, V. W.; and Yu, Z. 2014. Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1): 129 142. Yao, Z.; Fu, Y.; Liu, B.; Hu, W.; and Xiong, H. 2018. Representing urban functions through zone embedding with human mobility patterns. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Zhang, C.; Zhao, K.; and Chen, M. 2022. Beyond the limits of predictability in human mobility prediction: contexttransition predictability. IEEE Transactions on Knowledge and Data Engineering, 35(5): 4514 4526. Zhang, D.; Xu, R.; Huang, W.; Zhao, K.; and Chen, M. 2023. Towards an integrated view of semantic annotation for POIs with spatial and textual information. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2441 2449. Zhang, L.; Long, C.; and Cong, G. 2022. Region embedding with intra and inter-view contrastive learning. IEEE Transactions on Knowledge and Data Engineering. Zhang, M.; Li, T.; Li, Y.; and Hui, P. 2021. Multi-view joint graph representation learning for urban region embedding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 4431 4437. Zhang, Y.; Fu, Y.; Wang, P.; Li, X.; and Zheng, Y. 2019. Unifying inter-region autocorrelation and intra-region structures for spatial embedding via collective adversarial learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1700 1708. Zheng, B.; Bi, L.; Cao, J.; Chai, H.; Fang, J.; Chen, L.; Gao, Y.; Zhou, X.; and Jensen, C. S. 2021. Speaknav: Voice-based route description language understanding for template-driven path search. Proceedings of the VLDB Endowment, 14(12): 3056 3068. Zheng, B.; Huang, C.; Jensen, C. S.; Chen, L.; Hung, N. Q. V.; Liu, G.; Li, G.; and Zheng, K. 2020. Online trichromatic pickup and delivery scheduling in spatial crowdsourc-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

ing. In 2020 IEEE 36th International Conference on Data Engineering, 973 984. IEEE. Zhou, S.; He, D.; Chen, L.; Shang, S.; and Han, P. 2023. Heterogeneous region embedding with prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 4981 4989.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)