# realtime_route_search_by_locations__7bc10764.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Real-Time Route Search by Locations

Lisi Chen,1,2 Shuo Shang,1 Tao Guo3

1UESTC, China 2Inception Institute of Artiﬁcial Intelligence, UAE 3Google, Singapore {chenlisi.cs, jedi.shang}@gmail.com, darkgt@google.com

With the proliferation of GPS-based data (e.g., routes and trajectories), it is of great importance to enable the functionality of real-time route search and recommendations. We deﬁne and study a novel Continuous Route-Search-by-Location (CRSL) problem to enable real-time route search by locations for a large number of users over route data streams. Given a set of C-RSL queries where each query q contains a set of places q.O to visit and a threshold q.θ, we continuously feed each query q with routes that has similarity to q.O no less than q.θ. We also extend our proposal to support top-k C-RSL problem where each query continuously maintains k most similar routes. The C-RSL problem targets a variety of applications, including real-time route planning, ridesharing, and other location-based services that have real-time demand. To enable efﬁcient route matching on a large number of CRSL queries, we develop novel parallel route matching algorithms with good time complexity. Extensive experiments with real data offer insight into the performance of our algorithms, indicating that our proposal is capable of achieving high efﬁciency and scalability.

Introduction

With the continued proliferations of GPS-enabled devices (e.g., vehicle navigation systems and smart phones), online map-based services (e.g., Google Maps), and ridesharing services (e.g., Di Di, Uber, and Grab), travel route data is being generated rapidly. For example, the average number of new taxi trips per day from New York City in 2017 is well over 300K1. The availability of massive-scale route data fosters a line of research on RSL-query: Given a collection of routes and a set of query locations, ﬁnds a subset of routes that are spatially close to the query locations (Chen et al. 2010; Shang et al. 2012; 2014). In most existing studies, the RSL query is deﬁned as a one-time query that search from a static collection of routes. However, such one-time query is not capable of delivering users instant results or keeping them updated with most recent results over a stream of route data (Li et al. 2013; Chen, Cong, and Cao 2013). This motivates us to study a novel

Corresponding author. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1https://data.cityofnewyork.us/browse?q=trip

Figure 1: An example of the C-RSL problem

Continuous Route-Search-by-Location (C-RSL) problem. Consider the example in Figure 1, where q1 = {o1, o2, o3}, q2 = {o4, o5, o6, o7}, and q3 = {o8, o9, o10, o11} are three C-RSL queries. Here, o1 o11 are query locations (userspeciﬁed places), and τ1, τ2, and τ3 are routes that arrive in a streaming fashion. When the trip of τ1 completes (8:45), we compute the similarity between τ1 and each C-RSL query (i.e., Sim( , )). If the similarity between a new route and a query is no less than a user-speciﬁed threshold (i.e., qi.θ in Figure 1), we say that the query matches the new route and we deliver the new route to the query. From Figure 1, we can see that q1 and q2 match τ1, q2 matches τ2, and q3 matches τ3. Our C-RSL problem is applied in dynamic spatial networks. The reason is that in many real-life scenarios, edge weights of networks (e.g., travel times, taxi fares) are changing over time (Pfoser, Tryfona, and Voisard 2006; Ding, Yu, and Qin 2008; Hua and Pei 2010). We adopt aggregatecost measurement (i.e., the sum of (travel) costs between query locations and a route) (Shang et al. 2014; 2017b; Chen et al. 2010) to measure the similarity between a route and query locations. The C-RSL problem aims to feed a large number of CRSL queries with similar routes in a real-time fashion. It is challenging due to its high computation complexity. When

a new route is committed, we need to calculate the similarity between the new route and each query, which is very time consuming. To alleviate the computational cost, we develop Query Location Expansion Matching (QLOE) algorithm that explores the spatial network from each vertex containing query locations through network expansion (Dijkstra 1959) and retrieve the C-RSL queries that match the new route. Expansion from each vertex occurs in parallel. Our QLOE algorithm substantially reduces the time cost of processing a new route. In addition, we propose an optimization for the QLOE algorithm (i.e., QLOE+) by developing vertex ﬁltering and group expansion techniques that are able to ﬁlter out unqualiﬁed queries in early stage and combine network expansion results from different vertices, respectively. Experimental study shows that QLOE+ outperforms QLOE by at least an order of magnitude regarding the efﬁciency of route processing. Our contributions can be summarized as follows. First, we study a novel problem, the C-RSL problem, that continuously feeds a large number of users with routes similar to their query locations, thus targeting applications such as real-time route planning and recommendation, ridesharing, and other location-based services that have real-time demand. Second, we develop two efﬁcient algorithms, QLOE and QLOE+, to match routes with a large number of CRSL queries in a real-time fashion. Further, we extend our QLQE series to support top-k C-RSL problem where each query continuously maintains k most similar routes. Third, we conduct experiments on large route datasets to comprehensively study the performance of the algorithms. Our experiment results conﬁrm the capability of QLOE+ to handle C-RSL queries over real-world route streams.

Preliminaries and Problem Formulation

This section introduces road networks, routes, the RSL query, and the C-RSL problem.

Road Networks and Routes

Following existing studies (Chen et al. 2019a; Shang et al. 2017b), we formulate road network by a connected, undirected graph G = (V, E, F, W), where V is a vertex set and E {{vi, vj}|vi, vj V vi = vj} is an edge set. A vertex vi V denotes an intersection or an end point, and an edge e = {vi, vj} E denotes a road segment connecting between vi and vj. Function F maps a vertex to a spatial point location with longitude and latitude and maps an edge to a polyline that represents the road segment. Function W : (E, t) R assigns a real-valued time-dependent weight W(e, t) to an edge e that denotes the travel cost of the road segment at time t. We use lc(vi, vj) to denote the lowest cost between vi and vj and use LP(vi, vj) to denote the path with the lowest cost between vi and vj. This modeling of road networks aligns with a number of previous studies (Chen et al. 2010; Shang et al. 2014; Chen et al. 2019a; Shang et al. 2017b). Here, a route is deﬁned by Deﬁnition 1. Deﬁnition 1: (Route) Route τ is a sequence p1, p2, ..., pn that consists of at least 2 vertices (points), where pi and pi+1 (i [1, n 1]) are adjacent vertices in V . 2

RSL Query and Cost Measures Given a query location o and a route τ, the network travel cost c(o, τ) between them is deﬁned by Equation 1.

c(o, τ) = min pi τ{lc(o, pi)}, (1)

where lc(o, pi) denotes the lowest network travel cost between o and pi. Given a set O of query locations and a route τ, the similarity Sim(O, τ) between them is deﬁned according to aggregate travel cost (Shang et al. 2014; Chen et al. 2010; 2019a):

Sim(O, τ) =

o O e c(o,τ) (2)

Based on the travel cost measures, we present the deﬁnition of RSL problem. Deﬁnition 2: (RSL Problem) An RSL query q = {O, θ} consists of set O of query locations {o1, o2, ..., on} and a similarity threshold θ. Given a set R of routes, the RSL query q ﬁnds a subset R of R such that (τ R ) (Sim(q.O, τ) θ) and (τ R \ R ) (Sim(q.O, τ) < θ). 2

C-RSL Problem Based on Deﬁnition 2, we formally deﬁne the C-RSL problem in Deﬁnition 3. Deﬁnition 3: (C-RSL Problem) Given a set Q = {q1, q2, ..., q|Q|} of continuous RSL (C-RSL) queries and a stream of routes, the C-RSL problem aims to maintain an up-to-date result for each RSL query in Q. 2 In our applications, the typical arrival rate of travel routes in a metropolitan area (e.g., New York City) is in the scale of hundreds of thousands a day and we may serve tens of thousands of C-RSL queries at one time. We aim to develop an efﬁcient and scalable solution to maintain the up-to-date results for a large number of C-RSL queries over a stream of routes. Routes and C-RSL queries can be easily maintained by the available memory of modern servers (Chen et al. 2015; Chen and Cong 2015). As a result, our solution is developed under in-memory setting.

Direct Route Matching We propose a baseline algorithm, Direct Route Matching (DRM), to answer the C-RSL problem. Its high-level idea works as follows. When a user completes and commits a trip, a new travel route τ is published. We compute the similarity between each C-RSL query qi Q and τ (i.e., Sim(qi.O, τ)). If Sim(qi.O, τ) qi.θ, we deliver route τ to qi and regard τ as a result of qi. To compute Sim(qi.O, τ), we calculate the network travel cost between τ and each location o in qi.O. Algorithm 1 presents the pseudo code of DRM. The inputs are C-RSL query set Q and a new route τ. The output is a subset of matched C-RSL queries D that route τ will be delivered to. After initializing D (Line 1), we evaluate each RSL query qi Q and check whether the new route τ can be a result of Oi. Speciﬁcally, we visit each location o qi.O and add e c(o,τ) to s based on Equation 1 (Line 5). If the

current accumulated similarity s is no less than qi.θ, we add qi to D (Lines 6 7). After evaluating all qi Q, we return D as the result and deliver τ to each query in D.

Complexity analysis To compute c(o, τ), we need to calculate the lowest travel cost between o and each points in τ. Because the travel cost on each edge is changing over time, it is impossible to pre-compute the lowest costs between all pairs of vertices in network. Consequently, we need to calculate the lowest cost between two vertices online. In this case, the time complexity of DRM is O(|Q| no |τ| (|E|+ |V | log |V |)), where |Q| is the cardinality of C-RSL queries, no denotes the number of locations in each C-RSL query, and |τ| is the number of points in route τ.

Algorithm 1: DRM

Data: C-RSL query set Q, new route τ Result: matched C-RSL queries D

2 for each qi in Q do

4 for each location o in qi.O do

5 s s + e c(o,τ);

6 if s qi.θ then

7 D.add(qi);

8 return D;

Query Location Expansion Matching

To reduce the high complexity of DRM, we propose a Query Location Expansion Matching (QLOE) algorithm. Initially, we generate a subset of V , Vs, where for each vertex v Vs there exists at least one query q such that v q.O. We regard each vertex in Vs as an expansion center. For each center v, we perform network expansion (Dijkstra 1959) to explore the network and to compute the travel cost between v and the new route τ. Once we reach a vertex p such that p τ, the expansion terminates and the travel cost between v and τ is derived. Next, we aggregate the travel cost to each query that contains v as one of its query location. The QLOE terminates when the similarity between each C-RSL query and τ are computed. Note that we deﬁne an upper bound on network cost to prune unqualiﬁed query candidates. The expansion of each vertex can be performed in parallel.

QLOE Algorithm

Pre-processing Given a C-RSL query set Q and the vertex set V of network G, we ﬁrst need to generate a subset Vs V where each v Vs contains at least one query location. This step requires O(|q.O|) when we insert each new C-RSL query q to Q. Before presenting the algorithm, we ﬁrst deﬁne C-RSL tuple, which will be used to record the visiting status and current upper bound of similarity between the C-RSL query and

the new route. C-RSL tuples will be retrieved and updated during network expansion from different query locations. Deﬁnition 4: (C-RSL Tuple) Let τ be a new route. A C-RSL tuple of query q is denoted by T(q) = e, n, ub , which consists of three elements: an entry e (identiﬁer) of query q, the number of visited query locations n in q.O, and the similarity upper bound ub between q and τ. Let q.O+ q.O be a subset of q.O where for each oi q.O+, oi is scanned by network expansion. We compute ub via Equation 3

T(q).ub = |q.O| T(q).n +

vi q.O+ e c(vi,τ) (3)

2 At the beginning, we initialize a C-RSL tuple set T by adding an initial tuple T(q) of each query q. Here, we set T(q).n and T(q).ub to be 0 and |q.O|, respectively. Algorithm 2 presents the pseudo code of QLOE. The inputs are expansion center v, new route τ, and C-RSL tuple set T . The output is an updated tuple set T that ﬁlters out unqualiﬁed tuples.2 We ﬁrst initialize the current vertex we scanned as null (Line 1). Next, we perform a network expansion from vertex v (Lines 2 17). Speciﬁcally, if vc is null, it indicates that the expansion has not been started, so we assign v to vc (Lines 3 4); otherwise, we assign the next vertex based on Dijkstra expansion to vc (Lines 5 6). If vc is a point of τ, we terminate the expansion from v and evaluate each query locations associated with v (Lines 7 16). For each query q of which locations contain vc, we update its tuple T(q) in T . In particular, we add T(q).n by one and update the similarity upper bound of q (i.e., T(q).ub) based on Deﬁnition 4 (Lines 10 11). If T(q).ub is less than the similarity threshold θ deﬁned by q, we remove tuple T(q) from T because τ cannot be a result of q (Lines 12 13); otherwise, we update T(q) to T (Line 15). Note that the expansion of each vertex can be processed in parallel. After running QLOE on each vertex, the C-RSL tuples in T are queries that can include τ as their results. Complexity analysis The worst-case time complexity of running QLOE on all vertices is O((|E|+|V | log |V |) |V |+ no |Q|) where no is the number of query locations of each query. Since the expansion from each vertex can be terminated as soon as one of the points in route τ is reached, it is unnecessary to scan all vertices in real scenario.

QLOE with Optimized Expansion

In QLOE, we need to run network expansion from each non-empty vertex, which is very time consuming. To address this problem, we develop an optimized network expansion approach, QLOE+, to ﬁltering unnecessary expansions. Speciﬁcally, during the expansion from vertex v to new route τ, we record the lowest cost between v and every vertex we scanned. To avoid a vertex being scanned repeatedly by expansions from different vertices, we maintain global vertex map M to record scanned vertices. Further, we deﬁne the concept of vertex ﬁltering condition (Theorem 1) to ﬁlter unqualiﬁed queries associated with a particular vertex.

2We say that T(q) is an unqualiﬁed tuple if q cannot match τ

Algorithm 2: QLOE

Data: Vertex v, new route τ, C-RSL tuples T Result: Updated C-RSL tuples T

3 if vc is null then

6 vc Dijkstra Expansion(v).next();

7 if vc τ then

8 for each query q that contains v do

9 if T(q) exists in T then

10 T(q).n T(q).n + 1;

11 T(q).ub T(q).ub 1 + e lc(v,vc);

12 if T(q).ub < q.θ then

13 T .remove(T(q));

15 Update T(q) in T ;

17 while Dijkstra Expansion(v).has Next();

Theorem 1: Let Qv be a set of queries associated with vertex v and vc be the current vertex being visited by network expansion. All queries in Qv can be safely ﬁltered if the following inequality is satisﬁed. max qi Qv{|qi.O|} 1 + exp( lc(v, vc)) < min qi Qv{qi.θ}. (4)

Proof: Because the expansion terminates when we reach the new route τ, we have lc(v, vc) min vi τ{lc(v, vi)}. Based

on Equations 1 and 2, we have lc(v, vc) c(v, τ). Thus, if Inequality 4 is satisﬁed, we can deduce that (qi Qv) (Sim(qi.O, τ) qi.θ). We complete the proof. Algorithm 3 presents the pseudo code of QLOE+. First, we initialize H, vc, and Qv (Lines 1 3). Here, H is used to record scanned vertices vc and the lowest cost between v and vc (i.e., vc, lc(v, vc) ) in network expansion from v. Then we perform the network expansion. When we reach a new vertex vc, we check if query set Qv satisﬁes vertex ﬁltering condition (Theorem 1). If so, we add v to M, remove all tuples of queries in Qv from T , and terminate the expansion (Lines 9 12). If there exists at least one query associated with vc and it has never been scanned, we add vc, lc(v, vc) to H (Line 14). If vc τ, we evaluate queries associated with each vi in H. Speciﬁcally, we ﬁrst add vi to M denoting that vi has been scanned (Line 17). Then we update the C-RSL tuples of each query q that contains vi (Lines 18 25). The steps are the same as those in QLOE (Algorithm 2). The expansion of each vertex can be processed in parallel. Recall that QLOE+ maintains global vertex map M to record scanned vertices. We do not need to perform expansion from a scanned vertex. Thus, when an expansion thread is completed, we randomly pick a vertex v V \M and perform expansion from v. When M = V , which means that all vertices have been scanned, we return T as the result.

Algorithm 3: QLOE+

Data: Vertex v, new route τ, C-RSL tuples T , vertex map M Result: Updated C-RSL tuples T , updated vertex map M

3 Qv set of queries associated with v;

5 if vc is null then

8 vc Dijkstra Expansion(v).next();

9 if Qv satisﬁes vertex ﬁltering condition then

10 M.add(v);

11 T .remove All(T(Qv));

13 if there exists a query that contains vc then

14 H.add( vc, lc(v, vc) );

15 if vc τ then

16 for each vi, H do

17 M.add(vi);

18 for each query q that contains vi do

19 if T(q) exists in T then

20 T(q).n T(q).n + 1;

21 T(q).ub T(q).ub 1 + e lc(vi,vc);

22 if T(q).ub < q.θ then

23 T .remove(T(q));

25 Update T(q) in T ;

27 while Dijkstra Expansion(v).has Next();

Extension of Processing Top-k C-RSL Queries This section discuss how to extend our QLOE series to support top-k continuous RSL query, which is deﬁned by Deﬁnition 5. Deﬁnition 5: (Top-k C-RSL Problem) A top-k C-RSL query q = O, k consists of set O of query locations {o1, o2, ..., on} and the number of results k. Given a set R of routes, the top-k RSL query q ﬁnds k routes that has the highest similarities to O. Given a set Q = {q1, q2, ..., q|Q|} of top-k C-RSL queries and a stream of routes, the C-RSL problem aims to maintain a top-k result for each query in Q. 2 Recall that for threshold-based C-RSL problem, the route matching is solely based on a static similarity threshold q.θ speciﬁed by each query q. While for top-k C-RSL problem, we need to maintain a dynamic set of k most similar routes. To solve the problem, for each top-k C-RSL query q we regard the similarity between q and the q.τk (i.e., the k-th result route maintained by q) as the similar-

ity threshold . Speciﬁcally, we a new route τn arrives, for each query q we compute the similarity between q.O and τ. If Sim(q.O, τn) > Sim(q.O, q.τk), we replace τk by τn and update q s result set; Otherwise, we discard τn. Note that when the result of q is updated, we also need to update the similarity threshold because the original q.τk is changed. As a result, both QLOE and QLOE+ can support the processing of top-k C-RSL queries by regarding q.τk in top-k C-RSL as q.θ in threshold-based C-RSL.

Experimental Study This section reports our experiments conducted on real road networks and route data sets. The results offer insight into the efﬁciency and scalability of the proposed algorithms.

Experiment Settings Data sets and query generation We use two datasets in our experiments: Beijing Road Network (BJR) and the New York Road Network (NYR)3. BJR consists of a road network of Beijing and a real taxi trajectory data set collected by the T-drive project (Yuan et al. 2013). NYR consists of a road network of New York City and a taxi trip data set from New York3, which only contains pick-up and drop-off locations of a taxis. To generate the route of a taxi trip by deriving the shortest path from the pick-up location to the drop-off location. To simulate dynamic edge weights on road networks, we let d% of edges change by a random ratio ranging from 0.8x to 1.2x each time we complete the processing of a new route (Shang et al. 2016). Note that d is a parameter set as 1. The efﬁciency of our methods is not inﬂuenced by d. A C-RSL query location set q.O is generated as follows: First, we sample 100K routes and calculate the number of routes passing each vertex. Next, we pick vertices based on the following probability formulation:

where P(v) is the probability that v is picked, S is a set of sampled routes, and nv denotes the number of routes passing v.

Implementations The road network graphs, routes, and C-RSL queries are memory resident. All algorithms are implemented in Java and run on a server with two Intel Xeon Processors Gold 5120 (2.20GHz) and 64GB RAM. The performance metric is runtime of processing a new route (i.e., ﬁnding a subset of queries that match a new route). We evaluate the following three methods: DRM, QLOE, and QLOE+. To enhance the efﬁciency of DRM, we let the computation of similarity between a new route and each query be processed in parallel (Algorithm 1, Lines 3 7). The parameter settings are listed in Table 1.

Experimental Results Effect of the number of queries Figure 2 presents the performance of the algorithms when varying the number

3https://publish.illinois.edu/dbwork/open-data/

Table 1: Parameter Settings

BJR NYR Number of queries 20,000 100,000 / default 20,000

20,000 100,000 / default 20,000 Cardinality of query location set |q.O|

3 7 / default 5 3 7 / default 5

Similarity threshold θ ( |q.O|)

0.80 0.95 / default random

0.80 0.95 / default random Number of results k

5 30 / default 10

5 30 / default 10 Thread count m 16 48 / default 48

16 48 / default 48

20K 40K 60K 80K 100K

Runtime (s)

Number of Queries

20K 40K 60K 80K 100K

Runtime (s)

Number of Queries

Figure 2: Effect of the number of queries (runtime)

1 10 100 1000 10000

20K 40K 60K 80K 100K

# Processed Routes

Number of Queries

1 10 100 1000 10000

20K 40K 60K 80K 100K

# Processed Routes

Number of Queries

Figure 3: Effect of the number of queries (arrival rates)

Runtime (s)

Number of Locations in a Query

Runtime (s)

Number of Locations in a Query

Figure 4: Effect of the number of locations in a query

of C-RSL queries from 20K to 100K. We can see that the time cost of processing a new route increases for all three methods when we increase the number of queries. In particular, QLOE outperforms DRM by approximately an or-

0.80 0.85 0.90 0.95

Runtime (s)

Similarity Threshold

0.80 0.85 0.90 0.95

Runtime (s)

Similarity Threshold

Figure 5: Effect of similarity threshold, θ

20K 40K 60K 80K 100K

Runtime (s)

Number of Queries

20K 40K 60K 80K 100K

Runtime (s)

Number of Queries

Figure 6: Effect of the number of locations in a query (top-k)

Runtime (s)

Number of Results, k

Runtime (s)

Number of Results, k

Figure 7: Effect of the number of results

16 24 32 40 48

Runtime (s)

Thread Count

16 24 32 40 48

Runtime (s)

Thread Count

Figure 8: Effect of thread counts

der of magnitude for both BJR and NYR. Such signiﬁcant performance discrepancy can be explained as follows: (1) The time complexity of QLOE is lower than that of DRM; (2) For QLOE, each expansion thread completes as soon as one of the point (vertex) of new route τ is scanned, which substantially improves the efﬁciency of route processing. In addition, compared with QLOE, QLOE+ further improves the route processing efﬁciency by at least an order of magni-

tude. Such conspicuous performance improvement conﬁrms the effectiveness of our vertex ﬁltering technique and optimized network expansion. Figure 3 reports the maximum arrival rate (number of routes per hour) of the route stream that can be supported by each method given the number of subscriptions varying from 20K to 100K. We can see that when we have 20K queries, QLOE+ can support the data stream with the rates of 15K routes/h and 6K routes/h on BJR and NYR, respectively. In contrast, DRM can only support the stream with the rates of 83 routes/h and 16 routes/h on BJR and NYR, respectively. Even if we increase the number of queries to 100K, QLOE+ can still support the data stream with the rates of 6.1K routes/h and 2.2K routes/h on BJR and NYR, respectively, which are 12x and 21x better than the performances of QLOE on BJR and NYR, respectively. In the real-world scenario, the average route arrival rate in NYC is 15K/h4. Theoretically, QLOE+ is able to handle 100K C-RSL queries over the real route data stream if we deploy 7 modern servers. In contrast, QLOE requires more than 140 servers to handle the real route data stream in NYC.

Effect of the number of locations in a query |q.O| Figure 4 shows the effect of varying |q.O| on the efﬁciency of the algorithms. A larger |q.O| implies: (1) A larger number of vertices to be expanded and evaluated for QLOE and QLOE+; (2) A larger number shortest path computations occur for DRM. Thus, when we increase the number of locations in C-RSL query, the runtime of route processing increases for all methods.

Effect of similarity threshold θ This set of experiments investigates the effect of similarity threshold θ. Figure 5 shows the results when we vary the similarity threshold θ. We can see that the performance of DRM and QLOE is consistent as we vary θ from 0.8 to 0.95. However, the runtime of QLOE+ exhibits a decreasing trend as we increase the value of θ. The reason is that the value of θ has little inﬂuence on the number of shortest path computations and search space. However, the vertex ﬁltering power in QLOE+ becomes stronger and more queries will be ﬁltered out when we increase θ. As a result, the efﬁciency of QLOE+ improves substantially when we increase θ.

Extension of processing top-k C-RSL queries Figure 6 shows the performance of the three methods for processing top-k C-RSL queries when we vary the number of queries from 20K to 100K. Compared with the performance of processing threshold-based C-RSL queries, we observe a increment of time cost for processing top-k C-RSL queries. In particular, the runtime of DRM, QLOE, and QLOE+ are increased by factors of 1.1, 1.2 1.6, and 2.5 4.0, respectively. The signiﬁcant performance contrast regarding QLOE+ can be explained by the fact that QLOE+ requires extra time cost for maintaining the dynamic value of the similarity between q.O and q.τk for each query q, which may lower the effectiveness of the vertex ﬁltering condition. Figure 7 presents the effect of the number of result routes maintained by each query. We can see that the performance

4https://data.cityofnewyork.us/browse?q=trips

of DRM remains consistent as we vary the value of k. As for QLOE series, we observe a slight increasing trend regarding the time cost as we increase k. The reason is that higher value of k is likely to lower the value of the similarity between q.O and q.τk for each query q. Therefore, each time a new route arrives, the average number of queries that have their result set updated will increase, which may increase the time cost of route matching. Nevertheless, even though we increase k from 5 to 30, the average runtime of QLOE and QLOE+ only increases by a factor of 1.2 and 1.4, respectively.

Effect of thread counts We study the effect of thread count on the efﬁciency of the algorithms. The results in Figure 8 show that QLOE+ consistently outperforms QLOE and DRM by at least an order of magnitude and two orders of magnitude, respectively, when we vary the number of threads from 16 to 48.

Related Work Route search based on locations Route search based on locations aims at ﬁnding routes that have the highest relevances to query arguments (Frentzos, Gratsias, and Theodoridis 2007; Zheng et al. 2013; Shang et al. 2012; Zheng et al. 2016; Shang et al. 2017a; 2019). Existing studies deﬁne different ranking functions to measure the relevancy between query locations and routes. Speciﬁcally, ranking functions may consider spatial (Chen et al. 2010), temporal (Shang et al. 2014), textual (Shang et al. 2012; Zheng et al. 2013), and density elements. However, existing studies regard routes or trajectories as a collection of static data and their query is performed in one-time fashion. In contrast, the C-RSL query is continuous and the routes are modeled as data streams instead of a collection of data.

Route similarity joins This line of research focuses on the problem of ﬁnding route pairs or trajectory pairs with similarities higher than a threshold (Ding, Trajcevski, and Scheuermann 2008; Ray et al. 2015; Chen and Patel 2009; Bakalov et al. 2005; Bakalov and Tsotras 2006; Ta et al. 2017; Shang et al. 2017b; 2018). Existing studies solve the problem of trajectory similarity join by the following steps: (1) similarity deﬁnition step and (2) join processing step. In similarity deﬁnition step, a ranking function that evaluates the similarity between two trajectories is deﬁned. Basically, the function takes spatial proximity and temporal relevancy into account. For join processing step, efﬁcient searching algorithm is proposed to ﬁnd all trajectory pairs whose similarities are above a user-deﬁned threshold. However, existing studies on route similarity joins take a collection of trajectories or routes as input while our problem takes a stream of routes as input. As a result, the aforementioned proposals cannot be applied to solve our C-RSL problem.

Location-based continuous query Different from onetime queries that ﬁnd items based on a snapshot of a database, continuous queries receive items that satisfy their query predicates over data streams in a real-time fashion (Chen et al. 2019b). A host of studies exist that aim at developing query indexing structures and streaming item

processing algorithms to handle a large number of locationbased continuous queries over data streams (Li et al. 2013; Guo et al. 2015; Wang et al. 2015; Chen et al. 2018; Xu et al. 2017; Yang et al. 2019). The data handled by these studies is a stream of individual geo-tagged items (e.g., geotagged tweets). In contrast, the input data of the C-RSL problem is a stream of location sequences, which calls for different purposeful query indices and matching algorithms.

Conclusions We propose and study C-RSL problem to enable continuous route-search-by-locations for a large number of users over route data streams. We develop two parallel route matching algorithms: Query Location Expansion Matching (QLOE) and QLOE with optimized expansion techniques (QLOE+), which substantially reduce the time complexity of answering the C-RSL problem in comparison to the baseline. Extensive experiment with real data demonstrates that QLOE and QLOE+ are capable of achieving high efﬁciency and scalability in answering C-RSL problem.

Acknowledgments This work is supported by the National Natural Science Foundation of China (No. 61932004).

References Bakalov, P., and Tsotras, V. J. 2006. Continuous spatiotemporal trajectory joins. In GSN, 109 128. Bakalov, P.; Hadjieleftheriou, M.; Keogh, E. J.; and Tsotras, V. J. 2005. Efﬁcient trajectory joins using symbolic representations. In MDM, 86 93. Chen, L., and Cong, G. 2015. Diversity-aware top-k publish/subscribe for text stream. In SIGMOD, 347 362. Chen, Y., and Patel, J. M. 2009. Design and evaluation of trajectory join algorithms. In ACM-GIS, 266 275. Chen, Z.; Shen, H. T.; Zhou, X.; Zheng, Y.; and Xie, X. 2010. Searching trajectories by locations: an efﬁciency study. In SIGMOD, 255 266. Chen, L.; Cong, G.; Cao, X.; and Tan, K. 2015. Temporal spatial-keyword top-k publish/subscribe. In ICDE, 255 266. Chen, L.; Shang, S.; Zhang, Z.; Cao, X.; Jensen, C. S.; and Kalnis, P. 2018. Location-aware top-k term publish/subscribe. In ICDE, 749 760. Chen, L.; Shang, S.; Jensen, C. S.; Yao, B.; Zhang, Z.; and Shao, L. 2019a. Effective and efﬁcient reuse of past travel behavior for route recommendation. In KDD, 488 498. Chen, L.; Shang, S.; Yang, C.; and Li, J. 2019b. Spatial keyword search: a survey. Geo Informatica. Chen, L.; Cong, G.; and Cao, X. 2013. An efﬁcient query indexing mechanism for ﬁltering geo-textual data. In SIGMOD, 749 760. Dijkstra, E. W. 1959. A note on two problems in connection with graphs. Numerische Math 1:269 271. Ding, H.; Trajcevski, G.; and Scheuermann, P. 2008. Efﬁcient similarity join of large sets of moving object trajectories. In TIME, 79 87.

Ding, B.; Yu, J. X.; and Qin, L. 2008. Finding timedependent shortest paths over large graphs. In EDBT, 205 216. Frentzos, E.; Gratsias, K.; and Theodoridis, Y. 2007. Indexbased most similar trajectory search. In ICDE, 816 825. Guo, L.; Zhang, D.; Li, G.; Tan, K.; and Bao, Z. 2015. Location-aware pub/sub system: When continuous moving queries meet dynamic event streams. In SIGMOD, 843 857. Hua, M., and Pei, J. 2010. Probabilistic path queries in road networks: trafﬁc uncertainty aware path selection. In EDBT, 347 358. Li, G.; Wang, Y.; Wang, T.; and Feng, J. 2013. Locationaware publish/subscribe. In KDD, 802 810. Pfoser, D.; Tryfona, N.; and Voisard, A. 2006. Dynamic travel time maps - enabling efﬁcient navigation. In SSDBM, 369 378. Ray, S.; Brown, A. D.; Koudas, N.; Blanco, R.; and Goel, A. K. 2015. Parallel in-memory trajectory-based spatiotemporal topological join. In Big Data, 361 370. IEEE. Shang, S.; Ding, R.; Yuan, B.; Xie, K.; Zheng, K.; and Kalnis, P. 2012. User oriented trajectory search for trip recommendation. In EDBT, 156 167. Shang, S.; Ding, R.; Zheng, K.; Jensen, C. S.; Kalnis, P.; and Zhou, X. 2014. Personalized trajectory matching in spatial networks. VLDB J. 23(3):449 468. Shang, S.; Chen, L.; Wei, Z.; Guo, D.; and Wen, J. 2016. Dynamic shortest path monitoring in spatial networks. J. Comput. Sci. Technol. 31(4):637 648. Shang, S.; Chen, L.; Jensen, C. S.; Wen, J.; and Kalnis, P. 2017a. Searching trajectories by regions of interest. IEEE Trans. Knowl. Data Eng. 29(7):1549 1562. Shang, S.; Chen, L.; Wei, Z.; Jensen, C. S.; Zheng, K.; and Kalnis, P. 2017b. Trajectory similarity join in spatial networks. PVLDB 10(11):1178 1189. Shang, S.; Chen, L.; Wei, Z.; Jensen, C. S.; Zheng, K.; and Kalnis, P. 2018. Parallel trajectory similarity joins in spatial networks. VLDB J. 27(3):395 420. Shang, S.; Chen, L.; Zheng, K.; Jensen, C. S.; Wei, Z.; and Kalnis, P. 2019. Parallel trajectory-to-location join. IEEE Trans. Knowl. Data Eng. 31(6):1194 1207. Ta, N.; Li, G.; Xie, Y.; Li, C.; Hao, S.; and Feng, J. 2017. Signature-based trajectory similarity join. IEEE Trans. Knowl. Data Eng. 29(4):870 883. Wang, X.; Zhang, Y.; Zhang, W.; Lin, X.; and Wang, W. 2015. Ap-tree: Efﬁciently support continuous spatialkeyword queries over stream. In ICDE, 1107 1118. Xu, Y.; Chen, L.; Yao, B.; Shang, S.; Zhu, S.; Zheng, K.; and Li, F. 2017. Location-based top-k term querying over sliding window. In WISE, 299 314. Yang, C.; Chen, L.; Shang, S.; Zhu, F.; Liu, L.; and Shao, L. 2019. Toward efﬁcient navigation of massive-scale geotextual streams. In IJCAI, 4838 4845. Yuan, J.; Zheng, Y.; Xie, X.; and Sun, G. 2013. T-drive: Enhancing driving directions with taxi drivers intelligence. IEEE Trans. Knowl. Data Eng. 25(1):220 232.

Zheng, K.; Shang, S.; Yuan, N. J.; and Yang, Y. 2013. Towards efﬁcient search for activity trajectories. In ICDE, 230 241. Zheng, K.; Zheng, B.; Xu, J.; Liu, G.; Liu, A.; and Li, Z. 2016. Popularity-aware spatial keyword search on activity trajectories. World Wide Web 19(6):1 25, online ﬁrst.