# optimal_region_search_with_submodular_maximization__0a2000d2.pdf

Optimal Region Search with Submodular Maximization

Xuefeng Chen1 , Xin Cao1 , Yifeng Zeng2 , Yixiang Fang1 and Bin Yao3

1 School of Computer Science and Engineering, University of New South Wales, Australia, 2 Department of Computer & Information Sciences, Northumbria University, UK, 3 Department of Computer Science and Engineering, Shanghai Jiaotong University, China, {xuefeng.chen, xin.cao, yixiang.fang}@unsw.edu.au, yifeng.zeng.uk@gmail.com, yaobin@cs.sjtu.edu.cn

Region search is an important problem in locationbased services due to its wide applications. In this paper, we study the problem of optimal region search with submodular maximization (ORS-SM). This problem considers a region as a connected subgraph. We compute an objective value over the locations in the region using a submodular function and a budget value by summing up the costs of edges in the region, and aim to search the region with the largest objective score under a budget value constraint. ORS-SM supports many applications such as the most diversiﬁed region search. We prove that the problem is NP-hard and develop two approximation algorithms with guaranteed error bounds. We conduct experiments on two applications using three real-world datasets. The results demonstrate that our algorithms can achieve highquality solutions and are faster than a state-of-theart method by orders of magnitude.

1 Introduction With the proliferation of mobile devices and Location-Based Services (LBS) (e.g., Google Maps, Foursquare), a huge amount of geo-tagged data are being generated every day. The geo-tagged data has rich information including timestamp, geo-location and comments, which facilitates a large number of location-based applications. As an important problem in LBS, region search has wide applications such as user exploration in a city [Feng et al., 2016; Cao et al., 2014], similar regions query [Feng et al., 2019a], region detection [Feng et al., 2019b], etc. Most existing studies consider a region as a rectangle with a ﬁxed length and width [Nandy and Bhattacharya, 1995; Choi et al., 2012; Tao et al., 2013; Feng et al., 2016]. It is not general in practice since the regions might be of arbitrary shapes. In order to address this limitation, Cao et al. [Cao et al., 2014] deﬁne a region as a connected subgraph and propose the Length-Constrained Maximum-Sum Region (LCMSR) query. It considers two attributes for region search: an objective value of the region computed by summing up the weights of nodes in a region and a budget value by summing up the costs of edges in a region. The

{coffee} {mall}

{mall, bar}

{mall, bar}

Figure 1: A toy road network with POIs.

target of the LCMSR query is to ﬁnd the region that has the largest objective value under a given budget constraint. A simple accumulative function (e.g., SUM) cannot measure the objective value in some applications (e.g., the inﬂuence or the diversity of a region [Feng et al., 2016], the likelihood of information [Leskovec et al., 2007], the entropy of locations [Zhang and Vorobeychik, 2016]). In this paper, we use a submodular function to compute the objective score of a region deﬁned as a connected subgraph, which makes the problem more challenging. We denote this problem as optimal region search with submodular maximization (ORS-SM). We use the example network shown in Figure 1 to illustrate the problem, which consists of 6 locations connected by 8 edges. Each location is associated with some keywords (e.g, mall, coffee) which represent its POI categories, and each edge has a travel time cost. Consider a user travels in a city. She would like to explore the most diversiﬁed region to experience the most types of POIs, and she does not want to spend too much time on the road. We can use ORS-SM to ﬁnd solutions for the user by computing the objective score of a region as the number of different types of POIs contained in the region and setting a travel time limit Δ. Given Δ = 5, LCMSR returns the region R1 = {e1 = (v2, v4), e2 = (v4, v5), e3 = (v4, v6)}, as R1 has the most the number of keywords (not distinct!), but R1 is not the most diversiﬁed region since it only contains two types of locations. By contrast, ORS-SM returns R2 = {e1 = (v1, v3), e2 = (v3, v5), e3 = (v4, v5)} which has the most distinct keywords. As the objective function we use is submodular, the ORSSM problem is a type of submodular optimization [Krause and Golovin, 2014], which are extremely hard [Zeng et al., 2015; Zhang and Vorobeychik, 2016; Stan et al., 2017; Crawford, 2019]. We prove that solving the ORS-SM prob-

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

lem is NP-hard. To our best knowledge, only the Generalized Cost-Beneﬁt Algorithm (GCB) [Zhang and Vorobeychik, 2016] for the problem of submodular optimization with routing constraints can be adapted to solve our ORS-SM problem. The algorithm requires a root node, which does not exist in our problem. Hence, we need to ﬁnd a tree with the largest objective value by treating each node as root, and the best tree among them is returned as the result for ORS-SM. We denoted this method as GCBAll. GCBAll has a high complexity, so that it has poor efﬁciency and is not scalable. To better solve ORS-SM, we propose a 1 O(

approximation algorithm App ORS, where bmin is the minimal cost on edges. We observe that the returned region is always a tree. The basic idea of this algorithm is to ﬁrst ﬁnd a set of nodes with large objective scores, and we guarantee that there exists a tree connecting them whose total cost is smaller than the budget limit Δ. Next, we ﬁnd the best region by using these nodes. Next, we propose another algorithm with the same approximation ratio called IApp ORS to further improve the effectiveness of App ORS. This algorithm can obtain regions of better quality. In summary, the contribution of this paper is threefold. Firstly, we propose ORS-SM through a submodular optimization and prove its NP-hardness. Secondly, we develop two approximation algorithms App ORS and IApp ORS for this problem. Finally we conduct experiments on two applications using three real-world datasets and demonstrate the efﬁciency and accuracy of the proposed algorithms.

2 Related Work

Region search has been well studied and has a wide range of applications such as user exploration in a city [Feng et al., 2016; Cao et al., 2014], similar regions search [Feng et al., 2019a], region detection [Feng et al., 2019b], etc. Most existing region search problems consider the region as size-ﬁxed rectangles or radius-ﬁxed circles. The Max-Enclosing Rectangle (MER) problem [Nandy and Bhattacharya, 1995] aims to ﬁnd a rectangle with a given size a b such that the rectangle encloses the maximum number of spatial objects. This problem is systematically studied as the maximizing range sum problem [Choi et al., 2012; Tao et al., 2013]. The similar region search [Feng et al., 2019a], bursty region detection [Feng et al., 2019b], and the best region search [Feng et al., 2016] also consider regions as rectangles. However, as shown in the study of the LCMSR query proposed by Cao et al. [Cao et al., 2014], in practice the regions are usually arbitrarily shaped. LCMSR deﬁnes a region as a connected subgraph with relevant objects. It uses a sum function to accumulate the weights of objects in the region, and it aims to ﬁnd the optimal region that has the largest total weight and does not exceed a size limit. We use a submodular function to compute the objective value for a region and aim to maximize this value under a budget constraint, which makes the problem more challenging. Our work is also related to submodular optimization which has breadth of applicability with applications including viral marketing [Kempe et al., 2003], recommendation system [Chen et al., 2015], information gathering [Leskovec

et al., 2007], deep neural network training [Joseph et al., 2019], etc. A large amount of research considers submodular optimization under constraints, such as cardinality constraint [Kempe et al., 2003], cost constraint [Qian et al., 2017], routing constraint [Zeng et al., 2015; Zhang and Vorobeychik, 2016], connectivity constraint [Kuo et al., 2014], and so on. Submodular Optimization under Connectivity Constraint (SOCC) [Kuo et al., 2014] is similar to our problem, but it considers the cost on nodes to compute the budget values. In contrast, the budget values are associated with edges in our problem, and thus the algorithms for SOCC is not feasible to solve ORS-SM.

3 Problem Formulation In this section, we formally deﬁne the problem of optimal region search with submodular maximization (ORS-SM) and prove its hardness. The problem is deﬁned over a spatial network G = (V, E), where V is a set of nodes and E V V is a set of edges. Each node v V represents a location, and each edge vi, vj E represents an undirected path between two locations vi and vj in V associated with a cost b(vi, vi+1) (representing the distance, the travel time, etc.). Deﬁnition 1. Region. Given a spatial network G = (V, E), a region R is a connected subgraph of G. We denote the nodes and edges in R as R.V and R.E respectively. In order to measure the quality of a region, we consider two values, namely the budget value and the objective value. The budget value is deﬁned as the sum of the costs on all the edges in the region, which is computed as:

(vi,vj) R.E b(vi, vj) (1)

The objective value is computed by a submodular function f() over the set of locations in a region (e.g., the number of distinct keywords, the social inﬂuence, etc.): OS(R) = f( vi R.Vvi) (2) OS(R) is submodular means that OS(R1 vi) OS(R1) OS(R2 vi) OS(R2), for all R1 R2 and vi R1, and R1(R2) vi composes a new region. For example, we consider a problem of ﬁnding the most diversiﬁed region in a network G as shown in Figure 1. In this example, the objective value is computed as OS(R) = | vi R.V K(vi)|, where K(vi) is the set of distinct keywords on vi. For R2 in the example, we can get OS(R2) = |{bar, coffee, park, mall}| = 4 and BS(R2) = 5. Formally, we deﬁne the ORS-SM problem as follows: Deﬁnition 2. Optimal Region Search with Submodular Maximization (ORS-SM). Given G = (V, E), and a budget constraint Δ, we aim to ﬁnd the region R such that R = argmax R OS(R) subject to BS(R) Δ (3)

Theorem 1. Solving the ORS-SM problem is NP-hard. Proof. The LCMSR query [Cao et al., 2014] which is mentioned in the Introduction is a special case of ORS-SM, because the sum function also satisﬁes the submodular properties. As the LCMSR problem is proved to be NP-hard, the ORS-SM problem is NP-hard as well.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

4 Approximation Algorithm App ORS

In this section, we propose a novel approximation algorithm called App ORS for solving the ORS-SM problem, and we also prove its error bound. For any subgraph region Ri, we can always ﬁnd its minimum spanning tree Ti such that BS(Ti) BS(Ri) and OS(Ti) = OS(Ri). Hence, the optimal region of the ORS-SM problem must be a tree, and we consider the tree search in the subsequent discussions directly. Before presenting the algorithm, we ﬁrst introduce the following lemma which lays the foundation of our algorithm.

Lemma 1. For any tree T = (VT , ET ) with budget value BS(T), there always exist n max(10 BS(T )

m , 2) subtrees Ti = (VTi, ETi), 1 i n, such that BS(Ti) m if |ETi| > 1 and n i=1VTi = VT .

Proof. We utilize the claim 3 of the Maximum Connected Submodular set function with Budget constraint problem (MCSB) [Kuo et al., 2014], and we conduct the proof through the following steps: 1) Given a graph G = (V, E), we can get its line graph GL = (VL, EL). For a tree T = (VT , ET ) with the budget value BS(T) in G, we can get its corresponding subgraph GL T in GL. If GL T contains cycles, we break cycles by removing some edges. After that, we get a tree T L = (VT L, ET L) spanning all nodes of GL T . Now each edge in T corresponds to a node in T L. 2) According to the claim 3 for MCSB, for T L with budget value BS(T L), there always exist n = max(5 BS(T L)

m , 1) subtrees T L i = (VT L i , ET L i ), such that BS(T L i ) m + BS(r L i ) and n i=1VT L i = VT L, where r L i is the root of T L i for all 1 i n. For each subtree T L i , if VT L i > 1, we can always select a node with degree 1 as r L i , and we can then divide it into two subtrees T 1,L i consisting of r L i and T 2,L i consisting of the remaining of VT L i . Hence, we

know that BS(T 2,L i ) m. After that, we can obtain n

max(10 BS(T L)

m , 2) subtrees T L i = (VT L i , ET L i ), such

that BS(T L i ) m if |VT L i | > 1 and n i=1VT L i = VT L.

3) For each subtree T L i , we ﬁnd its line graph Gi. As T L i is obtained from the line graph of G, Gi must be a subgraph of the original tree T in the ﬁrst step. Based on translated properties of a line graph, i.e., the the line graph of a connected graph is connected, Gi must be connected. Thus, each Gi must be a tree as well and we denote it as Ti, and we have n i=1VT L i = VT L = ET . We thus complete the proof. Now we are ready to present our approximation algorithm App ORS. The basic idea of App ORS is to ﬁnd a tree T , such that OS(T ) is larger than the objective score of any subtree Ti whose budget score is no larger than m and BS(T ) Δ. We then can prove that the region R obtained based on T is an approximate feasible solution of the problem based on Lemma 1. In App ORS, Nemhauser s greedy algorithm [Nemhauser et al., 1978] is used to to solve the Maximum Rooted Submodular set function (MRS) problem [Kuo et al., 2014], which aims to ﬁnd a node set containing a given root node and other K 1 nodes such that the score computed

Algorithm 1: App ORS Algorithm

Input: G = (V, E), Δ Output: A region R

1 Initialize a node set VK opt , vopt ;

Δ bmin + 1, Dis Lim Δbmin;

3 for each node vi V do

4 Vvi {vj|vi = vj, dis(vi, vj) Dis Lim};

5 Use Nemhauser s greedy algorithm to solve MRS(OS( ), Vvi, K, vi), and get top-K node set VK vi ;

6 if OS(VK opt) < OS(VK vi ) then

7 VK opt VK vi , vopt vi;

8 T find Opt Tree(VK opt, G, Δ, vopt);

9 T a single edge in E with the maximum OS value and BS(T ) Δ;

10 R T if OS(T ) OS(T ) else T ;

11 return the region R ;

by a submodular function over the K nodes is maximum. As shown in Alg.1, App ORS starts with an initial node set VK opt, a root node vopt, two parameters K and Dis Lim (lines 1-2), where bmin = min(vi,vj) Vb(vi, vj). After that, for each node vi V, App ORS ﬁrst ﬁnds the candidate node set Vvi in which each node has a distance to vi smaller than Dis Lim (lines 3-4). Then App ORS constructs a MRS problem instance MRS(OS( ), Vvi, K, vi), and then uses the Nemhauser s greedy algorithm to solve the instance for getting a top-K node set VK vi (line 5). If OS(VK opt) < OS(VK vi ), App ORS updates VK opt and vopt (lines 6-7). Next, App ORS uses function find Opt Tree(VK opt, G, Δ, vopt) to ﬁnd a tree T spanning all nodes in VK opt, satisfying BS(T ) Δ (line 8). After that, App ORS ﬁnds T consisting of a feasible edge by scanning all edges in E which has the maximum objective score (line 9). Finally, it returns the tree with the larger objective value as the region R (lines 10-11). One simple method of implementing find Opt Tree(VK opt, G, Δ, vopt) is to ﬁnd all shortest paths from vopt to all nodes in VK opt, and then get a tree T by removing duplicate edges in the shortest paths. We will present a better method in Section 5.3. We analyze the error bound of App ORS as below. Lemma 2. App ORS returns a feasible solution of the ORSSM problem. Proof. In App ORS, BS(T ) Δ. And VK opt contains K nodes, the largest distance between vopt and vj VK opt is Dis Lim. Thus, in the tree T is achieved by combining all shortest paths from vopt to all nodes in VK opt, BS(T ) (K 1) Dis Lim = Δ.

Lemma 3. OS(Ropt) O(

Δ/bmin)OS(R ), where Ropt denotes the optimal region of the ORS-SM problem. Proof. We denote the optimal solution of MRS(OS( ), Vvi, K, vi) as VMRS opt , the optimal tree rooted at vi whose budget value under Dis Lim as T vi,DL, and T vi,DL with the maximal objective value as T vi,DL opt . We can get that T vi,DL opt has at most Dis Lim

bmin nodes except vi,

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

and these nodes are in {vj|dis(vi, vj) Dis Lim}. And

Δ bmin K 1. For any node vj T vi,DL opt ,

vj Vvi. Thus, OS(VMRS opt ) OS(T vi,DL opt ). Meanwhile, in App ORS, OS(T ) OS(VK opt). And based on Lemma 2 in [Kuo et al., 2014], Nemhauser s greedy algorithm can get a e 1

e -approximate solution for the MRS problem, we have OS(VK opt) e 1

e OS(VMRS opt ). Hence we can get OS(T ) e 1

e OS(T vi,DL opt ). As Ropt must be a tree, we use Topt to represent it. According to Lemma 1, for Topt = (VTopt, ETopt) with the budget value Topt, there always exist n max(10 Δ Dis Lim , 2) subtrees Ti = (VTi, ETi), where BS(Ti) Dis Lim if |ETi| > 1, such that n i=1VTi = VTopt. Hence, we have:

OS(Ropt) = OS(Topt) = OS(VTopt) = OS( n i=1VTi)

max(10 Δ Dis Lim , 2) max(OS(T vi,DL opt ), OS(T ))

Δ/bmin) max( e e 1 OS(T ), OS(T ))

Δ/bmin)OS(R ) where the ﬁrst inequality is derived from the property of the submodular function OS( ). Based on Lemma 2 and 3, we can obtain Theorem 2: Theorem 2. App ORS is a 1 O(

Δ/bmin)-approximation algo-

rithm for the ORS-SM problem. Proof. It is obvious since App ORS returns a feasible solution whose objective score is no smaller than 1 O(

Δ/bmin) times

of the optimal value.

5 Improved App ORS As App ORS considers the worst case to compute K nodes under the budget constraint and only searches within Vvi = {vj|dis(vi, vj) Dis Lim}, it cannot achieve high-quality solutions in practice (as shown in the experimental study). To improve the solution quality of App ORS, we present an Improved version of App ORS (IApp ORS). Meanwhile, we develop two heuristic methods to implement a key function of IApp ORS.

5.1 An Improved Version of App ORS We abbreviate lines 3-7 of Alg. 1 as: for each node vi V do update(vopt, VK opt). In the improved version of App ORS, the details of the new update(vopt, VK opt) function are shown in Alg. 2. Different to the old function, the new one improves the solution quality by using a method find Fea Node Set(vi, G, Δ) to ﬁnd another feasible node set V2,K vi (lines 4-6). Note that, iff V2,K vi has a tree T 2,K vi spanning its all nodes, and BS(T 2,K vi ) Δ, V2,K vi is a feasible node set. Theorem 3. IApp ORS returns a 1 O(

Δ/bmin)-approximation

solution of the ORS-SM problem. Proof. As both of V1,K vi and V2,K vi are feasible node sets, IApp ORS can ﬁnd a feasible solution for the ORS-SM problem. Meanwhile, the solution quality of IApp ORS is not worse than that of App ORS. This leads to the ﬁnal conclusion.

Algorithm 2: New update(vopt, VK opt)

Input: G = (V, E), Δ, K, Dis Lim, vi, vopt, VK opt Output: Updated vopt, VK opt

1 V1 vi {vj|vi = vj, dis(vi, vj) Dis Lim};

2 Use Nemhauser s greedy algorithm to solve MRS(OS( ), V1 vi, K, vi), and get top K node set V1,K vi ;

3 VK vi V1,K vi ;

4 V2,K vi find Fea Node Set(vi, G, Δ);

5 if OS(VK vi ) < OS(V2,K vi ) then

6 VK vi V2,K vi ;

7 if OS(VK opt) < OS(VK vi ) then

8 VK opt VK vi , vopt vi;

9 return vopt, VK opt;

5.2 Two Heuristic Methods for IApp ORS

To implement find Fea Node Set(vi, G, Δ) in IApp ORS, we present two heuristic methods. At ﬁrst, we introduce the function compute(BS(Vi)) which is used to compute the minimal budget value of the tree spanning all nodes in the node set Vi. compute(BS(Vi)) is a Steiner tree problem which is proved to be NP-hard [Vazirani, 2013]. Rather than computing accurate BS(Vi), we use an approximation algorithm to get an estimated value ˆ BS(Vi). Here, we use Kruskal s algorithm to compute a Minimum-cost Spanning Tree (MST) for spanning all nodes in BS(Vi), as we can utilize the MST to obtain a 2-approximation solution for the Steiner tree problem [Vazirani, 2013], and Kruskal s algorithm is efﬁcient. We denote this method as compute( ˆ BS(Vi)). Next, we present the ﬁrst implementation of find Fea Node Set1(vi, G, Δ). As shown in Alg. 3, it begins with initializing node sets V2,K vi and V2 vi (lines 1). In each while loop, the method selects a node vmax with the maximum marginal objective value for the current V2,K vi from V2 vi. After that, the method removes vmax from V2 vi and computes ˆ BS(V2,K vi vmax) by utilizing the Kruskal s algorithm. If the estimated budget value is not larger than Δ, vmax would be inserted into V2,K vi . The process is terminated when V2 vi = (lines 2-7). At the last step, the method returns a node set V2,K vi .

Algorithm 3: Function find Fea Node Set1(vi, G, Δ)

Input: G = (V, E), Δ, vi Output: V2,K vi 1 V2,K vi {vi}, V2 vi {vj|vi = vj, dis(vi, vj) Δ};

2 while V2 vi = do

3 vmax argmax OSvj V2vi (V2,K vi vj);

4 V2 vi V2 vi \ vmax;

5 ˆ BS(V2,K vi ) compute( ˆ BS(V2,K vi vmax));

6 if ˆ BS(V2,K vi vmax) Δ then

7 V2,K vi V2,K vi vmax;

8 return V2,K vi ;

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Note that, as BS(V2,K vi vmax) ˆ BS(V2,K vi vmax), and thus we have BS(V2,K vi vmax) Δ. So that V2,K vi is a feasible solution of the ORS-SM problem. In addition, we can speed up the process of find Fea Node Set1(vi, G, Δ) using an optimization technique called Lazy Evaluations [Leskovec et al., 2007]. We denote IApp ORS with find Fea Node Set1(vi, G, Δ) as IApp ORSHeu1. Next, we propose another heuristic method to implement find Fea Node Set(vi, G, Δ) which is faster than the ﬁrst one. Intuitively, for two nodes v1 and v2, when dis(v1, v2) is smaller, the difference between V2,K v1 and V2,K v2 is smaller. Thus, we can use V2,K v1 to approximate V2,K v2 . Based on this observation, we present the second method find Fea Node Set2(vi, G, Δ). The most steps are similar to Alg. 3, but the line 4 is replaced by Vnei vi {vj|dis(vi, vj) γΔ}, V2 vi V2 vi \ {Vnei vi vmax}, where γ (0, 1] is a parameter, and Vnei vi is the neighbor node set of vi. In this way, we use the seconde feasible node set of vi as those of vi s neighbor nodes, and reduce the related computations. We denote IApp ORS with find Fea Node Set2(vi, G, Δ) as IApp ORSHeu2, and V2,K vi in IApp ORSHeu2 is also a feasible node set.

5.3 A Better find Opt Tree(VK opt, G, Δ, vopt)

In this section, we present a better method to implement the function find Opt Tree(VK opt, G, Δ, vopt) in line 8 of Alg. 1. We ﬁrst compute the shortest distances between each pair of nodes using some existing methods. Then, we use the Kruskal s algorithm to ﬁnd a MST (denoted as MST1) spanning all nodes in VK opt, and MST1 contains unknown shortest paths between some pairs of nodes. Next, we ﬁnd the real shortest paths by inserting the node set Vmid = {vi|vi Vvopt, ˆ BS(VK opt vi) = ˆ BS(VK opt)} into VK opt, and ﬁnding a new MST (denoted as MST2) spanning all nodes in MST2, where Vvopt {vi|vi / VK opt, dis(vopt, vi) Δ}. MST2 is a feasible solution for the ORS-SM problem, but its budget value may be far less than Δ. We improve MST2 by extending it greedily. In each iteration, we select a node which is connected to MST2 and has the maximum marginal objective value, and insert the node into MST2. The process is repeated until no more nodes can be added into MST2. Finally, we can get a better region MST2 from VK opt. Thus, we use this method to do find Opt Tree(VK opt, G, Δ, vopt) for all proposed algorithms by default.

5.4 Time Complexity of the Proposed Algorithms In this section, we analyze the complexity of all the proposed algorithms App ORS, IApp ORSHeu1 and IApp ORSHeu2. These algorithms have two main steps. Step1 computes a feasible node set VK opt, and Step 2 searches a feasible spanning tree T from VK opt and T consisting of one edge. We denote the time cost of computing dis(vi, vj) as tdis. App ORS needs |V| loops in Step 1, and in each loop, it takes O(|V|tdis) time to get the candidate node set Vvi, and costs O(K|Vvi|max) time to get top-K node set and update VK opt and vopt, where |Vvi|max is the maximum value of |Vvi|. The method in Sec-

tion 5.3 is the default method used in Step 2. It ﬁrst takes O(|VK opt| 2log|VK opt|) time to ﬁnd MST1. After that, it costs O(|V|tdis) time to get the node set Vvopt, and then costs O(|Vvopt||VK opt| 2(tdis + log|VK opt|)) time to get MST2. Next, it requires O(|Vvopt||VK opt|) time to extend MST2. Finally, it takes O(|E|) time to ﬁnd T . Thus, the time complexity of App ORS is O(|V|2tdis + |Vvopt||VK opt| 2(tdis + log|VK opt|)). IApp ORSHeu1 also needs |V| loops in Step 1. In each loop, it ﬁrst takes O(|V|tdis) time to get V1,K vi , and then costs O(|V|tdis + |V 2 vi|max(|V2,K vi |max)2(tdis + log|V2,K vi |max)) time to get V2,K vi , Step 2 of IApp ORSHeu1 is the same to that of App ORS, and the worst case of IApp ORSHeu2 is IApp ORSHeu1. Therefore, the time complexity of IApp ORSHeu1 and IApp ORSHeu2 is O(|V|2tdis + |V||V 2 vi|max(|V2,K vi |max)2(tdis + log|V2,K vi |max)).

6 Experimental Study

We compare the proposed algorithms to the state-of-the-art algorithm GCBAll in two applications on three real-world datasets. As discussed in Section 1, GCBAll is the adapted version of GCB [Zhang and Vorobeychik, 2016] for solving our ORS-SM problem. In GCBAll, to compute an approximation of the minimal budget value of a tree spanning all the given nodes, we adopt the 2-approximation algorithm for the Steiner tree problem [Vazirani, 2013] by utilizing the minimum spanning tree. We implement all the algorithms in JAVA on Windows 10, and run on a server with an Intel(R)Xeon(R) W-2155 3.3GHz CPU and 256 GB memory.

6.1 Case Study 1: Most Inﬂuential Region Problem Deﬁnition and Datasets We ﬁrst investigate the Most Inﬂuential Region Search (MIRS) which aims to ﬁnd an optimal region under a budget constraint such that the number of affected users is maximal in a geo-social network. This problem is useful in many scenarios, e.g., a company wants to ﬁnd a region for marketing its products, the government intends to ﬁnd a region for building some public facilities, etc. We model the geo-social network as an undirected graph G = (V, E) and compute the probability that a user ui visits a region R as P R ui = 1

vj R.V(1 P vj ui ), where P vj ui is the probability that ui visits the location vj, and it is computed as P vj ui = # of check ins in vj of ui

# of check ins of ui . Thus, the objective value of R is the number of users expected to be affected by R, and it is computed as a submodular function: OS(R) =

ui U P R ui, where U represents all users. In this problem, we use two real-world datasets SG and AS crawled from Four Square (also used in the work [Zeng et al., 2015]), in which SG has 189,306 check-ins made by 2,321 users at 5,412 locations in Singapore, and AS contains 201,525 check-ins made by 4,630 users at 6,176 locations in Austin. Following the work [Zeng et al., 2015], we make an edge between two locations which were visited continuously in one day by the same user, and use the Euclidean distance as the budget value for each edge. We also removed the checkins made by users who checked in a location less than 3 times.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009

20 30 40 50 60

Run Time (ms)

Δ (kilometer)

GCBAll App ORS IApp ORSHeu1 IApp ORSHeu2

20 30 40 50 60

The Objective Score

Δ (kilometer)

GCBAll App ORS IApp ORSHeu1 IApp ORSHeu2

10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 1e+010

20 30 40 50 60

Run Time (ms)

Δ (kilometer)

GCBAll App ORS IApp ORSHeu1 IApp ORSHeu2

20 30 40 50 60

The Objective Score

Δ (kilometer)

GCBAll App ORS IApp ORSHeu1 IApp ORSHeu2

Figure 2: Comparison of algorithms in MIRS.

Performance of Our Methods We compare three proposed algorithms with GCBAll in terms of efﬁciency (the run time) and effectiveness (the objective score) by varying Δ from 20km to 60km. Fig. 2 shows that GCBAll is rather time-consuming when Δ is large, as it calls compute( ˆ BS(Vi)) too many times and cannot use Lazy Evaluations optimization [Leskovec et al., 2007] to ﬁnd vmax quickly. All the proposed algorithms are faster than GCBAll more than one order of magnitude. Meanwhile, the objective scores of IApp ORSHeu1 and IApp ORSHeu2 are relatively high (more than 90% of GCBAll), which is also consistent with our theoretical analysis of the solution quality. The results demonstrate that two presented heuristic methods can improve the solution quality of App ORS signiﬁcantly. The experimental results also show that the solution quality of our algorithms is better than that of GCBAll within a run time limit (e.g., 1s, 5s, 10s). We omit these results here due to the space limitation.

6.2 Case Study 2: Most Diversiﬁed Region Problem Deﬁnition and Datasets The second application Most Diversiﬁed Region Search (MDRS) is to ﬁnd the most diversiﬁed region under a budget constraint. We consider MDRS on road networks which contains a set of locations associated with a set of keywords (e.g., mall and bar). We measure the diversity of a region using the number of different keywords, and thus the objective function is OS(R) = | vi R.V K(vi)|, where K(vi) is the set of keywords on vi. We use the road network in California (CA) from a public website1. We then utilized the Foursquare APIs to ﬁll in the missing keywords for nodes (categories of locations)2. CA contains 21,048 nodes and 22,830 edges [Li et al., 2005].

Performance of Our Methods As GCBAll cannot ﬁnd results under Δ = 30km on AS within 1, 000s and CA is even larger than AS, we only com-

1http://www.cs.utah.edu/ lifeifei/Spatial Dataset.htm 2https://developer.foursquare.com/docs/api

60 70 80 90 100

Run Time (ms)

Δ (kilometer)

App ORS IApp ORSHeu1 IApp ORSHeu2

(a) Efﬁciency

60 70 80 90 100

The Objective Score

Δ (kilometer)

App ORS IApp ORSHeu1 IApp ORSHeu2

(b) Solution Quality

Figure 3: Performances of the algorithms in MDRS on CA.

10K 15K 20K 25K 30K

Run Time (ms)

The Number of Nodes

App ORS IApp ORSHeu1 IApp ORSHeu2

(a) Efﬁciency

10K 15K 20K 25K 30K

The Objective Score

The Number of Nodes

App ORS IApp ORSHeu1 IApp ORSHeu2

(b) Solution Quality

Figure 4: Performance of the algorithms in MDRS on SY.

pare the performance of the three proposed algorithms on a larger Δ using CA. As shown in Fig. 3, the objective scores of IApp ORSHeu1 and IApp ORSHeu2 are competitive, and higher than that of App ORS. Meanwhile, IApp ORSHeu2 is faster than IApp ORSHeu1 by about one order of magnitude, and can solve the problem with 4s when Δ = 100km. In addition, to compare the performances of the three algorithms by varying the dataset size, we generate a synthetic dataset (denoted as SY ) based on the structure of the CA dataset. Next, we run the three algorithms on SY by varying the number of nodes from 10,000 to 30,000, and set Δ = 60km. Fig. 4 shows that the run time of IApp ORSHeu2 grows slowly with the increasing number of nodes, and the solution quality of IApp ORSHeu2 is close to that of IApp ORSHeu1, and better than that of App ORS.

7 Conclusion

We propose the ORS-SM problem which aims to ﬁnd the optimal region such that it maximizes the objective score of a the region computed by a submodular function subject to a given budget score constraint. To efﬁciently solve the ORS-SM problem, we propose two approximation algorithms and further improve them with some optimization techniques. The results of empirical studies on two applications using three real-world datasets demonstrate the efﬁciency and the solution quality of our proposed algorithms. In future work, we would like to design an efﬁcient index to improve the efﬁciency and scalability of the proposed methods.

Acknowledgements

Xin Cao is supported by ARC DE190100663. Yifeng Zeng is supported by EPSRC-EP/S011609/1. Bin Yao is supported by NSFC (No. U1636210, 61729202, 61922054, 61872235, 61832017), The National Key Research and Development Program of China (2018YFC1504504). Xin Cao is the corresponding author of this paper.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

References [Cao et al., 2014] Xin Cao, Gao Cong, Christian S Jensen, and Man Lung Yiu. Retrieving regions of interest for user exploration. Proceedings of the VLDB Endowment, 7(9):733 744, 2014. [Chen et al., 2015] Xuefeng Chen, Yifeng Zeng, Gao Cong, Shengchao Qin, Yanping Xiang, and Yuanshun Dai. On information coverage for location category based point-ofinterest recommendation. In AAAI, pages 37 43, 2015. [Choi et al., 2012] Dong-Wan Choi, Chin-Wan Chung, and Yufei Tao. A scalable algorithm for maximizing range sum in spatial databases. Proceedings of the VLDB Endowment, 5(11):1088 1099, 2012. [Crawford, 2019] Victoria G Crawford. An efﬁcient evolutionary algorithm for minimum cost submodular cover. In IJCAI, pages 1227 1233, 2019. [Feng et al., 2016] Kaiyu Feng, Gao Cong, Sourav S Bhowmick, Wen-Chih Peng, and Chunyan Miao. Towards best region search for data exploration. In SIGMOD, pages 1055 1070. ACM, 2016. [Feng et al., 2019a] Kaiyu Feng, Gao Cong, Christian S Jensen, and Tao Guo. Finding attribute-aware similar regions for data analysis. Proceedings of the VLDB Endowment, 12(11):1414 1426, 2019. [Feng et al., 2019b] Kaiyu Feng, Tao Guo, Gao Cong, Sourav S Bhowmick, and Shuai Ma. Surge: Continuous detection of bursty regions over a stream of spatial objects. IEEE Transactions on Knowledge and Data Engineering, 2019. [Joseph et al., 2019] KJ Joseph, Krishnakant Singh, and Vineeth N Balasubramanian. Submodular batch selection for training deep neural networks. In IJCAI, pages 2677 2683, 2019. [Kempe et al., 2003] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, pages 137 146. ACM, 2003. [Krause and Golovin, 2014] Andreas Krause and Daniel Golovin. Submodular function maximization., 2014. [Kuo et al., 2014] Tung-Wei Kuo, Kate Ching-Ju Lin, and Ming-Jer Tsai. Maximizing submodular set function with connectivity constraint: Theory and application to networks. IEEE/ACM Transactions on Networking, 23(2):533 546, 2014. [Leskovec et al., 2007] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne Van Briesen, and Natalie Glance. Cost-effective outbreak detection in networks. In SIGKDD, pages 420 429, 2007. [Li et al., 2005] Feifei Li, Dihan Cheng, Marios Hadjieleftheriou, George Kollios, and Shang-Hua Teng. On trip planning queries in spatial databases. In SSTD, pages 273 290. Springer, 2005. [Nandy and Bhattacharya, 1995] Subhas C Nandy and Bhargab B Bhattacharya. A uniﬁed algorithm for ﬁnding maximum and minimum object enclosing rectangles and

cuboids. Computers & Mathematics with Applications, 29(8):45 61, 1995. [Nemhauser et al., 1978] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical programming, 14(1):265 294, 1978. [Qian et al., 2017] Chao Qian, Jing-Cheng Shi, Yang Yu, and Ke Tang. On subset selection with general cost constraints. In IJCAI, volume 17, pages 2613 2619, 2017. [Stan et al., 2017] Serban Stan, Morteza Zadimoghaddam, Andreas Krause, and Amin Karbasi. Probabilistic submodular maximization in sub-linear time. In ICML, volume 70, pages 3241 3250. JMLR. org, 2017. [Tao et al., 2013] Yufei Tao, Xiaocheng Hu, Dong-Wan Choi, and Chin-Wan Chung. Approximate maxrs in spatial databases. Proceedings of the VLDB Endowment, 6(13):1546 1557, 2013. [Vazirani, 2013] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media, 2013. [Zeng et al., 2015] Yifeng Zeng, Xuefeng Chen, Xin Cao, Shengchao Qin, Marc Cavazza, and Yanping Xiang. Optimal route search with the coverage of users preferences. In IJCAI, pages 2118 2124, 2015. [Zhang and Vorobeychik, 2016] Haifeng Zhang and Yevgeniy Vorobeychik. Submodular optimization with routing constraints. In AAAI, pages 819 825, 2016.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)