# radon_ð_rapid_discovery_of_topological_relations__4c07249a.pdf
Radon Rapid Discovery of Topological Relations
Mohamed Ahmed Sherif,a Kevin Dreßler,a Panayiotis Smeros,b Axel-Cyrille Ngonga Ngomoa
a Department of Computer Science, University of Leipzig, 04109 Leipzig, Germany {sherif|dressler|ngonga}@informatik.uni-leipzig.de b EPFL, BC 142, Station 14, CH-1015 Lausanne, Switzerland panayiotis.smeros@epfl.ch
Geospatial data is at the core of the Semantic Web, of which the largest knowledge base contains more than 30 billions facts. Reasoning on these large amounts of geospatial data requires efficient methods for the computation of links between the resources contained in these knowledge bases. In this paper, we present Radon efficient solution for the discovery of topological relations between geospatial resources according to the DE9-IM standard. Our evaluation shows that we outperform the state of the art significantly and by several orders of magnitude.
1 Introduction Geo-spatial datasets belong to the largest sources of Linked Data. For example, Linked Geo Data1 contains more than 20 billion triples which describe millions of geo-spatial entities. Datasets such as NUTS2 use polygons of up to 1500 points to describe resources such as countries. As pointed out in previous works (Ngonga Ngomo 2013), only 7.1% of the links between resources connect geo-spatial entities. This is due to two main factors. First, the large number of geo-spatial resources available on the Linked Data Web requires scalable algorithms for computing links between geo-spatial resources. In addition, the description of geospatial resources being commonly based on polygons demands the computation of particular relations, i.e., topological relations, between geo-spatial resources. According to the Linked Data principles3 and for the sake of real-time application such as structured machine learning (e.g., DLLearner (Lehmann 2009)) and question Answering (e.g., DEQA platform (Lehmann et al. 2012)), the provision of explicit topological relations between resources is of central importance to achieve scalability. However, only a few approaches have been developed to deal with geo-spatial data represented in RDF. For example, (Ngonga Ngomo 2013) uses the Hausdorffdistance to compute a topological distance between geo-spatial entities. (Smeros and Koubarakis 2016) builds upon Multi Block to compute topological relations according to the DE-9IM standard.
Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1http://linkedgeodata.org 2http://nuts.geovocab.org/ 3https://www.w3.org/Design Issues/Linked Data.html
We go beyond the state of the art by providing a novel indexing method combined with space tiling that allows for the efficient computation of topological relations between geospatial resources. In particular, we present a novel sparse index for geo-spatial resources. We then develop a strategy to discard unnecessary computations for DE-9IM relations based on bounding boxes. Our extensive experiments show that our approach scales well and outperforms the state of the art by up to 3 orders of magnitude w.r.t. to its runtime. The contributions of this paper can be summarized as follows: (1) We present a novel indexing algorithm for geospatial resources based on an optimized sparse space tiling. (2) We provide a novel filtering approach for the rapid discovery of topological relations (Radon), which uses minimum bounding box (MBB) approximation. (3) We show that Radon is able to discover any of the DE-9IM relations that involve intersection of at least one point. (4) We evaluate Radon on real datasets and show that it clearly outperforms the state of the art.
2 Preliminaries
Let K be a finite RDF knowledge base. K can be regarded as a set of triples (s, p, o) (R B) P (R L B), where R is the set of all resources, B is the set of all blank nodes, P the set of all predicates and L the set of all literals. Given a set of source resources S and target resources T from two (not necessarily distinct) knowledge bases K1 and K2 as well as a relation R, the goal of Link Discovery (LD) is to find the set of mapping M = {(s, t) S T : R(s, t)}. Naive computation of M requires quadratic time complexity to compare every s S with every t T, which is clearly impracticable for large datasets such as geo-spatial datasets, which are the focus of this work. Here, we present an algorithm for efficient computations of topological relations between resources with geo-spatial descriptions (i.e., described by means of vector geometry).4 We assume that each of the resources in S and T considered in the subsequent portion of this paper as being described by a geometry, where each geometry is modelled as sequence of points. An example of such resources is shown in Figure 1(a). The Dimensionally Extended nine-Intersection Model
4Most commonly encoded in the WKT format, see http://www. opengeospatial.org/standards/sfa.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)
(a) Example geometries (b) MBB construction
(c) Space tiling (d) Optimized sparse space tiling
Figure 1: City of Leipzig from NUTS (in gray) together with topologically related geometries from CLC (in green and blue). See Section 4 for description of NUTS and CLC.
(DE-9IM) (Clementini, Sharma, and Egenhofer 1994) is a standard used to describe the topological relations between two geometries in two-dimensional space. The spatial relations expressed by the model are topological and are invariant to rotation, translation and scaling transformations (Egenhofer and Franzosa 1991). The basic idea behind DE-9IM is to construct the 3 3 intersection matrix:
DE9IM(a, b)
dim(I(g1) I(g2)) dim(I(g1) B(g2)) dim(I(g1) E(g2)) dim(B(g1) I(g2)) dim(B(g1) B(g2)) dim(B(g1) E(g2)) dim(E(g1) I(g2)) dim(E(g1) B(g2)) dim(E(g1) E(g2))
where dim is the maximum number of dimensions of the intersection of the interior (I), boundary (B), or exterior (E) of the two geometries g1 and g2. The domain of dim is { 1, 0, 1, 2}, where 1 indicates no intersection, 0 stands for an intersection which results into a set of one or more points, 1 indicates an intersection made up of lines and 2 standard for an intersection which results in an area. A simplified binary version of dim(x) with the binary domain {true, false} is obtained using the boolean function β(dim(I(g)) = false iffdim(I(g)) = 1 and true otherwise. The major insight behind Radon is that one condition must hold for any of the entries of the DE-9IM matrix to be true: There must be at least one point in space that is common to the perimeters of the polygons. Note that the only spatial relation for which all entries are 0 is the disjoint relation, which Radon can easily compute by computing the inverse of the intersects relation. Hence, by accelerating the computation of whether two geometries share at least one point, we can accelerate the computation of any of the DE-9IM entries. Therewith, we can also accelerate the computation of any topological relation, as they can all be derived from the DE-9IM entries. We implement this insight by using an improved indexing approach based on minimum bounding boxes and space tiling. The minimum bounding box (MBB) of a geometry g in n dimensions (O Rourke 1985) (also called its envelope) is the rectangular box with the smallest measure (area, volume, or hypervolume in higher dimensions) within which all points of g lie. Let κi(p) denote the ith dimension coordinate of a
point p. To obtain the MBB of a geometry g, we have to find the lowest point coordinate c i = minp g{κi(p)} and the highest point coordinate c i = maxp g{κi(p)} in each dimension i {0, . . . , n}. Then, the 2n vertices of the MBB in n dimensions are all the vectors c( ) 0 , c( ) 1 , . . . , c( ) n , where ( ) { , }. Figure 1(b) shows an example of using the MBB to abstract the running example in Figure 1(a). Space tiling is an indexing technique for spatial data inspired by tessellation and previously used by LD optimization approaches such as Orchid (Ngonga Ngomo 2013) and HR3 (Ngonga Ngomo 2012). The main idea behind space tiling is to divide n-dimensional affine spaces into arbitrarily many hypercubes with the same edge length ℓ. These hypercubes are indexed with vectors i Nn to serve as addressable buckets for geometries. In turn, the obtained index structures can be exploited by various optimization techniques. We call Δ = ℓ 1 the granularity factor. This notion of space tiling can be generalized to hyperrectangles, in which case there exist n independent granularity factors Δi where i {0 . . . n}. Note that although we eventually use hyperrectangles, we will stick to the term hypercube for the sake of simplicity and just define independent granularity factors when necessary. Figure 1(c) shows our running example along with a grid of hypercubes using Δ = 2, where the green area will be indexed to each highlighted hypercube.
We have now introduced all ingredients necessary for defining the Radon algorithm (Algorithm 1). Radon takes a set of source resources S , a set of target resources T and a topological relation r as input. The goal of Radon is to generate the mapping M = {(s, t) S T : r(s, t)} efficiently, where r is a topological relation. Radon addresses this challenge by means of three optimization steps: Swapping for index size minimization, space tiling for indexing and filtering to improve the runtime of the computation of topological relations. In the following, we detail each of these steps.
3.1 Swapping Strategy
We introduce the Estimated Total Hypervolume (eth) of a set of geometries X as
eth(X) = |X|
max p x {κi(p)} min p x {κi(p)} , (2)
with d being the number of dimensions of the resource geometries and κi(p) denoting the coordinate of a point p in the ith dimension. If eth(T) < eth(S ), Radon swaps S and T and computes the reverse5 relation r instead of r (Lines 2 5). For example, if r were the topological relation covered and eth(S ) < eth(T), then Radon swaps T and S and compute the reverse relation of r, i.e., covered By. The rationale behind using eth instead of the size of the datasets is that even small datasets can contain very large geometries that span over a large number of hypercubes and would lead to large
5Formally, the reverse relation r of a relation r is defined as r (y, x) r(x, y).
spatial index when used as source. For the sake of illustration, consider the running example in Figure 1(a). Here, we can see that the eth of NUTS (containing only the gray geometry) is greater than the eth of CLC (containing the green and blue areas). Thus, we set S = CLC and T = NUTS.
3.2 Optimized Sparse Space Tiling In its second step, Radon utilizes space tiling to insert all geometries s S and t T into an index I, which maps resources to sets of hypercubes. Let Δϕ and Δλ be the granularities across the latitude and longitude (several strategies can be used to compute these values. We present and evaluate them in Section 4). For indexing a resource x, we begin by computing its MMB s upper left and lower right corners coordinates (ϕ1(x), λ1(x)) and (ϕ2(x), λ2(x)) respectively (Line 8). Then, we map each x to all hypercubes over which its MBB spans (Lines 9 11). To this end, we transform the MBB s corner coordinates into hypercube indices using ψ and ψ from Equation 3.
ψ (x) = x Δϕ ψ (x) = x Δϕ (3)
We then map x to all hypercubes with indices (i, j) where i, j Z, ψ (ϕ1(x)) i ψ (ϕ2(x)) and ψ (λ1(x)) j ψ (λ2(x)). Note that the special case of geometries passing over the antimeridian is detected and dealt with by splitting such geometries into 2 geometries before and after the antimeridian. The index I now contains the portions of the space (i.e., the hypercubes) within which portions of x can potentially be found. It is important to notice that entities in portions of space that do not belong to the hypercubes which contain elements of S (denoted I(S )) will always be disjoint with the elements of T. We leverage this insight as follows: We first index all s S . Then we follow the same procedure for t T (Lines 14 21) but only index geometries t that are potentially in hypercubes already contained in I(S ). This optimized sparse space tiling is the motivation for the previously introduced swapping strategy. Indexing the dataset with the least eth first results in an index I with less hypercubes. Consider again our running example in Figure 1(c) for the sake of illustration. Assume the granularity factors are Δϕ = Δλ = 2. The green area s MBB has the following corner coordinates: (ϕ1(g), λ1(g)) = (12.340703846780286, 51.28797110806819) and (ϕ2(g), λ2(g)) = (12.389192648396918, 51.33902633403139). Therefore, ψ (ϕ1(g)) = 24, ψ (λ1(g)) = 102, ψ (ϕ2(g)) = 25, ψ (λ2(g)) = 103 and thus this geometry will be indexed into the four highlighted hypercubes with index vectors (24, 102), (24, 103), (25, 102) and (25, 103). In Figure 1(d), we highlighted all hypercubes containing the gray geometry after the optimized sparse space tiling. Notice that many hypercubes are empty as a result of not containing any portion of the other dataset s geometries.
3.3 Link Generation After the computation of the index I, Radon implements the last speedup strategy using a MBB-based filtering technique. For each hypercube with indexed geometries from
both S and T (Line 24), Radon first discards unnecessary computations using the Test MBB procedure. Test MBB optimizes the subset of DE-9IM relations for relations where one geometry has interior or boundary points in the exterior of the other geometry, i.e. s t or t s (e.g. equals, covers, within formally defined in the annex (Sherif et al. 2016)). Let (g) denote the MBB geometry of a geometry g. Note that g (g) always holds. We can now infer r( (s), (t)) r(s, t) using the transitivity of . For all other relations, Test MBB simply returns true. For example, in our running example in Figure 1(b), if r is the within topological relation, we do not need to compute r for the blue geometry, as its MBB is not completely within the gray geometry s MBB. In case the Test MBB method returns true, Radon carries out the more expensive computation of the topological relation between the geometries s and t (Line 31). If r(s, t) holds, Radon adds the pair (s, t) to the result mapping M. To make sure that we compute each pair (s, t) S T at most once, we cache the computed pairs (s, t) in the mapping C (Lines 27-29). Proposition 1. Radon is complete and correct w.r.t. the application of space tiling.6
4 Evaluation Topological relations Only a subset of the topological relations obtainable through DE-9IM reflects the semantics of the English language (Clementini, Di Felice, and van Oosterom 1993; Clementini, Sharma, and Egenhofer 1994) including equals, within, contains, disjoint, touches, meets, covers, covered By, intersects, inside, crosses and overlaps. Note that some of these relations are synonyms (e.g., touches(x, y) meets(x, y)) while others are combinations of more atomic relations,(e.g., equals(x, y) within(x, y) contains(x, y)). Moreover, some relations are the reverse of some other relation. Hence, in this evaluation, we focused on the rapid computation of those 7 relations.7 In our running example in Figure 1(a), if we dub the blue, green and gray areas in a1, a2 and a3 respectively. Then, distinct(a1, a2), within(a2, a3) and intersects(a2, a3) hold.
Hardware and Software All experiments were carried out on a 64-core 2.3 GHz PC running Open JDK 64-Bit Server 1.7.0 75 on Ubuntu 14.04.2 LTS. Each experiment was assigned 20 GB RAM and a timeout limit of 2 hour. Experiments which ran longer than this upper limit were terminated and the processed data percentage as well as the estimated time are reported. We evaluate Radon against two state-of-the-art approaches; (1) Silk as it is (to the best of our knowledge) the only LD framework that supports the discovery of topological relation, (2) Strabon as it implements the standard Geo SPARQL (OGC 2010) and is based on Post GIS. For Silk experiments, we ran our experiments using its latest version (v2.6.1) with a blocking factor of 10 as in (Smeros and Koubarakis 2016). For Strabon, we also used the latest version (v3.2.10) with the accordingly tuned
6Proof is given in the annex (Sherif et al. 2016). 7See the annex (Sherif et al. 2016) for the formal definitions.
Algorithm 1: Radon
input : S , set of source resources. T, set of target resources. r, topological relation. output: M, Mapping {(s, t) S T : r(s, t)}
1 reversed false;
2 if eth(T) < eth(S ) then
3 swap(S, T);
5 reversed true; /* Get index I using optimized sparse space tiling */
6 (Δϕ, Δλ) Find Best Granularity(S, T);
7 foreach geometry s S do
8 (ϕ1(s), λ1(s), ϕ2(s), λ2(s)) Get MBBDiagonal Corners(s);
9 for i ϕ1(s) Δϕ to ϕ2(s) Δϕ do
10 for j λ1(s) Δλ to λ2(s) Δλ do
11 Insert Into Hypercube(I(S ), i, j, s);
12 j j + 1;
13 i i + 1;
14 foreach geometry t T do
15 (ϕ1(t), λ1(t), ϕ2(t), λ2(t)) Get MBBDiagonal Corners(t);
16 for i ϕ1(t) Δϕ to ϕ2(t) Δϕ do
17 for j λ1(t) Δλ to λ2(t) Δλ do
18 if Get Hypercube(I(S ), i, j) is not empty then
19 Insert Into Hypercube(I(T), i, j, t);
20 j j + 1;
21 i i + 1;
/* Generate Links */
22 foreach hypercube HS I(S ) do
23 HT Get Hypercube(I(T), ϕ(HS ), λ(HS ));
24 if HT is not empty then
25 for s HS do
26 for t HT do
27 if (s, t) C then
28 C C {(s, t)};
29 B (ϕ1(s), λ1(s), ϕ2(s), λ2(s)), (ϕ1(t), λ1(t), ϕ2(t), λ2(t)) ;
30 if Test MBB(r, B) then
31 if r(s, t) is true then
32 M M {(s, t)};
33 if reversed then return M ;
34 else return M ;
Postgre SQL (v9.1.13) and Post GIS (v2.0) as proposed by the developers. A more complete list of results can be obtained from the project website8. Note that Radon achieves a precision, a recall and an F-measure of 1 by virtue of its complete-
8http://aksw.org/Projects/LIMES
ness and correctness. Silk and Strabon theoretically achieve the same F-measure (we were not always able to check this value for the two systems as the experiments did not always terminate before the timeout).
Datasets We evaluated our approach using two real-world datasets. (1) NUTS9 is manually curated by the Eurostat group of the European Commission. NUTS contains a detailed hierarchical description of statistical regions for the whole European regions. (2) CORINE Land Cover or simply CLC is an activity of the European Environment Agency that collects data regarding the land cover of European countries. CLC contains 44 sub-datasets range in size from 240 to 248, 242 resources.10 For testing the scalability of Radon, we merged all subsets of CLC into one big dataset of size 2, 209, 538 (dubbed CLCm). We preprocessed the datasets in the following fashion: To enable the processing of the NUTS dataset by Radon, Silk and Strabon, the ngeo:pos List serialisation was converted into the WKT format prior to experiments. Moreover, because of a Silk issue11, we had to trim lines larger than 64 KB from all datasets in order to get a fair comparison. All the reported dataset sizes are after preprocessing.
Granularity Factor Selection Heuristic The aim of this experiment was to evaluate different heuristics to approximate the optimal granularity factors Δϕ and Δλ used for tiling the space and generating the sparse index of hypercubes. We tried 4 different heuristics corresponding to a statistical measure: minimum, maximum, median and average. Each heuristic first computes the respective statistical measure η independently for both datasets and both dimensions, resulting in 4 temporary values hη,ϕ(S ), hη,ϕ(T), hη,λ(S ), hη,λ(T). Finally, the granularity factor in each dimension is the average of the two datasets. Formally,
hη,ϕ(X) = η x X
max p x {ϕ(p)} min p x {ϕ(p)} (4)
Δη,ϕ(S, T) = 1
hη,ϕ(S ) + hη,ϕ(T) (5)
hη,λ(X) and Δη,λ(S, T) are defined similarly. S, T are the input source and target datasets, ϕ(p) the latitude of a point p, λ(p) the longitude of p and η {min, max, avg, median}. We used all the 44 subsets of the CLC dataset as input for this experiment and recorded how many times each heuristic achieved the best runtime for the intersects relation. Additionally, when a heuristic was not the best in a run, we computed the percentage it was worse than the best one. The average heuristic achieves the best result 24 times out of 44 experiments. Runner-up is median, achieving the best runtime 17 times. Finally, the min and max heuristics achieved only 2 and 1 time(s) respectively. Interestingly, average and median were only 4% slower than the best measure on average when not being the best, while min and max where 34%
9Version 0.91 (http://nuts.geovocab.org/data/0.91/) is used in this work. 10For more details about CLC see https://datahub.io/dataset/ corine-land-cover 11https://github.com/silk-framework/silk/issues/57
0 108 2 108 3 108 4 108 5 108 0
Figure 2: Speedup of Radon over Silk for the within relation. The x-axis represents the dataset sizes, y-axis represents the speedup. The blue line is the linear regression line.
and 61% worse on average respectively. Based on these results, we used the average heuristic as the granularity selection policy in the rest of the experiments. The basic idea behind the first three sets of of experiments is to quantify the speedup gained by Radon over other LD frameworks. To the best of our knowledge, only the Silk LD framework recently (Smeros and Koubarakis 2016) implemented a multi-dimensional blocking approach to compute the topological relations. Therefore, we compare Radon s and Silk s runtimes in the subsequent experiments In the first set of experiments, we aimed to quantify the speedup of Radon over the other state-of-the-art approaches when applied to small datasets. To this end, we ran 44 experiments for each of the 7 basic topological relations identified in the previous section. In each experiment, we compared one of the 44 subsets of the CLC with the full NUTS. Altogether, we carried out 308 experiments. Note that both Radon and Silk were ran on 1 core. Radon achieves an average speedup of 221.52, 213.76, 4.94, 4.82, 4.77, 4.76 and 4.75 times faster than Silk for the relations within, equals, covers, overlaps, intersects, crosses and touches respectively. Overall, Radon was able to outperform Silk by being 65.62 times faster on average over all topological relations. Moreover, Radon was able to achieve a linear speedup relative to the dataset sizes. In Figure 2, we show an overview of a subset of the experimental results (including a linear fit) achieved on the relations on which Radon achieved the best (up to two orders of 450 times faster) and the poorest (up to 6.5 times faster) relative performance w.r.t. Silk. Figures for other relations are given in the annex (Sherif et al. 2016). Moreover, Radon ran significantly less complete computations of the relations at hand. For example in Figure 3(a), Radon carries out only 3 and 4 computations for the "equals" and "within" relations respectively. On average, 449 times less computations per relation. In the second set of experiments we aimed to evaluate the scalability of Radon when applied to big datasets. Thus, we used the merged dataset CLCm as both source and target dataset and ran Radon and Silk on 1 core. For more complete results, see Table 1 in the annex (Sherif et al. 2016). Radon is able to finish all the tasks within 67.44 minutes on
1 SELECT ?s ?t WHERE {
2 GRAPH {?s geo:as WKT ?s_geometry.}
3 GRAPH {?t geo:as WKT ?t_geometry.}
4 FILTER( strdf:intersects(?s_geometry , ?t_geometry) )
Listing 1: SPARQL query for retrieving the intersects topological relation between resources from NUTS and CLC from Strabon.
average (maximum = 95.10 minutes for the crosses relation). Silk was only able to (on average) finalize 0.34% of each task within the 2 hour timeout limit. We extrapolated the runtime of Silk linearly to get an approximation of how long it would need to carry out the tasks at hand. On average, Silk would need 24.85 days to complete each task (linear extrapolation). Consequently, Radon is at least 715.16 times faster than Silk on average. These results emphasize the ability of Radon to deal with large datasets even when ran on 1 core. In the third set of experiments, we wanted to quantify the speedup gained by using a parallel implementation of Algorithm 1 over the parallel implementation of Silk. For load balancing in Radon, we used the simple round robin load balancing policy (Shreedhar and Varghese 1996) with chunks size of 1000. As data, we used CLCm as both source and target. The parallel implementations were configured to run using 2, 4 and 8 threads. The results (For detailed result see Table 1 in the annex (Sherif et al. 2016).) show that our parallel implementation for Radon was able to discover all the topological relations in 20.83 minutes on average (maximum of 49.03 minutes in the case of the intersect relation). On the other side, Silk implementation was only able to (on average) finalize 1.16% of each task within the 2 hours timeout limit. We extrapolated the performance of Silk s parallel implementation and computed that it will need an average of 4.36 days to finalize each task with 8 threads. Overall, our parallel implementation of Radon was up to 1725.77 times (834.69 times on average) faster than Silk, which clearly show the scalability of Radon s parallel implementation. In our fourth set of experiments, we aimed to compare Radon against Strabon on small datasets. The semantic spatio-temporal RDF store Strabon is not a LD framework but since it supports the Geo SPARQL and st SPARQL query languages, it can be employed for discovering topological relations via corresponding queries. To compare with Strabon, we used the same setting we used in the first set of experiments. Figure 3(b) shows the average runtimes result of both Radon and Strabon in seconds. In average, Radon was 11.99 times faster than Strabon. Interestingly, Strabon performed better than Radon on the intersects relation. The reason behind this behaviour is that Strabon uses an R-tree-over-Gi ST spatial index over the stored geometries in the underlying Post GIS database (Kyzirakos, Karpathiotakis, and Koubarakis 2012). This data structure is highly optimized for the retrieval of spatially connected objects. Hence, Strabon requires solely a data retrieval to compute
0 2 105 4 105 6 105 8 105
(a) Average number of computations of topological relations
0 100 200 300 400
RADON STRABON SILK
(b) Average runtime
Figure 3: Average number of complete computations of topological relations and average runtime for the datasets experiments. All runtimes are in seconds.
the intersects relation. However, this index is clearly outperformed by our sparse index in all the other relations. In our fifth and last set of experiments, we evaluated the scalability of Radon vs. Strabon when tackling big datasets. To this end, we applied the experimental setting we used in the second set of experiments (S = T = CLCm). Strabon was not able to finish any of the experiments within the 2hour time limit while Radon required approx. 95.10 minutes in the worst case. Given that Strabon provides no feedback pertaining to the progress of its tasks, we could not extrapolate its runtime. Thus, we attempted a smaller deduplication experiment with only one subset of CLC, CLC-243, which is about 10 times smaller than the merged CLCm dataset. Even these experiments did not finish within the 2-hour limit. Therefore, we approximated Strabon s runtime conservatively as follows: Assume that the CLC-243 deduplication experiments would have finished just one minute after the 2-hour timeout. Assuming that Strabon s runtime scales linear with the input dataset size, the merged CLCm experiments would take roughly 20.17 hours. Having this overly optimistic estimate of Strabon s runtime, Radon achieves an average speedup of 24. When we move from the assump-
tion that Strabon scales linearly to the more realistic assessment that it scales in O(n2), then we get an average speedup of 241. Overall, our results show clearly that Radon outperforms the state of the art by up to 3 orders of magnitude.
5 Related Work Based on the original works of Egenhofer et al. (Egenhofer and Franzosa 1991), Clementini et al. (Clementini, Sharma, and Egenhofer 1994) propose the The DE-9IM model to capture the topological relations in the R2. In addition, the Simple Features Model proposed by OGC12 contain different subsets of the topological relations that derive from the DE-9IM. Geo SPARQL (OGC 2010) is a recent OGC standard that proposes a query language that enable the discovery of topological relations. Geo SPARQL is implemented in the spatiotemporal RDF store Strabon (Kyzirakos, Karpathiotakis, and Koubarakis 2012). Other frameworks such as Virtuoso13 and newly Blaze Graph14 support geo-spatial extensions of SPARQL. The discovery of topological relations has been paid little attention to in previous research related to Link Discovery (Auer et al. 2013). Up to now, the state-of-the-art LD frameworks were able to discover only spatial similarities (Salas and Harth 2011; Sehgal, Getoor, and Viechnicki 2006; Vilches-Blázquez, Saquicela, and Corcho 2012). For example, (Ngonga Ngomo 2013) uses the Hausdorffdistance to compute the point-set distance between geo-spatial entities. In recent work, (Georgala, Sherif, and Ngonga Ngomo 2016) implements an efficient approach for Allen Relations extraction.To the best of our knowledge, the only LD framework that support discovery of topological relations is Silk (Smeros and Koubarakis 2016). Based on Multi Blocking technique, Silk computes the topological relations according to the DE-9IM standard between geo-spatial resources. A review of the current state of LD frameworks is in (Nentwig et al. 2015).
6 Conclusions and Future Work We presented Radon, an approach for rapid discovery of topological relations among geo-spatial resources. Radon combines space tiling, minimum bounding box approximation and a sparse index to achieve a high scalability. We evaluated Radon with real datasets of various sizes and showed that in addition to being complete and correct, it also outperforms the state of the art by up to three orders of magnitude (e.g., equals relation against Silk). In future work, we aim to apply more sophisticated load balancing approaches, such as the particle-swarm-optimization based approaches (Sherif and Ngomo 2015). In addition, we will consider the usage of other topology approximation methods.
Acknowledgments This work has been supported by the European Union s H2020 research and innovation action HOBBIT (GA no. 688227) and the BMWI Project GEISER (project no. 01MD16014).
12http://www.opengeospatial.org/standards/sfs 13http://virtuoso.openlinksw.com/ 14https://www.blazegraph.com/
References Auer, S.; Lehmann, J.; Ngomo, A.-C. N.; and Zaveri, A. 2013. Introduction to Linked Data and Its Lifecycle on the Web. In Rudolph, S.; Gottlob, G.; Horrocks, I.; and van Harmelen, F., eds., Reasoning Web, volume 8067 of LNCS, 1 90. Springer. Clementini, E.; Di Felice, P.; and van Oosterom, P. 1993. Advances in Spatial Databases: 3rd International Symposium, SSD 93 Singapore, June. Springer. chapter A small set of formal topological relationships suitable for end-user interaction, 277 295. Clementini, E.; Sharma, J.; and Egenhofer, M. J. 1994. Modelling topological spatial relations: Strategies for query processing. Computers & graphics 18(6):815 822. Egenhofer, M. J., and Franzosa, R. D. 1991. Point-set topological spatial relations. International Journal of Geographical Information System 5(2):161 174. Georgala, K.; Sherif, M. A.; and Ngonga Ngomo, A.-C. 2016. An Efficient Approach for the Generation of Allen Relations. In European Conference on Artificial Intelligence (ECAI). Kyzirakos, K.; Karpathiotakis, M.; and Koubarakis, M. 2012. Strabon: A semantic geospatial DBMS. In CudréMauroux, P.; Heflin, J.; Sirin, E.; Tudorache, T.; Euzenat, J.; Hauswirth, M.; Parreira, J. X.; Hendler, J.; Schreiber, G.; Bernstein, A.; and Blomqvist, E., eds., ISWC 2012, Boston, USA, November, 2012, volume 7649 of LNCS, 295 311. Springer. Lehmann, J.; Furche, T.; Grasso, G.; Ngonga Ngomo, A.-C.; Schallhart, C.; Sellers, A.; Unger, C.; Bühmann, L.; Gerber, D.; Höffner, K.; Liu, D.; and Auer, S. 2012. Deqa: Deep web extraction for question answering. In Proceedings of ISWC. Lehmann, J. 2009. DL-Learner: learning concepts in description logics. Journal of Machine Learning Research (JMLR) 10:2639 2642. Nentwig, M.; Hartung, M.; Ngomo, A.-C. N.; and Rahm, E. 2015. A survey of current link discovery frameworks. Semantic Web Journal. Ngonga Ngomo, A.-C. 2012. Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures. In International Semantic Web Conference (1), 378 393. Ngonga Ngomo, A.-C. 2013. Orchid reduction-ratiooptimal computation of geo-spatial distances for link discovery. In The Semantic Web ISWC 2013. Springer. 395 410. OGC. 2010. Geo SPARQL - A geographic query language for RDF data. O Rourke, J. 1985. Finding minimal enclosing boxes. International journal of computer & information sciences 14(3):183 199. Salas, J., and Harth, A. 2011. Finding spatial equivalences accross multiple RDF datasets. In Proceedings of the Terra Cognita Workshop on Foundations, Technologies and Applications of the Geospatial Web, 114 126. Citeseer. Sehgal, V.; Getoor, L.; and Viechnicki, P. D. 2006. Entity resolution in geospatial data integration. In Proceedings of
the 14th annual ACM international symposium on Advances in geographic information systems, 83 90. ACM. Sherif, M. A., and Ngomo, A.-C. N. 2015. An optimization approach for load balancing in parallel link discovery. In Proceedings of the 11th International Conference on Semantic Systems, SEMANTICS 15, 161 168. New York, NY, USA: ACM. Sherif, M. A.; Dreßler, K.; Smeros, P.; and Ngonga Ngomo, A.-C. 2016. Annex: Radon - Rapid Discovery of Topological Relations. Ar Xiv e-prints. Shreedhar, M., and Varghese, G. 1996. Efficient fair queuing using deficit round-robin. Networking, IEEE/ACM Transactions on 4(3):375 385. Smeros, P., and Koubarakis, M. 2016. Discovering Spatial and Temporal Links Among RDF Data. In WWW2016 Workshop: Linked Data on the Web (LDOW2016). Vilches-Blázquez, L. M.; Saquicela, V.; and Corcho, O. 2012. Interlinking geospatial information in the web of data. In Bridging the Geographic Information Sciences. Springer. 119 139.