# mining_el_bases_with_adaptable_role_depth__5c1b2783.pdf

Mining ELK Bases with Adaptable Role Depth

Ricardo Guimarães,1 Ana Ozaki,1 Cosimo Persia,1 Baris Sertkaya2

1 Department of Informatics, University of Bergen, Norway 2 Frankfurt University of Applied Sciences, Germany Ricardo.Guimaraes@uib.no, Ana.Ozaki@uib.no, Cosimo.Persia@uib.no, Sertkaya@fb2.fra-uas.de

In Formal Concept Analysis, a base for a ﬁnite structure is a set of implications that characterizes all valid implications of the structure. This notion can be adapted to the context of Description Logic, where the base consists of a set of concept inclusions instead of implications. In this setting, concept expressions can be arbitrarily large. Thus, it is not clear whether a ﬁnite base exists and, if so, how large concept expressions may need to be. We ﬁrst revisit results in the literature for mining ELK bases from ﬁnite interpretations. Those mainly focus on ﬁnding a ﬁnite base or on ﬁxing the role depth but potentially losing some of the valid concept inclusions with higher role depth. We then present a new strategy for mining ELK bases which is adaptable in the sense that it can bound the role depth of concepts depending on the local structure of the interpretation. Our strategy guarantees to capture all ELK

concept inclusions holding in the interpretation, not only the ones up to a ﬁxed role depth.

1 Introduction

Among its many applications in artiﬁcial intelligence, logic is used to formally represent knowledge. Such knowledge, often in the form of facts and rules, enables machines to process complex relational data, deduce new knowledge from it, and extract hidden relationships in a speciﬁc domain. A well-studied formalism for knowledge representation is given by a family of logics known as description logics (DLs) (Baader et al. 2017). DL is the logical formalism behind the design of many knowledge-based applications. However, it is often difﬁcult and time-consuming to manually model in a formal language rules and constraints that hold in a domain of knowledge. In this work, we consider an automatic method to extract rules (concept inclusions (CIs)) formulated in DL from data. This data can be, for instance, a collection of facts in a database or a knowledge graph. For instance, in the DBpedia knowledge graph (Lehmann et al. 2015), one can represent the relationship between a city a and the region b it belongs to with the facts citypaq, regionpbq, partofpa, bq, and capitalpb, aq. From this data, one can mine a CI expressing that a capital is a city that is part of a region.

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

To mine CIs that hold in a dataset, we combine notions of Formal Concept Analysis (FCA) (Ganter and Wille 1999) and DLs. FCA is a subﬁeld of lattice theory that provides methods for analysing datasets and identifying the dependencies in them. In FCA a dataset, also called a formal context, is a table showing which objects have which attributes. Given a formal context, FCA methods are used to extract the dependencies between the attributes, also called implications (Figure 1). A base is a set of implications that entails every valid implication of the dataset and only those (soundness and completeness). It can be used for detecting erroneous or missing items in the dataset (Baader et al. 2007). In the DL setting, a base is a set of CIs (an ontology) which can serve as a starting point for ontology engineers to build an ontology in a domain of interest.

However, for some DLs and datasets, it may happen that no ﬁnite base exists. Cyclic relationships are common in knowledge graphs and they are the main challenge for ﬁnding a ﬁnite DL base. With only one cyclic relationship, we already have that inﬁnitely many concepts hold in the dataset. Strategies for limiting the size of concepts in the presence of cyclic dependencies have already been investigated in the literature. Baader and Distel (2008) and Distel (2011) propose a way of mining DL ﬁnite bases expressible in the DL ELK gfp which is the addition of greatest ﬁxpoint semantics to the DL language ELK. The semantics offered by ELK gfp elegantly solves the difﬁculty of mining CIs from cyclic relationships in the data. However, this semantics comes with two drawbacks. Firstly, ELK gfp concepts may be difﬁcult to understand, and learned CIs may be too complex to validate by domain experts. Secondly, there is no efﬁcient implementation of a reasoner for ELK gfp, even though the reasoning complexity is tractable, like for ELK. The authors also show how to transform an ELK gfp base into an EL base. However, it is far from being trivial to avoid the step of creating an ELK gfp base in their approach.

A simpliﬁcation of the mentioned work has been proposed by Borchmann, Distel, and Kriegel (2016) where they show how to mine ELK ﬁnite bases with a predeﬁned and ﬁxed role depth for concept expressions. As a consequence, the base is sound and complete only w.r.t. CIs containing concepts with bounded role depth. Their approach avoids

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

City Region Dpartof.J Settlement a ˆ ˆ ˆ b ˆ ˆ ˆ c ˆ ˆ

Figure 1: (a) A dataset with 4 attributes and 3 objects. (b) The implications City Ñ Dpartof.J and City Ñ Settlement hold in the dataset but City Ñ Region does not hold in it.

the step of creating an ELK gfp base but also avoids the main challenge in creating a ﬁnite base for ELK, which is the fact that the role depth of concepts can be arbitrarily large. Our work brings together the best of the approaches by Distel (2011) and Borchmann, Distel, and Kriegel (2016): we directly compute a ﬁnite ELK base that captures the whole language (not only up to a certain role depth). In particular, we present a new approach for computing the role depth of concepts which adapts depending on the objects considered during the computation of CIs.

Related work. Several authors have worked on combining FCA and DLs or on applying methods from one ﬁeld to the other (Ozaki 2020). Baader (1995) uses FCA to compute the subsumption hierarchy of the conjunction of predeﬁned concepts. Rudolph (2004; 2006) uses the DL FLE for the deﬁnition of FCA attributes and FCA techniques for generating a knowledge base. Baader et al. (2007) uses FCA for completing missing knowledge in a DL knowledge base. Baader et al. (2000) proposes a method for building DL ontologies through the interaction of domain experts. Sertkaya (2010) presents a survey on applications of FCA methods in DLs. Borchmann and Distel (2011) provide a practical application of the theory developed by Distel (2011) on knowledge graphs. Borchmann (2014) shows how a base of conﬁdent ELK gfp concept inclusions can be extracted from a DL interpretation. Monnin et al. (2017) compare, using FCA techniques, data present in DBpedia with the constraints of a given ontology to check if the data is compliant with it. Kriegel (2019a) among other contributions extends the results by Borchmann, Distel, and Kriegel (2016) to a logic that is more expressive than ELK. He also investigates the same problem for probabilistic DLs (Kriegel 2019b).

In the next section, we present basic deﬁnitions and notation. In Section 3, we present the problem of mining ELK CIs and establish lower bounds for this problem. In Section 4, we present our main result for mining ELK bases with adaptable role depth. Our result uses a notion that relates each vertex in a graph to a set of vertices, called maximum vertices from (MVF). In Section 5, we show that the MVF of a vertex in a graph can be computed in linear time in the size of the graph. Missing proofs can be found in the long version (Guimarães et al. 2021).

2 Preliminaries

We introduce the syntax and semantics of ELK and basic deﬁnitions related to description graphs used in the paper.

The Description Logic ELK

ELK (Baader, Brandt, and Lutz 2005) is a lightweight DL, which only allows for expressing conjunctions and existential restrictions. Despite this rather low expressive power, slight extensions of it have turned out to be highly successful in practical applications, especially in the medical domain (Spackman, Campbell, and Cote 1997). We use two ﬁnite and disjoint sets, NC and NR, of concept and role names to deﬁne the syntax and semantics of ELK. ELK concept expressions are built according to the grammar rule C, D :: A | J | K | C [ D | Dr.C with A P NC and r P NR. We write Drn 1.C as a shorthand for Dr.p Drn.Cq where Dr1.C : Dr.C. An ELK TBox is a ﬁnite set of concept inclusions (CIs) C Ď D, where C, D are ELK concept expressions. We may omit ELK when we speak of concept expressions, CIs, and TBoxes, if this is clear from the context. We may write C D (an equivalence) as a short hand for when we have both C Ď D and D Ď C. The signature of a concept expression, a CI, or a TBox is the set of concept and role names occurring in it. The semantics of ELK is based on interpretations. An interpretation I is a pair p I, Iq where I is a non-empty set, called the domain of I, and I is a function mapping each A P NC to a subset AI of I and each r P NR to a subset r I of I ˆ I. The function I extends to arbitrary ELK concept expressions as usual:

p C [ Dq I : CI X DI p Jq I : I p Kq I : H

p Dr.Cq I : tx P I | px, yq P r I and y P CIu

An interpretation I satisﬁes a CI C Ď D, in symbols I |ù C Ď D, iff CI Ď DI. It satisﬁes a TBox T if it satisﬁes all CIs in T . A TBox T entails a CI C Ď D, written T |ù C Ď D, iff all interpretations satisfying T also satisfy C Ď D. We write ΣI for the set of concept or role names X such that XI H. A ﬁnite interpretation is an interpretation with a ﬁnite domain.

Description Graphs, Products, and Unravellings We also use the notion of description graphs (Baader 2003). The description graph Gp Iq p VI, EI, LIq of an interpretation I is deﬁned as (e.g. Figure 4): 1. VI I; 2. EI tpx, r, yq | r P NR and px, yq P r Iu; 3. LIpxq t A P NC | x P AIu.

The description tree of an ELK concept expression C over the signature Σ is the ﬁnite directed tree Gp Cq p VC, EC, LCq where VC is the set of nodes, EC Ď VC ˆ NR ˆ VC is the set of edges, and LC : V Ñ 2NC is the labelling function. Gp Cq is deﬁned inductively: 1. for C J, VC tρCu and LCpρCq H where ρC is the root node of the tree; 2. for C A P NC, VC tρCu and LCpρCq A; 3. for C D1 [ D2, Gp Cq is obtained by merging the roots ρD1, ρD2 in one ρC with LCpρCq LD1pρD1q Y LD2pρD2q;

City [ Dgovernment.Party[ Dpartof.p Region [ Dcapital.Jq City

Region govern.

partof capital

Figure 2: A concept expression and its description graph.

4. for C Dr.D, Gp Cq is built from Gp Dq by adding a new node (root) ρC to VD and an edge pρC, r, ρDq to ED.

The concept expression (unique up to logical equivalence) Cp Gvq of a tree shaped graph Gv p V, E, Lq rooted in v is

j 1 Drj.Cp Gwjq,

where Lpvq t Pi | 1 ď i ď ku, pv, rj, wjq P E (and there are l such edges) and Cp Gwjq is inductively deﬁned, with Gwj being the subgraph of G rooted in wj. A walk in a description graph G p V, E, Lq between two nodes u, v P V is a word w v0r0v1r1 . . . rn 1vn where v0 u, vn v, vi P V , ri P NR and pvi, ri, vi 1q P E for all i P t0, . . . , n 1u. The length of w in this case is n, in symbols, |w| n. Walks with length n 0 are possible, it means that the walk has just one vertex (no edges). Vertices and edges may occur multiple times in a walk. Let G p V, E, Lq be an ELK description graph with x P V and d P N. Denote by δpwq the last vertex in the walk w. The unravelling of G up to depth d is the description graph Gx d p Vd, Ed, Ldq starting at node x deﬁned as follows:

1. Vd is the set of all directed walks in G that start at x and have length at most d;

2. Ed tpw, r, wrvq | v P V, r P NR, w, wrv P Vdu;

3. Ldpwq Lpδpwqq.

A path is a walk where vertices do not repeat.

a a1b City Region 1

City Region

City Region a1b2a1b

Figure 3: Unravelling of the description graph of the interpretation I in piq. For readability, partof has been replaced with symbol 1 and capital with symbol 2. piiq depicts Gp Iqa 1 and piiiq depicts Gp Iqa 3.

Let G1, . . . , Gn be n description graphs such that Gi p Vi, Ei, Liq. Then the product of G1, . . . , Gn is the description graph p V, E, Lq deﬁned as:

1. V Śn i 1 Vi;

2. E tppv1, . . . , vnq, r, pw1, . . . , wnqq | r P NR, pvi, r, wiq P Ei, for all 1 ď i ď nu;

3. Lpv1, . . . , vnq Şn i 1 Lipviq.

Party, Liberal

Region govern.

partof capital

Party, Organization

Region govern.

Figure 4: Description graph of the interpretation I ttx1, , x7u, Iu where tx1, x2u City I, tx3, x4u Party I, tpx1, x5q, px2, x7qu partof I, etc.

If each Gi is a tree with root vi then we denote by śn i 1 Gi the tree rooted in pv1, . . . , vnq contained in the product graph of G1, . . . , Gn.

3 Mining ELK Bases

The set of all ELK CIs that are satisﬁed by an interpretation I is in general inﬁnite because whenever I |ù C Ď D, I |ù Dr.C Ď Dr.D as well. Therefore one is interested in a ﬁnite and small set of CIs that entails the whole set of valid CIs. For mining such a set of CIs from a given interpretation we employ ideas from FCA and recall literature results.

Deﬁnition 1. A TBox T is a base for a ﬁnite interpretation I and a DL language L, if for every CI C Ď D, formulated within L and ΣI: I |ù C Ď D iff T |ù C Ď D.

We say that a DL has the ﬁnite base property (FBP) if, for all ﬁnite interpretations I, there is a ﬁnite base with CIs formulated within the DL language and ΣI. Not all DLs have the ﬁnite base property. Consider for instance the fragments ELK rhs (and ELK lhs) of ELK that allows only concept names on the left-hand (right-hand) side but complex ELK concept expressions on the right-hand (left-hand) side of CIs.

Proposition 1. ELK rhs and ELK lhs do not have the FBP.

Proof Sketch. No ﬁnite base ELK rhs exists for the interpretation in Figure 5 piq. For every n ě 1, the ELK rhs base should entail the CI A Ď Drn.J. Similarly, no ﬁnite ELK lhs base exists for the interpretation in Figure 5 piiq. For every n ě 1, the ELK lhs base should entail the CI Ds.Drn.B Ď A.

Figure 5: Lack of the FBP for ELK rhs piq and ELK lhs piiq.

The main difﬁculty in creating an ELK base is knowing how to deﬁne the role depth of concept expressions in the base. In a ﬁnite interpretation, an arbitrarily large role depth means the presence of a cyclic structure in the interpretation. However, ELK concept expressions cannot express cycles. The difﬁculty can be overcomed by extending ELK with

greatest ﬁx-point semantics. It is known that the resulting DL, called ELK gfp, has the FBP (Baader and Distel 2008; Distel 2011). The authors then show how to transform an ELK gfp base into an ELK base, thus, establishing that ELK

also enjoys the FBP. In the following, we show that, although ﬁnite, the role depth of a base for ELK and a (ﬁnite) interpretation I can be exponential in the size of I.

Example 1. Consider I represented in the shaded area in Figure 6. For p1 2, p2 3, p3 5 and for all k P N , we have that xi P p Drk pi 1.Aq I, where 1 ď i ď 3. We know that 30 minpŞ3 i 1tk pi | k P N uq śn i 1 pi (which is the least common multiple). We also know that for any n, p P N , n 1 is a multiple of p iff n is a multiple of p minus 1. Therefore, the number

i 1 tk pi 1 | k P N uq,

such that tx1, x2, x3u BI p Drd.Aq I, is ś3 i 1 pi 1 29. A base for I should have the CI with role depth at least d because it has to entail the CI B Ď Drd.A.

Theorem 1. There is a ﬁnite interpretation I p I, Iq such that any ELK base for I has a concept expression with role depth exponential in the size of I.

Proof Sketch. We can generalise Example 1 to the case where we have an interpretation J that for an arbitrary n ą 1, and for every i P t1, , nu and k P N , there is an x P J that satisﬁes x P p Drk pi 1.Aq J where pi is the i-th prime number. In this case, the minimal role depth of concepts in any base for J must be d ě śn i 1 pi 1 ě 2n.

Figure 6: Description graph of an interpretation I. Let X tx1, x2, x3u. For all d ă 29 we have x4 P C pś x PX Gp Iqx dq I p B [ Drd.Jq I. However, for all k ě 29, x4 R C pś x PX Gp Iqx kq I since x4 R p Dr29.Aq I.

In addition to the role depth of the concept expressions in the base, the size of the base itself can also be exponential in the size of the data given as input, which is a well-known result in classical FCA (Kuznetsov 2004). The DL setting is more challenging than classical FCA, and so, this lower bound also holds in the problem we consider. In Section 4, we present our deﬁnition of an ELK base for a ﬁnite interpretation I and highlight cases in which the role depth is polynomial in the size of I.

4 Adaptable Role Depth We present in this section our main result which is our strategy to construct ELK bases with adaptable role depth. To deﬁne an ELK base, we use the notion of a model-based most speciﬁc concept (MMSC) up to a certain role depth. The MMSC plays a key rôle in the computation of a base from a given ﬁnite interpretation.

Deﬁnition 2. An ELK concept expression C is a modelbased most speciﬁc concept of X Ď I with role depth d ě 0 iff (1) X Ď CI, (2) C has role depth at most d, and (3) for all ELK concept expressions D with role depth at most d, if X Ď DI then H |ù C Ď D. For a given X Ď CI and a role depth d there may be multiple MMSCs (always at least one (Borchmann, Distel, and Kriegel 2016)) but they are logically equivalent. So we write the MMSC of X with role depth d (in symbols mmsc p X, I, dq), meaning a representative of such class of concepts. As a consequence of Deﬁnition 2, if X H then mmsc p X, I, dq K for any interpretation I and d P N. Example 2. Consider the interpretation I in Figure 4 and let X tx1, x2u. We have that mmsc p X, I, 1q is equivalent to

City [ Dgovernment.Party [ Dpartof.Region.

With an increasing k, the concept expression mmsc p X, I, kq can become more and more speciﬁc. Indeed, mmsc p X, I, 2q is equivalent to

mmsc p X, I, 1q [ Dpartof.p Region [ Dcapital.Jq

which is more speciﬁc than mmsc p X, I, 1q. However, for any k ě 2, mmsc p X, I, 2q mmsc p X, I, kq. For a ﬁxed d, a straightforward (and inefﬁcient) way of computing mmsc pt Xu, I, dq would be conjoining every ELK concept expression C (over NC Y NR) such that X Ď CI and the depth of C is bounded by d. A more elegant method for computing MMSCs is based on the product of description graphs and unravelling cyclic concept expressions up to a sufﬁcient role depth. The MMSC can be written as the concept expression obtained from the product of description graphs of an interpretation (Borchmann, Distel, and Kriegel 2016). Formally, if I p I, Iq is a ﬁnite interpretation, X tx1, . . . , xnu Ď I and a d ě 0, then

mmsc pt Xu, I, dq Cp

i 1 Gp Iqxi d q.

The interesting challenge is how to identify the smallest d that satisﬁes the property: if x P mmsc p X, I, dq I, then x P mmsc p X, I, kq I for every k ą d. In the following, we develop a method for computing MMSCs with a role depth that is suitable for building an ELK base of the given interpretation. This method is based on the already mentioned MVF notion, deﬁned as follows. Deﬁnition 3. Given a description graph G p V, Eq with u P V , we deﬁne the maximum vertices from (or MVF) u in G, denoted mvfp G, uq, as:

maxtvnumpwq | w is a walk in G starting at uu

where vnumpwq is the number of distinct vertices occurring in w. Additionally, we deﬁne the function mmvf as follows:

mmvfp Gq : max u PV mvfp G, uq.

In other words, MVF measures the maximum number of distinct vertices that a walk with a ﬁxed starting point can visit in the graph. Example 3. Consider the interpretation I in Figure 4. Any walk in the description graph of I starting at x1 will visit at most three distinct vertices (including x1). Although there are four vertices reachable from x1, we have that mvfp Gp Iq, x1q 3. For the vertex x2, there are walks of any ﬁnite length, but we visit at most three distinct vertices, namely, x2, x4, x7, and mvfp Gp Iq, x2q 3. For computing the MMSC up to a sufﬁcient role depth based on MVF we use the following notion of simulation. Deﬁnition 4. Let G1 p V1, E1, L1q, G2 p V2, E2, L2q be ELK description graphs and pv1, v2q P V1 ˆ V2. A relation Z Ď V1 ˆ V2 is a simulation from p G1, v1q to p G2, v2q, if (1) pv1, v2q P Z, (2) pw1, w2q P Z implies L1pw1q Ď L2pw2q, and (3) pw1, w2q P Z and pw1, r, w1 1q P E1 imply there is w1 2 P V2 such that pw2, r, w1 2q P E2 and pw1 1, w1 2q P Z. Simulations can be used to decide whether an individual from an interpretation domain belongs to the extension of a given concept expression. Lemma 1 ((Borchmann, Distel, and Kriegel 2016)). Let I be an interpretation, let C be an ELK concept expression, and let Gp Cq p VC, EC, LCq be the ELK description graph of C with root ρC. For every x P I, there is a simulation from p Gp Cq, ρCq to p Gp Iq, xq iff x P CI. Lemma 1 together with other previous results is used below to prove Lemma 2, which is crucial for deﬁning the adaptable role depth. It shows the upper bound on the required role depth of the MMSC. Lemma 2. Let I p I, Iq be a ﬁnite interpretation and take an arbitrary X tx1, . . . , xnu Ď I, x1 P I, and k P N. Let

i 1 Gp Iq, px1, . . . , xnq

mvfp Gp Iq, x1q.

If x1 P C pśn i 1 Gp Iqxi d q I then x1 P C pśn i 1 Gp Iqxi k q I.

Proof Sketch. We show in the long version (Guimarães et al. 2021) the following claim.

Claim 1. For all description graphs G p V, E, Lq and G1 p V 1, E1, L1q, all vertices v P V and v1 P V 1, and

d mvfp G, vq mvfp G1, v1q

if there is a simulation Zd : p Gv d, vq ÞÑ p G1, v1q, then there is a simulation Zk : p Gv k, vq ÞÑ p G1v1q for all k P N.

If k ď d, one can restrict Zd to the vertices of Gv k, which would be a subgraph of Gv d. Otherwise, the intuition behind this claim is that the pairs in Zd deﬁne a walk in G1 for each walk in G that has length at most d 1. And if a walk in G has

length at least d 1, then there is a vertex w that this walk visits twice while the image of this walk in G1 also repeats a vertex at the same time. This paired repetition can be used to ﬁnd a matching vertex in V 1 for each vertex of Gv k by recursively shortening the walk that this vertex corresponds to if it has length d or larger. Lemma 1 and x1 P C pśn i 1 Gp Iqxi d q I imply that there is a simulation Zd from pśn i 1 Gp Iqxi d , px1, . . . , xnqq to p Gp Iq, x1q. Then, by Claim 1 there is a simulation Zk : pśn i 1 Gp Iqxi k , px1, . . . , xnqq ÞÑ p Gp Iq, x1q (we just need to take G śn i 1 Gp Iq, G1 Gp Iq, v px1, . . . , xnq and v1 x1). Therefore, Lemma 1 implies that x1 P C pśn i 1 Gp Iqxi k q I.

Lemma 2 shows that even for vertices that are parts of cycles, there is a certain depth of unravellings, which we call a ﬁxpoint, that is guaranteed to be an upper bound. Proposition 2 gives an intuition about how large the MVF of a vertex in a product graph can be when compared to the MVF of the corresponding vertices in the product s factors. Proposition 2. Let t Gi | 1 ď i ď nu be n description graphs such that Gi p Vi, Ei, Liq. Also let vi P Vi. Then:

i 1 Gi, pv1, . . . , vnq

i 1 mvfp Gi, viq.

Proof. Let w be an arbitrary walk in śn i 1 Gi, pviq1ďiďn that starts in pv1, . . . , vnq and let pw1, . . . , wnq be a vertex in this walk. It follows from the deﬁnition of product that each wi belongs to a walk in Gi that begins in vi. Therefore, there are only mvfp Gi, viq options for each wi. Hence, there are at most śn i 1 mvfp Gi, viq possible options for pw1, . . . , wnq. In other words, vnumpwq ď śn i 1 mvfp Gi, viq. Since w is arbitrary, we can conclude that mvf pśn i 1 Gi, pv1, . . . , vnqq ď śn i 1 mvfp Gi, viq.

Although the MVF of a product can be exponential in | I|, there are many cases in which it is linear in | I|. Example 4 illustrates one such case. Example 4. Consider the interpretation of Figure 4. The elements x1, x3, x4, x5 and x6 never reach cycles, therefore, each of them can only have walks up to a ﬁnite length. Take X tx1, x2u. Since every walk in Gp Iq starting from x1 has length at most 2, the longest walk possible in ś i Pt1,2u Gp Iq starting at node px1, x2q is: px1, x2q, partof, px5, x7q, capital, px6, x2q. Thus mvf ś i Pt1,2u Gp Iq, px1, x2q 2. Take X tx1, x7u,

then mvf ś i Pt1,7u Gp Iq, px1, x7q 1, since x1 and x7 do not share labels in their outgoing edges. The observations about the MVF in Example 4 are generalised in Lemma 3 which shows a sufﬁcient condition for polynomial (linear) role depth. Lemma 3. Let I p I, Iq be a ﬁnite interpretation and X tx1, . . . , xnu Ď I. If for some 1 ď i ď n it holds that every walk in Gp Iq starting at xi has length at most m for some m P N, then mvf pśn i 1 Gp Iq, px1, . . . , xnqq ď mvf p Gp Iq, xiq.

Proof Sketch. As it happens in Example 4, it can be proven that whenever there is a vertex xi for which every walk starting at it has length at most m, then m also bounds the lengths of the walks starting at px1, . . . xnq in śn i 1 Gp Iq.

Combining the bounds for the ﬁxpoint and MVF given by Lemmas 2 and 3, we can deﬁne a function that returns an upper approximation of the ﬁxpoint, for any subset of the domain of an interpretation, as follows. Deﬁnition 5. Let I p I, Iq be a ﬁnite interpretation and X tx1, . . . , xnu Ď I. Also let

Xlim tx P X | Dm P N : every walk starting at x in Gp Iq has length ď mu.

The function d I : Pp Iq ÞÑ N is deﬁned as follows:

d I p Xq "d 1 if Xlim H d mmvfp Gp Iqq otherwise,

where d mvf pśn i 1 Gp Iq, px1, . . . , xnqq. Next, we prove that function d I is indeed an upper bound for the ﬁxpoint of an MMSC. The idea sustaining Lemma 4 is that if x P X Ď I and every walk in Gp Iq starting at x has length at most m, then m can be used as a ﬁxpoint depth for the MMSC of X in I. Lemma 2 covers the cases where vertices are the starting point of walks of any length. Lemma 4. Let I p I, Iq be a ﬁnite interpretation and X Ď I. Then, for any k P N, it holds that:

mmsc p X, I, d I p Xqq I Ď mmsc p X, I, kq I .

Proof Sketch. Let X tx1, . . . , xnu Ď I. If k ď d I p Xq, the lemma holds trivially. For k ą d I p Xq we divide the proof in two cases. First, if there is a xi P X such that every walk in Gp Iq starting at xi has length at most m for some m P N, then as stated in Lemma 3, every walk in śn i 1 Gp Iq starting at px1, . . . , xnq has length at most mvf pśn i 1 Gp Iq, px1, . . . , xnqq 1. In other words, even when k ą d I p Xq, we have: śn i 1 Gp Iqx k śn i 1 Gp Iqx d Ip Xq, and therefore, we can ap-

ply Lemma 1 to conclude that: mmsc p X, I, d I p Xqq I Ď mmsc p X, I, kq I. Otherwise, if Xlim H, the lemma is a direct consequence of Deﬁnition 5 and Lemma 2.

In the remaining of this paper, we write mmsc p X, Iq as a shorthand for mmsc p X, I, d I p Xqq. An important consequence of Lemma 4 and the deﬁnition of MMSC is that, for any ELK concept expression C and ﬁnite interpretation I, it holds that CI mmsc CI, I I.

Lemma 5. Let I p I, Iq be a ﬁnite interpretation. Then, for all ELK concept expression C it holds that: mmsc CI, I I CI.

Proof. Direct consequence of Lemma 4.4 (vi) of (Borchmann, Distel, and Kriegel 2016) and Lemma 4.

We use this result below to deﬁne a ﬁnite set of concept expressions MI for building a base of the CIs valid in I.

Deﬁnition 6. Let I p I, Iq be a ﬁnite interpretation. The set MI is the union of t Ku Y NC and

t Dr.mmsc p X, Iq | r P NR and X Ď I, X Hu

We also deﬁne ΛI td U | U Ď MIu. Building the base mostly relies on the fact that, given a ﬁnite interpretation I, for any ELK concept expression C, there is a concept expression D P ΛI such that CI DI. Theorem 2. Let I be a ﬁnite interpretation and let ΛI be deﬁned as above. Then,

Bp Iq t C mmsc CI, I | C P ΛIu Y

t C Ď D | C, D P ΛI and I |ù C Ď Du

is a ﬁnite ELK base for I.

Proof Sketch. As ΛI is ﬁnite, so is Bp Iq. The CIs are clearly sound and the soundness of the equivalences is due to Lemma 5. For completeness, assume that I |ù C Ď D. Using an adaptation of Lemma 5.8 from (Distel 2011) and Lemma 5 above, we can prove, by induction on the structure of the concept expressions C and D, that there are concept expressions E, F P ΛI such that Bp Iq |ù E mmsc CI, I , Bp Iq |ù F mmsc DI, I , Bp Iq |ù C mmsc CI, I , and Bp Iq |ù D mmsc DI, I . By construction, as E Ď F P Bp Iq, we can prove that whenever I |ù C Ď D, so does Bp Iq.

Recall the interpretation I in Figure 6. In order to compute a base for I, we should compute an MMSC with role depth at least 29. An important beneﬁt of our approach is that the role depth of the other MMSCs, which are part of the mined CIs in the base may be smaller. For instance, the role depth of mmsc ptx1u, Iq is 10. In the next section, we show that one can compute the MVF of a vertex in a graph in linear time in the size of the graph.

5 Computing the MVF As discussed in Section 4, the MVF is the key to provide an upper bound for the ﬁxpoint for each MMSC. An easy way to estimate the MVF would consist in computing the number of vertices reachable from v in the description graph G. Let reachp G, vq be such a function. By deﬁnition it holds that mvfp G, vq ď reachp G, vq. Although reachp G, vq can be computed in polynomial time, the difference between these two metrics can be quite large. For instance, consider that v is the root of a description graph G that is a binary tree with 2n nodes. Then mvfp G, vq n, while reachp G, vq 2n. In this section, we present an algorithm to compute mvfp G, vq that takes linear time in the size of G, but ﬁrst we need to recall some fundamental concepts from Graph Theory, one of them is the notion of strongly connected components (Deﬁnition 7). Deﬁnition 7. Let G p V, E, Lq be a description graph. The strongly connected components (SCCs) of G, in symbols SCCp Gq, are the partitions V1, . . . , Vn of V such that for all 1 ď i ď n: if u, v P Vi then there is a path from u to v and a path from v to u in G. Additionally, we deﬁne a function sccp G, vq, which returns the SCC of G that contains v.

tx1u tx3u tx5u tx6u tx2, x7u tx4u

Figure 7: Condensation of the description graph in Figure 4. Every vertex is an SCC of the original graph and the edges indicate accessibility between the SCCs. Also, the condensation has no labels.

A compact way of representing a description graph G consists in regarding each SCC in G as a single vertex. This compact graph is a directed acyclic graph (DAG), also called condensation of G (Harary, Norman, and Cartwright 1965), and it is formalised in Deﬁnition 8. Deﬁnition 8. Let G p V, E, Lq be a description graph. The condensation of G is the directed acyclic graph G p V , E q where V tsccp G, uq | u P V u and E tpsccp G, uq, sccp G, vqq | pu, r, vq P E and sccp G, uq sccp G, vqu. Also, if w is path in G , the weight of w , in symbols weightp G q, is the sum of the sizes of the SCCs that appear as vertices of w . We use these notions to link the MVF (Deﬁnition 3) to the paths in the condensation graph in Lemma 6. Lemma 6. Let G p V, E, Lq be a description graph, let G p V , E q be the condensation of G, and v P V . Then:

mvfp G, vq max tweightpw q |

w is a path in G starting at sccp G, vqu .

Proof Sketch. First we prove that every path w V1, . . . , Vm in G starting at sccp G, vq induces a walk w in G starting at v with vnumpwq weightpw q. Then, we show that if w has maximal weight, then no walk in G starting at v can visit more than weightpw q vertices.

By Lemma 6, we only need to compute the maximum weight of a path in G that starts at sccp G , vq to obtain the MVF of a vertex v in a description graph G. Algorithm 1 relies on this result and proceeds as follows: ﬁrst, it computes the SCCs of the description graph and the condensation graph. Then, the algorithm transverses the condensation graph, using an adaptation of depth-ﬁrst search to determine the maximum path weight for the initial SCC. Algorithm 1 assumes that the SCCs and condensation are computed correctly. Besides keeping the computed values, the array wgt prevents recursive calls on SCCs that have already been processed. According to Lemma 6, to prove that Algorithm 1 is correct we just need to prove that the function max Weight in fact returns the maximum weight of a path in the condensation given a starting vertex (which corresponds to an SCC in the original graph). Lemma 7. Given G p V, E, Lq and v P V as input, Algorithm 1 returns the maximum weight of a path in the condensation of G starting at sccp G, vq.

Proof Sketch. Let G p V , E q be the condensation of G. If sccp G, vq has no successor in G , then the output of

Algorithm 1: Computing MVF via Lemma 6

Input: A description graph G p V, E, Lq and a vertex v P V Output: The MVF of v in G, i.e., mvfp G, vq

1 V Ð SCCp Gq

2 E Ð condensep G, V q

3 G Ð p V , E q

4 for V 1 P V do

5 wgtr V 1s Ð null

6 return max Weightp G , sccp G, vq, wgtq // Auxiliary function

7 Function max Weightp G , V 1, wgtq:

8 current Ð 0

9 for W 1 P t U 1 P V | p V 1, U 1q P E u do

10 if wgtr W 1s null then

11 current Ð maxpcurrent, max Weightp G , W 1, wgtqq

13 current Ð wgtr W 1s

14 wgtr V 1s Ð current |V 1|

15 return wgtr V 1s

max Weight is correct. If sccp G, vq has successors, then the maximum weight of a path staring at sccp G, vq in G is given by |sccp G, vq| plus the maximum value computed among its successors. This equation holds because G is a DAG.

Lemmas 6 and 7 imply that Algorithm 1 computes the MVF of v in G correctly. Moreover, the computation of SCCs can be done in time Op|V | |E|q (Tarjan 1972), the condensation in time Op|E|q (Martello and Toth 1982) and the depth-ﬁrst transversal via max Weight in time Op|V | |E|q. Hence, it is possible to compute the MVF of a vertex in a graph in linear time in the size of the description graph even if it consists solely of cycles. Yet, given an interpretation I p I, Iq the graph given as input to Algorithm 1 might be a product graph with an exponential number of vertices in | I|. Also, Algorithm 1 can be modiﬁed to compute the MVF for all vertices by starting the function max Weight from an unvisited SCC until all vertices are visited in polynomial time in the size of the graph.

6 Conclusion In this work, we introduce a way of computing ELK bases from ﬁnite interpretations that adapts the role depth of concepts according to the the structure of interpretations. Our deﬁnition relies on a notion that relates vertices in a graph to sets of vertices, called MVF. We have also shown that the MVF computation can be performed in polynomial time in the size of the underlying graph structure. Our ELK base, however, is not minimal. As future work, we plan to build on previous results combining FCA and DLs to deﬁne a base with minimal cardinality. We will also investigate the problem of mining CIs in the presence of noise in the dataset. We plan to use the support and conﬁdence measures from association rule mining to deal with noisy data and implement our approach using knowledge graphs as datasets.

Acknowledgements Parts of this work have been done in the context of CEDAS (Center for Data Science, University of Bergen, Norway). This work was supported by the Free University of Bozen Bolzano, Italy, under the project PACO.

References Baader, F. 1995. Computing a Minimal Representation of the Subsumption Lattice of All Conjunctions of Concepts Deﬁned in a Terminology. In Proc. Intl. KRUSE Symposium, 168 178. Baader, F. 2003. Terminological Cycles in a Description Logic with Existential Restrictions. In IJCAI, 325 330. Morgan Kaufmann. Baader, F.; Brandt, S.; and Lutz, C. 2005. Pushing the EL Envelope. In Kaelbling, L. P.; and Safﬁotti, A., eds., IJCAI, 364 369. Professional Book Center. Baader, F.; and Distel, F. 2008. A Finite Basis for the Set of EL-Implications Holding in a Finite Model. In Medina, R.; and Obiedkov, S., eds., ICFCA 2008, 46 61. Springer Verlag. Baader, F.; Ganter, B.; Sertkaya, B.; and Sattler, U. 2007. Completing Description Logic Knowledge Bases using Formal Concept Analysis. In Veloso, M. M., ed., IJCAI, 230 235. AAAI Press. Baader, F.; Horrocks, I.; Lutz, C.; and Sattler, U. 2017. An Introduction to Description Logic. Cambridge University Press. Baader, F.; and Molitor, R. 2000. Building and Structuring Description Logic Knowledge Bases Using Least Common Subsumers and Concept Analysis. In ICCS, 292 305. Springer. Borchmann, D. 2014. Learning Terminological Knowledge with High Conﬁdence from Erroneous Data. Ph.D. thesis, Dresden University of Technology. Borchmann, D.; and Distel, F. 2011. Mining of EL-GCIs. In Spiliopoulou, M.; Wang, H.; Cook, D. J.; Pei, J.; Wang, W.; Zaïane, O. R.; and Wu, X., eds., Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, 1083 1090. IEEE Computer Society. Borchmann, D.; Distel, F.; and Kriegel, F. 2016. Axiomatisation of general concept inclusions from ﬁnite interpretations. Journal of Applied Non-Classical Logics 26(1): 1 46. Distel, F. 2011. Learning description logic knowledge bases from data using methods from formal concept analysis. Ph.D. thesis, Dresden University of Technology. Ganter, B.; and Wille, R. 1999. Formal Concept Analysis: Mathematical Foundations. Berlin/Heidelberg: Springer. Guimarães, R.; Ozaki, A.; Persia, C.; and Sertkaya, B. 2021. Mining ELK Bases with Adaptable Role Depth. ar Xiv eprints ar Xiv:2102.10689. Harary, F.; Norman, R. Z.; and Cartwright, D. 1965. Structural models: an introduction to the theory of directed graphs. New York: John Wiley & Sons. ISBN 9780471351306.

Kriegel, F. 2019a. Constructing and Extending Description Logic Ontologies using Methods of Formal Concept Analysis. Ph.D. thesis, Technische Universität Dresden, Dresden, Germany. Kriegel, F. 2019b. Learning Description Logic Axioms from Discrete Probability Distributions over Description Graphs. In Calimeri, F.; Leone, N.; and Manna, M., eds., JELIA 2019, 399 417. Springer. Kuznetsov, S. O. 2004. On the Intractability of Computing the Duquenne-Guigues Base. J. UCS 10(8): 927 933. Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; van Kleef, P.; Auer, S.; and Bizer, C. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6(2): 167 195. Martello, S.; and Toth, P. 1982. Finding a minimum equivalent graph of a digraph. Networks 12(2): 89 100. Monnin, P.; Lezoche, M.; Napoli, A.; and Coulet, A. 2017. Using Formal Concept Analysis for Checking the Structure of an Ontology in LOD: The Example of DBpedia. In ISMIS, 674 683. Springer. Ozaki, A. 2020. Learning Description Logic Ontologies: Five Approaches. Where Do They Stand? KI - Künstliche Intelligenz . Rudolph, S. 2004. Exploring Relational Structures Via FLE. In Wolff, K. E.; Pfeiffer, H. D.; and Delugach, H. S., eds., ICCS, 196 212. Springer-Verlag. Rudolph, S. 2006. Relational exploration: Combining Description Logics and Formal Concept Analysis for knowledge speciﬁcation. Ph.D. thesis, Fakultät Mathematik und Naturwissenschaften, TU Dresden, Germany. Sertkaya, B. 2010. A Survey on how Description Logic Ontologies Beneﬁt from Formal Concept Analysis. In Kryszkiewicz, M.; and Obiedkov, S., eds., CLA, volume 672 of CEUR Workshop Proceedings, 2 21. Spackman, K.; Campbell, K.; and Cote, R. 1997. SNOMED RT: A reference terminology for health care. J. American Medical Informatics Association 640 644. Fall Symposium Supplement. Tarjan, R. 1972. Depth-First Search and Linear Graph Algorithms. SIAM Journal on Computing 1(2): 146 160.