# endtoend_ontology_learning_with_large_language_models__a14a4588.pdf

End-to-End Ontology Learning with Large Language Models

Andy Lo University of Cambridge cyal4@cam.ac.uk

Albert Q. Jiang University of Cambridge qj213@cam.ac.uk

Wenda Li University of Edinburgh wenda.li@ed.ac.uk

Mateja Jamnik University of Cambridge mateja.jamnik@cl.cam.ac.uk

Ontologies are useful for automatic machine processing of domain knowledge as they represent it in a structured format. Yet, constructing ontologies requires substantial manual effort. To automate part of this process, large language models (LLMs) have been applied to solve various subtasks of ontology learning. However, this partial ontology learning does not capture the interactions between subtasks. We address this gap by introducing OLLM, a general and scalable method for building the taxonomic backbone of an ontology from scratch. Rather than focusing on subtasks, like individual relations between entities, we model entire subcomponents of the target ontology by finetuning an LLM with a custom regulariser that reduces overfitting on high-frequency concepts. We introduce a novel suite of metrics for evaluating the quality of the generated ontology by measuring its semantic and structural similarity to the ground truth. In contrast to standard syntax-based metrics, our metrics use deep learning techniques to define more robust distance measures between graphs. Both our quantitative and qualitative results on Wikipedia show that OLLM outperforms subtask composition methods, producing more semantically accurate ontologies while maintaining structural integrity. We further demonstrate that our model can be effectively adapted to new domains, like ar Xiv, needing only a small number of training examples. Our source code and datasets are available at https://github.com/andylolu2/ollm.

1 Introduction

An ontology is a formal and structural way of representing domain-specific concepts and their relations [16]. They can be simple (e.g., Wikipedia categories) consisting of concepts and only a small number of types of taxonomic relations (e.g., is-a relationships), or they can be complex (e.g., Schema.org) consisting of axioms or many types of relations. For example, a simple ontology for programming languages might contain two concepts Dynamically-typed language and Python , and one relation Dynamically-typed language Python , representing the knowledge that Python is a dynamically-typed language. A more complex ontology might contain axioms too, for example, all programming languages are either dynamically or statically typed . In this paper, we focus on ontologies with only concepts and taxonomic relations. Compared to typical deep learning models, which represent knowledge implicitly in its weights, ontologies capture knowledge in a structured and explicit manner, making them reliable, easy to edit and human-interpretable. Such benefits of ontologies have led to their wide adoption in practice. For example, Wikipedia categories have been

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

LLM Mask-regularised loss

Backpropagate

Sum and prune

Gold standard evaluation

Training Evaluation / Inference

Is-a relation

Figure 1: OLLM: Using annotations of documents with their relevant concepts, we train an LLM to model relevant subgraphs of the target ontology with a custom regulariser. During inference, the generated subgraphs for each document are summed and pruned to give the final output ontology. For evaluation, we measure the similarity between the generated ontology and the ground truth.

used for entity ranking [46] and information retrieval [42], or Schema.org [40] is a core component of the Semantic Web [1] initiative.

While ontologies are useful, building ontologies often requires substantial manual effort. Ontology learning (OL) is the study of automating the construction of high-quality ontologies at scale. For a simple ontology, this amounts to discovering the concepts and taxonomic relations, usually based on a source corpus. In this paper we aim to develop domain-independent methods for OL that are scalable and produce better ontologies.

Traditionally, OL is viewed as a composition of subtasks [3], such as concept discovery and relation extraction. In particular, prior works have demonstrated that state-of-the-art large language models (LLMs) can solve such subtasks effectively [4]. While studying subtasks permits fine-grained analysis and evaluation, it does not directly indicate the subsequent impact on the quality of the final ontology. Moreover, there is potential room for improvement by combining several subtasks into one, such as by modelling concepts and relations in conjunction. In this paper, we instead develop and evaluate methods that construct ontologies in an end-to-end fashion to answer the following research questions:

1. How can we leverage LLMs knowledge base to build ontologies from scratch? 2. Does our method scale efficiently to practical problem sizes? 3. How well does our method generalise to new domains?

We introduce OLLM, an end-to-end method for using LLMs to construct ontologies at scale. Rather than focusing on individual relations between concepts, we finetune an LLM to model entire subcomponents of the target ontology. The output ontology is generated by taking the sum of generated sub-components and applying simple post-processing. An overview of the pipeline is shown in Figure 1. To train OLLM, we collect the categorisation metadata for a subset of Wikipedia articles. We attempt to adapt an LLM to model the relevant categorisation subgraph for a particular Wikipedia article, but discover that direct finetuning leads to poor generalisation due to overfitting to high-level, frequently occurring concepts. Instead, we propose a custom regulariser that reweights each concept based on its frequency of occurrence, which substantially improves generalisation.

We evaluate OLLM by measuring the similarity of the generated ontology with the ground truth. Current approaches for comparing ontologies rely on mapping components of the two ontologies onto each other, most commonly by literal text matching [30, 45]. This is unreliable when the two ontologies are not already sufficiently similar. Instead, we propose a suite of evaluation metrics suitable for comparing arbitrary labelled graphs. These metrics compare edges and subgraphs of the two ontologies using pretrained text embedders to test for semantic and structural similarity. Both our quantitative and qualitative results reveal that an LLM can already outperform existing

extraction-based methods out of the box, and the performance is further improved by finetuning with our custom regulariser. We additionally demonstrate that OLLM can be adapted to build the ar Xiv ontology using only a small number of training examples, suggesting that our model can be applied to new domains in a data-efficient way. In summary, our contributions are:

1. We constructed two datasets based on Wikipedia and ar Xiv, which can serve as standard datasets for future work studying end-to-end OL. 2. We created OLLM, a method that utilises LLMs to build ontologies from scratch. OLLM produces high-quality ontologies and serves as a strong baseline for end-to-end OL. 3. We developed new evaluation metrics for assessing the quality of the generated ontologies.

2 Background

An ontology is a structured way of representing concepts and relations of a shared conceptualisation, that is, domain knowledge [15, 16]. Ontologies can span a wide range of complexities. A fullyfledged ontology might contain concepts, relations, constraints, and axioms that enable complex automated reasoning. In this paper, we focus on the core building blocks of an ontology: concepts and taxonomic relations which represent is-a or is-subclass-of relationships between concepts. In some cases, the is-part-of relation is also considered a taxonomic relation. We treat such an ontology as a rooted labelled directed graph where nodes represent concepts, edges represent taxonomic relations and the root node is the special concept of all concepts. A strict ontology asserts that the taxonomic relation is asymmetric and thus the graph must be acyclic, though in practice some ontologies, such as the Wikipedia ontology studied in this paper, may contain cycles. We therefore do not assume that an ontology graph is necessarily acyclic. Examples of ontologies include Word Net [33] with 117,659 concepts and 89,089 taxonomic relations, and the Gene Ontology [2] with 42,255 concepts and 66,810 taxonomic relations.

Ontology learning is the automatic extraction of ontological elements [17]. The most studied source of input is unstructured text, though there are also works on semi-structured data like HTML [22]. In this paper, the input is a set of documents, each consisting of some unstructured text. We additionally assume each document is associated with one or more concepts in the ground truth ontology, which we utilise for training. The goal is to reconstruct the ground truth ontology given the set of documents.

Prior works view OL as a composition of subtasks, and study each subtask in isolation [3, 6]. A typical pipeline for building a simple ontology is to first perform concept discovery (identify the nodes), and then relation extraction (identify the edges) [8, 24]. A notable approach for relation extraction is Hearst patterns [18]. Hearst patterns are hand-crafted lexico-syntactic patterns that exploit natural language structure to discover taxonomic relations. For example, the pattern [noun phrase] such as [noun phrase] matches phrases like dogs such as chihuahuas , and thus can be processed by regular expressions to identify the relation dog chihuahua . Hearst patterns suffer from low recall, as the relations must occur in exact configurations to be identified by the rules. Roller et al. [39] suggest smoothing techniques to alleviate this issue though at the cost of lower precision.

Recently, language models have been used for OL. REBEL [7] treats relation discovery as a translation task, and finetunes encoder-decoder LLMs to extract both taxonomic and non-taxonomic relations. Babaei Giglou et al. [4] benchmarked a wide family of LLMs for concept and relation discovery, and showed promising results. However, the quadratic complexity of link prediction makes this approach unscalable to large ontologies. We provide more discussion in Appendix A.2.3. There are also proof-of-concept works for building ontologies end-to-end with LLMs. Funk et al. [13] proposes to build an ontology by recursively prompting LLMs, while Trajanoska et al. [44] generate the entire ontology in one completion. However, both studies are limited in the scale of the task and evaluation: they only considered ontologies of up to 1000 concepts and relied on manual qualitative evaluation. We bridge this gap by proposing a method that can scale to practical problem sizes and new metrics for systematic qualitative evaluation.

The evaluation of ontologies is an open research area. The main approaches are gold standard evaluation [51], which matches elements of the generated ontology with a predefined target ontology; task-based evaluation [36], which measures the usefulness of the ontology on a specific application; and human evaluation [5, 37]. In this paper, we evaluate by the gold standard metric as it is the most straightforward approach when ground-truth ontology exists. Prior works have considered

Main topic classifications

Human behavior

Politics by issue

Sociology of culture

Human activities

Politics and race

<s>[INST] Title: Hybridity Hybridity, in its most basic sense ... [/INST] Main topic classifications -> Human behavior

-> Human activities -> Culture -> Sociology of culture Main topic classifications -> Humanities -> Politics -> Politics by issue -> Politics and race Main topic classifications -> Politics ->

Politics by issue -> Politics and race Main topic classifications -> Culture -> Sociology of culture</s>

Figure 2: Example subgraph induced for the Wikipedia page Hybridity (left), where N = 4 and C = {Politics and race, Sociology of culture}. The corresponding training text sequence (right), where text coloured in grey is ignored as training targets, but is still present as context for later tokens.

matching concepts [30] and direct or indirect relations [23, 45] by literal text comparison. Others have also considered edit-distance [12] or bag-of-words distributional similarity for text comparison [51]. These techniques for measuring semantic similarity may be considered unreliable and have been superseded by current methods [9]. We instead rely on more modern techniques like pretrained text embedders [10] and graph convolutions [26] to match substructures between the two ontologies.

We now introduce OLLM, our novel, simple and scalable method for end-to-end OL with LLMs. On a high level, OLLM uses an LLM to model concept subgraphs of the target ontology by utilising a linearisation scheme to transform subgraphs into string sequences. In contrast to learning individual edges, modelling subgraphs allows the model to learn higher-order structures, such as the interactions between three or more nodes. To create the training dataset, OLLM relies on the annotations of documents to concepts to generate document-subgraph pairings. Such subgraphs are much smaller than the complete graph, so they can be learned by the model more easily. The generated subgraphs for each document are summed into a weighted graph, and simple post-processing is applied to obtain the final predicted ontology.

3.1 Subgraph modelling

Here, we describe the method for creating document-subgraph pairings. Given a document and its associated set of concepts C, we define the relevant paths as the set of paths of at most length N from the root to any of the concepts in C. The relevant subgraph is the set of nodes (concepts) and edges (taxonomic relations) that occur at least once in the relevant paths. An example is shown in Figure 2 (left). The choice of N is task-specific and we describe our method for choosing N in Section 5.1.

To employ LLMs to model the subgraphs, we must linearise the graph into a string sequence. Existing methods for autoregressive graph generation employ BFS [50] or DFS [14] ordering starting at an arbitrary node. We instead choose to linearise the subgraph as a list of relevant paths that produced the subgraph in the first place. We do so over BFS/DFS ordering for three reasons: 1) the subgraph is defined from the relevant paths, which makes them the most natural representation; 2) we hypothesise that the hierarchy of concepts in each path is a desirable inductive bias for the hierarchical nature of an ontology; and 3) the path-based representation is much easier to describe in natural language instructions so that our LLM prompting-based baselines may produce reasonable results without finetuning. The linearisation template can be found in Figure 5 in Appendix A.1.2.

3.2 Post-processing

The final output graph is obtained by summing all generated subgraphs for each document and pruning low-weighted components. Given the generated subgraphs G1 = (V1, E1), . . . , Gn = (Vn, En), the raw output graph is defined as Graw = (Vraw, Eraw), where Vraw = n i=1Vn and Eraw = n i=1En.

Each edge (u, v) Eraw is additionally weighted by the number of times it occurs in the collection of subgraphs: w(u, v) = Pn i=1 1[(u, v) En]. A few simple post-processing steps are then applied to Graw in order to prune it:

1. Self-loop pruning: All edges (u, u) Eraw are removed. 2. Inverse-edge pruning: For (u, v) Eraw, if (v, u) Eraw and w(v, u) > w(u, v), remove (u, v). That is, bidirectional edges are turned into unidirectional ones. 3. Absolute thresholding: Edges in Eraw with weight below the α-th quantile are removed, where 0 α 1 is a hyperparameter. This removes edges that are globally less important. 4. Relative thresholding: For each vertex u Vraw, let e1, . . . , ek be the outgoing edges from u, sorted by weight in ascending order. Let the cumulative weight be C(ei) = Pi j=1 w(ej)/ Pk j=1 w(ej). The edges {ei | C(ei) β} are pruned, where 0 β 1 is a hyperparameter. This is similar to top-p sampling [19], which we use to remove edges that are less important than their neighbours. 5. Clean up: After pruning all edges, nodes with no incoming or outgoing edges are removed.

We choose the hyperparameters α and β by tuning on the validation set (Section 5.1).

4 Evaluating end-to-end OL

Ontology evaluation is a hard problem as there are no quantitative definitions of what constitutes a good ontology , and metrics generally only capture one aspect (e.g., structure but not semantics) of an ontology. We approach evaluation by treating the ground truth as a proxy for a good ontology, and comparing the generated ontologies against the ground truth. Here, we describe how the ground truth is obtained, and introduce new evaluation metrics that are used for measuring ontology similarity.

4.1 Dataset

We collect the datasets for the two ontologies considered in this paper: Wikipedia categories and the ar Xiv taxonomy. We use Wikipedia for learning and in-domain evaluation, and ar Xiv for out-ofdomain evaluation. To build the Wikipedia dataset, we perform a BFS traversal from its root category Main topic classifications up to depth 3. For every category encountered, we retrieve the titles and summaries (the text before the first section) of up to 5000 pages that belong in that category. The source data is obtained from the Wikipedia API.1 The ar Xiv taxonomy is available from its home page, and the source corpus is constructed from the title and abstract of all the papers uploaded to ar Xiv in the years 2020 2022 with more than or equal to 10 citations.2 In total, the Wikipedia dataset has 13886 concepts, 28375 taxonomic relations and 362067 documents, while the ar Xiv dataset has 161 concepts, 166 taxonomic relations and 126001 documents.

3499 2127 227

1125 226 393

(a) Wikipedia

(b) ar Xiv Figure 3: Intersection of concepts among the train, validation and test splits of the datasets.

Generating the train and test splits from the datasets is a non-trivial problem. Each training example consists of a document and its relevant subgraph (Section 3.1). The naive approach of randomly selecting a subset of documentsubgraph pairs for the training likely leads to data leakage as there might be a significant overlap between subgraphs in the training set and the test set. Instead, we first split the full ontology into train and test graphs, and then generate the training document-subgraph pairs. This ensures that there are sufficiently many unseen concepts (and thus relations) in the test split, as shown in Figure 3. Our method is as follows:

1. Let V top be the set of top-level nodes, that is, children of the root node. Randomly partition V top

into train V top train, validation V top val , and test V top test splits in 7:3:10 ratio.

1https://en.wikipedia.org/w/api.php 2Citation counts obtained from https://api.semanticscholar.org/.

2. Let d be the depth on the full graph, that is, the distance of the furthest node from the root. The nodes of the train graph are taken as the union of all the nodes that are within distance d 1 from any node in V top train, plus V top train and the root. The edges are all the edges in the full graph that have both endpoints in the train graph. Similar applies for V top val and V top test .

4.2 Metrics

Many existing methods for comparing ontologies rely on syntactic measures like string edit distance [12] as a proxy for semantic similarity, or require every concept to be tagged with descriptions or documents for distributional semantics comparison [51]. To obtain more robust and general evaluation results, we introduce a suite of similarity metrics that use modern methods like text embeddings [38]. Multiple metrics are used as they trade off between interpretability and comprehensiveness and we aim to make them complementary by capturing different aspects of an ontology. For example, comparing ontologies by literal text equality is easy to understand but may be unreliable. In Section 5.4, we provide further discussion on evaluation metrics in the context of our experiment results. We denote the ground truth ontology graph as G = (V, E) and the generated graph as G = (V , E ).

Literal F1 While literal text matching is unreliable, it is also the simplest and the most interpretable. We treat this metric as a reference metric for sanity check. The Literal F1 metric [23] is given by the harmonic mean of the precision and recall of the edges:

Literal precision = |E E |

|E | Literal recall = |E E |

Fuzzy F1 The Literal F1 metric puts a strong emphasis on using the correct wording, while in practice, we are interested in evaluating the semantics of an ontology. For example, using a synonymous phrase for a concept should not be penalised. We utilise embeddings from a pretrained sentence transformer [38] and use the cosine similarity of the embeddings to measure semantic similarity. Specifically, let Node Sim(u, u ) V V [ 1, 1] be the cosine similarity between the sentence embeddings for u and u . The Fuzzy F1 score is obtained from the fuzzy precision and recall, defined as:

Fuzzy precision = |{(u , v ) E | (u, v) E. Node Sim(u, u ) > t Node Sim(v, v ) > t}|

Fuzzy recall = |{(u, v) E | (u , v ) E . Node Sim(u, u ) > t Node Sim(v, v ) > t}|

where t is the matching threshold. We use all-Mini LM-L6-v2 [38, 47] as the embedding model, and choose t as the median cosine similarity between the synonyms in Word Net [33], computed as 0.436.

Continuous F1 With fuzzy comparisons, the matches between the edges of the generated and the ground truth graph are no longer one-to-one. This is problematic: consider two graphs A B and B A B , where B and B match fuzzily. Such graphs will achieve a perfect Fuzzy F1 score yet they significantly differ. Additionally, we found that the previous metrics fail to provide a useful signal for hyperparameter tuning, particularly for our baselines where the generated graphs are poor. The Continuous F1 metric solves these issues by computing the highest-scoring edge matching between the two graphs, where the similarity score between (u, v) and (u , v ) is given by min(Node Sim(u, u ), Node Sim(v, v )). Obtaining such matching is equivalent to solving the linear assignment problem [32], which can be computed by the Hungarian algorithm [27]. The Continuous F1 score is obtained from the continuous precision and recall, given by:

Continuous precision = scont

|E | Continuous recall = scont

where scont is the score achieved by the best edge matching.

(a) Direct finetuning

(b) Finetuning with masked loss

Figure 4: Per token loss on a test set example of the final model trained with and without the custom masked loss objective. A stronger red colour represents a higher cross-entropy loss. Within the top-level concepts (children of the root) shown here, Culture and Humanities are in the training set while others are not. Using the masked loss objective improves generalisation on the high-level relations (e.g., Main topic classifications Academic disciplines ) while maintaining performance on lower-level relations.

Graph F1 Instead of individual edges, this metric aims to capture the wider structure of the two graphs. Intuitively, we want to know how concepts are related to their local neighbourhood. We do so by using simple graph convolutions [49] with K = 2 to compute graph-aware node embeddings after embedding each node with the pretrained embedder. Such embeddings in G are compared against those in G by cosine similarity, and the highest-scoring node matching, similar to the Continuous F1 metric, gives the graph similarity score. The Graph F1 score is computed from the graph precision and recall, defined as:

Graph precision = sgraph

|V | Graph recall = sgraph

where sgraph is the score achieved by the best node matching.

Motif distance Taking inspiration from classical network analysis, we use network motifs [34, 41] to evaluate the structural integrity of the generated graphs. Network motifs are reoccurring subgraphs in a larger graph, most commonly 3-vertex subgraphs. They are typically indicative of the structural characteristics of the full graph. We define the motif distance as the total variation distance between the distribution of all 3-vertex subgraphs in G and G .

5 Experiments

We design our experiments to answer the following research questions:

1. Does OLLM produce better ontologies than traditional methods by subtask composition?

2. Can OLLM be easily adapted to a new domain?

We approach the questions by training OLLM on the Wikipedia dataset, and further transfer the model to ar Xiv with a small number of ar Xiv samples. As baselines, we use two relation extraction methods, Hearst patterns [18, 39] and REBEL [7]. Relation extraction depends on successful concept discovery to produce high-quality ontologies. To estimate a ceiling to such baselines, we give the baselines a substantial advantage by providing them with the ground truth concepts in the test graph. The results show that even with such an advantage, OLLM outperforms the baselines on many metrics, demonstrating the potential of OLLM for end-to-end OL (Section 5.3).

5.1 Implementation details

Analysing the per-token loss on the test split sequences of a directly finetuned model (Section 3.1) shows that the model tends to memorise high-level relations from the training set, leading to poor generalisation, as shown in Figure 4 (top). The crux of the problem is that low-level relations are substantially more diverse than high-level ones: since we present both types of relations at the same rate to the model, it tends to overfit on high-level relations while underfitting on low-level ones. To alleviate this issue, we introduce a new training objective that randomly masks the loss contribution of frequently occurring relations. Suppose a relation u v is present n times in the training set. During training, when u v appears in one of the relevant paths, we mask the loss contribution of the tokens for v with probability max(1 M/n, 0), where M is a constant for the average number of times a relation is present in the training set. Intuitively, this regulariser ensures that frequent relations are only seen M times as targets throughout training, hence reducing overfitting as shown in Figure 4 (bottom). Note that while v is masked from the target, its tokens are still present in the input sequence as context for later tokens. A concrete training example can be found in Figure 2 (right).

We finetune Mistral 7B v0.2 [21] with Low-Rank Adaptation [20] on the masked loss objective. The model is trained on the Wikipedia dataset for two epochs with Adam [25]. During inference, the outputs are generated with temperature 0.1 and nucleus sampling [19] top-p of 0.9. We include a finetuning baseline without the masked loss objective, denoted as Finetune. To adapt OLLM for ar Xiv, we further finetune the model on 2048 document-subgraph pairs from ar Xiv. We initialise new low-rank adaptors and train until the loss stops improving on the validation set. We name these models OLLM (transfer) and Finetune (transfer) for training with and without the masked loss objective, respectively. Full details for the Wikipedia and ar Xiv experiments can be found in Appendix A.1.2.

The hyperparameters for the post-processing steps are tuned by grid search on the validation set. We sweep over α 1 geomspace(1/|Eraw|, 1, 21) and β geomspace(0.1, 1, 21) 0.1, and use the values that maximise Continuous F1. For Wikipedia, we choose the subgraph modelling path length N = 4 as it is the smallest N such that almost all edges (> 99%) occur in at least one relevant subgraph. Such criterion is used since smaller N results in smaller subgraphs, which we expect to be easier to model accurately. We choose N = 3 for ar Xiv for the same reason.

5.2 Baselines

We give a brief overview of the baseline methods here (in addition to Finetune and Finetune (transfer)). The full implementation details can be found in Appendix A.1. All baselines produce weighted directed graphs which we apply the same post-processing steps as in OLLM (Section 3.2) to obtain the final predicted graph.

Memorisation Simply memorising the train graph is a surprisingly strong baseline due to the overlap between train and test graphs, especially for Wikipedia. The weight of each edge is given by the number of relevant subgraphs in which it appears.

Hearst We follow the improved implementation of Hearst patterns by Roller et al. [39]. The authors propose spmi, a method which uses low-rank approximations to smooth the relation matrix so that two concepts can be compared even if there are no direct matches between them. We use the smoothed relation matrix to weigh the relations between the ground truth concepts. The additional hyperparameter for the rank of the smoothed matrix is tuned by grid search over the validation set.

REBEL The REBEL-large model [7] is an LLM trained to extract many types of relations from Wikipedia articles. We only take the subclass of , instance of , member of and part of relations that were extracted. Similar to Hearst, we find that it fails to find many direct relations between ground truth concepts. The same low-rank smoothing technique is applied to improve recall.

Prompting We test the Zero/One/Three-shot performance of instruction-tuned LLMs on the subgraph modelling task described in Section 3.1. To obtain more comparable results, we use Mistral 7B Instruct v0.2, the instruction-tuned version of the base model of OLLM, as the LLM for our prompting baseline. The prompt template used is shown in Figure 6 in Appendix A.1.

Finetune To test the effectiveness of our masked-loss objective, we introduce a direct finetuning baseline using the same configuration as OLLM except it is trained without loss masking.

Table 1: Evaluation metrics of OLLM and baselines on Wikipedia and ar Xiv. OLLM performs particularly well in modelling semantics, and remains competitive syntactically and structurally.

Dataset Method Literal F1 Fuzzy F1 Cont. F1 Graph F1 Motif Dist.

Wikipedia Memorisation 0.134 0.837 0.314 0.419 0.063 Hearst 0.003 0.538 0.350 0.544 0.163 Rebel 0.004 0.624 0.356 0.072 0.132 Zero-shot 0.007 0.871 0.455 0.639 0.341 One-shot 0.031 0.888 0.477 0.610 0.314 Three-shot 0.031 0.880 0.475 0.622 0.354 Finetune 0.124 0.884 0.470 0.588 0.050 OLLM 0.093 0.915 0.500 0.644 0.080

ar Xiv Memorisation 0.000 0.207 0.257 0.525 0.037 Hearst 0.000 0.000 0.151 0.553 0.098 Rebel 0.000 0.060 0.281 0.546 0.088 Zero-shot 0.025 0.450 0.237 0.414 0.145 One-shot 0.072 0.460 0.290 0.433 0.293 Three-shot 0.051 0.405 0.212 0.385 0.124 Finetune (transfer) 0.000 0.440 0.225 0.441 0.148 OLLM (transfer) 0.040 0.570 0.357 0.633 0.097

5.3 Results

We first evaluate whether OLLM can accurately create ontologies with many concepts and relations, such as the Wikipedia categories. Computationally, OLLM required 12 A100-hours for training and 7 A100-hours for inference to generate an ontology for Wikipedia. This is a modest cost in current standards, which demonstrates the scalability of OLLM for real-world problems. In terms of performance, OLLM produces the most semantically accurate ontology in comparison to our baselines as presented in Table 1. Across all of Fuzzy F1, Continuous F1 and Graph F1, we observe the trend that OLLM scores the best, followed by Finetune and Prompting, and lastly Hearst and REBEL. This is surprising, as it suggests that the combination of LLMs with our subgraph modelling framework is a sufficiently strong inductive bias for LLMs to outperform traditional methods even without finetuning. However, prompting alone is not sufficient to build high-quality ontologies. On the Motif Distance metric, prompting methods score poorly at 0.314 0.354 in comparison to 0.050 and 0.080 for Finetune and OLLM respectively. This shows that using LLMs out-of-the-box for subgraph modelling results in poor structural integrity, though this issue is solved by finetuning. Qualitatively, we observe that OLLM can adhere to the clear, explicit naming style of Wikipedia, even on unseen topics in the test set. For example, it generates Mathematical categories and Groups (mathematics) under the parent concept Mathematical structures to distinguish from the natural language sense of categories and groups (Figure 11c). Such style is not learned by the prompting baselines: Three-shot generated Elections France , while it most likely meant Elections Elections in France (Figure 18c). More sample outputs are shown in Appendix A.4.1.

The ar Xiv task differs from the Wikipedia task as it has much fewer relations, and there is even less overlap between the train and test split. This imposes a great challenge on Finetune and OLLM as they need to generalise with a limited diversity of training samples. Despite such constraints, OLLM is substantially better than other methods in modelling the semantics of the test graph. On the Fuzzy F1, Continuous F1, and Graph F1 metrics, OLLM performs the best among all methods with 0.570, 0.357, and 0.633, significantly higher than the next-best of 0.460, 0.290 and 0.546 respectively. Inspecting the generated ontologies (Appendix A.4.2), we observe that prompting baselines tend to produce repetitive concepts such as Machine Learning and Artificial Intelligence and Artificial Intelligence and Machine Learning (Figure 27), while Hearst and REBEL put almost all concepts under the same parent concept(s) (Figures 23 and 24). We also found that OLLM s output for ar Xiv contains concepts from Wikipedia, but restructured in a way that fits the ar Xiv ontology. For example, Life sciences and Biological evolution appear in the Wikipedia training set under the same parent category Life with no direct links between them. On the generated graph for ar Xiv, Life sciences is instead promoted to one of the top-level concepts with Biological Evolution as one of its children, which better fits the fields of science style of the ar Xiv ontology (Figure 20). This demonstrates that OLLM can adapt to produce a new type of ontology by restructuring its learned concepts, all using just a small number of training samples.

In summary, OLLM scores the best or is competitive across all metrics in both tasks, with the notable exception of the Literal F1 metric. We attribute this to the fact that Literal F1 is sensitive to factors like casing and choice of words, and generally only measures syntactic similarity. For example, we see that a suboptimal baseline like Memorisation scores the best on this metric with 0.134 on the Wikipedia task. This reflects that syntactic similarity generally does not entail semantic similarity, so syntax-based metrics should not be used as stand-alone measures for ontology quality.

5.4 Meta-evaluation

In this section, we analyse the usefulness of our new metrics for measuring graph similarity and discuss the limitations of existing metrics. On the Wikipedia task, Memorisation, despite being clearly the worst in Continuous F1 and Graph F1, performs the best on Literal F1 and the second-best on Motif Distance. This can be attributed to the fact that Literal F1 is sensitive to semantically insignificant syntactic differences such as casing and word form, and thus when the training and test set has non-trivial overlap (Figure 3), it is biased towards methods that overfit. Similarly, as per the method described in Section 4.1, the data splits are constructed with structural symmetry, hence we expect the train and test splits to have a similar graph structure even though the represented concepts are different. As a result, methods that tend to overfit, for example, Memorisation and Finetune, achieve the best scores on Motif Distance. This demonstrates that Literal F1 and Motif Distance only capture syntactic and structural similarity respectively, and thus should not be used as stand-alone metrics for evaluation.

Analysing the edge and node matchings found by our Continuous F1 and Graph F1 metrics on ar Xiv reveals that they successfully capture some human intuition on semantic similarity between the two ontologies. In Figures 9 and 10, we visualise the ontology generated by OLLM and the ground truth and observe that semantically similar components in the two graphs indeed get matched. For example, the Physics and Mathematics clusters in the generated graph get matched with the Mathematics cluster in the ground truth, Data Analysis and Information get matched with Statistics , Economics with Quantitative Finance , and Life Sciences with Quantitative Biology . This suggests that our edge/node matching procedure is capturing a semantic graph isomorphism that allows one to compare similar components in the two graphs, even if they do not exactly share the same concepts. We believe this example of a semantic mapping from one ontology to another is strong evidence that our metrics are capturing meaningful qualities of the ontologies.

6 Discussion

Limitations We only study and evaluate the construction of simple ontologies with only concepts and taxonomic relations. A potential approach to extend OLLM to produce non-taxonomic relations is to add tags indicating the relation type to each edge when linearising the subgraphs for sequence modelling. New evaluation metrics might also be required to handle multiple types of relations. Another limitation is that the taxonomic relations in the generated ontologies are not necessarily transitive due to the existence of cycles. This is a general problem for many OL methods and there are existing works on cycle removal algorithms for cleaning hierarchies [43, 52]. We ablate this in Appendix A.2.1 and found that the generated ontology can be made consistent by removing a small number of edges. Furthermore, we were unable to fully control for data contamination as the pretraining dataset of Mistral 7B is not publically known. We do, however, observe that the generated ontologies are sufficiently different from the ground truth, indicating that OLLM is not directly remembering samples from its pretraining stage.

Conclusion In this paper, we introduce a general method for building ontologies in an end-to-end fashion. We propose a set of metrics for end-to-end OL that measures the semantic and structural similarity between arbitrary labelled graphs. Our model, OLLM, outperforms traditional subtask composition methods in reconstructing the Wikipedia categories, and can be transferred to build ontologies for ar Xiv after finetuning on a small number of examples. Using LLMs as the backbone for subgraph modelling opens up exciting avenues for future research. For example, one may generate ontologies from corpora with images using vision language models [11].

7 Acknowledgements

We thank Dr Thomas Sauerwald for suggesting network motifs as a basis for evaluation. AQJ acknowledges the support of a Peterhouse Graduate Studentship.

[1] Grigoris Antoniou and Frank Van Harmelen. A semantic web primer. MIT press, 2004.

[2] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25 29, 2000.

[3] Muhammad Nabeel Asim, Muhammad Wasim, Muhammad Usman Ghani Khan, Waqar Mahmood, and Hafiza Mahnoor Abbasi. A survey of ontology learning techniques and applications. Database, 2018:bay101, 2018.

[4] Hamed Babaei Giglou, Jennifer D Souza, and Sören Auer. Llms4ol: Large language models for ontology learning. In International Semantic Web Conference, pages 408 427. Springer, 2023.

[5] Janez Brank, Marko Grobelnik, and Dunja Mladenic. A survey of ontology evaluation techniques. In Proceedings of the conference on data mining and data warehouses (Si KDD 2005), pages 166 170. Citeseer, 2005.

[6] Paul Buitelaar, Philipp Cimiano, and Bernardo Magnini. Ontology learning from text: methods, evaluation and applications, volume 123. IOS press, 2005.

[7] Pere-Lluís Huguet Cabot and Roberto Navigli. Rebel: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370 2381, 2021.

[8] Philipp Cimiano and Johanna Völker. Text2onto: A framework for ontology learning and data-driven change discovery. In International conference on application of natural language to information systems, pages 227 238. Springer, 2005.

[9] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. ar Xiv preprint ar Xiv:1705.02364, 2017.

[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[11] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625 2634, 2015.

[12] Marc Ehrig, Peter Haase, Mark Hefke, and Nenad Stojanovi c. Similarity for ontologies - a comprehensive framework. In European Conference on Information Systems, 2005. URL https://api.semanticscholar.org/Corpus ID:9982461.

[13] Maurice Funk, Simon Hosemann, Jean Christoph Jung, and Carsten Lutz. Towards ontology construction with language models. ar Xiv preprint ar Xiv:2309.09898, 2023.

[14] Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu. Graphgen: A scalable approach to domain-agnostic labeled graph generation. In Proceedings of The Web Conference 2020, pages 1253 1263, 2020.

[15] Thomas R Gruber. A translation approach to portable ontology specifications. Knowledge acquisition, 5(2):199 220, 1993.

[16] Thomas R Gruber. Toward principles for the design of ontologies used for knowledge sharing? International journal of human-computer studies, 43(5-6):907 928, 1995.

[17] Maryam Hazman, Samhaa R El-Beltagy, and Ahmed Rafea. A survey of ontology learning approaches. International Journal of Computer Applications, 22(9):36 43, 2011.

[18] Marti A Hearst. Automated discovery of wordnet relations. Word Net: an electronic lexical database, 2, 1998.

[19] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ar Xiv preprint ar Xiv:1904.09751, 2019.

[20] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

[21] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

[22] Lobna Karoui, Marie-Aude Aufaure, and Nacera Bennacer. Ontology discovery from web pages: Application to tourism. In In the Workshop of Knowledge Discovery and Ontologies. Citeseer, 2004.

[23] Vipul Kashyap, Cartic Ramakrishnan, Christopher Thomas, and A. Sheth. Taxaminer: an experimentation framework for automated taxonomy bootstrapping. Int. J. Web Grid Serv., 1: 240 266, 2005. URL https://api.semanticscholar.org/Corpus ID:5549251.

[24] Neha Kaushik and Niladri Chatterjee. Automatic relationship extraction from agricultural text for ontology construction. Information processing in agriculture, 5(1):60 73, 2018.

[25] Diederik P Kingma. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[26] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[27] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83 97, 1955.

[28] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611 626, 2023.

[29] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ar Xiv preprint ar Xiv:1910.13461, 2019.

[30] Alexander Maedche and Steffen Staab. Measuring similarity between ontologies. In Asunción Gómez-Pérez and V. Richard Benjamins, editors, Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, pages 251 263, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-45810-4.

[31] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David Mc Closky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55 60, 2014.

[32] Silvano Martello and Paolo Toth. Linear assignment problems. In North-Holland Mathematics Studies, volume 132, pages 259 282. Elsevier, 1987.

[33] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39 41, 1995.

[34] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824 827, 2002.

[35] Simone Paolo Ponzetto, Michael Strube, et al. Deriving a large scale taxonomy from wikipedia. In AAAI, volume 7, pages 1440 1445, 2007.

[36] Robert Porzel and Rainer Malaka. A task-based approach for ontology evaluation. In ECAI Workshop on Ontology Learning and Population, Valencia, Spain, volume 1. Citeseer Valencia, Spain, 2004.

[37] Joe Raad and Christophe Cruz. A survey on ontology evaluation methods. In Proceedings of the International Conference on Knowledge Engineering and Ontology Development, part of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 2015.

[38] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bertnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/ abs/1908.10084.

[39] Stephen Roller, Douwe Kiela, and Maximilian Nickel. Hearst patterns revisited: Automatic hypernym detection from large text corpora. ar Xiv preprint ar Xiv:1806.03191, 2018.

[40] Schema.org. Schema.org, 2011. URL https://www.schema.org/.

[41] Shai S Shen-Orr, Ron Milo, Shmoolik Mangan, and Uri Alon. Network motifs in the transcriptional regulation network of escherichia coli. Nature genetics, 31(1):64 68, 2002.

[42] Philipp Sorg and Philipp Cimiano. Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data & Knowledge Engineering, 74:26 45, 2012.

[43] Jiankai Sun, Deepak Ajwani, Patrick K Nicholson, Alessandra Sala, and Srinivasan Parthasarathy. Breaking cycles in noisy hierarchies. In Proceedings of the 2017 ACM on Web Science Conference, pages 151 160, 2017.

[44] Milena Trajanoska, Riste Stojanov, and Dimitar Trajanov. Enhancing knowledge graph construction using large language models. ar Xiv preprint ar Xiv:2305.04676, 2023.

[45] Pucktada Treeratpituk, Madian Khabsa, and C. Lee Giles. Graph-based approach to automatic taxonomy generation (grabtax). Ar Xiv, abs/1307.1718, 2013. URL https://api. semanticscholar.org/Corpus ID:8625171.

[46] Anne-Marie Vercoustre, Jovan Pehcevski, and James A Thom. Using wikipedia categories and links in entity ranking. In Focused Access to XML Documents: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007 Dagstuhl Castle, Germany, December 17-19, 2007. Selected Papers 6, pages 321 335. Springer, 2008.

[47] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776 5788, 2020.

[48] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824 24837, 2022.

[49] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861 6871. PMLR, 2019.

[50] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, pages 5708 5717. PMLR, 2018.

[51] Elias Zavitsanos, Georgios Paliouras, and George A. Vouros. Gold standard evaluation of ontology learning methods through ontology transformation and alignment. IEEE Transactions on Knowledge and Data Engineering, 23:1635 1648, 2011. URL https://api. semanticscholar.org/Corpus ID:15607684.

[52] Torsten Zesch and Iryna Gurevych. Analysis of the wikipedia category graph for nlp applications. In Proceedings of the Second Workshop on Text Graphs: Graph-Based Algorithms for Natural Language Processing, pages 1 8, 2007.

A Appendix / supplemental material

A.1 Experiment details

A.1.1 Wikipedia

Some prior works that use Wikipedia categories as a dataset perform additional filtering of concepts considered as meta-categories mainly used for page management [35]. We instead decided not to further filter the source data to minimise external bias. We note that it is often not clear-cut whether a Wikipedia category is just for page management. For example, the Wikipedia categories of the form Lists of [subject] refer to the special type of articles where the main body is a bullet point/table listing of the subject, which is a useful concept in the Wikipedia domain.

For the Wikipedia experiment, we use Mistral 7B v0.2 (not instruction-tuned) [21] as the base model. We attach Lo RA [20] adaptors to all attention and feed-forward layers with parameters r = 32 and α = 16. The model is trained for 2 epochs ( 17K steps) with batch size 16, context length 2048, and is optimised with Adam using a constant learning rate of 1e-5 with warm-up from zero for the first 100 steps. Finetune uses the same configuration. Training takes 12 A100-hours.

For the ar Xiv experiment, we further finetune the model trained on Wikipedia with masked loss objective on 2048 document-subgraph pairs from the ar Xiv training set. We merge the Lo RA adaptors from the Wikipedia experiment and initialise new ones with r = 8 and α = 8. The model is trained with batch size 16 and Adam with constant learning rate 3e-6 and warp-up from zero for the first 10 steps. Training terminates when the loss stops improving on the evaluation set, which happened at step 288. Finetune (transfer) uses the same configuration. Early stopping happened at step 192.

For both experiments, we finetune the model with the instruction template similar to that of Mistral 7B instruct v0.2. The format is shown below:

<s>[INST]\ Title: {{ title }} {{ abstract }}[/INST]\ {% for path in paths %} {{ path | join(" -> ") }} {% endfor %}\ </s>

Figure 5: Linearisation template for OLLM training.

For inference, we use the v LLM [28] server which achieves a throughput of 10 documents per second. Inference on the validation and test splits of both datasets takes 12 A100-hours in total.

A.1.3 Hearst

The Hearst baseline follows the implementation by Roller et al. [39]. Using the tokenization, partof-speech tagging, lemmatisation, and token regex functionality of the Core NLP pipeline [31], taxonomic relations are extracted according to the 28 Hearst patterns used by the authors. Processing all documents takes 10 CPU-hours.

Following the spmi method, low-rank smoothing is applied to the relation matrix to allow comparison between any two concepts even if they are not directly related by an extracted relation. The rank of the smoothed matrix, r, is a hyperparameter which we tune by sweeping over r {5, 10, 15, 20, 25, 50, 100, 150, 200, 250} on the validation set. This defines a dense weighted graph as the raw output. Unfortunately, computing Continuous F1 on a dense graph is very slow, especially for Wikipedia. This is because the Hungarian algorithm used for solving the optimal matching between edges has time complexity O(N 3), where N is the number of edges. To bypass this issue, we perform a pre-filtering step of only exporting the top 10|V | weighted edges in the smoothed relation matrix, where |V | is the number of nodes in the graph. For the datasets considered,

this density of edges is still much higher than that of the ground truth, and thus, we expect this to have minimal impact on the final output after post-processing.

A.1.4 REBEL

We use REBEL-large [7] in the implementation. The model is an encoder-decoder transformer based on BART-large [29] with 406M parameters. We sample the model with the default configuration used by Cabot and Navigli [7]. The model is trained to predict 220 types of relations, most of which are not taxonomic relations. We filter the extracted relations and only keep those tagged with subclass of , instance of , member of , and part of relation types. The same low-rank smoothing method as Hearst is applied to the raw extractions. Processing all documents takes 3 A100-hours.

A.1.5 Prompting

To obtain more comparable results, we use Mistral 7B Instruct v0.2, the instruction-tuned version of the base model of OLLM, as the LLM for our prompting baseline. For One-shot and Three-shot, we randomly sample examples from the training set for each query. The output is parsed using regex and results that do not match the regex are discarded. We perform manual prompt engineering by inspecting individual responses. The final prompt template is shown in Figure 6. The total inference cost for all prompting baselines is 50 A100-hours.

The following is an article s title and abstract. Your task is to assign this

article to suitable category hierarchy. A category is typically represented by a word or a short phrase, representing broader topics/concepts that the article

is about. A category hierarchy represented by a collection of paths from the generic root category "Main topic classifications" to a specific category suitable for the article. The topics titles should become more and more specific as you move from the root to the leaf.

{% if examples|length > 0 %} {% for example in examples %} ### EXAMPLE {{ loop.index }} ### ### ARTICLE ### Title: {{ example[ title ] }} {{ example[ abstract ] }} ### END ARTICLE ### {% for path in example[ paths ] %} {{ path | join(" -> ") }} {% endfor %} ### END EXAMPLE {{ loop.index }} ### {% endfor %} {% else %} You must answer in the format of: Main topic classifications -> Broad topic 1 -> Subtopic 1 -> ... -> Most specific

topic 1 Main topic classifications -> Borad topic 2 -> Subtopic 2 -> ... -> Most specific

topic 2 ... {% endif %}

### ARTICLE ### Title: {{ title }} {{ abstract }} ### END ARTICLE ###

Provide a category hierarchy for the above article. \ {% if examples|length > 0 %} Use the same format as the examples above. {% else %} Use the format described above. {% endif %}

Figure 6: Prompt template used for the Zero/One/Three-shot baselines.

A.1.6 Hyperparameters

The raw generated outputs of all methods are post-processed with the same scheme as described in Section 3.2. The best hyperparameters for the post-processing step found by grid search on the validation are reported in Table 2.

Table 2: Values of the best hyperparameters found by grid search. r is the rank of the low-rank smoothing, only applicable to Hearst and REBEL. α = β = 0 means no edges are pruned from the raw output apart from self-loop and inverse edge removal.

Dataset Method α β r

Wikipedia Memorisation 0 0.058489 - Hearst 0.786685 0 5 REBEL 0.872544 0 20 Zero-shot 0.976781 0.298107 - One-shot 0.990906 0.346684 - Three-shot 0.991955 0.530957 - Finetune 0.883848 0.058489 - OLLM 0.974330 0.025893 -

ar Xiv Memorisation 0.340246 0 - Hearst 0.595878 0 150 REBEL 0.836685 0 100 Zero-shot 0.999896 0.346684 - One-shot 0.999611 0.401187 - Three-shot 0.999851 0.298107 - Finetune (transfer) 0.988129 0.346684 - OLLM (transfer) 0.983681 0.123872 -

A.2 Ablations

In this section, we present the results of our ablations regarding output consistency, the benefits of more advanced prompting techniques, and a comparison against LLMs4OL [4].

A.2.1 Consistency

A common assumption of taxonomic relations is its transitivity and anti-symmetry. One limitation of many OL methods, including OLLM, is that they do not guarantee that the generated ontology is cyclefree, leading to inconsistent taxonomic relations. To achieve consistency, generic post-processing techniques [43] can be applied to remove such cycles.

We analysed the ontologies generated by OLLM and found only 97 simple cycles in Wikipedia and none in ar Xiv. Using the greedy algorithm of repeatedly removing the edge that breaks the most simple cycles (a heuristic to the smallest set of edges whose removal makes the graph acyclic), we prune all such cycles and make the ontology consistent by removing just 26 of 10414 edges in Wikipedia. This is surprising considering we did not explicitly optimise our model to satisfy consistency.

A.2.2 Chain of thought prompting

More sophisticated prompting techniques, such as chain-of-thought (Co T) [48] have been shown to bring significant improvements in LLM inference. We explore whether we can establish strong baselines here by employing Co T in our prompting methods.

We extend the zero-shot prompting method such that prediction now involves two rounds of inference: In the first round, we ask the model to describe the possible relevant concepts for the input document and to explain its reasoning. Then, we ask the model to predict the subgraph in the specified format given the additional, self-generated context. The prompts used are shown below:

We tested the Co T method on Wikipedia and found no significant difference from basic zero-shot prompting, as shown in Table 3. We attribute this to the fact that Co T prompting primarily aims to improve logic and reasoning. We hypothesise that the performance in OL is more dependent on the model s understanding of natural language than its ability to perform multi-step reasoning, hence we do not observe any significant improvement from Co T.

The following is an article s title and abstract. Briefly break down the topics (

both specific and general concepts) relevant to this article. Explain your reasoning step by step.

### ARTICLE ### Title: {{ title }} {{ abstract }} ### END ARTICLE ###

Figure 7: Chain-of-thought first prompt

Your task now is to assign this article to a suitable category hierarchy. A category

is typically represented by a word or a short phrase, representing broader topics/concepts that the article is about. A category hierarchy is represented by a collection of paths from the generic root category "Main topic classifications" to a specific category suitable for the article. The topic titles should become more and more specific as you move from the root to the leaf.

You must answer in the format of: Main topic classifications -> Broad topic 1 -> Subtopic 1 -> ... -> Most specific

topic 1 Main topic classifications -> Broad topic 2 -> Subtopic 2 -> ... -> Most specific

topic 2 ...

Figure 8: Chain-of-thought second prompt

A.2.3 Comparison against LLMs4OL

In this ablation, we evaluate whether the improvement by OLLM is due to the improved methodology (end-to-end modelling) or simply due to the use of LLMs. One way to construct ontologies with LLMs proposed by LLMs4OL is to first prompt LLMs for possible concepts in a document, then link prediction by prompting for a yes/no response. Unfortunately, constructing a baseline from such two subtasks is non-trivial. We encountered significant scalability issues in the link prediction stage as it required O(n2) inferences. We make two modifications to overcome such limitation:

1. After the concept discovery stage, we only discard all but the n most frequent concepts to limit the number of inferences required during link prediction, where n is the number of concepts in the ground truth. 2. Instead of using zero-shot Mistral 7B as the link predictor, we use a finetuned BERT as the link predictor as it runs much faster. Given that LLMs4OL demonstrated that finetuned models perform much better than zero-shot inference on link prediction, we expect the finetuned BERT to be at least as good, if not better, than zero-shot Mistral 7B on this subtask.

We design this ablation such that it is comparable to zero-shot end-to-end modelling: both use zero-shot Mistral 7B as the backbone, just utilised in different ways. We tested this method on Wikipedia and found that it is worse than zero-shot end-to-end modelling on all metrics except Motif Distance, as shown in Table 4. This is evidence that our end-to-end modelling approach is a clear improvement over traditional subtask-based OL. Not only does LLMs4OL suffer from significant scalability bottlenecks thus unlikely to be scalable to solve large problems, its performance is also worse. The results suggest that we can more effectively and efficiently leverage the capabilities of LLMs beyond just solving subtasks, such as by predicting subgraphs.

Table 3: Comparison of Zero-shot with and without chain-of-thought prompting. There is no significant difference in performance.

Dataset Method Literal F1 Fuzzy F1 Cont. F1 Graph F1 Motif Dist.

Wikipedia Zero-shot 0.007 0.871 0.455 0.639 0.341 Zero-shot Co T 0.007 0.873 0.449 0.635 0.357

Table 4: Comparison of Zero-shot end-to-end modelling and LLMs4OL-style modelling with zeroshot concept discovery and fine-tuned BERT link prediction. LLMs4OL generally performs worse than zero-shot.

Dataset Method Literal F1 Fuzzy F1 Cont. F1 Graph F1 Motif Dist.

Wikipedia Zero-shot 0.007 0.871 0.455 0.639 0.341 LLMs4OL 0.003 0.841 0.428 0.482 0.092

A.3 Visualising evaluation metrics

A.3.1 Visualisation of node matching in Graph F1

Neurons and Cognition

Neuroscience

General Economics Econometrics

Mathematics

Mathematical Physics

Quantum Algebra

Algebraic Geometry

Dynamical Systems

Representation Theory

Differential Geometry

Combinatorics

History and Overview

K-Theory and Homology

Probability

Number Theory

Algebraic Topology

Symplectic Geometry

Spectral Theory

Operator Algebras

Statistics Theory

Metric Geometry

Optimization and Control

General Topology

Commutative Algebra

Information Theory

Complex Variables

Numerical Analysis

Category Theory

Classical Analysis and ODEs

Group Theory

Rings and Algebras

General Mathematics

Geometric Topology

Analysis of PDEs

Functional Analysis

Electrical Engineering and Systems Science

Computational Physics

Quantum Physics

Mathematical Systems Theory

Information Theory

Machine Learning

Information Systems

Mathematics

Mathematical Logic and Foundations

Mathematical Logic

Signal Processing

Data Analysis

Mathematical Systems and Control

Information Systems Theory

Mathematical Physics

Differential Equations

Condensed Matter

Mathematics and Society

Partial Differential Equations

Time Series Analysis

Data Mining

Statistical Inference

Statistical Mechanics

Molecular Networks

Machine Learning

Main topic classifications

Quantitative Biology

Quantitative Finance

Main topic classifications

Tissues and Organs

Cell Behavior

Quantitative Methods

Subcellular Processes

Biomolecules

Populations and Evolution Other Quantitative Biology

Computational Finance

Risk Management

Trading and Market Microstructure Statistical Finance

Pricing of Securities

Portfolio Management

General Finance Mathematical Finance

Artificial Intelligence

Applications

Other Statistics

Computation

Methodology

Information

Life Sciences

Biological Data Science

Biological Data

Biological Evolution

Economics and Finance

Information Retrieval

Information Technology

Figure 9: Highest scoring node matching from the Graph F1 metric between the ontology generated by OLLM (teal) and the ground truth ontology (black). The matching between nodes is shown in red, where the opacity of the edge indicates the similarity score (weaker links are more transparent). Visually, the matching defines a clear alignment of the two graphs: from the centre to the left we have the Mathematics-related concepts; at the top right we have Biology-related concepts; and at the bottom right we have Economics-related concepts.

A.3.2 Visualisation of edge matching in Continuous F1

Neurons and Cognition

General Economics

Mathematics

Commutative Algebra

Algebraic Geometry

Analysis of PDEs

Algebraic Topology

Classical Analysis and ODEs

Combinatorics

Category Theory

Complex Variables

Differential Geometry

Dynamical Systems

Functional Analysis

General Mathematics

General Topology Group Theory

Geometric Topology

History and Overview

Information Theory K-Theory and Homology

Metric Geometry

Mathematical Physics

Numerical Analysis

Number Theory

Operator Algebras

Optimization and Control

Probability

Quantum Algebra

Rings and Algebras

Representation Theory

Symplectic Geometry

Spectral Theory

Statistics Theory

Time Series Analysis

Differential Equations Machine Learning

Information Systems Theory Artificial Intelligence

Partial Differential Equations

Mathematical Systems Theory

Signal Processing

Mathematics and Society Condensed Matter

Electrical Engineering and Systems Science

Mathematical Logic

Mathematical Physics

Computational Physics

Information Theory

Mathematical Systems and Control

Quantum Physics

Information

Information Systems

Molecular Networks

Machine Learning

Main topic classifications

Quantitative Biology

Quantitative Finance

Mathematics

Life Sciences

Data Analysis

Computational Finance

Tissues and Organs

Risk Management

Cell Behavior

Quantitative Methods

Trading and Market Microstructure

Biomolecules

Other Quantitative Biology

Populations and Evolution

Subcellular Processes

Biological Data Science

Neuroscience

Biological Data

Biological Evolution

Statistical Finance

General Finance

Mathematical Finance

Portfolio Management Pricing of Securities

Statistical Mechanics

Econometrics

Economics and Finance

Mathematical Logic and Foundations

Applications

Other Statistics

Computation

Methodology

Data Mining

Information Technology

Information Retrieval

Statistical Inference

Main topic classifications

Figure 10: Highest scoring edge matching from the Continuous F1 metric between the ontology generated by OLLM (teal) and the ground truth ontology (black). The matching between edges is shown in red, where the opacity of the edge indicates the similarity score (weaker links are more transparent). Visually, the matching defines a clear alignment of the two graphs: in the bottom left and centre we have the Mathematics-related concepts; at the right we have Biology-related concepts; and at the top left we have Economics-related concepts.

A.4 Visualisation of generated ontologies

A.4.1 Wikipedia

We include some generated outputs for Wikipedia here. Since the full generated output is too large to visualise, we plot subgraphs of the output instead. We sample the subgraphs by the following method:

1. Pick a random node in the generated graph. 2. Get the induced subgraph by the 1-hop neighbourhood of the chosen node. 3. Include the shortest path from the root Main topic classifications to the chosen node if such path exists. 4. Repeat from step 1 if the subgraph has more than 30 nodes or less than 5 nodes.

We apply the filtering step (step 4) as subgraphs with too many nodes are difficult to inspect manually, and those with too few are uninformative. For Hearst, we choose the filtering upper bound to be 50 nodes as we fail to find subgraphs smaller than 30 nodes quickly. We additionally colour each edge black if it occurs literally in the training graph, blue if it occurs literally in the test graph, and red otherwise.

Biology organizations

Biological evolution

Biographies of biologists

Biological research

Biotechnology

Biology organisations

Biological research stubs

Biology lists

Biological classification

Biology stubs

Biological concepts

Biology-related lists

Main topic classifications

(a) Biology

Main topic classifications

Language reform

Language law

Language teaching

Language advocacy

Language policy

Language rights Language advocacy organizations

Language legislation

Language laws

Language planning

Language policy by country Language policy organizations

Language standardization

Language policy stubs

Multilingual education

(b) Language policy

Main topic classifications Mathematics Mathematical structures

Mathematical objects Topological spaces Algebraic structures

Groups (mathematics) Algebras Ordered sets

Mathematical categories

(c) Mathematical structures

Figure 11: Sub-ontologies for Wikipedia generated by OLLM, centred on various topics.

Main topic classifications

Energy economics

Energy accidents and incidents

Energy policy

Energy consumption

Energy conservation

Alternative energy economy

Renewable energy economy

Energy economists

Energy companies

Energy policy organizations

Energy crises

Fuel economy

Carbon finance

(a) Energy economics

Main topic classifications Politics Political activism Internet activism

Internet activism stubs

Internet activists

Internet activism organizations

(b) Internet activism

Mathematical concepts

Proof theory

Mathematical proofs

Mathematical theorems

Theorem proofs

Philosophical theories

Theories of history

Scientific theories

Theories of everything

Social theories

Mathematical theories

Real numbers

Linguistic theories and hypotheses

Main topic classifications

(c) Theories

Figure 12: Sub-ontologies for Wikipedia generated by Finetune, centred on various topics.

Historical objects Archaeological artifacts

Culture Artificial objects

Genetically modified organisms

Works of art

Buildings and structures

Synthetic biology Flags

Ceremonial objects

Goods (economics)

Main topic classifications

(a) Aritificial objects

Main topic classifications Human behavior Deception Fraud

Confidence tricks

Internet fraud

Main topic classifications Religion Nature and religion

Nature spirits

Fire in religion

Nature deities

Water and religion

Animals in religion

Light and religion

Sacred natural sites

(c) Nature and religion

Figure 13: Sub-ontologies for Wikipedia generated by Memorisation, centred on various topics.

Social sciences

Information

World Health Organization

Human rights organizations

Earth sciences

Action games

Republicans

Millennium Development Goals Concepts

Social networks

Social sciences

Languages Places

Information Suicides

World Health Organization

Human rights organizations

Earth sciences

Action games

Republicans

Millennium Development Goals

Social networks

(b) Government

Social sciences

Information

World Health Organization

Human rights organizations

Earth sciences

Action games

Republicans

Millennium Development Goals

Social networks

Venues Plants

Dentists Outer space

Religious studies

Women's health

Construction

Professional wrestling

Voyager program

Chinese people

Government of Vietnam

Animatronics

Mathematicians

Street theatre

Polish people

American people

Linguistics

(c) Society

Figure 14: Sub-ontologies for Wikipedia generated by Hearst, centred on various topics.

Experimental mathematics

Nauruan people

Computational mathematics

Irish people

Combinatorics

Vietnamese people British people

Circassian people Engineers Without Borders

Portuguese people

Order theory

Chinese people

Game theory

Japanese people

Azerbaijani people

Elementary mathematics

Recreational mathematics

Polish people

Medical software

Lithuanian people

Mathematical software

Sustainable Development Goals

Main topic classifications

(a) Elections

Social sciences

Philosophy of culture

Experimental mathematics

Computational mathematics

Combinatorics

Natural language processing

Order theory Speech recognition Philosophy of mind

Game theory Computer engineering

Recreational mathematics

Elementary mathematics

Medical software

Mathematical software

Main topic classifications

Main topic classifications Vocal music

Elementary mathematics

Recreational mathematics

Experimental mathematics

Game theory

Medical software

Computational mathematics

Combinatorics

Mathematical software

Order theory

Engineering

(c) Vocal music

Figure 15: Sub-ontologies for Wikipedia generated by REBEL, centred on various topics.

Main topic classifications Social Issues Criminal Justice System Corrections

Capital Punishment

Law Enforcement Agencies

Prisons and Prisoners

Police Misconduct Corrections and Rehabilitation Prisons and Penitentiaries

Sentencing Trials and Sentencing

Law Enforcement

Prisons and Corrections

(a) Criminal Justice System

Main topic classifications

Government and Politics

Danish Politics

Danish Government

Danish Football University of Copenhagen

Danish History Regions

Political History

(b) Denmark

Main topic classifications

Computer Science

Natural Language Processing

Machine Learning

Supervised Learning

Pattern Recognition

Deep Learning

Neural Networks

(c) Machine Learning

Figure 16: Sub-ontologies for Wikipedia generated by Zero-shot, centred on various topics.

Competitions

World Championships

Main topic classifications

Running Football

Long-distance running

Track and field

Track and Field

(a) Athletics

Criminal law

International law

Main topic classifications

Administrative law

Legal studies

Comparative law

Nationality law

Jurisprudence Legal history Constitutional law

(b) Legal studies

Main topic classifications Science Biology Physiology

Muscle physiology

Neurophysiology

Cardiovascular system

Human physiology

Respiratory system

(c) Physiology

Figure 17: Sub-ontologies for Wikipedia generated by One-shot, centred on various topics.

Main topic classifications

Space technology

Aerospace technology

Space exploration

Rocket technology

(a) Aerospace technology

Main topic classifications

Artificial intelligence and machine learning

Natural language processing Speech recognition

Neural networks

Knowledge representation and reasoning Computer vision

Knowledge representation

Machine learning

(b) Artificial intelligence and machine learning

Voting Voting systems

United States elections

General elections

State elections

Local elections

Political figures

Opinion polling

Electoral commissions

Voter turnout

By-elections

Voting methods

Electoral systems

Election administration

Parliamentary elections

Presidential elections

Provincial elections

History United States

Main topic classifications

(c) Elections

Figure 18: Sub-ontologies for Wikipedia generated by Three-shot, centred on various topics.

A.4.2 ar Xiv

Mathematics

Mathematical Physics Quantum Algebra

Algebraic Geometry

Dynamical Systems

Representation Theory

Differential Geometry

Combinatorics

History and Overview

K-Theory and Homology

Probability

Number Theory

Logic Algebraic Topology

Symplectic Geometry

Spectral Theory

Operator Algebras

Statistics Theory

Metric Geometry

Optimization and Control

General Topology

Commutative Algebra

Information Theory

Complex Variables

Numerical Analysis Category Theory

Classical Analysis and ODEs

Group Theory

Rings and Algebras

General Mathematics

Geometric Topology

Analysis of PDEs

Functional Analysis

Main topic classifications

Quantitative Biology

Quantitative Finance

Neurons and Cognition

Molecular Networks

Tissues and Organs

Cell Behavior

Quantitative Methods

Subcellular Processes

Biomolecules

Populations and Evolution

Other Quantitative Biology

General Economics

Computational Finance

Risk Management

Trading and Market Microstructure

Statistical Finance

Pricing of Securities

Portfolio Management

General Finance

Mathematical Finance

Machine Learning

Applications

Other Statistics

Computation

Methodology

Figure 19: Ground truth test split ontology for ar Xiv

Information Systems

Information Systems Theory

Data Analysis

Statistical Inference

Data Mining

Time Series Analysis

Condensed Matter Computational Physics

Quantum Physics

Statistical Mechanics

Electrical Engineering and Systems Science

Signal Processing

Mathematics

Mathematical Systems Theory

Mathematical Physics

Mathematical Logic and Foundations

Mathematics and Society

Mathematical Systems and Control

Mathematical Logic

Artificial Intelligence

Machine Learning

Econometrics

Economics and Finance

Life Sciences

Neuroscience

Biological Data

Biological Evolution

Biological Data Science

Information

Information Retrieval Information Technology

Main topic classifications

Information Theory

Differential Equations

Partial Differential Equations

Figure 20: Ontology for ar Xiv generated by OLLM

Main topic classifications

Humanities Artificial Intelligence

Life Sciences

Methods in Physics

Nonlinear Science

Quantum Physics

Statistical Mechanics

Biological Physics

Condensed Matter

Computational Physics

Fluid Dynamics

Linguistics

Anthropology

Medical Humanities

Biological Sciences

Economics Economics and Finance

Information Information Theory

Figure 21: Ontology for ar Xiv generated by Finetune

Physics Mathematical Physics

General Relativity and Quantum Cosmology

Astrophysics

High Energy Physics - Theory

High Energy Physics - Phenomenology

Quantum Physics

Condensed Matter

Nuclear Theory

Instrumentation and Detectors

High Energy Physics - Lattice

Atomic Physics

High Energy Physics - Experiment Fluid Dynamics Nonlinear Sciences

Computational Physics

Plasma Physics

Physics and Society

Nuclear Experiment

Applied Physics

Biological Physics

Chemical Physics

Instrumentation and Methods for Astrophysics

Cosmology and Nongalactic Astrophysics

High Energy Astrophysical Phenomena

Astrophysics of Galaxies

Solar and Stellar Astrophysics

Earth and Planetary Astrophysics

Mesoscale and Nanoscale Physics

Statistical Mechanics

Quantum Gases

Strongly Correlated Electrons

Superconductivity

Disordered Systems and Neural Networks

Materials Science

Soft Condensed Matter

Main topic classifications

Electrical Engineering and Systems Science

Image and Video Processing

Signal Processing

Systems and Control

Audio and Speech Processing

Figure 22: Ontology for ar Xiv generated by Memorisation

Cell Behavior Algebraic Geometry

Tissues and Organs

Algebraic Topology

Classical Analysis and ODEs

General Mathematics

Statistics Theory

Symplectic Geometry

Quantitative Methods

Neurons and Cognition

K-Theory and Homology

Mathematics

Geometric Topology

Operator Algebras Risk Management

Pricing of Securities

Metric Geometry

Biomolecules

Molecular Networks Mathematical Finance

Computation

Dynamical Systems

Group Theory

Commutative Algebra

General Economics

Analysis of PDEs Differential Geometry

Rings and Algebras

General Topology

Mathematical Physics

Trading and Market Microstructure

Number Theory

Computational Finance

Quantum Algebra

Subcellular Processes

Quantitative Finance

Statistical Finance

Category Theory

Representation Theory Probability

Spectral Theory

Complex Variables

Combinatorics

Populations and Evolution

Main topic classifications

Quantitative Biology

Portfolio Management

Other Statistics

Information Theory

Numerical Analysis General Finance

Other Quantitative Biology

Optimization and Control

Functional Analysis

History and Overview

Applications

Machine Learning

Methodology

Figure 23: Ontology for ar Xiv generated by Hearst

General Finance

Algebraic Topology

Quantitative Finance

Operator Algebras

Analysis of PDEs

Rings and Algebras

Number Theory

Representation Theory

Mathematical Finance

Symplectic Geometry

Quantitative Methods

Probability

Statistical Finance

Populations and Evolution

Geometric Topology

Classical Analysis and ODEs

Pricing of Securities

Methodology

Numerical Analysis

Trading and Market Microstructure

Cell Behavior

Metric Geometry

Tissues and Organs

History and Overview

Information Theory Optimization and Control

Other Quantitative Biology

Mathematics

Quantitative Biology

Category Theory Statistics Theory

Portfolio Management

Subcellular Processes

General Mathematics

Other Statistics

Differential Geometry K-Theory and Homology

Biomolecules

Computation

Machine Learning

Algebraic Geometry

Combinatorics

Molecular Networks

Spectral Theory

Complex Variables

General Economics

Risk Management

Mathematical Physics

Dynamical Systems

Neurons and Cognition

Group Theory

Quantum Algebra

Commutative Algebra

Computational Finance

General Topology

Functional Analysis

Main topic classifications

Applications

Figure 24: Ontology for ar Xiv generated by REBEL

Machine Learning

Deep Learning

Neural Networks

Main topic classifications

Health and Medicine

Communications Technology

Computer Science

Machine Learning and Artificial Intelligence

Artificial Intelligence and Machine Learning

Machine Learning and AI

Mathematics

Artificial Intelligence

Applied Mathematics

Optimization

Theoretical Mathematics

Probability Theory and Statistics

Functional Analysis

Figure 25: Ontology for ar Xiv generated by Zero-shot

Machine Learning

Deep Learning

Neural Networks

Main topic classifications

Computer Science Mathematics

Mathematical Physics

Theoretical Physics

Artificial Intelligence and Machine Learning

Machine Learning and AI

Artificial Intelligence

Applied Mathematics

Functional Analysis

Optimization

Mathematical Analysis

Probability Theory and Statistics

Engineering Electrical Engineering

Figure 26: Ontology for ar Xiv generated by One-shot

Machine Learning Deep Learning

Main topic classifications

Electrical Engineering and Systems Science

Computer Science

Machine Learning and Artificial Intelligence

Artificial Intelligence and Machine Learning Engineering

Mathematics

Engineering and Technology

Machine Learning and AI

Artificial Intelligence

Applied Mathematics

Mathematical Physics

Probability and Statistics

Figure 27: Ontology for ar Xiv generated by Three-shot

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes]

Justification: The main claims of this paper are: 1. OLLM is an effective method for building ontologies from scratch, and 2. the evaluation metrics we introduce are robust and useful for gold standard evaluation of ontologies. We justify the first claim by demonstrating that OLLM outperforms our baseline methods on Wikipedia and ar Xiv according to our metrics. We justify the second claim by showing that an existing metric (Literal F1) can score the sub-optimal Memorisation solution highly, while our metrics are not subject to this issue. Our new metrics also suggest results that align with our qualitative analysis. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations and the possible resolutions in Section 6. This includes the scope of the work, where we only study ontologies with concepts and taxonomic relations, the inability to guarantee taxonomic relation transitivity, and the inability to control for data leakage from the pretraining stage of the LLM base model. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best

judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [NA]

Justification: This paper does not include theoretical results.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We describe the data collection procedure in Section 4.1 and the full experiment details in Appendix A.1. We also include the code and dataset in the supplementary material.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: The code and data used for this project are provided in the supplementary material. The code includes a README which details the steps for reproducing our results.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We give all the experimental details in Appendix A.1.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [No]

Justification: We did not perform repeated experiments due to compute constraints, therefore there are no error bars.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We describe the compute requirements for each experiment in Appendix A.1. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This paper does not involve human subjects and does not use data that is not already in the public domain. We clearly describe our data collection procedure in Section 4.1. Our method is highly specialised in building ontologies and solving related tasks, thus having minimal societal impact or any potentially harmful consequences. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA] Justification: Our method is highly specialised in building ontologies and solving related tasks only. We do not expect our work to have a wider impact than improving the quality of existing or new ontologies.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA] Justification: This paper poses no such risks. The trained model is specific to building ontologies only.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The two datasets collected in this project use data from Wikipedia and ar Xiv, which are in the public domain under the CC BY-SA 4.0 and CC0 1.0 Deed licenses respectively. The REBEL-large model is available under CC BY-NC-SA 4.0 license.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes]

Justification: We release the dataset and code for this paper. The data collection procedure is clearly described in Section 4.1. The training procedure is clearly described in Appendix A.1. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.