Ontologies are key in how we share common understandings, do research, and build technologies. Critical for knowledge integration is the ability to reconcile terms from one ontology to related concepts in other ontologies. We previously looked at some of the challenges and approaches of doing this. In this post, we’ll provide a high-level overview of some of the automated methods that have been developed which are enabling the rapid connection of high numbers of ontology terms.
Performing ontology alignments can be complex, often requiring subject matter expertise to recognize context and nuance. Several unsupervised and semi-supervised automated approaches which scale up mappings have been developed. These can broadly be classified into rules-based and machine learning techniques.
Lexical mapping tools, including the open-source Agreement Maker Light (AML)1 or Large-Scale Ontology Matching System (LSMatch)2, use algorithms to detect similarities between labels and synonyms of corresponding ontologies. Other rules-based approaches include LogMap3 which infer or negate mappings by comparing ontology structures and relationships. Whilst these approaches can achieve high levels of precision, they often work best when additional information is already available about a concept (synonyms, logical descriptions, existing classification etc..), something that is often missing from flat lists or simple vocabularies.
Machine learning has been used to identify relatedness between concepts. This is particularly useful for lists which lack ontology structure and metadata. These techniques include Word2Vec, which uses word embeddings4, as well as language models like BERTmap, a fine-tuned version of BERT for mapping predictions coupled with analysis of ontology structure and logic5.
Combining different rules-based or machine learning mapping techniques is yielding more accurate mappings. In a recent paper, the large language model (LLM), MapperGPT was used in a post processing step with a high-recall simple lexical algorithm. When deployed this way, researchers were able to eliminate the technology and financial overhead of using a GPT model to singularly map all concepts, by instead deploying it to isolate and remove false positives6.
The performance of classification models and mappings can be measured by considering their precision and recall. Precision can be seen as a measure of quality, whilst recall as a measure of quantity. Higher precision means that the algorithm returns more relevant results than irrelevant ones, and high recall means that an algorithm returns most of the relevant results (whether or not irrelevant ones are also returned). Accuracy, Precision, and Recall — Never Forget Again!7 is an excellent explainer.
In order to accurately assess the performance of different ontology mapping approaches, researchers can compare their results to standard datasets produced by the Ontology Alignment Evaluation Initiative (OAEI)8. These gold standards, which are published each year, are organized by tracks, for example, the anatomy track is a real-world set of mappings between the Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy. The OAEI also publishes its results allowing data scientists to compare their models against other industry models.
Whilst there is a plethora of approaches available that generate large numbers of “likely” alignments, automated ontology mapping is not without its challenges. These include:
At SciBite, we’ve developed Workbench, a simple, user-friendly tool that allows users to map ontologies. Workbench connects concepts using TERMite, SciBite’s named entity recognition engine, and and over 80 core SciBite vocabularies, providing comprehensive coverage of the biomedical domain and containing >20 million synonyms..
A user would begin by adding in the source ontology, and then selecting the target ontology they wish to map to. In the following example, (see Figure 1) terms from the source ontology, Mouse Anatomy (MA) are being mapped to human anatomy terms within the target National Cancer Institute (NCIt) ontology.
Figure 1: Selecting the source and target ontologies.
Workbench also includes some options to control the specificity of suggested ontology mappings. It’s possible to use mask terms, which effectively allows users to exclude certain words from the string similarity algorithm that may be considered noisy. For example, the terms disease or system may be disregarded if they are found to occur within multiple concepts within the source and target ontologies. Users are also able to control the suggested mapping relationships (exact, broad, and none) or the similarity scoring between mapped concepts.
Workbench uses a string similarity algorithm that calculates a similarity score and an ambiguity score. A lower score is created if a term within the source ontology matches multiple terms within the target ontology. If exact matches aren’t observed, Workbench also includes a substring match (e.g., parent-child) score.
We’re also able to use SciBite’s extensive vocabularies to suggest specific vocabularies that can act as mediators to generate mappings if there are no direct matches between source and target concepts or synonyms. In the following example (see Figure 2), it would not be possible to map the Mouse Anatomy term (MA:0000254) to the National Cancer Institute Thesaurus (NCI:C32164) term as there are no direct matches between their concepts or synonyms. However, Workbench can produce a mapping based on SciBite’s Anatomy vocabulary, which contains the related Auditory Ossicle and auditory bone terms to connect the MA and NCI ontologies.
Figure 2: Using SciBite’s mediating vocabularies.
Mappings generated in Workbench are displayed within a simple table (see Figure 3). The source concepts and their corresponding mapped target terms are presented in rows, along with the semantic relationship, similarity score and the common or “matched” term between source and target. Within Workbench, users have the option of reviewing specific mappings which can be saved for future ontology alignments.
Figure 3: View ontology mappings in Workbench.
SciBite recognizes the importance of providing biocurators with a simple tool for rapidly and accurately mapping ontologies. We are keen to extend the mapping algorithm to use combined rules-based and machine-learning approaches, as well as SciBite’s versioned, maintained, and customizable mapping datasets to support mapping predictions.
We’d love to hear your thoughts. What automated approaches do you use to map your ontologies, what work well and what are your challenges? To share your responses or for more information about our ontology mapping datasets or Workbench, please get in touch.
Andy Balfe received his BSc and PhD in organic chemistry from the University of East Anglia. He coordinates the delivery of innovative projects across SciBite’s product suite.
Other articles by Andy: