Ontology mapping: Advancing data interoperability

Ontology mapping:
Advancing data interoperability

Aerial View of foggy Pine Forest At Sunrise

Any organization that deals with data today needs to adopt digital technologies to drive innovation and improve efficiency. This is especially true of data-rich organizations, such as those in the life sciences. The more data an organization has, the more challenges it faces in managing that data in a way that allows its reuse.

Fortunately, public ontologies have been developed that encode the semantics of knowledge in a machine-readable manner, dramatically reducing this burden of analysis.^1,2 Rapid alignment of text to formally defined things, often referred to as data FAIR-ification, (because it makes data Findable, Accessible, Interoperable, and Reusable), allows researchers to analyze vast repositories of historic and prospective data, confident that results are contextually relevant to their use cases. Scientists are increasingly turning to Artificial Intelligence (AI) to derive insights from their data and to improve productivity, but recognize the importance of data labeling and annotation to create better machine learning models.³

Navigating the complex ontology landscape

Ontologies, therefore, are a key component in managing data. They articulate knowledge about a domain (see Figure 1) through the use of formally agreed names and definitions of categories, properties, and relationships between concepts, data, or entities.

Figure 1: A short video explaining the role of ontologies in research

But whilst these common frameworks support knowledge sharing, similar ontologies have emerged within the same domain, created at different times by different groups and for different use cases, and with differing levels of specificity, presenting us with the dilemma of which ontology to use. So why not use a single universal standard?

There are several reasons why converging to a single ontology is not feasible:

Moving away from existing standards requires effort: An organization may already be using multiple public ontologies internally or their own “home-grown” standards to describe the same domain.
Contending with external sources: It’s not possible to control how data from external organizations, applications or other sources have been annotated.
Different preferences: An organization may have adopted a preferred standard that differs from counterparts elsewhere within their industry.
Need to adapt: A single reference ontology often provides insufficient coverage for a particular application, giving rise to the development of application ontologies.
Compliance: There may be a requirement to use a specific ontology for a particular use case. For example, in drug development, it might be necessary to use the Systemized Nomenclature of Medicine (SNOMED) ontology within the clinical phase, whilst Clinical Data Interchange Standards Consortium (CDISC) or Medical Dictionary for Regulatory Activities (MedDRA) is required for regulatory submissions.

Building ontology mappings

Ontology developers have sought to reconcile similar terms across this evolving ontology landscape, either manually using domain expertise or with automated tools. This expands coverage across large domains, such as anatomy, disease, phenotype, and laboratory investigation.⁴ However, creating and maintaining intra-domain mappings is not without its challenges.

As with ontologies, derived mappings are dynamic, so also need to keep pace with any ontology changes.
Class names, terms, or labels can have ambiguous meanings, e.g., mole (animal) vs. mole (Avogadro’s constant), requiring context to resolve.
Differences in hierarchy and definitions across databases can also complicate the mapping process. The following example (see Figure 2) taken from Mondo Disease Ontology (MONDO), which itself merges multiple disease ontologies, exemplifies the nuances involved when mapping Zellweger syndrome across ontologies.

Click on image to enlarge

Figure 2: Orphanet reflects the original understanding that Zellweger syndrome was different from neonatal adrenoeukodistrophy (NALD). Whilst the Disease Ontology (DO) simply describes Zellweger syndrome as a child term of Peroxisomal biogenesis disorder, Nation Cancer Institute Thesaurus (NCIT) presents Zellweger syndrome and NALD as being within the same disease spectrum.

Using existing ontology mappings

In a recent data extraction (from November 2023) by SciBite, around 10 million public mappings with the qualifier “has_dbxref” were identified. Whilst this axiom was originally used to describe exact matches across databases, it is now used more ambiguously to mean broad, narrow, close, or related matches without expressly specifying which within the metadata!⁵ Adding to this complexity are Curie prefixes – normally included to identify the source ontology within the unique identifier, these are sometimes missing.

This is not to say that more reliable sources of public mappings are not currently available. SNOMED, NCIT, and MEDdra ontologies produce well-defined and refreshed mappings as part of their terminology releases. These are however, represented differently and must be accessed from different places, which can make concatenation difficult.

Similarly, multiple knowledge bases and data models such as the Unified Medical Language System (UMLS) and the Observational Medical Outcomes Partnership (OMOP), have also been created to pull together and map different data standards. But this also produces another resource with relationships which are not normalised to one another. And so, the cycle continues.

Addressing the mess: The SSSOM initiative

The Simple Standard for Simplifying Ontology Mappings (SSSOM) project⁶, launched in 2022, tackles ontology mappings from a slightly different angle. Rather than harmonizing terms by subject, this approach establishes a standardized model for representing mappings. This allows researchers to identify the types of associations, how they were made, and certainty of mappings made between ontologies.

At its core, terms mapped using the SSSOM approach are connected using triples (see Figure 3).

Click on image to enlarge

Figure 3: SSSOM mappings are represented using subject-predicate-object expressions

Relationships between terms are chosen from a fixed list, and users can also include other controlled metadata to describe the mapping. Figure 4 represents how some typical SSSOM mappings appear.

Figure 4: SSSOM representation: Each row in this table includes a description of how a subject (with ID, label, and ontology) is connected ton, ID, label, and ontology) via a predicate ID. A description of the match type is also included, with a tool used to perform the mapping, confidence score, and origin of mapping.

SciBite’s approach to tackling ontology mappings

SciBite has long recognized the challenges and importance of semantically robust mappings. As such, we’ve sought to normalize public mappings for our users. Where appropriate, we’ve added reciprocal mappings, as well as CURIE-prefixes, to help identify where these mappings come from. More recently, SciBite has adopted the SSSOM data standard, developing a schema that has allowed us to create mappings with SSSOM-compliant metadata.

Next steps

Our goals here are to create broader customizable mapping datasets with increased governance (e.g., with versioning and an explanation of how and when an update took place) that support our customer’s use cases. In our next blog, we’ll examine automated approaches to ontology mapping and how our Workbench tool allows users to rapidly map thousands of terms from one ontology to another.

In the meantime, we’d love to hear your thoughts. What are your biggest challenges, and what approaches do you use when performing ontology mappings? To share your responses or explore how our SciBite Ontology team could support you, please get in touch.

References:

P. L. Whetzel, et al., NCBO Technology: powering semantically aware applications. J. Biomed. Semant., 2013, 15, (Suppl. 1), S8.
R. Hoehndorf, et al., The role of ontologies in biological and biomedical research: a functional perspective. Brief Bioinform., 2015, 16, 1069-1080. DOI: 10.1093/bib/bbv011 Accessed 16.10.2024
A. Jaffri and S. Sicular, What’s New in Artificial Intelligence from the 2023 Gartner Hype Cycle. Accessed 16.10.2024
I. Harrow et al., Ontology mapping for semantically enabled applications. Drug Discovery Today, 2019, 24 (10), 2068-2075. DOI: 10.1016/j.drudis.2019.05.020 Accessed 16.10.2024
A. Laadhar, et al., Investigating One Million XRefs in Thirty Ontologies from the OBO World. Proceedings of the 11th International Conference on Biomedical Ontologies (ICBO) joint with the 10th Workshop on Ontologies and Data in Life Sciences (ODLS) and part of the Bolzano Summer of Knowledge (BoSK 2020), Virtual conference hosted in Bolzano, Italy, September 17, 2020. Volume 2807 of CEUR Workshop Proceedings, 1-12, CEUR-WS.org, 2020. DOI: ⟨lirmm-02945170⟩ Accessed 16.10.2024
N. Matentzoglu, et al., A Simple Standard for Sharing Ontological Mappings (SSSOM). Database-the journal of biological databases and curation, 2022. DOI: 10.1093/database/baac035 Accessed 16.10.2024

Andy Balfe

Product Manager, SciBite

Andy Balfe received his BSc and PhD in organic chemistry from the University of East Anglia. He coordinates the delivery of innovative projects across SciBite’s product suite.

Other articles by Andy:

Ontology mapping: Advancing data interoperability Read article
SciBite launches Workbench – Taking the effort out of tabular data curation Read article
Harnessing our latest VOCab: Emtree read article
What’s in our 6.5.2 TERMite / VOCabs release read article

Share this article