Any organization that deals with data today needs to adopt digital technologies to drive innovation and improve efficiency. This is especially true of data-rich organizations, such as those in the life sciences. The more data an organization has, the more challenges it faces in managing that data in a way that allows its reuse.
Fortunately, public ontologies have been developed that encode the semantics of knowledge in a machine-readable manner, dramatically reducing this burden of analysis.1,2 Rapid alignment of text to formally defined things, often referred to as data FAIR-ification, (because it makes data Findable, Accessible, Interoperable, and Reusable), allows researchers to analyze vast repositories of historic and prospective data, confident that results are contextually relevant to their use cases. Scientists are increasingly turning to Artificial Intelligence (AI) to derive insights from their data and to improve productivity, but recognize the importance of data labeling and annotation to create better machine learning models.3
Ontologies, therefore, are a key component in managing data. They articulate knowledge about a domain (see Figure 1) through the use of formally agreed names and definitions of categories, properties, and relationships between concepts, data, or entities.
Figure 1: A short video explaining the role of ontologies in research
But whilst these common frameworks support knowledge sharing, similar ontologies have emerged within the same domain, created at different times by different groups and for different use cases, and with differing levels of specificity, presenting us with the dilemma of which ontology to use. So why not use a single universal standard?
There are several reasons why converging to a single ontology is not feasible:
Ontology developers have sought to reconcile similar terms across this evolving ontology landscape, either manually using domain expertise or with automated tools. This expands coverage across large domains, such as anatomy, disease, phenotype, and laboratory investigation.4 However, creating and maintaining intra-domain mappings is not without its challenges.
Figure 2: Orphanet reflects the original understanding that Zellweger syndrome was different from neonatal adrenoeukodistrophy (NALD). Whilst the Disease Ontology (DO) simply describes Zellweger syndrome as a child term of Peroxisomal biogenesis disorder, Nation Cancer Institute Thesaurus (NCIT) presents Zellweger syndrome and NALD as being within the same disease spectrum.
In a recent data extraction (from November 2023) by SciBite, around 10 million public mappings with the qualifier “has_dbxref” were identified. Whilst this axiom was originally used to describe exact matches across databases, it is now used more ambiguously to mean broad, narrow, close, or related matches without expressly specifying which within the metadata!5 Adding to this complexity are Curie prefixes – normally included to identify the source ontology within the unique identifier, these are sometimes missing.
This is not to say that more reliable sources of public mappings are not currently available. SNOMED, NCIT, and MEDdra ontologies produce well-defined and refreshed mappings as part of their terminology releases. These are however, represented differently and must be accessed from different places, which can make concatenation difficult.
Similarly, multiple knowledge bases and data models such as the Unified Medical Language System (UMLS) and the Observational Medical Outcomes Partnership (OMOP), have also been created to pull together and map different data standards. But this also produces another resource with relationships which are not normalised to one another. And so, the cycle continues.
The Simple Standard for Simplifying Ontology Mappings (SSSOM) project6, launched in 2022, tackles ontology mappings from a slightly different angle. Rather than harmonizing terms by subject, this approach establishes a standardized model for representing mappings. This allows researchers to identify the types of associations, how they were made, and certainty of mappings made between ontologies.
At its core, terms mapped using the SSSOM approach are connected using triples (see Figure 3).
Figure 3: SSSOM mappings are represented using subject-predicate-object expressions
Relationships between terms are chosen from a fixed list, and users can also include other controlled metadata to describe the mapping. Figure 4 represents how some typical SSSOM mappings appear.
Figure 4: SSSOM representation: Each row in this table includes a description of how a subject (with ID, label, and ontology) is connected ton, ID, label, and ontology) via a predicate ID. A description of the match type is also included, with a tool used to perform the mapping, confidence score, and origin of mapping.
SciBite has long recognized the challenges and importance of semantically robust mappings. As such, we’ve sought to normalize public mappings for our users. Where appropriate, we’ve added reciprocal mappings, as well as CURIE-prefixes, to help identify where these mappings come from. More recently, SciBite has adopted the SSSOM data standard, developing a schema that has allowed us to create mappings with SSSOM-compliant metadata.
Our goals here are to create broader customizable mapping datasets with increased governance (e.g., with versioning and an explanation of how and when an update took place) that support our customer’s use cases. In our next blog, we’ll examine automated approaches to ontology mapping and how our Workbench tool allows users to rapidly map thousands of terms from one ontology to another.
In the meantime, we’d love to hear your thoughts. What are your biggest challenges, and what approaches do you use when performing ontology mappings? To share your responses or explore how our SciBite Ontology team could support you, please get in touch.
Andy Balfe received his BSc and PhD in organic chemistry from the University of East Anglia. He coordinates the delivery of innovative projects across SciBite’s product suite.
Other articles by Andy: