Ontologies and controlled vocabularies exist to define domain knowledge, serving to harmonize concepts across datasets, thus enabling semantic alignment and data standardization.
However, ontologies and vocabularies built for different purposes overlap in terms of domain. A study by Kamdar et al. found that, generally, well-established ontologies and controlled terminologies do not reuse terms from other ontologies [1]. The reasons for this might be inability to find the required term in another ontology due to differing labels, the use case for the term may be slightly different or the definitions between equivalent terms may diverge. Additionally, groups may want to set up bespoke ontologies from scratch. In order to optimize the utility of ontologies for biomedical data integration, it is necessary to map or match concepts between different resources [2].
Here we describe a specific ontology mapping project that was recently undertaken by SciBite and researchers from the University of Maryland School of Medicine for the Human Disease Ontology Knowlegebase [3]. We illustrate some of the challenges and intricacies of this seemingly straightforward process, together with some considerations on how to improve the process. We found that both automated and manual approaches were necessary to complete the project, highlighting the importance of a combined approach to maximize the efficiency and accuracy of matching ontology terms.
Ontology mapping is a tricky business for a person, let alone a computer. The ambiguity and nuances of the English language means simple word or phrase matching will not be sufficient for the more nuanced matches. Take, for instance, the word ‘trunk’. Even within the anatomical domain, without additional context, this could be referring to the main part of a body, the woody stem of a tree or a prehensile appendage of an elephant.
Outside of anatomy it could be a large traveling case, a male swimming garment or – if you are in the United States – the boot of a car or a pair of underwear (Figure 1).
Rules-based and machine-learning approaches have been developed to address ontology mapping challenges [4], but despite this, Subject Matter Expert validation is essential to ensure the quality and accuracy of the suggested mappings [5].
Figure 1: The many interpretations of ‘Trunk’.
This was highlighted in a recent project that SciBite undertook with the Human Disease Ontology (DO) Knowledgebase group at University of Maryland School of Medicine. The DO group connected with SciBite requesting disease term mapping based on OMIM IDs (Online Mendelian Inheritance in Man) [6] between the Disease Ontology and UniProt Disease (UPDISEASE) entries [7]. The purpose of this effort, for the DO project, was to facilitate timely disease term mapping between UniProt and the DO. Cross-mapping between resources is a timely and curatorially heavy effort. Automated, expert mappings discerned from SciBite greatly facilitated the speed with which this effort could be accomplished.
The disease mapping project involved aligning the approximate 6,500 UniProt Disease classes to the DO classes, utilizing several of SciBite’s tools in addition to expert curators. Throughout this process, SciBite and DO curators noted that the combination of automated and manual approaches was key to a successful mapping project. We found that automated mapping tools alone were unlikely to create a complete and accurate mapping. Thus, it is necessary to employ manual approaches to both refine the automated approach, verify the mappings created by the tool and find mappings that are missed by the tool.
The primary tool used was SciBite’s mapping tool Workbench [8]. Workbench uses SciBite vocabularies (VOCabs) within TERMite (SciBite’s Named Entity Recognition engine) to align data or ontologies to another data standard or ontology. Workbench is designed to assist the user in the time-consuming and error-prone process of matching terms between sources.
In addition to Workbench, SciBite’s ontology management platform CENtree was incorporated into the mapping process to allow the curators to visualize the DO terms, their synonyms, definitions and position within the hierarchy to assist with validation of the suggested mappings (Figure 2).
Workbench is built to utilize SciBite VOCabs and TERMite to provide comprehensive synonym coverage of life sciences domains. The key VOCab for the UPDISEASE:DO mapping project was SciBite’s DOID VOCab. This VOCab is built from the DO ontology but optimized for named entity recognition (NER) by expert curators. This process includes augmenting with additional synonyms from literature review and rules-based synonym generation plus adding context and disambiguation where appropriate to increase the precision and recall in NER.
Workbench requires a TERMite-ready VOCab for each of the terminologies being mapped, which meant a rudimentary VOCab also had to be created for the UniProt Disease entries. Fortunately, a simple three-column file would suffice for the initial mapping procedure (Figure 3).
Figure 3: Building a UniProt Disease Entity VOCab. The ID, Name and Alternative Name fields from the UniProt export were used to build the UPDISEASE VOCab. The “Mnemonic” field wasn’t used for initial mapping as these short acronyms are likely to produce false positive mappings.
The results of the initial mapping in Workbench were promising (Figure 4). Approximately 90% of UniProt disease terms were mapped to a DO term, either as an EXACT or BROADER match. An exact match is where at least one label or synonym matches exactly between the source term and target, whereas a broader match holds a similarity between labels or synonyms and is above the user-defined cut-off score.
Figure 4: A sample of mappings between UPDISEASE and DO from Workbench. Workbench scores mappings based on the lexical similarity between labels and synonyms and measures mapping ambiguity. This score is normalized to a percentage. Generally, mappings over 80% (0.80) are medium to high confidence.
Although a large proportion of the UPDISEASE classes have been successfully mapped to DO, there were still 50% that were not exact and required curator review. Many of the broader matches could be verified by a quick eyeball of the corresponding terms, as in Figure 5a. The UPDISEASE term “Dystonia 6, torsion” is clearly the same as the DO term “torsion dystonia 6”, so these are quick for the curator to review. Other reviews may take longer, as in the example in Figure 5b. The UPDISEASE term “Corticosterone methyloxidase 1 deficiency” was reported as having NO mapping, but a curator checking the ontology would find that an EXACT mapping to DO actually does exist, differing only in the order of the terms within the disease name.
Figure 5a: 5a Some mappings are reported as broader but a quick look at the term labels shows they are clearly exact. 5b Exact mappings may exist, as shown by the DOID term found by the curator, but Workbench found no mapping to the UPDISEASE term.
Curator review of the mapping output can also identify possible improvements to the mapping process. One of the major reasons for non-exact matches in this mapping project was the different word order between the labels of UPDISEASE terms and the DO, or the addition of ‘type’ in the DO label, resulting in no mapping being found (Figure 6a and 6b).
Figure 6: 6a Different word order between resources means that the corresponding term cannot be automatically identified. 6b Testing these UniProt labels over the DOID VOCab in TERMite shows that the full names are not identified, only partial matches to broader DO classes.
This could be resolved by augmenting the UPDISEASE VOCab using a semi-automated approach. This involved using regular expressions to create additional synonyms for the UPDISEASE terms that would match the word order in DO (Figure 7).
Figure 7: Creating synonym variations for the UniProt Disease entities allows the NER to align the synonyms to DO.
By taking this approach of creating synonyms with matching word orders, it was possible to map an additional 10-15% of UPDISEASE terms to DO, thus further reducing the manual effort required by the curators.
Unfortunately, there were too many variations of word order and word addition to capture with regular expressions. Figure 8 shows the initial broader mapping of the UPDISEASE term “Muscular dystrophy congenital LMNA-related” to the DO term “muscular dystrophy”. After manual review it was found the exact mapping was in fact to DO term “congenital muscular dystrophy due to LMNA mutation”. It would be extremely time-consuming to construct regular expressions to cover all the different types of synonym variation for little gain (generally only a handful terms follow each particular pattern), therefore manual review of these was essential.
Figure 8: The initial mapping of the UPDISEASE term was too broad, but the different word order plus additional words in the DO term meant that automated mapping was not possible. After manual review, the curator identified the corresponding term to map to.
At this point most of the low hanging fruit had been mapped, but you may well remember that the UniProt Disease entities also had mnemonics or acronyms associated with them. It was decided to add the mnemonics to the UPDISEASE VOCab and perform the mapping process again using only the remaining unmapped UPDISEASE terms. Adding these mnemonics to the VOCab used in the initial mapping would likely have resulted in many false positives due to the short length of the synonyms. This would have meant the curators would have more mappings to manually verify, increasing the time and effort.
Adding the mnemonic synonyms to the UniProt Disease Entities resulted in an additional ~300 terms being mapped.
As with all ontology mapping exercises, there are several hundred or (if you’re unlucky) thousands of terms that cannot be automatically mapped and the only option, until automated approaches improve, is for a curator to search for a matching term within the target source. This is where the curator’s expertise and fastidiousness come into their own (with a little help from an ontology viewer such as CENtree).
Take the example shown in Figure 9. The UniProt Disease term “Crisponi/Cold-induced sweating syndrome 2” has a mnemonic synonym “CISS2”. This term does not exist as an exact match in DO, but the curator has found a similar term “cold-induced sweating syndrome 2”, however it has no synonyms to provide additional evidence to its meaning (Figure 9a). The question is, are these equivalent?
By looking at DO in an ontology browser such as CENtree, the curator could examine the terms surrounding this potential target term. The parent term of the DO term is “cold-induced sweating syndrome”, and among its synonyms is “Crisponi syndrome” (Figure 9b), so the curator can infer from this that the UPDISEASE term is in fact equivalent to the DO term “cold-induced sweating syndrome 2”.
Figure 9: 9a The UPDISEASE and DO terms have no shared labels or synonyms. 9b It is only by manual review of the terms surrounding the DO term in the ontology, in this case the synonyms of the parent term, can we be certain that these are the same concept.
A final example of how manual review was necessary for this mapping exercise is shown in Figure 10. The UPDISEASE entity “Agammaglobulinemia 9, autosomal recessive” expresses not only the disease but its autosomal recessive inheritance. In some cases, the Disease Ontology also expresses the inheritance within the term label, such as “autosomal recessive hypercholesterolemia”. However, due to nomenclature variations, this is not consistent and there are terms where the inheritance is expressed only within the ontology hierarchy. Only by viewing the ontology is the curator able to combine this information and conclude that the two terms are equivalent.
The project successfully identified the majority of disease term mappings between UniProt and the DO, providing a ML-ready dataset of disease-to-disease mappings. The iterative mapping approach facilitated new solutions to tricky term mappings. For the DO project curators, this approach provided a targeted set of mappings to review, thus reducing the time burden to identify related data across resources.
Ontology mapping continues to be challenging due to the nuances of language and context-dependent interpretation of certain words and phrases. While some of the burden can be taken by automated approaches for the more straightforward connections, much of the work relies on subject matter experts and curators to verify the matches or to search for the appropriate mappings.
Automated approaches to mapping ontologies are improving all the time but, for now, if you want to distinguish between an elephant’s proboscis and a pair of underpants don’t rely on an algorithm!
Rachael Huntley is Lead Scientific Curator at SciBite with over 20 years biocuration experience. Dr. Huntley received her PhD in plant biochemistry from the University of Cambridge and completed post-doctoral research in both Cambridge, UK and Stanford, USA.
During her time at EMBL-EBI and University College London she contributed to functional annotation of human proteins and microRNAs involved in human health and disease. Throughout her biocuration career, she has worked closely with the Gene Ontology Consortium and major pharmaceutical companies and has contributed to the development of ontologies, biocuration standards and curation tools.