This blog post focuses on mapping, building, and managing ontologies. In my previous blog, I described what ontologies are and how you can use them to make the best use of scientific data within your organization. Here I’ll expand upon this and focus on mapping, building, and managing ontologies.
The first major Biomedical Ontology was the Gene Ontology (GO), which arose from a collective who were managing different model organism databases. They realised that what they were describing in their respective databases – the roles, processes and locations of proteins – were the same thing, so it made sense to create a collective standard. Independently, clinical standards developed for the purposes of medical billing (SNOMED) and monitoring global disease (ICD-9 and ICD-10). A precursor to all of these is MeSH, which was originally developed to tag life-science papers.
With the growing recognition of their importance, the number of public ontologies has grown dramatically. For example, BioPortal contains 169 ontologies relating to health.
It’s best to avoid using multiple ontologies for the same domain, but with so many ontologies, how do you choose the right one? The first question to ask is whether it is suited to your data – do the entities in your data match up well with what is in the ontology?
It’s helpful to understand whether the ontology is being maintained long-term or is going to become deprecated. How often does the ontology get updated, and how large are those changes? How do I handle those updates?
I’ve mentioned the importance data Interoperability, but the same applies to ontologies. For example, is it interoperable with other ontologies? Does it overlap with other ontologies, and if so, are those overlaps handled cleanly? For example, one of the advantages of using OBO Foundry ontologies is that they strive to be a non-redundant collection that is built to common standards so that they can be used together. Or you can use what’s known as an application ontology like the EBI’s Experimental Factor Ontology (EFO), where parts of other ontologies are imported and glued together with EFO terms, so the interoperability is handled for you.
Another consideration is licensing, especially if the ontology is something that’s going to be built into a product and redistributed. For example, most clinical standards are licensed. It’s also worth checking what is already being used within your organization. There’s no point in reinventing the wheel – if your organization is already using MedDRA elsewhere, it might make logistical sense to go with that. Or if you’re currently using a home-grown standard, how difficult would it be to replace that with a public ontology?
Last but certainly not least, if you are generating some sort of user-facing tool, what do the users need? What will actually help them get their job done? ICD-10 might not be the easiest to work with, but if that’s what your scientists want, then you need to take that on board.
In an ideal world, we would have only one standard per domain, which is the aim of the OBO Foundry ontology set. However, in the real world, this is not always the case, especially in the clinical domain, where there are many competing standards and ontologies, and you may need to pivot between them. Using different ontologies within the same data domain ultimately hampers its interoperability and application.
You may also need to map your own internal standards out to public standards. There are some great tools and resources out there to help with this, such as the EBI’s Ontology Xref Service (OxO) resource and SciBite’s own tools. But ultimately, ontology mapping is hard, and the effort involved can often be minimised or avoided by choosing common standards and ontologies to use across an organization.
I’ve been involved with the Ontologies Mapping project coordinated by the Pistoia Alliance, funded by a consortium of pharma companies and SMEs, which is attempting to provide reference mappings for major biomedical ontologies. It’s been run in collaboration with EMBL-EBI, and the mappings will be made available through its OxO tool, as well, Paxo, the mapping algorithm to generate them.
EMBL-EBI and the mappings will be made available through its OxO tool, as well, Paxo, the mapping algorithm to generate them.
Recently this project delivered mappings of disease and phenotype ontologies [1], and the current is to extend the mapping of biological and chemical ontologies for support of laboratory analytics
So once you have chosen and started to implement your chosen ontologies, you then need to consider how those will be managed. Your ontologies and vocabularies will probably have come from different places, such as:
Regardless of their source, there are some important considerations for any ontology management system. Firstly, strict governance is critical within a large organization – it’s important to be able to know who has done what to a particular ontology and when, such as via an audit trail, but that shouldn’t be too onerous otherwise, you won’t have the agility needed to enable ontologies to grow and improve.
Ontologies need to be version controlled, so you know exactly what version you’re using, even if it’s not an official “version”. Related to this, the ability to roll back to an earlier version is also important. You also need to be able to control access for individuals and applications, but that obviously needs to be flexible because you may not know who or what those are from the outset. It’s particularly helpful if scientific groups and other subject matter experts can contribute by suggesting changes and have a way to incorporate them without the risk of breaking things!
Finally, it’s important to embrace public ontologies, to be able to handle updates and to be able to reconcile these with internal additions.
As I described in my previous blog, public ontologies serve a vital role in encapsulating scientific knowledge in a given scientific domain. They also provide an important foundation for SciBite’s VOCabs. We align VOCabs with public standards to maintain interoperability and ensure adherence to FAIR principles, but our expert team also enriches them to ensure they have comprehensive synonym coverage so they can be used for text analytics.
For example, the MeSH entry for Abetalipoproteinemia includes 7 synonyms [2], whereas SciBite’s hand-curated Disease ontology contains hundreds, including US/UK spelling variants (such as the substitution of ‘z’ and ‘s’) and variations in the use of hyphens and spacings. MeSH doesn’t simply ‘just work’ for text analytics without manual curation.
We maintain VOCabs using a combination using both manual and machine curation to ensure quality and accuracy. For example, the term “COX2” is officially a name for the mitochondrial gene Cytochrome c Oxidase. However, when people mention COX2, they usually mean Cyclooxygenase 2 – an entirely different gene and major drug target. Without deep curation, text analytics tools will get this wrong, with serious implications.
VOCabs incorporate linguistic rules and contextual ‘boost words’ to add context to address the inherent ambiguity of biomedical terms. For example, boost words help text analytics applications determine that, when mentioned in the context of metabolism, the term “GSK” refers to Glucose Synthase Kinase rather than GlaxoSmithKline. Removing the ambiguity associated with such terms enables tools that use VOCabs, such as our ultra-fast named entity recognition (NER) and extraction engine, TERMite (TERM identification, tagging & extraction), to recognize and extract relevant terms found in scientific text with greater recall, accuracy, and precision.
Jane leads the development of SciBite’s vocabularies and ontology services. With a Ph.D. in Genetics from Cambridge University and 15 years of experience working with biomedical ontologies, including at the EBI and Sanger Institute, she focussed on bioinformatics and developing biomedical ontologies. She has published over 35 scientific papers, mainly in ontology development.
Other articles by Jane: