Are ontologies still relevant in the age of LLMs?
 

Headshot of Joe Mullen, SciBite

Technological advancements exhibit varying degrees of longevity. Some are tried and trusted, enduring longer than others, while other technologies succumb to fleeting hype without attaining substantive fruition. One constant, in this dynamic landscape is the data.

Machapuchare With Moon Light

To make use of the latest and the greatest technology, you must have your house, or more specifically, your data, in order. Through high-quality foundational data management, where ontologies play a crucial role, an organization can be agile enough to adapt to, and make use of, state-of-the-art technologies such as large language models (LLM).

What is an LLM?

An LLM, is a sophisticated, generative, artificial intelligence (AI) model designed to understand and generate human-like text. Trained on monumental amounts of data, LLMs are designed to generate coherent and contextually relevant responses. LLMs are, therefore, great at language-based tasks that allow them to draw on their learned knowledge of textual patterns, such as summarization, generation, aggregation, translation, and, programming assistance.

What is an ontology?

An ontology is a formal representation of a knowledge domain, enabling the structured encoding of information using principles akin to those found in traditional symbolic logic. Ontologies capture the human knowledge in a format that is computationally friendly; allowing for analysis to be completed with up-to-date knowledge of the subject matter expert (SME) in a scalable fashion.

It’s important to also note that the term ontology is used to define standards varying in expressivity and formality, ranging from glossaries through thesauri and taxonomies via metadata and data models all the way to rich semantic ontologies.

Foundational data management

A key use case of ontologies, or more specifically ontology-derived standards, is the tagging and management of data. Whether that be structured, unstructured, internal, or external, by aligning your data to standards, it becomes more Findable, Accessible, Interoperable and Reusable (FAIR). As a result, information retrieval tasks can be enhanced and by representing previously unstructured data in a structured, semantic fashion, inference and extraction of insights can be expedited.

Can I not just use an LLM for that?

But hang on… surely, I can just use the power of an LLM to manage, retrieve and analyze my data I hear (some of) you cry? Well, no, not completely. And certainly not in scenarios where evidence-based decision-making is crucial, such as the Life Sciences and other domains where decisions need to be calculated in an explainable manner with provenance, and where wrong decisions can result in dire consequences.

Let’s take a closer look at some of the specific tasks we believe the creation of, and application of ontologies are still vital …

1. Ontology generation

LLMs can help uncover ‘knowledge’, yet ontologies are needed to capture that for future use. While various technologies can support the semi-automated curation of ontologies, human validation is crucial for candidate classes, terms, and relationships.

Transforming LLM output into an ontology through a lightweight model enables versatility and reuse in downstream applications. Even if AI could create a flawless ontology, it is crucial to recognize that the value of a standard lies in its consensus among humans. While AI may generate a well-structured ontology, it may well struggle with nuanced distinctions, and diverge from what others use. The human in the loop is vital.

2. Alignment to get standards

While adept at recognizing types of things, like genes, LLMs struggle with aligning varied, synonymous, representations of instances to specific identifiers (e.g., understanding that BRCA1 and FANCS are equivalents). Short of providing an ontology as part of a prompt (limitations include prompt token length), the ability to annotate textual data to type, instance and accompanying relationships, or hierarchies, is still something that requires ontologies. Ontologies ‘know’ things and are validated by humans.

3. Search, single DB

LLMs’ limitations in search tasks are well-documented: hallucinations, outdated information, security and privacy, a lack of provenance and auditability/reproducibility to name a few. The Retrieval Augmented Generation (RAG) architecture, the widely accepted grounding approach, is gaining prominence.

Successful RAG systems hinge on precise information retrieval (IR), often using embeddings to indicate relatedness, not the crucial “how” for explainable decisions. In this domain, the consensus is that lexical/ontological or hybrid approaches excel over purely vector-based methods.

4. Search, democratizing siloes

Data seldom resides in a single enterprise source. It exists in diverse sources, formats, and syntaxes. To democratize data, aligning disparate siloed data to common standards is crucial, enabled by ontologies ensuring source-level interoperability. Similarly, converting natural language queries to ontological entities is vital for querying diverse data effectively.

Ontologies play a pivotal role in such cases, aiding seamless retrieval from siloes that key against IDs captured in ontologies, such as knowledge graphs. Whilst LLMs have a role in aspects of such a solution, e.g., converting natural language queries to a relevant query syntax (i.e., SSQL for SciBite Search or Cypher for Neo4J), summarising results from an accurate IR set – ontologies are also paramount.

Milky Way Over Mount Rainier

When we model data with ontologies, we can ask extremely precise questions, get definitive answers and are able to use reasoning to both deduce and explain answers to queries. Ontologies were born out of a need to unambiguously identify the kinds of entities in the world (and in our data) and the relationships that hold between them. LLMs, on the other hand, provide a statistical approach to identifying these that can be presented through natural language – enabling us to see what things are potentially related, but not how.

LLMs are not going anywhere and will benefit us all, their ability to support operational tasks is clear for all to see, however, in situations where evidence-based decision-making is paramount, particularly within R&D in the Life Sciences, ontologies still have a massive role to play.

Just like ontologies or any piece of software for that matter, LLMs are just tools to help you do things. No one thing solves every problem on their own. We must all remember to start by understanding the problem, not shoehorning a solution into a place it does not fit.

 


About Joe Mullen

Director of Data Science & Professional Services, SciBite

Joe Mullen, Director of Science & Professional Services. Holds a Ph.D. from Newcastle University in the development of computational approaches to drug repositioning, with a focus on semantic data integration and data mining. He has been with SciBite since 2017, initially as part of the Data Science team.

View LinkedIn profile

Other articles by Joe

1. [Blog] What is Retrieval Augmented Generation and why is the data you feed it so important?read more
2. [Blog] Large language models (LLMs) and search; it’s a FAIR game, read more.
3. [Blog] A review of the Pistoia Alliance Spring Conference 2023 read more.
4. [Blog] Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1] read more.
5. [Blog] Why Use Your Ontology Management Platform as a Central Ontology Server, read more.
6. [Blog] SKOS in CENtree: Further support in our latest 2.1 release, read more.


Related articles

  1. What is Retrieval Augmented Generation and why is the data you feed it so important?

    Headshot of Joe Mullen, SciBite

    Within the life sciences, evidence-based decision-making is imperative; wrong decisions can have dire consequences. As such, it is vital that systems that support the generation and validation of hypotheses provide direct links, or provenance, to the data that was used to generate them. But how can one implement such a workflow?

    Read
  2. Microbiome repurposing: Is it a potential therapeutic approach & how can we do it with LLMs?

    Discover the past and future of microbiome-based healing. From ancient remedies to modern AI, learn how SciBite's groundbreaking approach blends Large Language Models (LLMs) with advanced tech to unravel the potential of therapeutic microbiomes.

    Read

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us