Technological advancements exhibit varying degrees of longevity. Some are tried and trusted, enduring longer than others, while other technologies succumb to fleeting hype without attaining substantive fruition. One constant, in this dynamic landscape is the data.
To make use of the latest and the greatest technology, you must have your house, or more specifically, your data, in order. Through high-quality foundational data management, where ontologies play a crucial role, an organization can be agile enough to adapt to, and make use of, state-of-the-art technologies such as large language models (LLM).
An LLM, is a sophisticated, generative, artificial intelligence (AI) model designed to understand and generate human-like text. Trained on monumental amounts of data, LLMs are designed to generate coherent and contextually relevant responses. LLMs are, therefore, great at language-based tasks that allow them to draw on their learned knowledge of textual patterns, such as summarization, generation, aggregation, translation, and, programming assistance.
An ontology is a formal representation of a knowledge domain, enabling the structured encoding of information using principles akin to those found in traditional symbolic logic. Ontologies capture the human knowledge in a format that is computationally friendly; allowing for analysis to be completed with up-to-date knowledge of the subject matter expert (SME) in a scalable fashion.
It’s important to also note that the term ontology is used to define standards varying in expressivity and formality, ranging from glossaries through thesauri and taxonomies via metadata and data models all the way to rich semantic ontologies.
A key use case of ontologies, or more specifically ontology-derived standards, is the tagging and management of data. Whether that be structured, unstructured, internal, or external, by aligning your data to standards, it becomes more Findable, Accessible, Interoperable and Reusable (FAIR). As a result, information retrieval tasks can be enhanced and by representing previously unstructured data in a structured, semantic fashion, inference and extraction of insights can be expedited.
But hang on… surely, I can just use the power of an LLM to manage, retrieve and analyze my data I hear (some of) you cry? Well, no, not completely. And certainly not in scenarios where evidence-based decision-making is crucial, such as the Life Sciences and other domains where decisions need to be calculated in an explainable manner with provenance, and where wrong decisions can result in dire consequences.
Let’s take a closer look at some of the specific tasks we believe the creation of, and application of ontologies are still vital…
LLMs can help uncover ‘knowledge’, yet ontologies are needed to capture that for future use. While various technologies can support the semi-automated curation of ontologies, human validation is crucial for candidate classes, terms, and relationships.
Transforming LLM output into an ontology through a lightweight model enables versatility and reuse in downstream applications. Even if AI could create a flawless ontology, it is crucial to recognize that the value of a standard lies in its consensus among humans. While AI may generate a well-structured ontology, it may well struggle with nuanced distinctions, and diverge from what others use. The human in the loop is vital.
While adept at recognizing types of things, like genes, LLMs struggle with aligning varied, synonymous, representations of instances to specific identifiers (e.g., understanding that BRCA1 and FANCS are equivalents). Short of providing an ontology as part of a prompt (limitations include prompt token length), the ability to annotate textual data to type, instance and accompanying relationships, or hierarchies, is still something that requires ontologies. Ontologies ‘know’ things and are validated by humans.
LLMs’ limitations in search tasks are well-documented: hallucinations, outdated information, security and privacy, a lack of provenance and auditability/reproducibility to name a few. The Retrieval Augmented Generation (RAG) architecture, the widely accepted grounding approach, is gaining prominence.
Successful RAG systems hinge on precise information retrieval (IR), often using embeddings to indicate relatedness, not the crucial “how” for explainable decisions. In this domain, the consensus is that lexical/ontological or hybrid approaches excel over purely vector-based methods.
Data seldom resides in a single enterprise source. It exists in diverse sources, formats, and syntaxes. To democratize data, aligning disparate siloed data to common standards is crucial, enabled by ontologies ensuring source-level interoperability. Similarly, converting natural language queries to ontological entities is vital for querying diverse data effectively.
Ontologies play a pivotal role in such cases, aiding seamless retrieval from siloes that key against IDs captured in ontologies, such as knowledge graphs. Whilst LLMs have a role in aspects of such a solution, e.g., converting natural language queries to a relevant query syntax (i.e., SSQL for SciBite Search or Cypher for Neo4J), summarising results from an accurate IR set – ontologies are also paramount.
When we model data with ontologies, we can ask extremely precise questions, get definitive answers and are able to use reasoning to both deduce and explain answers to queries. Ontologies were born out of a need to unambiguously identify the kinds of entities in the world (and in our data) and the relationships that hold between them. LLMs, on the other hand, provide a statistical approach to identifying these that can be presented through natural language – enabling us to see what things are potentially related, but not how.
Large language models are not going anywhere and will benefit us all, their ability to support operational tasks is clear for all to see, however, in situations where evidence-based decision-making is paramount, particularly within R&D in the Life Sciences, ontologies still have a massive role to play.
Just like ontologies or any piece of software for that matter, large language models are just tools to help you do things. No one thing solves every problem on their own. We must all remember to start by understanding the problem, not shoehorning a solution into a place it does not fit.
Leading SciBite’s data science and professional services team, Joe is dedicated to helping customers unlock the full potential of their data using SciBite’s semantic stack. Spearheading R&D initiatives within the team and pushing the boundaries of the possible. Joe’s expertise is rooted in a PhD from Newcastle University, focussing on novel computational approaches to drug repositioning; building atop semantic data integration, knowledge graph & data mining.
Since joining SciBite in 2017, Joe has been enthused by the rapid advancements in technology, particularly within AI. Recognizing the immense potential of AI, Joe combines this cutting-edge technology with SciBite’s core technologies to craft tailored, bespoke solutions that cater to diverse customer needs.
Other articles by Joe