To make use of the latest and the greatest technology, you must have your house, or more specifically, your data, in order. Through high-quality foundational data management, where ontologies play a crucial role, an organization can be agile enough to adapt to, and make use of, state-of-the-art technologies such as large language models (LLM).
An LLM, is a sophisticated, generative, artificial intelligence (AI) model designed to understand and generate human-like text. Trained on monumental amounts of data, LLMs are designed to generate coherent and contextually relevant responses. LLMs are, therefore, great at language-based tasks that allow them to draw on their learned knowledge of textual patterns, such as summarization, generation, aggregation, translation, and, programming assistance.
An ontology is a formal representation of a knowledge domain, enabling the structured encoding of information using principles akin to those found in traditional symbolic logic. Ontologies capture the human knowledge in a format that is computationally friendly; allowing for analysis to be completed with up-to-date knowledge of the subject matter expert (SME) in a scalable fashion.
It’s important to also note that the term ontology is used to define standards varying in expressivity and formality, ranging from glossaries through thesauri and taxonomies via metadata and data models all the way to rich semantic ontologies.
A key use case of ontologies, or more specifically ontology-derived standards, is the tagging and management of data. Whether that be structured, unstructured, internal, or external, by aligning your data to standards, it becomes more Findable, Accessible, Interoperable and Reusable (FAIR). As a result, information retrieval tasks can be enhanced and by representing previously unstructured data in a structured, semantic fashion, inference and extraction of insights can be expedited.
But hang on… surely, I can just use the power of an LLM to manage, retrieve and analyze my data I hear (some of) you cry? Well, no, not completely. And certainly not in scenarios where evidence-based decision-making is crucial, such as the Life Sciences and other domains where decisions need to be calculated in an explainable manner with provenance, and where wrong decisions can result in dire consequences.
Let’s take a closer look at some of the specific tasks we believe the creation of, and application of ontologies are still vital …
LLMs can help uncover ‘knowledge’, yet ontologies are needed to capture that for future use. While various technologies can support the semi-automated curation of ontologies, human validation is crucial for candidate classes, terms, and relationships.
Transforming LLM output into an ontology through a lightweight model enables versatility and reuse in downstream applications. Even if AI could create a flawless ontology, it is crucial to recognize that the value of a standard lies in its consensus among humans. While AI may generate a well-structured ontology, it may well struggle with nuanced distinctions, and diverge from what others use. The human in the loop is vital.
While adept at recognizing types of things, like genes, LLMs struggle with aligning varied, synonymous, representations of instances to specific identifiers (e.g., understanding that BRCA1 and FANCS are equivalents). Short of providing an ontology as part of a prompt (limitations include prompt token length), the ability to annotate textual data to type, instance and accompanying relationships, or hierarchies, is still something that requires ontologies. Ontologies ‘know’ things and are validated by humans.
LLMs’ limitations in search tasks are well-documented: hallucinations, outdated information, security and privacy, a lack of provenance and auditability/reproducibility to name a few. The Retrieval Augmented Generation (RAG) architecture, the widely accepted grounding approach, is gaining prominence.
Successful RAG systems hinge on precise information retrieval (IR), often using embeddings to indicate relatedness, not the crucial “how” for explainable decisions. In this domain, the consensus is that lexical/ontological or hybrid approaches excel over purely vector-based methods.
Data seldom resides in a single enterprise source. It exists in diverse sources, formats, and syntaxes. To democratize data, aligning disparate siloed data to common standards is crucial, enabled by ontologies ensuring source-level interoperability. Similarly, converting natural language queries to ontological entities is vital for querying diverse data effectively.
Ontologies play a pivotal role in such cases, aiding seamless retrieval from siloes that key against IDs captured in ontologies, such as knowledge graphs. Whilst LLMs have a role in aspects of such a solution, e.g., converting natural language queries to a relevant query syntax (i.e., SSQL for SciBite Search or Cypher for Neo4J), summarising results from an accurate IR set – ontologies are also paramount.
When we model data with ontologies, we can ask extremely precise questions, get definitive answers and are able to use reasoning to both deduce and explain answers to queries. Ontologies were born out of a need to unambiguously identify the kinds of entities in the world (and in our data) and the relationships that hold between them. LLMs, on the other hand, provide a statistical approach to identifying these that can be presented through natural language – enabling us to see what things are potentially related, but not how.
LLMs are not going anywhere and will benefit us all, their ability to support operational tasks is clear for all to see, however, in situations where evidence-based decision-making is paramount, particularly within R&D in the Life Sciences, ontologies still have a massive role to play.
Just like ontologies or any piece of software for that matter, LLMs are just tools to help you do things. No one thing solves every problem on their own. We must all remember to start by understanding the problem, not shoehorning a solution into a place it does not fit.
Joe Mullen, Director of Science & Professional Services. Holds a Ph.D. from Newcastle University in the development of computational approaches to drug repositioning, with a focus on semantic data integration and data mining. He has been with SciBite since 2017, initially as part of the Data Science team.
1. [Webinar] How important is subject matter expertise in Life Sciences when using technology and artificial intelligence?
Watch on demand
2. [Blog] Large language models (LLMs) and search; it’s a FAIR game, read more.
3. [Blog] A review of the Pistoia Alliance Spring Conference 2023 read more.
4. [Blog] Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1] read more.
5. [Blog] Why Use Your Ontology Management Platform as a Central Ontology Server, read more.
6. [Blog] SKOS in CENtree: Further support in our latest 2.1 release, read more.
Large language models (LLMs) have limitations when applied to search due to their inability to distinguish between fact and fiction, potential privacy concerns, and provenance issues. LLMs can, however, support search when used in conjunction with FAIR data and could even support the democratisation of data, if used correctly…Read
Discover the past and future of microbiome-based healing. From ancient remedies to modern AI, learn how SciBite's groundbreaking approach blends Large Language Models (LLMs) with advanced tech to unravel the potential of therapeutic microbiomes.Read
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456