AI-based chat application for life sciences [Part 2]: Role of ontologies

SciBite / News / AI-based chat application for life sciences [Part 2]: Role of ontologies

Emergence of RAG

In the first part of this series, we explored the key considerations for AI-based chat applications and discussed the limitations of large language models (LLMs).

Life sciences organizations worldwide have been experimenting with LLMs for a conversational search using a variety of data sources resulting in the emergence of Retrieval augmented generation (RAG) as a framework to improve the results.

The quality of answers generated through a RAG approach is largely dependent on the effectiveness of retrieval of external documents. While fine-tuning LLMs and RAG remains crucial, it’s worth considering breaking down the methodology and beginning with the foundational aspect: the data. Here, ontologies play a crucial role, offering us the opportunity to provide meaning to the words and concepts in data, ultimately ensuring transparency in such systems.

Let’s start with an example…

This document describes Duchenne muscular dystrophy and its causes, mainly because of mutation in dystrophin gene. This could be used to answer questions such as:

What causes Duchenne muscular dystrophy?
The document features the exact three-word disease name and the verb “causes”. Therefore, even a basic keyword-based retrieval system would easily include this in the list of candidates.

What causes Duchenne disease?
Notice the disease name varies slightly, potentially posing a challenge for a keyword-based system. A vector-based retrieval system should be able to locate this document given that terms such as “Duchenne”, “disease”, and “causes” are present throughout the document.

Figure 2: Document explaining causes of Duchenne muscular dystrophy.

What are the effects of mutation in DMD?
This is where the situation becomes complex. “DMD” is the official symbol for the dystrophin gene as assigned by HGNC, a committee established by international organization of scientists involved in genetics and genomics research to facilitate research and collaboration. But the term itself is missing in the document. The only matching word here is “mutation”, which is a commonly occurring term in millions of documents discussing genetic disorders.

Consequently, the likelihood of our document being selected for answer generation is slim. What’s more concerning is that if no other documents discuss “mutation” of “DMD”, the system may begin generating answers sourced from documents discussing mutations in other genes with symbols sounding like “DMD”. To further complicate matters, “MRX85” serves as another synonym for the same gene, albeit rarely used in contemporary publications. However, users may still inquire about topics related to “MRX85”.

The prevalence of published research on DMD increases the likelihood of the correct answer being generated. However, for less studied scientific concepts, the system may miss crucial information needed for accurate responses.

Figure 3: Document enriched with ontologies.

Ontologies like MeSH can provide all the synonyms for diseases like Duchenne muscular dystrophy. Should the data incorporate this additional information prior to the generation step?

Applying ontologies-based enrichment on both data and questions

SciBite’s head of ontologies Jane Lomax, mentions in her recent interview that ontologies represent the truth as agreed upon by humans: that something is this type of thing, and it relates to these other types of things. So if you can feed that into your AI, you get the best of both worlds.

Computers and humans speak different languages. Computers understand strings of characters, while humans understand concepts and ideas.

Ontologies act as translators between these languages. They define what things are and how they’re connected, making it easier for computers to understand human concepts.

So, the first suggestion is to enhance all data with specific ontologies related to the subject matter. This means enriching and indexing data based on terms defined in these ontologies.

When a user asks a question, it contains scientific concepts and relationships. The second suggestion is to use ontologies to analyze and structure the natural language question. By doing this, we not only extract scientific concepts but also start to discern the context of the questions being asked.

“What are the effects of mutation in DMD?”

Ontology enrichment will identify one key element and two secondary elements in this question: DMD from HGNC with ID 2928 and Mutation and Effect from Bioverb. For retrieval mechanisms, this implies that the candidate documents must contain HGNC:2928 (and should discuss its “mutation” and “effects”). If the retrieval mechanism fails to locate relevant information, it’s preferable to indicate “no answer found” rather than generating responses for a different, similar-sounding gene.

“What gene is associated with Duchenne disease?”

The candidate document must contain Duchenne Muscular Dystrophy (i.e., D020388) and its association with a scientific concept belonging to HGNC ontology.

One important factor to keep in mind is that the effectiveness of an ontology-based retrieval mechanism relies heavily on the quality of the ontologies designed specifically for the domain of the questions being asked.

SciBite’s core solution is a named entity engine that works on SciBite’s highly performant, NER-optimized, and ontologies-based vocabularies, tailored to comprehensively cover the intricacies of the life sciences domain.
SciBite’s ontologies-based retrieval mechanism is powered by an advanced matching algorithm that selects a candidate document not just by overlapping entities, but also by considering their proximity, occurrence within document sections, and any hierarchical relationships between the entities in the question and the documents.

So, how does it ensure transparency for the user?

Ontologies enrichment, when applied to documents and user questions prior to retrieving documents, ensures transparency for the user in several ways. Firstly, it informs users of the concepts identified in their natural language question that are deemed crucial for locating candidate documents. This helps users understand which aspects of their query are being focused on.

Additionally, ontologies enrichment furnishes sentences from candidate documents containing potential answers, along with identified concepts that align with the query, allowing users to see relevant information directly. Moreover, the system explains why it deems those sentences relevant to the question posed, providing users with insight into the reasoning behind the retrieved results. Overall, ontologies enrichment enhances transparency by empowering users with detailed information and explanations throughout the document retrieval process.

In the next part, we’ll explore how ontologies enhance the retrieval mechanism, rendering it more cost-effective. Additionally, we’ll introduce you to other design decisions made by the SciBite team during application development, aimed at not only meeting but surpassing the requirements. Stay tuned for more updates!

Harpreet Singh Riat

Director of Technical Sales, SciBite

Harpreet is the Director of Technical Sales at SciBite, a leading data-first, semantic analytics software company. With a strong background in data management and analytics, Harpreet has played a vital role in assisting numerous organizations in implementing knowledge graphs, from data preparation to visualization to gaining insights.

Other articles by Harpreet:

AI-based chat application for Life Sciences: Part I key considerations: read more
AI-based chat application for Life Sciences: Part II role of ontologies; read more
AI-based chat application for Life Sciences: Part III design decisions; read more
Utilising the power of LLMs and ontologies in life sciences; watch webinar

Share this article

Relevant resources, events and news

https://scibite.com/knowledge-hub/news/ai-based-chat-application-for-life-sciences/ thumbnail image

News AI-based chat application for life sciences [Part 1]: Key considerations

Are your teams now posing potentially confidential questions to consumer tools such as Bard and ChatGPT, relying on their responses?