In the first part of this series, we explored the key considerations for AI-based chat applications and discussed the limitations of large language models (LLMs).
Life sciences organizations worldwide have been experimenting with LLMs for a conversational search using a variety of data sources resulting in the emergence of Retrieval augmented generation (RAG) as a framework to improve the results.
The quality of answers generated through a RAG approach is largely dependent on the effectiveness of retrieval of external documents. While fine-tuning LLMs and RAG remains crucial, it’s worth considering breaking down the methodology and beginning with the foundational aspect: the data. Here, ontologies play a crucial role, offering us the opportunity to provide meaning to the words and concepts in data, ultimately ensuring transparency in such systems.
This document describes Duchenne muscular dystrophy and its causes, mainly because of mutation in dystrophin gene. This could be used to answer questions such as:
What causes Duchenne muscular dystrophy?
The document features the exact three-word disease name and the verb “causes”. Therefore, even a basic keyword-based retrieval system would easily include this in the list of candidates.
What causes Duchenne disease?
Notice the disease name varies slightly, potentially posing a challenge for a keyword-based system. A vector-based retrieval system should be able to locate this document given that terms such as “Duchenne”, “disease”, and “causes” are present throughout the document.
Figure 2: Document explaining causes of Duchenne muscular dystrophy.
What are the effects of mutation in DMD?
This is where the situation becomes complex. “DMD” is the official symbol for the dystrophin gene as assigned by HGNC, a committee established by international organization of scientists involved in genetics and genomics research to facilitate research and collaboration. But the term itself is missing in the document. The only matching word here is “mutation”, which is a commonly occurring term in millions of documents discussing genetic disorders.
Consequently, the likelihood of our document being selected for answer generation is slim. What’s more concerning is that if no other documents discuss “mutation” of “DMD”, the system may begin generating answers sourced from documents discussing mutations in other genes with symbols sounding like “DMD”. To further complicate matters, “MRX85” serves as another synonym for the same gene, albeit rarely used in contemporary publications. However, users may still inquire about topics related to “MRX85”.
The prevalence of published research on DMD increases the likelihood of the correct answer being generated. However, for less studied scientific concepts, the system may miss crucial information needed for accurate responses.
Figure 3: Document enriched with ontologies.
Ontologies like MeSH can provide all the synonyms for diseases like Duchenne muscular dystrophy. Should the data incorporate this additional information prior to the generation step?
SciBite’s head of ontologies Jane Lomax, mentions in her recent interview that ontologies represent the truth as agreed upon by humans: that something is this type of thing, and it relates to these other types of things. So if you can feed that into your AI, you get the best of both worlds.
Computers and humans speak different languages. Computers understand strings of characters, while humans understand concepts and ideas.
Ontologies act as translators between these languages. They define what things are and how they’re connected, making it easier for computers to understand human concepts.
So, the first suggestion is to enhance all data with specific ontologies related to the subject matter. This means enriching and indexing data based on terms defined in these ontologies.
When a user asks a question, it contains scientific concepts and relationships. The second suggestion is to use ontologies to analyze and structure the natural language question. By doing this, we not only extract scientific concepts but also start to discern the context of the questions being asked.
“What are the effects of mutation in DMD?”
Ontology enrichment will identify one key element and two secondary elements in this question: DMD from HGNC with ID 2928 and Mutation and Effect from Bioverb. For retrieval mechanisms, this implies that the candidate documents must contain HGNC:2928 (and should discuss its “mutation” and “effects”). If the retrieval mechanism fails to locate relevant information, it’s preferable to indicate “no answer found” rather than generating responses for a different, similar-sounding gene.
“What gene is associated with Duchenne disease?”
The candidate document must contain Duchenne Muscular Dystrophy (i.e., D020388) and its association with a scientific concept belonging to HGNC ontology.
One important factor to keep in mind is that the effectiveness of an ontology-based retrieval mechanism relies heavily on the quality of the ontologies designed specifically for the domain of the questions being asked.
Ontologies enrichment, when applied to documents and user questions prior to retrieving documents, ensures transparency for the user in several ways. Firstly, it informs users of the concepts identified in their natural language question that are deemed crucial for locating candidate documents. This helps users understand which aspects of their query are being focused on.
Additionally, ontologies enrichment furnishes sentences from candidate documents containing potential answers, along with identified concepts that align with the query, allowing users to see relevant information directly. Moreover, the system explains why it deems those sentences relevant to the question posed, providing users with insight into the reasoning behind the retrieved results. Overall, ontologies enrichment enhances transparency by empowering users with detailed information and explanations throughout the document retrieval process.
In the next part, we’ll explore how ontologies enhance the retrieval mechanism, rendering it more cost-effective. Additionally, we’ll introduce you to other design decisions made by the SciBite team during application development, aimed at not only meeting but surpassing the requirements. Stay tuned for more updates!
Harpreet is the Director of Technical Sales at SciBite, a leading data-first, semantic analytics software company. With a strong background in data management and analytics, Harpreet has played a vital role in assisting numerous organizations in implementing knowledge graphs, from data preparation to visualization to gaining insights.
Other articles by Harpreet: