AI based chat application for life sciences:
Part I key considerations

Are your teams now posing potentially confidential questions to consumer tools such as Bard and ChatGPT, relying on their responses?

Aerial Panorama Of California Coastal Redwoods

Are your teams now posing potentially confidential questions to consumer tools such as Bard and ChatGPT, relying on their responses? Or have you noticed a slowdown in your research process due to information overload, hindering the ability to swiftly identify critical findings?

Whatever the reason, you’ve acknowledged the pressing need for a dedicated AI-based chat application for your teams. If this scenario resonates with you, allow us to guide you to the next level.

What are the essential requirements that must be met by such an application?

  1. Accuracy: Before your researchers commit to additional resources, it is imperative that the application provides accurate answers to your questions. Particularly in the fields of life sciences, medicine, and clinical domains, reliance on such a system is only possible if it provides accurate and current information. While certain limitations of Large Language Models (LLMs), such as hallucinations and bias, may lead to inaccuracies, can your application effectively address these challenges?
  2. Provenance: Even if the summarized answer is accurate, it may still be insufficient. Can you justify conducting research solely based on the fact that “the LLM said that”? The application should enable you to trace the answers back to the reference documents, whether they are external or internal. Evidence-based decision-making is paramount in life sciences, where wrong decisions can have dire consequences.
  3. Transparency: Let’s push for more. Why settle for just a list of reference documents? The application should provide insights into why it considers the given reference documents relevant to the search and highlight which sections of the documents contain evidence that contributed to the answer.
  4. Domain expertise: The data within the Life Sciences field can be intricate and rife with ambiguities. It’s imperative that the application can navigate this complexity without becoming entangled in subtle nuances, terminologies, abbreviations, and similar intricacies.
  5. Dynamic source selection: Your question might pertain to a competitor, internal research, a patent, or a clinical trial – relying on a single document source may not yield all the answers you seek. To cater to a broader range of users, an application must support a diverse array of data sources and possess the capability to dynamically switch between them based on the nature of the questions being posed.
  6. Security & privacy: Regardless of whether it concerns the confidentiality of the questions posed or the user’s access level to the document housing the answer, the application must uphold data privacy and respect the user’s access permissions.
  7. Operational efficiency: It’s a technology currently in high demand, and one doesn’t require expertise in economics to grasp the substantial operational costs associated with the utilization of generative AI and LLMs, along with their accompanying computing expenses. To ensure a system remains up to date, the application must strike a delicate balance, meeting most of the requirements without depleting all available funds.

In addition to these requirements, any application striving to thrive in today’s world must meet a minimum standard, which is already set quite high, for user experience (UX), performance, and availability.

What do we know about the technology?

Before we attempt to meet these requirements, let’s pause to understand the strengths and weaknesses of LLMs.  LLMs essentially rely on statistical probabilities derived from extensive training data, determining the likelihood of word sequences within sentences. Consequently, if the training data lacks an answer to a query, the model resorts to generating sentences based solely on these statistics, resulting in nonsensical outputs or “hallucinations.”

Moreover, if the training data contains inherent biases, the generated answers are prone to reflecting those biases. Additionally, since the model is trained on data without preserving its sources, it lacks the technical capability to provide source links for generated responses.

Given that the training data for LLMs comprises essentially all text accessible on the internet, maintaining an LLM to incorporate the latest information consistently is an exceedingly costly endeavor.

Nevertheless, LLMs excel in summarizing text, generating content, and interpreting human language.

Can we meet these requirements?

At SciBite, through persistent efforts, our teams conducted rigorous experiments involving various flavors of LLMs, vector-based retrieval, ontologies-based retrieval, and hybrid approaches. We also integrated ontologies enrichment at different stages of the question-answer flow. The culmination of these efforts has resulted in an AI chat application that fulfills all requirements.

As an advantage, the application renders the answer-generation process entirely transparent. It utilizes ontologies to provide clarity on how and why results were identified. It maintains not only a list of relevant documents but also segments of the documents utilized for answer generation, along with an explanation of why it considers them to contain the answers. This stands in contrast to any other system that operates as a black box and lacks the ability to offer this level of transparency.

Use of ontologies also facilitates structuring the natural language question, ensuring reproducibility, and allowing saving or approval.

In the next part, I will explore further how utilization of ontologies for enrichment at various stages addresses gaps in a RAG application and enhances its accuracy, reliability, and efficiency.


About Harpreet Singh Riat

Director of Technical Sales, SciBite

Harpreet is the Director of Technical Sales at SciBite, a leading data-first, semantic analytics software company. With a strong background in data management and analytics, Harpreet has played a vital role in assisting numerous organizations in implementing knowledge graphs, from data preparation to visualization to gaining insights.

View LinkedIn profile

Related articles

  1. Are ontologies still relevant in the age of LLMs?

    Headshot of Joe Mullen, SciBite

    Technological advancements exhibit varying degrees of longevity. Some are tried and trusted, enduring longer than others, while other technologies succumb to fleeting hype without attaining substantive fruition. One constant, in this dynamic landscape is the data.

  2. Large language models (LLMs) and search; it’s a FAIR game

    Headshot of Joe Mullen, SciBite

    Large language models (LLMs) have limitations when applied to search due to their inability to distinguish between fact and fiction, potential privacy concerns, and provenance issues. LLMs can, however, support search when used in conjunction with FAIR data and could even support the democratisation of data, if used correctly…


How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us