Within the life sciences, evidence-based decision-making is imperative; wrong decisions can have dire consequences. As such, it is vital that systems that support the generation and validation of hypotheses provide direct links, or provenance, to the data that was used to generate them. But how can one implement such a workflow?
An LLM, is a sophisticated, generative, artificial intelligence (AI) model designed to understand and generate human-like text. Trained on monumental amounts of data, LLMs are designed to generate coherent and contextually relevant responses. LLMs are, therefore, great at language-based tasks that allow them to draw on their learned knowledge of textual patterns, such as summarization, generation, aggregation, translation, and programming assistance.
LLMs’ limitations in search tasks are well-documented: hallucinations, outdated information, security and privacy, and a lack of provenance and auditability/reproducibility, to name a few. Approaches to mitigate some of these issues fall under the banner of AI grounding. Grounding AI involves giving LLMs access to use case-specific information, which is not inherently part of their training data and typically is done via fine-tuning or using the Retrieval Augmented Generation (RAG) architecture.
With the former being expensive (both computationally and time-wise) RAG is the widely accepted grounding approach, but what is RAG?
Retrieval augmented generation involves sharing a question with an LLM, along with a set of contextually relevant documents that are likely to contain the answer. This restricts the model’s ability to hallucinate, enables for up-to-date data to be reviewed, and allows for provenance to be provided… but there is one vital aspect to a RAG implementation: the information retrieval step.
This is the step that is used to identify the most relevant set of documents likely to contain the answer to a question; it’s the adage – ‘garbage in, garbage out’.
Figure 1: LLM Q&A vs RAG architecture. A. process whereby a user asks a question direct of an LLM and the response is generated by the LLM B. A user asks a question of a RAG system, contextual information is gathered from a relevant knowledgebase before being passed to an LLM along with the question. The LLM summarises the response and references can also be served to the end user.
Successful RAG systems hinge on precise Information Retrieval (IR). Systems often use embeddings to indicate relatedness during IR, but this does not cover the crucial “how”; vital to support explainable decisions (see Figure 2). In the life sciences domain, the consensus is that lexical/ontological or hybrid approaches (one that uses a combination of both vector and ontological IR) excel over purely vector-based methods.
But let’s have a look at that in a bit more detail…
Figure 2: Information retrieval methods are typically either vector or lexical/ontology-based.
Vector-based information retrieval depends on embeddings models. In this scenario, a corpus of literature is chunked, these chunks are passed through an embedding model, where they are converted to numerical arrays; which in turn are stored in vector databases. Questions are then passed through the same embeddings model; converting them to the same numerical representation as the corpus.
Using similarity metrics, such as cosine similarity, document chunks are returned from the vector database based on how close they are in the n-dimensional vector space to the embedded representation of the query. This approach allows for relatedness to be calculated, but due to the fact they are scientifically naïve it cannot explain the why. There are also heavy operational costs associated with such IR approaches including when creating embeddings, storing embeddings, and querying over these.
On the other hand, ontology, or lexical-based IR methods are explainable, and importantly, understand science! Using named entity recognition (NER) approaches, one can identify entities (e.g., drugs, genes, indications, etc.) that are captured in textual data. It is then possible, simply put, to identify which entities that occur in the question and pull back all documents that mention entities, or closely related (read child/parent terms), that occur in the question.
It is a lot easier to understand why a document returned using ontology-based IR that mentions non-alcoholic steatohepatitis when the question is focussed on different liver diseases. In comparison to vector-based IR approaches, ontology-based approaches are a lot less costly, in terms of generation, storage, and querying.
Furthermore, we are confident that ontology-based retrieval is more accurate when it comes to identifying documents most likely to contain the answer within the life sciences.
Having identified a gold standard set of 3,217 question-document set pairs* released as part of the BioAsq 2023 task we set about evaluating whether a vector-based IR approach, an ontology-based IR approach or a hybrid-based approach was most performant.
For the test, the Medline 2023 base document set was used as the corpus. In the vector-based analysis, documents were chunked based on title and abstract, and embeddings were generated using ada and stored in Elastic Search. For, the ontology approach, documents were marked up using SciBite VOCabs and stored in SciBite’s semantic search tool, SciBite Search.
For vector retrieval, embeddings of questions were also generated, and similar docs returned based on these embeddings. For ontology retrieval, a basic heuristic for converting natural language to SSQL (SciBite Search Query Language) was developed (the basic version of which performed surprisingly well and consisted of creating an OR query out of identified entities and non-stop words).
For each approach (vector, ontology, and a hybrid), their ability to identify the correct documents from the curated document set was evaluated. It was shown that the ontology-based retrieval process considerably outperformed the vector-based approaches. And, although the hybrid brought a slight improvement to the pure ontology-based approach, this comes with massive overhead and operational costs.
*where the documents have been curated as being the most relevant when it comes to answering said question
At SciBite, we understand data. We believe that quality foundational data management is paramount to employing the latest and greatest technologies to support data democratisation and expedite the extraction of insight. LLMs also have a role to play when lowering the barrier of entry to data exploration but must be grounded using contextual and use case relevant data, for example, when using architectures such as RAG. Ultimately, these approaches are only as good as the data that is being fed to the LLM.
As such, we have done systematic comparisons of vector-based and ontology-based retrieval, and the results are conclusive – together with operational considerations such as storage cost, search speed, and embedding speed, ontology-based retrieval alone provides a more compelling solution than out-of-the-box vector search.
Want to hear more about this and how we can help you get the most out of your data? Reach out and we would be happy to talk!
Leading SciBite’s data science and professional services team, Joe is dedicated to helping customers unlock the full potential of their data using SciBite’s semantic stack. Spearheading R&D initiatives within the team and pushing the boundaries of the possible. Joe’s expertise is rooted in a PhD from Newcastle University, focussing on novel computational approaches to drug repositioning; building atop semantic data integration, knowledge graph & data mining.
Since joining SciBite in 2017, Joe has been enthused by the rapid advancements in technology, particularly within AI. Recognizing the immense potential of AI, Joe combines this cutting-edge technology with SciBite’s core technologies to craft tailored, bespoke solutions that cater to diverse customer needs.
Other articles by Joe