What is retrieval augmented generation and why is the data you feed it so important?

SciBite / News / What is retrieval augmented generation and why is the data you feed it so important?

Within the life sciences, evidence-based decision-making is imperative; wrong decisions can have dire consequences. As such, it is vital that systems that support the generation and validation of hypotheses provide direct links, or provenance, to the data that was used to generate them. But how can one implement such a workflow?

What is an LLM?

An LLM, is a sophisticated, generative, artificial intelligence (AI) model designed to understand and generate human-like text. Trained on monumental amounts of data, LLMs are designed to generate coherent and contextually relevant responses. LLMs are, therefore, great at language-based tasks that allow them to draw on their learned knowledge of textual patterns, such as summarization, generation, aggregation, translation, and programming assistance.

Limitations of LLMs in the context of search

LLMs’ limitations in search tasks are well-documented: hallucinations, outdated information, security and privacy, and a lack of provenance and auditability/reproducibility, to name a few. Approaches to mitigate some of these issues fall under the banner of AI grounding. Grounding AI involves giving LLMs access to use case-specific information, which is not inherently part of their training data and typically is done via fine-tuning or using the Retrieval Augmented Generation (RAG) architecture.

With the former being expensive (both computationally and time-wise) RAG is the widely accepted grounding approach, but what is RAG?

What is retrieval augmented generation?

Retrieval augmented generation involves sharing a question with an LLM, along with a set of contextually relevant documents that are likely to contain the answer. This restricts the model’s ability to hallucinate, enables for up-to-date data to be reviewed, and allows for provenance to be provided… but there is one vital aspect to a RAG implementation: the information retrieval step.

This is the step that is used to identify the most relevant set of documents likely to contain the answer to a question; it’s the adage – ‘garbage in, garbage out’.

Figure 1: LLM Q&A vs RAG architecture. A. process whereby a user asks a question direct of an LLM and the response is generated by the LLM B. A user asks a question of a RAG system, contextual information is gathered from a relevant knowledgebase before being passed to an LLM along with the question. The LLM summarises the response and references can also be served to the end user.

Successful RAG systems hinge on precise Information Retrieval (IR). Systems often use embeddings to indicate relatedness during IR, but this does not cover the crucial “how”; vital to support explainable decisions (see Figure 2). In the life sciences domain, the consensus is that lexical/ontological or hybrid approaches (one that uses a combination of both vector and ontological IR) excel over purely vector-based methods.

But let’s have a look at that in a bit more detail…

Figure 2: Information retrieval methods are typically either vector or lexical/ontology-based.

Contrasting Vector-based and Ontology-based information retrieval

Vector-based information retrieval depends on embeddings models. In this scenario, a corpus of literature is chunked, these chunks are passed through an embedding model, where they are converted to numerical arrays; which in turn are stored in vector databases. Questions are then passed through the same embeddings model; converting them to the same numerical representation as the corpus.

Using similarity metrics, such as cosine similarity, document chunks are returned from the vector database based on how close they are in the n-dimensional vector space to the embedded representation of the query. This approach allows for relatedness to be calculated, but due to the fact they are scientifically naïve it cannot explain the why. There are also heavy operational costs associated with such IR approaches including when creating embeddings, storing embeddings, and querying over these.

On the other hand, ontology, or lexical-based IR methods are explainable, and importantly, understand science! Using named entity recognition (NER) approaches, one can identify entities (e.g., drugs, genes, indications, etc.) that are captured in textual data. It is then possible, simply put, to identify which entities that occur in the question and pull back all documents that mention entities, or closely related (read child/parent terms), that occur in the question.

It is a lot easier to understand why a document returned using ontology-based IR that mentions non-alcoholic steatohepatitis when the question is focussed on different liver diseases. In comparison to vector-based IR approaches, ontology-based approaches are a lot less costly, in terms of generation, storage, and querying.

Furthermore, we are confident that ontology-based retrieval is more accurate when it comes to identifying documents most likely to contain the answer within the life sciences.

And how have you evaluated this?

Having identified a gold standard set of 3,217 question-document set pairs* released as part of the BioAsq 2023 task we set about evaluating whether a vector-based IR approach, an ontology-based IR approach or a hybrid-based approach was most performant.

Document preparation

For the test, the Medline 2023 base document set was used as the corpus. In the vector-based analysis, documents were chunked based on title and abstract, and embeddings were generated using ada and stored in Elastic Search. For, the ontology approach, documents were marked up using SciBite VOCabs and stored in SciBite’s semantic search tool, SciBite Search.

Query preparation

For vector retrieval, embeddings of questions were also generated, and similar docs returned based on these embeddings. For ontology retrieval, a basic heuristic for converting natural language to SSQL (SciBite Search Query Language) was developed (the basic version of which performed surprisingly well and consisted of creating an OR query out of identified entities and non-stop words).

For each approach (vector, ontology, and a hybrid), their ability to identify the correct documents from the curated document set was evaluated. It was shown that the ontology-based retrieval process considerably outperformed the vector-based approaches. And, although the hybrid brought a slight improvement to the pure ontology-based approach, this comes with massive overhead and operational costs.
*where the documents have been curated as being the most relevant when it comes to answering said question

Concluding thoughts

At SciBite, we understand data. We believe that quality foundational data management is paramount to employing the latest and greatest technologies to support data democratisation and expedite the extraction of insight. LLMs also have a role to play when lowering the barrier of entry to data exploration but must be grounded using contextual and use case relevant data, for example, when using architectures such as RAG. Ultimately, these approaches are only as good as the data that is being fed to the LLM.

As such, we have done systematic comparisons of vector-based and ontology-based retrieval, and the results are conclusive – together with operational considerations such as storage cost, search speed, and embedding speed, ontology-based retrieval alone provides a more compelling solution than out-of-the-box vector search.

Want to hear more about this and how we can help you get the most out of your data? Reach out and we would be happy to talk!

Joe Mullen

Product Director, Software Solutions

With a PhD from Newcastle University in computational approaches to drug repositioning, Joe brings a strong scientific foundation rooted in semantic data integration, knowledge graphs, and data mining. Since joining SciBite in 2017, he has had the privilege of leading the Data Science and Professional Services teams, where he combined cutting-edge technology with our core data enrichment products to create tailored solutions for a diverse range of customers.

Today, as Product Director, Joe is passionate about shaping the vision of our software solutions, aligning them with strategic goals, and most importantly, supporting our clients in unlocking the full potential of their scientific data.

His focus is on driving innovation that empowers scientists and organizations to make impactful discoveries faster and more efficiently.

Other articles by Joe

What is agentic AI and is there a role for ontologies? read more
Are ontologies still relevant in the age of LLMs? read more
What is Retrieval Augmented Generation, and why is the data you feed it so important? read more
Large language models (LLMs) and search; it’s a FAIR game, read more
Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1], read more
Why use your ontology management platform as a central ontology server? read more

Share this article

Relevant resources, events and news

News What is agentic AI and is there a role for ontologies?

Understand what is Agentic AI and ontologies’ role in enhancing decision-making and efficiency in life sciences.