What is Retrieval Augmented Generation and why is the data you feed it so important?

Headshot of Joe Mullen, SciBite

Within the life sciences, evidence-based decision-making is imperative; wrong decisions can have dire consequences. As such, it is vital that systems that support the generation and validation of hypotheses provide direct links, or provenance, to the data that was used to generate them. But how can one implement such a workflow?

Night Landscape. Beautiful Snow Covered Mountains In The Starry Night With Milky Way Galaxy.

What is an LLM?

An LLM, is a sophisticated, generative, artificial intelligence (AI) model designed to understand and generate human-like text. Trained on monumental amounts of data, LLMs are designed to generate coherent and contextually relevant responses. LLMs are, therefore, great at language-based tasks that allow them to draw on their learned knowledge of textual patterns, such as summarization, generation, aggregation, translation, and programming assistance.

Limitations of LLMs in the context of search

LLMs’ limitations in search tasks are well-documented: hallucinations, outdated information, security and privacy, and a lack of provenance and auditability/reproducibility, to name a few. Approaches to mitigate some of these issues fall under the banner of AI grounding. Grounding AI involves giving LLMs access to use case-specific information, which is not inherently part of their training data and typically is done via fine-tuning or using the Retrieval Augmented Generation (RAG) architecture.

With the former being expensive (both computationally and time-wise) RAG is the widely accepted grounding approach, but what is RAG?

What is retrieval augmented generation?

Retrieval augmented generation involves sharing a question with an LLM, along with a set of contextually relevant documents that are likely to contain the answer. This restricts the model’s ability to hallucinate, enables for up-to-date data to be reviewed, and allows for provenance to be provided… but there is one vital aspect to a RAG implementation: the information retrieval step.

This is the step that is used to identify the most relevant set of documents likely to contain the answer to a question; it’s the adage – ‘garbage in, garbage out’.

Blog What Is RAG Figure 1

Figure 1: LLM Q&A vs RAG architecture. A. process whereby a user asks a question direct of an LLM and the response is generated by the LLM B. A user asks a question of a RAG system, contextual information is gathered from a relevant knowledgebase before being passed to an LLM along with the question. The LLM summarises the response and references can also be served to the end user.

Successful RAG systems hinge on precise Information Retrieval (IR). Systems often use embeddings to indicate relatedness during IR, but this does not cover the crucial “how”; vital to support explainable decisions (see Figure 2). In the life sciences domain, the consensus is that lexical/ontological or hybrid approaches (one that uses a combination of both vector and ontological IR) excel over purely vector-based methods.

But let’s have a look at that in a bit more detail…

Blog What Is RAG Figure 2

Figure 2: Information retrieval methods are typically either vector or lexical/ontology-based.

Contrasting Vector-based and Ontology-based information retrieval

Vector-based information retrieval depends on embeddings models. In this scenario, a corpus of literature is chunked, these chunks are passed through an embedding model, where they are converted to numerical arrays; which in turn are stored in vector databases. Questions are then passed through the same embeddings model; converting them to the same numerical representation as the corpus.

Using similarity metrics, such as cosine similarity, document chunks are returned from the vector database based on how close they are in the n-dimensional vector space to the embedded representation of the query. This approach allows for relatedness to be calculated, but due to the fact they are scientifically naïve it cannot explain the why. There are also heavy operational costs associated with such IR approaches including when creating embeddings, storing embeddings, and querying over these.

On the other hand, ontology, or lexical-based IR methods are explainable, and importantly, understand science! Using named entity recognition (NER) approaches, one can identify entities (e.g., drugs, genes, indications, etc.) that are captured in textual data. It is then possible, simply put, to identify which entities that occur in the question and pull back all documents that mention entities, or closely related (read child/parent terms), that occur in the question.

It is a lot easier to understand why a document returned using ontology-based IR that mentions non-alcoholic steatohepatitis when the question is focussed on different liver diseases. In comparison to vector-based IR approaches, ontology-based approaches are a lot less costly, in terms of generation, storage, and querying.

Furthermore, we are confident that ontology-based retrieval is more accurate when it comes to identifying documents most likely to contain the answer within the life sciences.

And how have you evaluated this?

Having identified a gold standard set of 3,217 question-document set pairs* released as part of the BioAsq 2023 task we set about evaluating whether a vector-based IR approach, an ontology-based IR approach or a hybrid-based approach was most performant.

Document preparation

For the test, the Medline 2023 base document set was used as the corpus. In the vector-based analysis, documents were chunked based on title and abstract, and embeddings were generated using ada and stored in Elastic Search. For, the ontology approach, documents were marked up using SciBite VOCabs and stored in SciBite’s semantic search tool, SciBite Search.

Query preparation

For vector retrieval, embeddings of questions were also generated, and similar docs returned based on these embeddings. For ontology retrieval, a basic heuristic for converting natural language to SSQL (SciBite Search Query Language) was developed (the basic version of which performed surprisingly well and consisted of creating an OR query out of identified entities and non-stop words).

For each approach (vector, ontology, and a hybrid), their ability to identify the correct documents from the curated document set was evaluated. It was shown that the ontology-based retrieval process considerably outperformed the vector-based approaches. And, although the hybrid brought a slight improvement to the pure ontology-based approach, this comes with massive overhead and operational costs.
*where the documents have been curated as being the most relevant when it comes to answering said question

Concluding thoughts

At SciBite, we understand data. We believe that quality foundational data management is paramount to employing the latest and greatest technologies to support data democratisation and expedite the extraction of insight. LLMs also have a role to play when lowering the barrier of entry to data exploration but must be grounded using contextual and use case relevant data, for example, when using architectures such as RAG. Ultimately, these approaches are only as good as the data that is being fed to the LLM.

As such, we have done systematic comparisons of vector-based and ontology-based retrieval, and the results are conclusive – together with operational considerations such as storage cost, search speed, and embedding speed, ontology-based retrieval alone provides a more compelling solution than out-of-the-box vector search.

Want to hear more about this and how we can help you get the most out of your data? Reach out and we would be happy to talk!


About Joe Mullen

Director of Data Science & Professional Services, SciBite

Joe Mullen, Director of Science & Professional Services. Holds a Ph.D. from Newcastle University in the development of computational approaches to drug repositioning, with a focus on semantic data integration and data mining. He has been with SciBite since 2017, initially as part of the Data Science team.

View LinkedIn profile

Other articles by Joe

1. [Webinar] Are ontologies still relevant in the age of LLMs? Watch on demand
2. [Blog] How important is subject matter expertise in Life Sciences when using technology and artificial intelligence? read more
3. [Blog] Large language models (LLMs) and search; it’s a FAIR game, read more.
4. [Blog] Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1] read more.
5. [Blog] Why Use Your Ontology Management Platform as a Central Ontology Server, read more.

Related articles

  1. Are ontologies still relevant in the age of LLMs?

    Headshot of Joe Mullen, SciBite

    Technological advancements exhibit varying degrees of longevity. Some are tried and trusted, enduring longer than others, while other technologies succumb to fleeting hype without attaining substantive fruition. One constant, in this dynamic landscape is the data.

  2. Large language models (LLMs) and search; it’s a FAIR game

    Headshot of Joe Mullen, SciBite

    Large language models (LLMs) have limitations when applied to search due to their inability to distinguish between fact and fiction, potential privacy concerns, and provenance issues. LLMs can, however, support search when used in conjunction with FAIR data and could even support the democratisation of data, if used correctly…


How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us