The latest acronym from the world of machine learning, RAG (retrieval augmented generation), has become ubiquitous in the past few months. RAG involves using some method to collate trusted data so that a generative language model (e.g. GPT) is less likely to hallucinate information when asked to answer questions. So, for example, if we want to ask GPT which drug targets a certain gene, we might prefer that this information comes from biomedical literature or a knowledge graph, rather than trusting GPT’s memory alone. In other words, we expect GPT to cite its sources like everybody else.
The developments of the past year have seen language models become extremely proficient at interpreting text that is provided to them. They can summarize, extract and translate at a human level, and they are universally available. As such, the G in RAG is close to being solved. At the very least, everybody is on a level playing field. Now, then, for those of us not working on frontier large language models (LLMs), the greatest gains are to be found in grounding these models with better retrieval, and ensuring that the model has access to accurate, trusted information on which to base its reasonings. models
Perhaps the most popular method for retrieval these days utilizes vector similarity to match a query to an indexed dataset. You may have noticed a massive uptick in the number of articles, tutorials, and tools focused around this tech. From game developers creating more realistic non-player characters to autonomous research agents, vectors are in vogue.
Where in the past generating useful vectors required quite a specialized skillset, the ecosystem has now developed to the extent that anyone can do it. In many cases, it’s the easiest option. But that doesn’t necessarily mean that it’s the best option.
In the life sciences, for instance, we are uniquely blessed with an array of carefully curated ontologies. By indexing documents with annotations that map to these ontologies, we can substantially outperform state of the art vector retrieval in terms of recall and precision.
Moreover, we can avoid the operationalization costs associated with large vector stores. OpenAI’s state of the art embedding model, Ada-2, falls over 15% short of our semantic search for mean recall. But even to get this close, we had to use embeddings of 1536 dimensions. This is exacerbated by the use of algorithms that rely on using further storage to avoid time complexity at retrieval. Disk is relatively cheap, but using these vectors in combination with ElasticSearch we saw an almost 20x increase in storage requirements. That’s nothing to scoff at when dealing with hundreds of millions of documents.
There are also upfront costs associated with the vectorization process itself. There’s the cost of actually feeding data through the model to be embedded, of course, and if you want the added security of using models hosted on Azure, this hosting also comes at a cost. These costs are not just monetary, but also temporal. Many organizations are throttled by rate limits, and increasing these limits costs more money, leading to a tradeoff between upfront costs and opportunity costs. These costs recur as new models are released. Some tasks that take minutes with our semantic indexing, like embedding a 1M document chunk of MEDLINE for testing retrieval performance, took days using OpenAI embeddings.
Another advantage of semantic indexing is that once our documents have been retrieved, we know a lot about the contents of said documents. We know exactly which genes are mentioned alongside which drugs, and which scientific verbs link them together, and all these entities are aligned to consistent IDs. This allows for seamless integration with other knowledge sources.
For example, if we found literature evidence that Sildenafil binds to PDE5A, we might then want to see which proteins are functionally downstream of PDE5A and whether they are associated with any diseases that might represent repurposing candidates. To recreate this annotation capability over MEDLINE using GPT4 would rapidly accrue six-figure costs, and that is assuming each document would only need to be fed to the model once.
In reality, you would probably need to send each document multiple times with lengthy, expensive system prompts to explain your requirements. And to align annotated entities with consistent IDs would require complex agentic behavior demanding further calls to the model and would, once again, incur substantially more cost.
None of this is intended to devalue recent advances in AI. On the contrary, we can make use of these technologies in combination with our retrieval heuristic. For example, we can use GPT to identify authors, affiliations, dates and other data that don’t map onto our selection of ontologies. It can also help us to filter out purely grammatical words and to identify non-biological n-grams within incoming queries. In contrast to vector approaches, this is only necessary at query time.
This extends the lead of semantic search even further, and we’ve only just scratched the surface of what is possible here. We can also deploy LLMs as assistants for our subject matter expert curators, helping them to quickly arrive at relevant evidence for synonymity or relationships between entities.
Our semantic search is more performant, cheaper, greener, and being built upon a foundation of open source ontologies, more secure from the regulatory disruptions that may be facing LLM developments.
Oliver Giles, Machine Learning Scientist, received his MSc in Synthetic Biology from Newcastle University, and his BA in Philosophy from the University of East Anglia. He is currently focused on interfacing natural language with structured data, extracting said structured data from text and on using AI for the inference of novel hypotheses.
Other articles by Oliver