Use Cases

Discover how SciBite’s powerful solutions are supporting scientists and researchers.

Use Cases Overview

Gartner report

Gartner® The Pillars of a Successful Artificial Intelligence Strategy

Access report

Knowledge Hub

Explore expert insights, articles, and thought leadership on scientific data challenges.

Knowledge Hub

Resources

Discover our whitepapers, spec sheets, and webinars for in-depth product knowledge.

Resources

Events

Join us at upcoming events and webinars to learn more about SciBite solutions.

Events

News

Stay informed with the latest SciBite updates, announcements, and industry news.

News

About SciBite

Explore SciBite’s full suite of solutions to unlock the potential of your data.

Discover more about us

Our Partners

We build powerful partnerships with world-leading organizations.

Our Partners

With new tech comes new operationalization considerations

The latest acronym from the world of machine learning, RAG (retrieval augmented generation), has become ubiquitous in the past few months. RAG involves using some method to collate trusted data so that a generative language model (e.g. GPT) is less likely to hallucinate information when asked to answer questions. So, for example, if we want to ask GPT which drug targets a certain gene, we might prefer that this information comes from biomedical literature or a knowledge graph, rather than trusting GPT’s memory alone. In other words, we expect GPT to cite its sources like everybody else.

The developments of the past year have seen language models become extremely proficient at interpreting text that is provided to them. They can summarize, extract and translate at a human level, and they are universally available. As such, the G in RAG is close to being solved. At the very least, everybody is on a level playing field. Now, then, for those of us not working on frontier large language models (LLMs), the greatest gains are to be found in grounding these models with better retrieval, and ensuring that the model has access to accurate, trusted information on which to base its reasonings. models

Perhaps the most popular method for retrieval these days utilizes vector similarity to match a query to an indexed dataset. You may have noticed a massive uptick in the number of articles, tutorials, and tools focused around this tech. From game developers creating more realistic non-player characters to autonomous research agents, vectors are in vogue.

Vectors are in vogue

Where in the past generating useful vectors required quite a specialized skillset, the ecosystem has now developed to the extent that anyone can do it. In many cases, it’s the easiest option. But that doesn’t necessarily mean that it’s the best option.

In the life sciences, for instance, we are uniquely blessed with an array of carefully curated ontologies. By indexing documents with annotations that map to these ontologies, we can substantially outperform state of the art vector retrieval in terms of recall and precision.

Can we avoid the operationalization costs?

Moreover, we can avoid the operationalization costs associated with large vector stores. OpenAI’s state of the art embedding model, Ada-2, falls over 15% short of our semantic search for mean recall. But even to get this close, we had to use embeddings of 1536 dimensions. This is exacerbated by the use of algorithms that rely on using further storage to avoid time complexity at retrieval. Disk is relatively cheap, but using these vectors in combination with ElasticSearch we saw an almost 20x increase in storage requirements. That’s nothing to scoff at when dealing with hundreds of millions of documents.

There are also upfront costs associated with the vectorization process itself. There’s the cost of actually feeding data through the model to be embedded, of course, and if you want the added security of using models hosted on Azure, this hosting also comes at a cost. These costs are not just monetary, but also temporal. Many organizations are throttled by rate limits, and increasing these limits costs more money, leading to a tradeoff between upfront costs and opportunity costs. These costs recur as new models are released. Some tasks that take minutes with our semantic indexing, like embedding a 1M document chunk of MEDLINE for testing retrieval performance, took days using OpenAI embeddings.

 

Mt. Fuji from South Alps

Advantages of semantic indexing

Another advantage of semantic indexing is that once our documents have been retrieved, we know a lot about the contents of said documents. We know exactly which genes are mentioned alongside which drugs, and which scientific verbs link them together, and all these entities are aligned to consistent IDs. This allows for seamless integration with other knowledge sources.

For example, if we found literature evidence that Sildenafil binds to PDE5A, we might then want to see which proteins are functionally downstream of PDE5A and whether they are associated with any diseases that might represent repurposing candidates. To recreate this annotation capability over MEDLINE using GPT4 would rapidly accrue six-figure costs, and that is assuming each document would only need to be fed to the model once.

In reality, you would probably need to send each document multiple times with lengthy, expensive system prompts to explain your requirements. And to align annotated entities with consistent IDs would require complex agentic behavior demanding further calls to the model and would, once again, incur substantially more cost.

Leveraging recent advances and synergies with retrieval heuristics

None of this is intended to devalue recent advances in AI. On the contrary, we can make use of these technologies in combination with our retrieval heuristic. For example, we can use GPT to identify authors, affiliations, dates and other data that don’t map onto our selection of ontologies. It can also help us to filter out purely grammatical words and to identify non-biological n-grams within incoming queries. In contrast to vector approaches, this is only necessary at query time.

This extends the lead of semantic search even further, and we’ve only just scratched the surface of what is possible here. We can also deploy LLMs as assistants for our subject matter expert curators, helping them to quickly arrive at relevant evidence for synonymity or relationships between entities.

Our semantic search is more performant, cheaper, greener, and being built upon a foundation of open source ontologies, more secure from the regulatory disruptions that may be facing LLM developments.

Oliver Giles
Machine Learning Scientist, SciBite

Oliver Giles, Machine Learning Scientist, received his MSc in Synthetic Biology from Newcastle University, and his BA in Philosophy from the University of East Anglia. He is currently focused on interfacing natural language with structured data, extracting said structured data from text and on using AI for the inference of novel hypotheses.

Other articles by Oliver

  1. Training AI is hard, so we trained an AI to do it, read more
  2. Machine Learning insights from Japanese language academic text, read article
Share this article
Relevant resources, events and news