Large language models (LLMs) and search; it’s a FAIR game

Headshot of Joe Mullen, SciBite

Large language models (LLMs) have limitations when applied to search due to their inability to distinguish between fact and fiction, potential privacy concerns, and provenance issues. LLMs can, however, support search when used in conjunction with FAIR data and could even support the democratisation of data, if used correctly…

Ushba Mountain And Clear Sky On Sunset

Large language models (LLMs), such as GPT from Open AI/Microsoft and BARD from Google, are great at a variety of language-based tasks, like really great. They can generate coherent and contextually relevant text, summarise large documents or articles, aggregate text, translate language and interpret human language. These attributes allow LLMs to expedite a wide set of time-consuming, language-based tasks, including aiding operational, marketing, and legal efforts.

LLMs are not, however, designed for search. If you have played around with them, you will have seen evidence of this firsthand! Their ability to regurgitate the bias or misinformation that they have been trained against, as well as their tendency to hallucinate, i.e., return syntactically correct, but factually wrong responses, can also have potentially dire consequences, particularly if used in the wrong setting, particularly pertinent in the life science, where evidence-based decisions are paramount.

Other reasons as to why LLMs are not as effective for search when compared to specialised search tools include:

  1. Privacy serious questions remain around privacy and IP when using open LLMs; whether that be sending data to an LLM for fine-tuning or training – who owns the IP of an insight shared with, or indeed created by an LLM?
  2. Provenance it is often difficult to see the provenance of a response, i.e., from which document was the answer extracted? Without provenance, it is extremely hard to trust any answer returned from an LLMs, or any system, particularly in scenarios, be it industry or subdivision, where the answer simply must be correct.
  3. Out of date due to the sheer size of LLMs, their development and deployment of is computationally expensive, requiring significant resource, meaning they quickly become outdated.

Let’s start with a ‘simpler’ scenario. Say you want to ask a question of a specific internal dataset; a set of experimental write-ups that your LLM has not seen. You want to know what targets have been reviewed in the context of a particular therapeutic area, liver disease, but crucially you want to know from which experimental document the evidence has come.

In this scenario, a retriever is used to grab a set of relevant and prioritised documents from the internal dataset. Documents can then be passed to an LLM, along with the original question, to provide an answer and provide provenance as to where the answer was described in your internal data.

There are a few ways to identify the most relevant set of documents from search tools, including keyword and semantic search via vectorisation; that is chunking docs, vectorising them, and then using similarity measures to identify vectors from the doc that sit in the same space as the vectorised query. The issue is that keyword search has limited precision/recall and vectorising entire corpora of literature, at present, is timely and costly. Although as these costs drop, multimodal search (i.e., semantic search with vector relevancy scores also provided) will gain traction.

Semantic search, on the other hand, which marks up textual data against standards, or ontologies, has much more favourable precision and recall than a basic keyword search and is cheaper and, importantly, more explainable than vector-based search methods.

Semantic search solutions build on FAIRified data and allows, in this scenario, for any documents that mention any $GENE$ alongside anything from the $Liver Disease$ branch of a disease ontology, including subclasses like $nonalcoholic steatohepatitis$ to be passed to the LLM, with explainable provenance.

Machapuchare With Moon Light

Converting natural language to structured queries; public identifiers are key

So, we can see how retrieval methods may be used to resolve some of the issues around privacy, provenance, and relevance so long as your data is managed. BUT what happens if the data I want to look at is captured in structured datasets? Such as a labelled property graph? Well, thankfully, LLMs are great at generating and understanding language.

Codex, co-pilot, along with other examples, have demonstrated the utility of LLMs to develop code… so… why not use LLMs to convert natural language queries to specific syntactic query languages? Say, cypher, for querying a Neo4J knowledge graph (KG).

That is great, BUT, in a KG, nodes, or entities are often only retrievable via a specific ID, for example, $nonalcoholic steatohepatitis$ can be referred to as $NASH$, $Non-alcoholic hepatic disease$, $Non-alcoholic hepatic$ and so on, but will only be labelled and indexed by its unique identifier in the graph, DI00020. That is where it becomes vital to be able to convert strings to things.

Being able to take a NL query, identify and key strings to things, whilst also converting the NL to the correct query syntax, could also allow for one to query structured datasets in a retrieval process not dissimilar to that described above using.

A scientific marketplace for registering tools?

Wonderful, we have talked about how LLMs may be used in conjunction with semantically rich document databases in the context of retrieval, as well as how being able to convert and enrich NL queries can enable us to query structured datasets, BUT what if I want to query multiple semantic document databases and structured databases using the same natural language query? Again, this is all dependent on well-managed data.

It doesn’t make sense to send a query on target prioritisation to an HR database. To send a query to a relevant tool, there needs to be some understanding as to what is available in that tool and how queries can be developed against it. This is where APIs become important. As LLMs are good at generating code, it makes sense to use them too here… but, and here is the important bit that would require well-documented APIs.

One can imagine a world where APIs conform to a standard that provides an interface to support large language models understanding of their utility. For example, ensuring APIs are documented with NL querying in mind, and providing prompt engineering-tuned descriptions of endpoints. In such a nirvana, LLMs could use these descriptions to understand what the API does, how it can invoke it, and importantly, what questions it can answer. One could imagine asking a question of an LLM, have it look up against a registry of services the tools that may be relevant to query (these could be internal or external), generate queries (whilst keying on public identifiers), pull back information to the user who could then interact with the relevant data in a chat like fashion to refine queries.

All sounds good, doesn’t it… but how would one set up a more interactive framework that could make use of multiple tools captured in a registry of services to answer more complex scientific queries that potentially require multi-step logic? Questions, for example, such as ‘are there any current marketed drugs that may be repositioned to treat fatty liver disease?’. Luckily, we already have a few options…

Milky Way Over Mount Rainier

Creating more complex processes

Frameworks that allow for LLMs to be applied to more complex scenarios, scenarios that need multi-step instructions, are becoming increasingly popular and already proving value. For example, LangChain allows one to define controlled processes that make use of LLMs; allowing for components to be ‘chained’ together. Chains are made up multiple components, including prompt templates; LLMs; agents (use LLMs to decide what actions should be taken); and importantly making use of long/short term memory ensuring these components are not running in isolation with no context. Components allow for processes to be defined in LangChain in a more iterative interaction fashion.

Less concerned with controlled processes are frameworks known as recursive AI agents. Examples include Auto-GPT, more autonomous than LangChain, Auto-GPT provides results for more complex, multi-step procedures by taking a first prompt or goal and automating the back and forth between LLMs until the task is completed. Like LangChain, a key component to Auto-GPT is it is long/short-term memory capabilities. Whilst these technologies show great promise, they do not resolve some of the inherent issues surrounding LLMs, however, they could certainly play a part in a bigger solution.

What could the future hold for democratised scientific search?

Ok, so let us fast forward a minute. We’ve done it! We’ve democratised data and enabled anyone in the scientific community to ask system X a complex NL based query and have an answer returned… along with the evidence from which it came! Importantly the response has looked over structured and unstructured data from both internal as well as external data sources. So how was it done? Primarily, all data was made FAIR and aligned to standards – talking the same language; science.

Second, all tools, internal or external, holding this enriched data had richly documented APIs, which in turn were captured in a scientific registry, that held information around what kind of questions these tools can respond to and how questions need to be structured. Interestingly, in this future state, search engine optimisation focuses less on website traffic and instead is used as a mechanism for which plugins grapple to be prioritised in the scientific registry called! Anyway, so we have the registry or science marketplace, and a service for converting NL queries (i) to the relevant scientific ‘things’ and (ii) to the correct format for the tools to be queried.

Finally, a tool for chaining these more complex queries was used. An end user typed in a query activated that, this was fired to the relevant tools from the registry, and an answer, along with the evidence, which came from internal and external APIs registered in the marketplace, provided. Easy, eh?

Thanks for reading!

So, there we have it, LLMs can certainly bring value to search, but they need to be used in conjunction with well-managed, semantic, FAIR data, and other tools that enable the keying and conversion of strings to things. Additional concepts, including normalised API documentation and scientific marketplaces, may also prove beneficial as we look to a world where data truly is democratised!

If you are at BioIT this year and want to hear more about SciBite and see the applicability of large language models to search, come see us at Booth 702!


About Joe Mullen

Director of Data Science & Professional Services, SciBite

Joe Mullen, Director of Science & Professional Services. Holds a Ph.D. from Newcastle University in the development of computational approaches to drug repositioning, with a focus on semantic data integration and data mining. He has been with SciBite since 2017, initially as part of the Data Science team.

View LinkedIn profile

Other articles by Joe

1. [Webinar] How important is subject matter expertise in Life Sciences when using technology and artificial intelligence? Watch on demand
2. [Blog] A review of the Pistoia Alliance Spring Conference 2023 read more.
3. [Blog] Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1] read more.
4. [Blog] Why Use Your Ontology Management Platform as a Central Ontology Server, read more.
5. [Blog] SKOS in CENtree: Further support in our latest 2.1 release, read more.

Related articles

  1. What is Retrieval Augmented Generation and why is the data you feed it so important?

    Headshot of Joe Mullen, SciBite

    Within the life sciences, evidence-based decision-making is imperative; wrong decisions can have dire consequences. As such, it is vital that systems that support the generation and validation of hypotheses provide direct links, or provenance, to the data that was used to generate them. But how can one implement such a workflow?

  2. Are ontologies still relevant in the age of LLMs?

    Headshot of Joe Mullen, SciBite

    Technological advancements exhibit varying degrees of longevity. Some are tried and trusted, enduring longer than others, while other technologies succumb to fleeting hype without attaining substantive fruition. One constant, in this dynamic landscape is the data.


How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us