Revolutionizing life sciences [Part 2]: The future of AI in life sciences

SciBite / News / Revolutionizing life sciences [Part 2]: The future of AI in life sciences

As discussed in part 1, Artificial intelligence (AI) has revolutionized several areas in life sciences, including disease diagnosis and drug discovery. In this second blog, we introduce some specific text-based models whilst also discussing the challenges and future impact of AI in Life Science.

Text-based AI approaches

Applications described have been predominantly developed to work on a specific type of data. For example, image-based models are used to support diagnosis and audio-based models, such as speech recognition, have also been used to support accessibility of healthcare systems to those with disabilities.

Text-based models work on a data type a little closer to our hearts at SciBite, and can be applied, in some capacity, to all the use-cases described above, including more than one of the success stories.

There exist many different types of text-based models; below a simplified summary of these text-based model types is provided

Named Entity Recognition (NER) – models used to identify entities, such as drugs, genes, or diseases, from text.
Relation Extraction – models that confirm the presence of a relationship between entities captured in text.
Text Classification/Clustering Models – models used to classify text into different categories whether against a pre-defined set of categories or undefined ‘topics’.
Sentiment Analysis – models used to determine the sentiment, or emotion, expressed in a piece of text.
Text Summarisation – models used to automatically generate a summary of text
Question Answering Models – models used to answer questions by extracting relevant information from text.
Large Language Models – models that can generate text, examples include GPT-3 and other transformer-based models.

Limitations and considerations

Ok so we’ve got the models, we know the problems they can be applied too, and we have heard some convincing success stories… so it’s all good right? Unfortunately, not. There are less flashy, but critical, hurdles to be navigated that can cause many organizations utilizing AI headaches.

Below we describe some of these hurdles, particularly in reference to text-based models – although most will hold true regardless of the type of model being created; including data hurdles, model hurdles, and arguably most importantly, hurdles around people and the necessity of having subject matter expertise to ensure models are indeed trained and behaving in a scientifically accurate fashion.

1. Data quality

Starting off with the data. In what may seem slightly ironic as we see exponential growth in data, we are getting to the point where the limiting factor on model improvement is not the size or number of parameters in the language models, but rather the lack of (quality) training data available!²

The cliched paradigm of “garbage in garbage out” still holds true; having quality and relevant data to train a model is paramount, luckily, being part of the Elsevier family really helps us here.

2. Representative data

Is just having quality data sufficient, I hear you ask? Well no. We still need to ensure that the data is representative and not biased. Having certain elements of a dataset overrepresented can have serious repercussions on model performance.

3. Data privacy

Next, you must always be aware of any privacy or accessibility issues that limit your data usage, such as personal medical data.

4. Fine-tuning datasets

Ok, so we have a training dataset that is of high quality, is representative and not biased and has no privacy concerns. How can we use this data to train or fine-tune models? First, you must think about whether this is a collaborative effort and the security implications of such an approach; open federated learning frameworks, such as FLOWER are being developed, some work is still needed.

Furthermore, there is a whole raft of considerations that fall under the umbrella of responsible AI, that is the practice of designing, developing, and deploying AI with good intention.

Although we won’t cover all aspects of responsible AI here, some of the key points include:

Ensuring transparency and explainability – why is the model providing such an answer?
Privacy and security – ensuring that if models are trained on private information, this data cannot be exposed via certain prompts.
Ethical issues – seeking out and eliminating bias and managing misinformation that can be produced by some large language models.

5. Sustainable infrastructure

We have our data, and we have our models, so what else needs do we need to think about? Well, quite a lot. Despite the obvious consideration around cost, the scalability and efficiency of AI must be at the forefront of any project planning. We must also work together, as a collective, to ensure AI infrastructure is sustainable, keeping energy consumption is kept to a minimum.

Furthermore, as AI models become commonplace and accessible to all from open sources, before productionising these models, it is important to consider who is supporting them? If a model exhibits undesired behavior, how is that rectified and how do we ensure the model has been trained on representative data, whilst capturing the provenance of this?

6. Subject matter expertise (SME)

This brings us to our final, but likely the most important consideration, people. Subject matter expertise (SME), particularly within the Life Sciences, is vital when configuring and running AI models. With the increased accessibility of AI, this is often the piece that gets forgotten or neglected. But how can models be fine-tuned to support a domain without an SME who understands said domain?

What will the future hold?

Ok, so hopefully, we can see that AI is playing a key role in expediting a lot of processes and enabling the extraction of insight within the Life Sciences, but what do we see as being key areas for further focus? Below we list a few, and comment on how SciBite may be used to support:

1. Synthetic data generation

Understanding that the limitation on language model improvement depends on the availability of vast quantities of quality data, the ability to address this data shortage issues is of paramount importance to the domain.

To create synthetic data, it is important to have quality, representative data to build from. Being part of the Elsevier family means SciBite has the luxury of having access to a gold standard set of scientific text, coupled with our semantic technologies, puts us in a prime position to produce such datasets.

2. Real World Data (RWD)

Application of AI to the analysis of RWD will continue to be an area of focus, whether that be EHRs, claims, patient-generated data, and data collected from other sources that can inform on patient health – such as Twitter, Reddit, and other social media platforms. Access to this data will continue to be problematic, and it remains to be seen whether synthetic data generation can truly be used to substitute for the real thing.

3. Provenance in AI-based search solutions

As AI becomes a commodity utilized daily, we need to ensure that the provenance of decisions is captured. For example, using question-answer models to return answers, it is important to be able to see from which source, or document, the answer was extracted. This is even more important in Life Sciences, where the provenance of evidence used to support decision-making must be clear and explainable.

4. Multimodal models

Throughout this article, we have discussed many types of AI models, image, audio, and text-based models are commonplace, but what about models that can take videos or songs as input and answer questions about them via textual input? These are known as multimodal models. Many data types are used to train these models, and as ever, SciBite is there to support the quality of textual data going into these.

SciBite and AI

At SciBite, we believe this is an extremely exciting time for AI, specifically within the Life Science space. We understand and share the excitement around the latest developments, including large language models. More than ever, we believe the foundational management of standards and application of these to data is crucial when it comes to ensuring that AI models are trained, deployed, and analyzed in as accurate a means as possible.

To us, AI is an all-important tool to enrich our capabilities in making data work harder. We are actively exploring, quantifying, and scaling various AI-based approaches to build on our expertise in areas such as candidate ontology generation, identifying mappings between source and target ontologies, triple generation for knowledge graph, enhancing search in areas such as natural language-based queries, question answering and tagging documents using classification approaches to name but a few.

Click on image to enlarge

At SciBite we have the critical components to making confident, evidence-based connections in scientific data, using AI, rules-based, or combinatorial approaches. We have the data, the tools, and the team to explore the most effective way to ask/retrieve and interpret data to answer science’s most pressing questions.

Please get in touch with us to discuss how best we can support you in finding answers.

Click here to read the first blog in the series “Revolutionizing Life Sciences: The incredible impact of AI in Life Science”

Joe Mullen

Product Director, Software Solutions

With a PhD from Newcastle University in computational approaches to drug repositioning, Joe brings a strong scientific foundation rooted in semantic data integration, knowledge graphs, and data mining. Since joining SciBite in 2017, he has had the privilege of leading the Data Science and Professional Services teams, where he combined cutting-edge technology with our core data enrichment products to create tailored solutions for a diverse range of customers.

Today, as Product Director, Joe is passionate about shaping the vision of our software solutions, aligning them with strategic goals, and most importantly, supporting our clients in unlocking the full potential of their scientific data.

His focus is on driving innovation that empowers scientists and organizations to make impactful discoveries faster and more efficiently.

Other articles by Joe

What is agentic AI and is there a role for ontologies? read more
Are ontologies still relevant in the age of LLMs? read more
What is Retrieval Augmented Generation, and why is the data you feed it so important? read more
Large language models (LLMs) and search; it’s a FAIR game, read more
Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1], read more
Why use your ontology management platform as a central ontology server? read more

Share this article

Relevant resources, events and news

https://scibite.com/knowledge-hub/news/ai-based-chat-application-for-life-sciences/ thumbnail image

News AI-based chat application for life sciences [Part 1]: Key considerations

Are your teams now posing potentially confidential questions to consumer tools such as Bard and ChatGPT, relying on their responses?