SciBite
News
Revolutionizing Life Sciences: The future of AI in Life Science [Part 2]

Revolutionizing Life Sciences: The future of AI in Life Science [Part 2]

As discussed in part 1, Artificial intelligence (AI) has revolutionized several areas in life sciences, including disease diagnosis and drug discovery. In this second blog, we introduce some specific text-based models whilst also discussing the challenges and future impact of AI in Life Science.

Text-based AI approaches

Applications described have been predominantly developed to work on a specific type of data. For example, image-based models are used to support diagnosis and audio-based models, such as speech recognition, have also been used to support accessibility of healthcare systems to those with disabilities.

Text-based models work on a data type a little closer to our hearts at SciBite, and can be applied, in some capacity, to all the use-cases described above, including more than one of the success stories.

There exist many different types of text-based models; below a simplified summary of these text-based model types is provided

Named Entity Recognition (NER) – models used to identify entities, such as drugs, genes, or diseases, from text.
Relation Extraction – models that confirm the presence of a relationship between entities captured in text.
Text Classification/Clustering Models – models used to classify text into different categories whether against a pre-defined set of categories or undefined ‘topics’.
Sentiment Analysis – models used to determine the sentiment, or emotion, expressed in a piece of text.
Text Summarisation – models used to automatically generate a summary of text
Question Answering Models – models used to answer questions by extracting relevant information from text.
Large Language Models – models that can generate text, examples include GPT-3 and other transformer-based models.

Milky Way Over Mount Rainier

Limitations and considerations

Ok so we’ve got the models, we know the problems they can be applied too, and we have heard some convincing success stories… so it’s all good right? Unfortunately, not. There are less flashy, but critical, hurdles to be navigated that can cause many organizations utilizing AI headaches.

Below we describe some of these hurdles, particularly in reference to text-based models – although most will hold true regardless of the type of model being created; including data hurdles, model hurdles, and arguably most importantly, hurdles around people and the necessity of having subject matter expertise to ensure models are indeed trained and behaving in a scientifically accurate fashion.

1. Data quality

Starting off with the data. In what may seem slightly ironic as we see exponential growth in data, we are getting to the point where the limiting factor on model improvement is not the size or number of parameters in the language models, but rather the lack of (quality) training data available!²

The cliched paradigm of “garbage in garbage out” still holds true; having quality and relevant data to train a model is paramount, luckily, being part of the Elsevier family really helps us here.

2. Representative data

Is just having quality data sufficient, I hear you ask? Well no. We still need to ensure that the data is representative and not biased. Having certain elements of a dataset overrepresented can have serious repercussions on model performance.

3. Data privacy

Next, you must always be aware of any privacy or accessibility issues that limit your data usage, such as personal medical data.

4. Fine-tuning datasets

Ok, so we have a training dataset that is of high quality, is representative and not biased and has no privacy concerns. How can we use this data to train or fine-tune models? First, you must think about whether this is a collaborative effort and the security implications of such an approach; open federated learning frameworks, such as FLOWER are being developed, some work is still needed.

Furthermore, there is a whole raft of considerations that fall under the umbrella of responsible AI, that is the practice of designing, developing, and deploying AI with good intention.

Although we won’t cover all aspects of responsible AI here, some of the key points include:

Ensuring transparency and explainability – why is the model providing such an answer?
Privacy and security – ensuring that if models are trained on private information, this data cannot be exposed via certain prompts.
Ethical issues – seeking out and eliminating bias and managing misinformation that can be produced by some large language models.

5. Sustainable infrastructure

We have our data, and we have our models, so what else needs do we need to think about? Well, quite a lot. Despite the obvious consideration around cost, the scalability and efficiency of AI must be at the forefront of any project planning. We must also work together, as a collective, to ensure AI infrastructure is sustainable, keeping energy consumption is kept to a minimum.

Furthermore, as AI models become commonplace and accessible to all from open sources, before productionising these models, it is important to consider who is supporting them? If a model exhibits undesired behavior, how is that rectified and how do we ensure the model has been trained on representative data, whilst capturing the provenance of this?

6. Subject matter expertise (SME)

This brings us to our final, but likely the most important consideration, people. Subject matter expertise (SME), particularly within the Life Sciences, is vital when configuring and running AI models. With the increased accessibility of AI, this is often the piece that gets forgotten or neglected. But how can models be fine-tuned to support a domain without an SME who understands said domain?

Double Rainbow Pacific Ocean Beach Scene On Maui Hawaii

What will the future hold?

Ok, so hopefully, we can see that AI is playing a key role in expediting a lot of processes and enabling the extraction of insight within the Life Sciences, but what do we see as being key areas for further focus? Below we list a few, and comment on how SciBite may be used to support:

1. Synthetic data generation

Understanding that the limitation on language model improvement depends on the availability of vast quantities of quality data, the ability to address this data shortage issues is of paramount importance to the domain.

To create synthetic data, it is important to have quality, representative data to build from. Being part of the Elsevier family means SciBite has the luxury of having access to a gold standard set of scientific text, coupled with our semantic technologies, puts us in a prime position to produce such datasets.

2. Real World Data (RWD)

Application of AI to the analysis of RWD will continue to be an area of focus, whether that be EHRs, claims, patient-generated data, and data collected from other sources that can inform on patient health – such as Twitter, Reddit, and other social media platforms. Access to this data will continue to be problematic, and it remains to be seen whether synthetic data generation can truly be used to substitute for the real thing.

3. Provenance in AI-based search solutions

As AI becomes a commodity utilized daily, we need to ensure that the provenance of decisions is captured. For example, using question-answer models to return answers, it is important to be able to see from which source, or document, the answer was extracted. This is even more important in Life Sciences, where the provenance of evidence used to support decision-making must be clear and explainable.

4. Multimodal models

Throughout this article, we have discussed many types of AI models, image, audio, and text-based models are commonplace, but what about models that can take videos or songs as input and answer questions about them via textual input? These are known as multimodal models. Many data types are used to train these models, and as ever, SciBite is there to support the quality of textual data going into these.

Double Rainbow Pacific Ocean Beach Scene On Maui Hawaii

SciBite and AI

At SciBite, we believe this is an extremely exciting time for AI, specifically within the Life Science space. We understand and share the excitement around the latest developments, including large language models. More than ever, we believe the foundational management of standards and application of these to data is crucial when it comes to ensuring that AI models are trained, deployed, and analyzed in as accurate a means as possible.

To us, AI is an all-important tool to enrich our capabilities in making data work harder. We are actively exploring, quantifying, and scaling various AI-based approaches to build on our expertise in areas such as candidate ontology generation, identifying mappings between source and target ontologies, triple generation for knowledge graph, enhancing search in areas such as natural language-based queries, question answering and tagging documents using classification approaches to name but a few.

At SciBite we have the critical components to making confident, evidence-based connections in scientific data, using AI, rules-based, or combinatorial approaches. We have the data, the tools, and the team to explore the most effective way to ask/retrieve and interpret data to answer science’s most pressing questions.

Please get in touch with us to discuss how best we can support you in finding answers.

Click here to read the first blog in the series “Revolutionizing Life Sciences: The incredible impact of AI in Life Science”

About Joe Mullen

Director of Data Science & Professional Services, SciBite

Joe Mullen, Director of Science & Professional Services. Holds a Ph.D. from Newcastle University in the development of computational approaches to drug repositioning, with a focus on semantic data integration and data mining. He has been with SciBite since 2017, initially as part of the Data Science team.

View LinkedIn profile

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Revolutionizing Life Sciences: The future of AI in Life Science [Part 2]

Text-based AI approaches