As discussed in part 1, Artificial intelligence (AI) has revolutionized several areas in life sciences, including disease diagnosis and drug discovery. In this second blog, we introduce some specific text-based models whilst also discussing the challenges and future impact of AI in Life Science.
Applications described have been predominantly developed to work on a specific type of data. For example, image-based models are used to support diagnosis and audio-based models, such as speech recognition, have also been used to support accessibility of healthcare systems to those with disabilities.
Text-based models work on a data type a little closer to our hearts at SciBite, and can be applied, in some capacity, to all the use-cases described above, including more than one of the success stories.
There exist many different types of text-based models; below a simplified summary of these text-based model types is provided
Ok so we’ve got the models, we know the problems they can be applied too, and we have heard some convincing success stories… so it’s all good right? Unfortunately, not. There are less flashy, but critical, hurdles to be navigated that can cause many organizations utilizing AI headaches.
Below we describe some of these hurdles, particularly in reference to text-based models – although most will hold true regardless of the type of model being created; including data hurdles, model hurdles, and arguably most importantly, hurdles around people and the necessity of having subject matter expertise to ensure models are indeed trained and behaving in a scientifically accurate fashion.
Starting off with the data. In what may seem slightly ironic as we see exponential growth in data, we are getting to the point where the limiting factor on model improvement is not the size or number of parameters in the language models, but rather the lack of (quality) training data available!2
The cliched paradigm of “garbage in garbage out” still holds true; having quality and relevant data to train a model is paramount, luckily, being part of the Elsevier family really helps us here.
Is just having quality data sufficient, I hear you ask? Well no. We still need to ensure that the data is representative and not biased. Having certain elements of a dataset overrepresented can have serious repercussions on model performance.
Next, you must always be aware of any privacy or accessibility issues that limit your data usage, such as personal medical data.
Ok, so we have a training dataset that is of high quality, is representative and not biased and has no privacy concerns. How can we use this data to train or fine-tune models? First, you must think about whether this is a collaborative effort and the security implications of such an approach; open federated learning frameworks, such as FLOWER are being developed, some work is still needed.
Furthermore, there is a whole raft of considerations that fall under the umbrella of responsible AI, that is the practice of designing, developing, and deploying AI with good intention.
Although we won’t cover all aspects of responsible AI here, some of the key points include:
We have our data, and we have our models, so what else needs do we need to think about? Well, quite a lot. Despite the obvious consideration around cost, the scalability and efficiency of AI must be at the forefront of any project planning. We must also work together, as a collective, to ensure AI infrastructure is sustainable, keeping energy consumption is kept to a minimum.
Furthermore, as AI models become commonplace and accessible to all from open sources, before productionising these models, it is important to consider who is supporting them? If a model exhibits undesired behavior, how is that rectified and how do we ensure the model has been trained on representative data, whilst capturing the provenance of this?
This brings us to our final, but likely the most important consideration, people. Subject matter expertise (SME), particularly within the Life Sciences, is vital when configuring and running AI models. With the increased accessibility of AI, this is often the piece that gets forgotten or neglected. But how can models be fine-tuned to support a domain without an SME who understands said domain?
Ok, so hopefully, we can see that AI is playing a key role in expediting a lot of processes and enabling the extraction of insight within the Life Sciences, but what do we see as being key areas for further focus? Below we list a few, and comment on how SciBite may be used to support:
Understanding that the limitation on language model improvement depends on the availability of vast quantities of quality data, the ability to address this data shortage issues is of paramount importance to the domain.
To create synthetic data, it is important to have quality, representative data to build from. Being part of the Elsevier family means SciBite has the luxury of having access to a gold standard set of scientific text, coupled with our semantic technologies, puts us in a prime position to produce such datasets.
Application of AI to the analysis of RWD will continue to be an area of focus, whether that be EHRs, claims, patient-generated data, and data collected from other sources that can inform on patient health – such as Twitter, Reddit, and other social media platforms. Access to this data will continue to be problematic, and it remains to be seen whether synthetic data generation can truly be used to substitute for the real thing.
As AI becomes a commodity utilized daily, we need to ensure that the provenance of decisions is captured. For example, using question-answer models to return answers, it is important to be able to see from which source, or document, the answer was extracted. This is even more important in Life Sciences, where the provenance of evidence used to support decision-making must be clear and explainable.
Throughout this article, we have discussed many types of AI models, image, audio, and text-based models are commonplace, but what about models that can take videos or songs as input and answer questions about them via textual input? These are known as multimodal models. Many data types are used to train these models, and as ever, SciBite is there to support the quality of textual data going into these.
At SciBite, we believe this is an extremely exciting time for AI, specifically within the Life Science space. We understand and share the excitement around the latest developments, including large language models. More than ever, we believe the foundational management of standards and application of these to data is crucial when it comes to ensuring that AI models are trained, deployed, and analyzed in as accurate a means as possible.
To us, AI is an all-important tool to enrich our capabilities in making data work harder. We are actively exploring, quantifying, and scaling various AI-based approaches to build on our expertise in areas such as candidate ontology generation, identifying mappings between source and target ontologies, triple generation for knowledge graph, enhancing search in areas such as natural language-based queries, question answering and tagging documents using classification approaches to name but a few.
At SciBite we have the critical components to making confident, evidence-based connections in scientific data, using AI, rules-based, or combinatorial approaches. We have the data, the tools, and the team to explore the most effective way to ask/retrieve and interpret data to answer science’s most pressing questions.
Please get in touch with us to discuss how best we can support you in finding answers.
Click here to read the first blog in the series “Revolutionizing Life Sciences: The incredible impact of AI in Life Science”
Leading SciBite’s data science and professional services team, Joe is dedicated to helping customers unlock the full potential of their data using SciBite’s semantic stack. Spearheading R&D initiatives within the team and pushing the boundaries of the possible. Joe’s expertise is rooted in a PhD from Newcastle University, focussing on novel computational approaches to drug repositioning; building atop semantic data integration, knowledge graph & data mining.
Since joining SciBite in 2017, Joe has been enthused by the rapid advancements in technology, particularly within AI. Recognizing the immense potential of AI, Joe combines this cutting-edge technology with SciBite’s core technologies to craft tailored, bespoke solutions that cater to diverse customer needs.
Other articles by Joe