SciBite is one of several members of the Pistoia Alliance working to address challenges in making better use of big data in Global Life Sciences R&D. In April, our CTO took part in the Alliance’s Spring Virtual Conference on a day dedicated to emerging science and technologies.
In the talk “Combining Deep Learning, Semantics, and Domain Expertise to Detect Patterns in Biomedical Text,” we took viewers on a journey through examples of semantic-based machine learning (ML) that overcame hurdles in gaining information from varied sources with biomedically relevant content. In the process, we made a compelling argument for the power of SciBite’s approach of combining ML with subject matter expertise. “Know your data, understand the problem, and select your solution accordingly.”
“Semantics matter and are a powerful tool to capture, enrich, and even sub-select data to help know more about content before training, but language models require ‘fine-tuning’ for different language types, and that’s where subject matter expertise is essential.”
SciBite described an ML strategy based on “seeding” named entity recognition (NER). “Seed” terms are passed to an ML model that extracts similar terms from text based on the contexts in which those terms occur. For example, both lung and heart occur alongside words such as ‘surgery,’ ‘function,’ and ‘medication.’ The model, therefore, clusters them together.
Thus, each seed generates a cluster of terms of varied relevance. Pass those terms as seeds in a second iteration, and each will generate its own word cluster. The resulting pattern of word clusters is the beginning of semantic categories which, though noisy at first, can be pruned by subject matter experts to shape meaningful classifications through an iterative seeding and pruning process.
The advantage of this strategy is that by tag-teaming ML and domain expertise, models built on this approach can be rapidly developed and flexibly tackle numerous applications.
SciBite’s team of ML experts have, for example, trained a transformer model to scale up the seeding process and constructed a 6000-term vocabulary for genetic variation. Beginning with a handful of seed terms, the model generated increasingly rich term groupings, which were validated by subject experts as the relational backbone of the vocabulary. This approach was even successfully applied alongside translation to enable English speakers to develop a Japanese vocabulary with thousands of terms, despite not speaking a word of Japanese.
The final result was then validated by Japanese-speaking staff. And finally, the flexibility of the strategy addresses some of the language variations that have stymied efforts to capture relevant information from patient-side accounts in Facebook, Reddit, Twitter, forums, and other types of real-world data sources.
Models trained to understand how specific term categories, like medications or symptoms, appear within a sentence, can infer phrases that look like they should be medications or symptoms because of the language used. So, for example, “could not sleep,” although not in an ontology, can be recognized and annotated as a symptom, which is impossible with a model trained to look for specific terms, like “insomnia.”
Going beyond term extraction, the strategy also presents an opportunity to support Bidirectional Encoder Representations from Transformers (BERT) models that deliver answers to natural language questions within a given chunk of text. Identifying the chunks that might contain the right answer can pose a challenge when the information space is very extensive or ambiguous.
Flexible NER, like we described, can filter out the most salient paragraphs from a vast information space to deliver a narrower search field where answers are most likely to be found. This semantics-based sub-selection of data can streamline real-time processing and is something we are building into SciBite Search.
Our goal at SciBite is to create applied AI that is flexible and ready to be used. It should be scalable, it should be seamless, and it should allow the user to focus on doing high-quality science with data rather than puzzling over the “nuts and bolts” of how to do it. Our presentation at the Pistoia Alliance is one example of our continuous quest to create new ways of generating more insights from our data.
Watch the full recorded presentation from the Pistoia Alliance virtual conference.
Our SciBite CTO was invited to take part in a panel discussion as part of this year's virtual Biocuration Conference, where he shared his thoughts in a thought-provoking discussion on “The Future of Biocuration.”Read
In this blog we discusses how Sherlock Holmes (amongst others) made an appearance when we looked to exploit the efforts of Wikipedia to identify articles relevant to the life science domain for a language model project.Read
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456