Building gold-standard training sets for machine learning and AI systems [Use Case]

Search + business intelligence

The Business Challenge:

Many applications of AI involve pattern recognition, but their accuracy is highly dependent on the data being unambiguous. Machine learning models can be used to identify sentences describing positive and negative relations between entities (i.e. X  has some relation with Y). However, in order to train such models, it is vital to have as clean a dataset as possible. For example, without prior semantic enrichment of the text, a machine model would not be able to correctly identify that the phrase “…the binding of repaglinide to HSA in human plasma…” refers to an interaction between a drug and a protein, rather than between two proteins.

The SciBite Solution:

We created a tool that makes use of SciBite’s Named Entity Recognition (NER) engine, TERMite, to accurately identify and categorise examples of sentences that mention protein-protein interactions. First, all sentences mentioning entity type 1 and entity type 2 were extracted from MEDLINE. In the case of protein-protein interactions, we were looking for two GENE mentions in a sentence. These sentences were then surfaced to a curator, along with related metadata. The curator then assigns the sentence to one of three sets: i) sentences that describe a positive interaction, ii) sentences that describe a negative interaction, or iii) coincidental mentions. This data that can then be used to train machine learning models to automate the extraction of sentences describing a relation of interest.

Key Business Benefits:

  • Greatly streamlines the process of curating and categorising training data
  • Train machine learning model to accurately detect true protein-protein interactions using examples extracted from the biomedical literature.

Find out more about how our Ontology Services can benefit your business.

Learn more

Related articles

  1. Are ontologies relevant in a machine learning-centric world?

    SciBite CSO and Founder Lee Harland shares his views on why ontologies are relevant in a machine learning-centric world and are essential to help "clean up" scientific data in the Life Sciences industry.

  2. Of burns and bums: Machine Learning surprises!

    As many of our regular visitors will know, the focus of our work here at SciBite is unlocking the knowledge held in the vast amount of biomedical text researchers have access to. Sometimes this yields well, interesting, results...


How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us