Training AI is hard, so we trained an AI to do it
 

GPT3 is a large language model that is capable of generating text with very high fidelity. Unlike previous models, it doesn't stumble over its grammar or write like an inebriated caveman. In many circumstances it can easily be taken for a human author, and GPT-generated text is increasingly prolific across the internet for this reason.

Milky Way Over Mount Rainier

GPT’s grammar is flawless, its science is not

Within the life sciences, it’s been used to classify sentences, extract entities like genes or diseases, and summarise documents. There is, however, a reluctance to use its primary function, namely text generation. And there’s a good reason for this: while GPT’s grammar is flawless, its science is not. If you ask it to generate a list of recommended dosages for pharmaceuticals, for example, you’d be well advised not to take its output seriously. Frankly, it’s dangerous.

But there is one circumstance where we don’t need the output to be factually precise.

While GPT’s grammatical competence is unparalleled, there are still circumstances where smaller models are more utilitarian. GPT is enormous, and running it over large corpora is both time-consuming and expensive. And although GPT is the best language model we have, returns on size and cost are diminishing. Tuned BERT models outperform untuned GPT at a fraction of the size and cost. But to tune these models, we need data. And here we come to the symbiosis between GPT and BERT.

Fine-tuning BERT

To finetune BERT, we do not need true sentences. What we need are sentences that look like they’re true, so that when BERT is exposed to actual true sentences, it can recognise their general shapes and structures. In other words, we just need to teach BERT to understand the grammar of the problem we’re trying to solve. And this is where GPT excels. It doesn’t matter that the science is wrong, it matters that it looks like it’s right.

What’s more, not only can this principle be applied to datasets as a whole, but also to specific edge cases within a dataset. For example, we have found that models often struggle with negations, because they are so infrequent within the training data. We really don’t want a model for extracting protein interactions to identify ‘GENE1 does not bind to GENE2’ as a positive relationship, but we also don’t want to implement clumsy rules for every edge case we come across.

As these edge cases are rare (often occurring in only a fraction of a percent of sentences), we would need to curate huge amounts of data to get enough examples for the model to recognize the pattern. Instead, we can use GPT to generate sufficient examples to boost the signal within the training data.

Milky Way Over Mount Rainier

The resultant models

We’ve managed to use GPT generations to speed up our model prototyping pipeline, allowing us to get to highly accurate models with only small bursts of curation. With just a few verified examples, we can use GPT to generate many more. The resultant models can then be combined with TERMite for identifying and masking entities, which the model then checks for relationships of interest. And because TERMite outputs IDs that easily map onto public ontologies, these relationships can be directly plugged into knowledge graphs with minimal effort.

As part of this experimentation, we have tried many different methods for constructing our prompts, aided greatly by internal curation efforts. We have also tested the hyperparameters of GPT3 and the impact of fine-tuning.

The latter was of particular importance. Although GPT3, without any fine-tuning seemed to generate a reasonable and diverse selection of sentences, it failed to lead to any gains in our downstream model performance. The sentences read well to us as humans, but apparently, some deeper grammatical trends made them an unrepresentative sample of the class of real-world sentences in the eyes of BERT.

Blog Training Ai Is Hard_figure1

Figure 1: Curated data is still king, but when availability is limited, generated data can substantially augment performance.

These methods even worked in the challenging domain of real-world evidence. Many concrete examples of drugs causing certain side effects can only be found in online discussion boards, so we developed a model that extracts adverse events associated with drugs from real-world evidence sources like Facebook and Reddit. Here are some examples from the training data. Try to guess which of these are real examples found on Reddit and which are generated by an AI:

Blog Training Ai Is Hard_figure2

Could you tell them apart? They’re actually all generated. By the same AI. We wish it a swift recovery from its assorted ailments.

 


About Oliver Giles

Machine Learning Scientist, SciBite

Oliver Giles, Machine Learning Scientist, received his MSc in Synthetic Biology from Newcastle University, and his BA in Philosophy from the University of East Anglia. He is currently focused on interfacing natural language with structured data, extracting said structured data from text and on using AI for the inference of novel hypotheses.

View LinkedIn profile

Related articles

  1. Revolutionizing Life Sciences: The incredible impact of AI in Life Science [Part 1]

    Artificial intelligence (AI) has been applied to numerous aspects of the life sciences, from disease diagnosis to drug discovery; in the first of this two-part blog series, we outline the impact of AI in Life Science and illustrate the various success stories of AI in Life Science.

    Read
  2. Revolutionizing Life Sciences: The future of AI in Life Science [Part 2]

    As discussed in part 1, Artificial intelligence (AI) has revolutionized several areas in life sciences, including disease diagnosis and drug discovery. In this second blog, we introduce some specific text-based models whilst also discussing the challenges and future impact of AI in Life Science.

    Read

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us