GPT3 is a large language model that is capable of generating text with very high fidelity. Unlike previous models, it doesn’t stumble over its grammar or write like an inebriated caveman. In many circumstances it can easily be taken for a human author, and GPT-generated text is increasingly prolific across the internet for this reason.
Within the life sciences, it’s been used to classify sentences, extract entities like genes or diseases, and summarise documents. There is, however, a reluctance to use its primary function, namely text generation. And there’s a good reason for this: while GPT’s grammar is flawless, its science is not. If you ask it to generate a list of recommended dosages for pharmaceuticals, for example, you’d be well advised not to take its output seriously. Frankly, it’s dangerous.
But there is one circumstance where we don’t need the output to be factually precise.
While GPT’s grammatical competence is unparalleled, there are still circumstances where smaller models are more utilitarian. GPT is enormous, and running it over large corpora is both time-consuming and expensive. And although GPT is the best language model we have, returns on size and cost are diminishing. Tuned BERT models outperform untuned GPT at a fraction of the size and cost. But to tune these models, we need data. And here we come to the symbiosis between GPT and BERT.
To finetune BERT, we do not need true sentences. What we need are sentences that look like they’re true, so that when BERT is exposed to actual true sentences, it can recognise their general shapes and structures. In other words, we just need to teach BERT to understand the grammar of the problem we’re trying to solve. And this is where GPT excels. It doesn’t matter that the science is wrong, it matters that it looks like it’s right.
What’s more, not only can this principle be applied to datasets as a whole, but also to specific edge cases within a dataset. For example, we have found that models often struggle with negations, because they are so infrequent within the training data. We really don’t want a model for extracting protein interactions to identify ‘GENE1 does not bind to GENE2’ as a positive relationship, but we also don’t want to implement clumsy rules for every edge case we come across.
As these edge cases are rare (often occurring in only a fraction of a percent of sentences), we would need to curate huge amounts of data to get enough examples for the model to recognize the pattern. Instead, we can use GPT to generate sufficient examples to boost the signal within the training data.
We’ve managed to use GPT generations to speed up our model prototyping pipeline, allowing us to get to highly accurate models with only small bursts of curation. With just a few verified examples, we can use GPT to generate many more. The resultant models can then be combined with TERMite for identifying and masking entities, which the model then checks for relationships of interest. And because TERMite outputs IDs that easily map onto public ontologies, these relationships can be directly plugged into knowledge graphs with minimal effort.
As part of this experimentation, we have tried many different methods for constructing our prompts, aided greatly by internal curation efforts. We have also tested the hyperparameters of GPT3 and the impact of fine-tuning.
The latter was of particular importance. Although GPT3, without any fine-tuning seemed to generate a reasonable and diverse selection of sentences, it failed to lead to any gains in our downstream model performance. The sentences read well to us as humans, but apparently, some deeper grammatical trends made them an unrepresentative sample of the class of real-world sentences in the eyes of BERT.
These methods even worked in the challenging domain of real-world evidence. Many concrete examples of drugs causing certain side effects can only be found in online discussion boards, so we developed a model that extracts adverse events associated with drugs from real-world evidence sources like Facebook and Reddit.
Here are some examples from the training data. Try to guess which of these are real examples found on Reddit and which are generated by an AI:
Could you tell them apart? They’re actually all generated. By the same AI. We wish it a swift recovery from its assorted ailments.
Oliver Giles, Machine Learning Scientist, received his MSc in Synthetic Biology from Newcastle University, and his BA in Philosophy from the University of East Anglia. He is currently focused on interfacing natural language with structured data, extracting said structured data from text and on using AI for the inference of novel hypotheses.
Other articles by Oliver