On the first day of Christmas SciBite gave to me... 12 top tips for creating labelled Machine Learning training data.
With the holiday season upon us, we thought we would celebrate the season of giving – and what better gift than sharing with you our 12 top tips for creating labelled Machine Learning training data, with a few added festive analogies just because, well, it’s Christmas time…
So sit back, enjoy the read, deck the halls with piles of linear algebra, unwrap your shiny new transformer toys and pray for a 2020 in which we stop naming models after Sesame Street characters!
1. Start with the question, then the data, and let the model follow
Don’t dive head first into the latest, coolest methodology without first being sure it’s appropriate for the task at hand. The question you want to ask will determine the data you need, and the data will define which models are suitable. Sometimes you need a reindeer pulled sleigh, sometimes you need a 4×4.
2. Understand what relevant data looks like
Make sure data is relevant to the problem you are trying to address. One of our strategies when tackling natural language processing (NLP) problems is to use TERMite to exclude sentences that don’t contain a minimum set of entities we are interested in. For instance, if our objective is to identify protein-protein interactions, then the sentence must contain a minimum of two proteins. We’re making a list and checking it twice. This allows for significant savings in curation time and also in compute resource at inference.
3. Select representative data
The data you train on needs to look like the data you want to infer on, so try to always be conscious of the data that your model will be exposed to once it is deployed. One common mistake is to leave data by the wayside if it doesn’t fit neatly into your defined classes. If you do this, when your model comes across these difficult edge cases it will have no idea what to do. You must decide what to do with noise when creating your training data to make your model robust enough to handle noise in the wild. It’s not always panettone and mulled wine.
4. Balance recall against precision
Depending on your use case, it is likely that you will have a sense of whether recall or precision takes precedence for you (for the smart Elves amongst you, we realise both is ideal!). If not, this is something to consider when you construct your training set. When we curate sentences at SciBite, it is often necessary to have a ‘Don’t know’ option. We are then able to incorporate this category into either the positive or negative set and train both a recall-oriented and a precision-oriented model from that one dataset. This may not be possible in your case, so it’s important to plan ahead.
5. Select a suitable base model for transfer learning
Transfer learning allows us to take a model trained on enormous amounts of data, and leverage the general understanding it has acquired to bootstrap our own specific tasks. It is impossible for a model to learn the English language from a few hundred sentences of highly specific data, but it is much more feasible to do the reverse – take a model with a broad understanding of the language and teach it about a specific domain.
However, while it may all be English, there is a chasm between the dialect of the legal profession and the dialect of the Twitterati, for example. The grammar is different, the vocabulary is different, the lengthiness is… markedly different. One of the most performant and popular language models, BERT, has a number of domain specific variants which achieve state of the art results in their given fields. Choose the language model which has been exposed to language most similar to that which your model will see. You can check out our blog post on using BERT for Deep Learning approaches.
6. Think about what sufficient data means for your approach
A common bottleneck for creating machine learning models is collecting sufficient training data, and also knowing exactly how much data is sufficient. Can you find experiments in the literature which used models comparable to the one you intend to use? This may give you some indication of how much data is needed for convergence. If you’re struggling to get enough data, you may also want to consider transfer learning, as the general knowledge already captured within these models allow them to pick up on complex patterns with fewer examples than other methods.
7. Think beyond training
Remember to account for testing data when you consider how much data you will need. It is typical to hold back around 25% of your data in order to assess the performance of your model on examples it has not seen before. This 25% should also be as representative of the problem space as the training data – don’t train just on blue tinsel, and test only on green tinsel. Depending on your problem area, you may also be able to utilise public datasets for further validation, although it is likely that they will have used slightly different criteria for collecting their data.
8. Clearly define rules for what goes into your training set
Machine learning excels at tackling complex problems which are not easily solved with rules based methods. However, it is important that you don’t simply rely on intuition to define what you want your model to do. In our experiments we have found that individuals can have very different interpretations of what constitutes an interaction, association or causal link. These words and phrases need to be carefully defined for your specific use case, with examples of what does and does not count in each case.
We have also found that each task has its own unique ambiguities, and we recommend sitting down to perform a small round table curation session, optionally with warm mince pies, to identify these challenges and discuss the most appropriate solutions. Then make one final update to your rules before beginning the curation proper.
9. Communicate those rules to curators
It is vital to have the rules you have defined at hand at all times during the curation process. We recommend anchoring them at the top of your spreadsheet, or displaying them on your web app such that they are always visible. Make sure to double check with all your curators that your rules are clear and that they cover all the edge cases – their honed senses may pick up on nuances that you have missed!
10. Serve data to curators
Curation can be expensive and time consuming, so it is important to make the process itself as efficient as possible. One way to do this is to serve data to your curators using a curation tool, which presents them with one datapoint at a time and an easy, quick method to annotate that data. This can also take care of things like getting consensus between multiple curators, saving you from the tedious passing around of spreadsheets and even some mundane scripting.
11. Be aware of bias
As machine learning has become more prevalent in real-world applications, it has become apparent that bias can slip into a training set. Underlying assumptions about what the data is (or should) represent can be unintentionally built into training data as it is created. Think about whether the population you want your model to work on is truly being represented fairly by the training data being produced. Using a combination of curators to produce a consensus (often called inter-annotator agreement) can help. Just be careful not to introduce more bias in the process. Elves come in all shapes and sizes!
12. Plan for the future
In the life sciences, knowledge is constantly being updated and appended, so you need to consider a strategy for updating your training data and subsequently your model. This can be particularly important if your training data consistently produces an important misclassification. It’s likely you’ll need to generate new or additional training data. It is therefore a good plan to design your data collection and model training to be easily – or even automatically – repeated. If well designed, this system could be entirely handed over to the curators themselves to update as and when they add to the training data.
To learn more about how SciBite uses Machine Learning approaches, check out our latest Webinar recording on Scaling the Data Mountain with Ontologies, Deep Learning & FAIR.
You can also get in touch with the team to learn more at [email protected].
SciBite CSO and Founder Lee Harland shares his views on the use of BERT (or more specifically BioBERT) for deep learning approaches.
ReadSciBite's CTO explains how the semantic approach to using ontologies is essential in successfully training machine learning data sets. In this blog he discusses how Sherlock Holmes (amongst others) made an appearance when we looked to exploit the efforts of Wikipedia to identify articles relevant to the life science domain for a language model project.
ReadGet in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456