In the latest release of TERMite, we now include a fuzzy matching feature to help identify incorrectly spelt concepts in text.
TERMite, our high-performance text recognition engine, is designed to identify key concepts in biomedical text regardless of the synonyms used. However, a drawback of this approach is that generally, terms must be spelt correctly to be recognized.
In general, highly proofread sites such as Medline or Clinicaltrials.gov will contain very few errors, and this issue has a negligible effect on recall. Other document types, such as patents and many internal documents, are not subject to such a high degree of proofing and one can note a larger proportion of such errors within their content.
As a number of customers had expressed an interest in expanding TERMite’s capabilities in identifying matches to misspelt words, we have developed a new fuzzy matching feature in the latest release of TERMite.
Fuzzy matching requires the alignment of mis-spelt words to known dictionary terms. Once activated within TERMite, this feature invokes a set of algorithms designed to identify incorrectly spelt words.
For instance, this pubmed article concerning the gene Galectin-3, uses the technically incorrect term “Galactin” (‘a’ instead of ‘e’) in the first sentence. This is actually a common mistake in medline with over 20 articles using the incorrect ‘a’ form. A similar issue occurs with the misspelling of Vimentin as Vimintin in other articles.
A different example based around spacing rather than spelling comes from this cancer article which notes a measurement of carcinoembryonic antigen protein levels. The standard name for this gene does not separate the “carcino” and “embryonic” but very occasionally this form is used, such as in the aforementioned article.
Patents are well known to be a source of significant numbers of spelling and transposition errors. A good example is this patent from that lists a number of diseases, including Creutzfeldt Jakob disease as “Creutzfeld- Jacob Disease”, a mis-spelling in both elements of the name.
All of these forms are identified using TERMites fuzzy matching feature.
By default, the algorithm assumes that somewhere in the text, the entity will be spelt correctly at least once. This can be changed to show all fuzzy matches in the text though given the huge number of similar words, there may be a large number of hits!
Note the incorrectly spelt ‘hitamine (H2) receptor) being correctly attributed to the GENE HRH2
From now on, whether it’s down as a hitaminc receptor, histamine recep–tor or histaminereceptor – we’ll help you find the true meaning of the term and improve your search results across the corpus.
When building dictionaries, TERMite offers a suite of tools to help users gain the best levels of precision and recall for their input terms. Fuzzy matching can also help in this process of identifying closely related terms that may not be in the dictionary. For example, if a customer were interested in Signaling lymphocytic activation molecule they may not have thought to use the ‘lymphocyte’ variant of the name often used by authors. To aid with the identification of such variants, TERMite offers a simplified fuzzy term identification workflow that dictionary developers can use to rapidly identify such terms.
Of course, some spelling changes lie in a more grey area. For instance, a search using our protein-type dictionary for ‘oxytocin receptors identified a number of papers (such as this) through the fuzzy match of ‘oxytocics’ that describe a class of drugs operating on the oxtytocin receptors. It is of course, very debatable whether oxytocics is a true synonym for oxytocin receptors – it all depends on the research the customer is performing.
With fuzzy matching enabled, the user now can now choose to include or exclude this type of term on a case-by-case basis.
If you’d like to see how our fuzzy matching feature works on your data or have any questions, please get in touch.
One of the key aims of SciBite is to help our customers work with public ontologies in text mining applications. While these ontologies are very valuable resources, they are often built for the purpose of data organization, not text mining.
ReadLike it or loathe it, plain text is a goldmine of information. The challenge is that data mining is often complicated through ambiguity. Sure, identifying, disambiguating and extracting those scientific terms is a big challenge but we’ve got it covered.
ReadGet in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456