Recent years have seen drastic increases in the number of life sciences publications, and researchers have become reliant upon digital text analytics to distil the data and unlock the insight they need for their own use cases. SciBite has always aimed to develop transformative technology that makes this as simple and effective as possible, and by working closely with our customers we have made great strides in this direction.
In literature written in languages other than English, however, there exist unique challenges and depths as yet unfathomed. In this blog, we will delve into how we applied novel methods to Japanese language literature – we believe these techniques are transferable and we encourage groups working with other under-supported languages to read more and get in touch with our expert team.
The Japanese language presents its own challenges for text analytics. For example, many text mining and natural language processing (NLP) techniques rely on tokenizing words using whitespace, which is not a feature found in Japanese text. The language also has an array of alphabets, is particularly challenging to learn, and lacks freely accessible equivalents to key English sources of life sciences text, such as MEDLINE, which many methods rely upon for their training. All these factors contribute to a lack of options for researchers who want to unearth the riches buried in Japanese life sciences literature.
The dawn of machine learning has brought with it techniques which are effectively language independent. For example, by using advances in linguistics we can now separate out individual words from Japanese text. This allows us to leverage our state of the art tools, such as our named entity recognition (NER) and extraction engine TERMite, to comb through Japanese language text. With this infrastructure in place, all that remains is to develop state of the art vocabularies to empower TERMite to identify key concepts in Japanese.
Not only does this splitting allow us to use our NER engine, but it also allows us to embed Japanese words into a semantic space using techniques like Word2Vec (a group of related models that are used to produce word embeddings). This means we can perform computational analysis on Japanese text, even without any expertise in the language, using mathematical methods. By carefully selecting relevant Japanese texts, and developing appropriate preprocessing pipelines, we aim to create the best possible Japanese word embeddings, to get the best results in downstream applications.
One observation from our experiments has been that some knowledge is assumed and, as a result, this knowledge may seldom or even never be explicitly stated in the academic literature. As a result, if we exclusively use academic literature the models we train may never be exposed to some of the most critical facts within the field. To counter this, we also use methods developed by our machine learning team to include relevant strands of Wikipedia, where this knowledge is more likely to be explicitly stated. Because Wikipedia provides multilingual links, this stage can be performed entirely in English.
Of course, one thing that sets SciBite apart is that curation is at the core of our strategy. Rather than attempting to replace human oversight with machine learning, we use machine learning synergistically with our expert curators to help them to be more efficient and accurate. We are experimenting with pipelines for handling Japanese text, and these are now sufficiently advanced that even curators with no Japanese skills are able to create simple vocabularies using translation and suggestion systems. These can then be fine-tuned and expanded upon by Japanese curators to ensure the highest standards of accuracy and depth of coverage.
We began our pipeline with a minimal vocabulary curated by an expert with knowledge of Japanese. A simple algorithm using the word embeddings described below was then used to suggest words which may belong in this initial stub vocabulary. Direct translations often don’t work, which means trying to use a service like Google Translate to convert our English vocabularies into Japanese can generate nonsense. However, by learning embeddings from life sciences literature, we guarantee that the words which are suggested to our curators have first been identified within the relevant literature.
This confidence can be put beyond doubt if a direct definition can be found within a Japanese dictionary. New additions identified by curators can then be fed back into the suggestion algorithm in an iterative loop to generate more suggestions. Meanwhile, any terms without an English translation are put to one side for Japanese curators to check at a later date.
To test this, our SciBite team in Japan created a small stub vocabulary consisting of about 20 indications. Our SciBite team in the UK were then able to expand this to hundreds of terms within just a few minutes and without any knowledge of Japanese. This was sent back to the Japanese team who confirmed the suitability of the overwhelming majority of our additions.
These encouraging results show that this method can empower us to bootstrap new vocabularies in mere minutes, not only in English but also in foreign languages, and we are looking forward to using this in the wild to help our Japanese customers to find exactly what they are looking for within Japanese language text.
If you are hoping to analyze scientific text in another language with limited support, get in touch with the SciBite team for more information.
機械学習と翻訳ツールを連携した技術を使用により、キュレーターが日本語の知識がなくても、日本語のVOCab作成に成功しました。作成後、その分野に置けるVOCabの正確さと質が日本人専門家によって確認されたのです。このブログに関してのお問い合わせは [email protected] で承ります。
Using machine learning and translation, our expert curators at SciBite have created Japanese vocabularies without any prior knowledge of the language itself. Our Japanese experts confirmed the quality of these results. For more information on how we can help you unlock the insights of your scientific literature in another language, please contact our Japan-based Technical Sales Manager, Patrick at [email protected].
Oliver Giles, Machine Learning Scientist, received his MSc in Synthetic Biology from Newcastle University, and his BA in Philosophy from the University of East Anglia. He is currently focused on interfacing natural language with structured data, extracting said structured data from text and on using AI for the inference of novel hypotheses.
Other articles by Oliver