In this blog post hear from GSK’s Scientific Lead within the Data and Computational Sciences Solutions team, on how semantic integration can be made to ultimately become part of an integrated learning framework for more informed scientific decision making.
Most pharmaceutical companies have recognised the need to transform their R&D activities by taking a more data-driven approach to the discovery and development of novel therapeutics. For example, in a recent Forbes interview, GSK’s new CSO, Hal Barron, talked about how machine learning could be applied to genetic data to help researchers find the most promising experimental medicines. Clearly data science will play a pivotal role in realising this vision, so it was great to hear more about this from Samiul Hasan, Scientific Lead within the Data and Computational Sciences Solutions team at GSK during our recent webinar.
To validate their findings, over 250 of GSK’s discovery teams use a mandatory question-driven process [1] to progress assets through the pipeline. Much of the data sources at GSK’s disposal are unstructured, ranging from peer reviewed journal articles and internal experimental notes to clinical data in medical records and databases such as ClinicalTrials.gov. One area of particular interest is the identification of true molecule-mechanism-observation relationships.
However, according to Samiul, this typically “relies on freeing up resource constrained data scientists to iteratively refine and redo the same types of ad hoc relationship searches”. As a last resort, scientists use tools like Google and Pubmed, which are not designed by principle to find molecular relationships resulting in “impairment of the pace and quality of scientific reviews and ultimately project decisions”.
To address this problem, Samiul and his team have been working with SciBite to demonstrate how improved data curation can improve results when working with unstructured content. GSK have been able to augment SciBite’s extensive vocabularies, which focus on publicly available information (ontologies and terms), with proprietary internal nomenclature, such as compound IDs. These have then been used to facilitate a machine-learning based document classification process. GSK are also enriching their unstructured data sources using approximately 30 scientific concepts (or ‘VOCabs’ in SciBite’s terminology), including gene, drug and species.
We have also worked together to iteratively develop, refine and test over 40 TExpress ‘bundles’ that define semantic patterns. Semantic patterns describe a relationship between two concepts, such as a gene and drug, in the form Gene-Verb-Drug. If you’re not familiar with bundles yet, they enable multiple semantic patterns that encompass different ways of describing the same concept to be aggregated and run across the same data simultaneously. The number of patterns that match a piece of text and whether those matches are complementary or competing, provides a powerful and clear indicator of the relevance of search results.
GSK’s TExpress bundles cover a range of topics and have been tailored with the input of GSK subject matter experts. These can be used to identify phrases relating to targets such as:
“increased apoptosis in the lymph nodes and spleen) of rats given 100 mg/kg/day <GSK_CMPD_NO> were consistent with the known pharmacological effects of <DRUG_CLASS>”
… in vivo activity, such as:
“<GSK_CMPD_NO> showed absorbance in the region of concern, which indicates a potential for a human <DRUG_CLASS>”
.. and also encompass in vivo and clinical concepts.
Samiul explained how this approach “takes away many terminological, syntactic, and semantic complexities from the end users. For example, people used to use protein family names when referring to targets whereas nowadays it is more common to use gene symbols”.
He described how GSK scientists are now empowered to carry out sentence-level evidence searches and can “quickly and easily search through millions of documents and find studies relevant to their research interest that might not have otherwise been immediately obvious”. Within complex documents, such as investigator brochures, where different sections focus on different aspects, bundles also make it easier for readers to direct their attention to the most relevant areas of the document.
The presentation also offered some valuable advice for other companies looking to use TExpress bundles to get more from their unstructured content. Firstly, it’s important to find the right one or two individuals to assist with the testing process. Secondly, the ability to scale and productionize the approach, is reliant on the time of data scientists to improve existing bundles and create new ones.
Samiul also shared some thoughts about future refinements to the current bundles, as well as some ideas for enhancing the SciBite platform, such as supporting queries for the absence of an entity in a pattern, e.g. Drug X affects Disease A but not Disease B.
[1] Based on principles described in Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. David Cook, Dearg Brown, Robert Alexander, Ruth March, Paul Morgan, Gemma Satterthwaite & Menelas N. Pangalos. Nat Rev Drug Discov. 13(6):419-31 (2014). doi: 10.1038/nrd4309
Richard is a seasoned marketing professional with over two decades of experience in the information services and life sciences sectors. Currently, he is the Senior Manager, Portfolio Marketing at Elsevier’s SciBite, where he drives strategic campaigns and harnesses data-driven strategies to amplify the platform’s online visibility and impact.