The pivotal role of semantic enrichment in the evolution of data commons

SciBite / News / The pivotal role of semantic enrichment in the evolution of data commons

In this blog post, discover how Pfizer have integrated SciBite’s semantically enriched vocabularies into their Data Commons project, which has the goal of enabling scientists to develop and refine hypotheses by investigating correlations between genetic and phenotypic data.

The value of clinical data in pharmaceutical research

Clinical data can provide valuable insights for Pharmaceutical research, such as mining adverse event data to reveal opportunities for drug repositioning. For example, an analysis of public clinical trials data in ClinicalTrials.gov identified a lower incidence of gastric cancer in patients treated with Aliskiren, a treatment for hypertension, than those treated with the placebo, suggesting the possible repurposing of this drug to treat cancer [1].

However, one of the fundamental challenges associated with clinical data is that the information captured for a given clinical study is very specific to that study, typically resulting in a bespoke database for each trial with no common schema. While this may not be a problem when analyzing data for that specific study, the data is not interoperable, resulting in a barrier to performing translational research.

Evolution of Pfizer’s data commons project

During last year’s SciBite User Group Meeting in Boston, Cathy Marshall (Director, Genomics Data Informatics Strategy & Implementation) and Alicia Dana from Pfizer (Data Strategy Lead for Medicinal Sciences) gave an excellent presentation on the evolution of Pfizer’s Data Commons project, which has the goal of enabling scientists to develop and refine hypotheses by investigating correlations between genetic and phenotypic data.

In 2012, the initial version of Pfizer’s Data Commons was based on the deployment of tranSMART to provide a single platform to collect both clinical and genomic data. However, while tranSMART does have a unifying schema for all studies, Pfizer’s experience has been that the mapping of each new study requires extensive curation effort from technical and scientific experts.

In addition, because tranSMART has been designed with data capture in mind, Pfizer was unable to perform broad, cross-project queries in support of research hypotheses.

Figure 1: A conceptual view of the transition from Data Commons 1.0 to 2.0 [2]

More recently, Pfizer have integrated SciBite’s Gene, Drug, Species, and Technology vocabularies into their Data Commons 2.0 platform, augmented with proprietary dictionaries of internal study and compound IDs and a new dictionary based on CDISC standard for Measures. This gives Pfizer the possibility to semantically enrich data from a range of unstructured documents in file shares and repositories such as ELN and SharePoint.

According to Alicia, this approach has required “minimal data manipulation, removing the need for formal curation” and has resulted in “an ontology-based index which enables intelligent searches using broad English term queries tailored to the translational/exploratory research domain.”

For example, users can now perform faceted, ‘Amazon-like’ scientific searches such as:

Find biomarkers that predict the response to a given drug or that predict disease progression.
What were the results from placebo versus treated subjects?
What was the clinical response to a particular class of drug?
Which studies included one or more specified exclusion criteria?
Which studies excluded obese people, yet resulted in patients increasing weight beyond the exclusion criteria?

Cathy and Alicia also described how the ontology-based index can be used to power downstream applications, from Spotfire-based visualizations and statistical models to machine learning algorithms. We look forward to hearing how Pfizer’s Data Commons continues to evolve.

Learn more about SciBite’s named entity recognition engine (NER) and extraction engine, TERMite, and our high quality semantically enriched biomedical vocabularies.

References

[1] For example, see Su EW, and Sanger TM. (2017). Systematic drug repositioning through mining adverse event data in ClinicalTrials.gov. PeerJ 5:e3154.

[2] Taken from Cathy and Alicia’s presentation, ‘Termite Integration with Pfizer’s Data Commons Platform,’ presented at SciBite’s 2018 UGM in Boston.

Richard Harrison

Senior Manager, Portfolio Marketing, SciBite

Richard is a seasoned marketing professional with over two decades of experience in the information services and life sciences sectors. Currently, he is the Senior Manager, Portfolio Marketing at Elsevier’s SciBite, where he drives strategic campaigns and harnesses data-driven strategies to amplify the platform’s online visibility and impact.

Share this article

Relevant resources, events and news

https://scibite.com/knowledge-hub/news/scibite-toolkit/ thumbnail image

News SciBite-Toolkit: Our Python library to accompany your semantic workflow development

Our SciBite python library has levelled up! With a new name and added functionality, the SciBite-toolkit aims to be your companion in making the most of your SciBite platform.