The pivotal role of semantic enrichment in the evolution of data commons

In this blog post, discover how Pfizer have integrated SciBite’s semantically enriched vocabularies into their Data Commons project, which has the goal of enabling scientists to develop and refine hypotheses by investigating correlations between genetic and phenotypic data.

Data commons

Clinical data can provide valuable insights for Pharmaceutical research, such as mining adverse event data to reveal opportunities for drug repositioning. For example, an analysis of public clinical trials data in identified a lower incidence of gastric cancer in patients treated with Aliskiren, a treatment for hypertension, than those treated with the placebo, suggesting the possible repurposing of this drug to treat cancer [1].

However, one of the fundamental challenges associated with clinical data is that the information captured for a given clinical study is very specific to that study, typically resulting in a bespoke database for each trial with no common schema. While this may not be a problem when analyzing data for that specific study, the data is not interoperable, resulting in a barrier to performing translational research.

During last year’s SciBite User Group Meeting in Boston, Cathy Marshall (Director, Genomics Data Informatics Strategy & Implementation) and Alicia Dana from Pfizer (Data Strategy Lead for Medicinal Sciences) gave an excellent presentation on the evolution of Pfizer’s Data Commons project, which has the goal of enabling scientists to develop and refine hypotheses by investigating correlations between genetic and phenotypic data.

In 2012, the initial version of Pfizer’s Data Commons was based on the deployment of tranSMART to provide a single platform to collect both clinical and genomic data. However, while tranSMART does have a unifying schema for all studies, Pfizer’s experience has been that the mapping of each new study requires extensive curation effort from technical and scientific experts. In addition, because tranSMART has been designed with data capture in mind, Pfizer was unable to perform broad, cross-project queries in support of research hypotheses.

More recently, Pfizer have integrated SciBite’s Gene, Drug, Species, and Technology vocabularies into their Data Commons 2.0 platform, augmented with proprietary dictionaries of internal study and compound IDs and a new dictionary based on CDISC standard for Measures. This gives Pfizer the possibility to semantically enrich data from a range of unstructured documents in file shares and repositories such as ELN and SharePoint.

Clinical Data Analytics and Reporting System

A conceptual view of the transition from Data Commons 1.0 to 2.0 [2]

According to Alicia, this approach has required “minimal data manipulation, removing the need for formal curation” and has resulted in “an ontology-based index which enables intelligent searches using broad English term queries tailored to the translational/exploratory research domain.”

For example, users can now perform faceted, ‘Amazon-like’ scientific searches such as:

  • Find biomarkers that predict the response to a given drug or that predict disease progression.
  • What were the results from placebo versus treated subjects?
  • What was the clinical response to a particular class of drug?
  • Which studies included one or more specified exclusion criteria?
  • Which studies excluded obese people, yet resulted in patients increasing weight beyond the exclusion criteria?

Cathy and Alicia also described how the ontology-based index can be used to power downstream applications, from Spotfire-based visualizations and statistical models to machine learning algorithms. We look forward to hearing how Pfizer’s Data Commons continues to evolve.

Learn more about SciBite’s named entity recognition engine (NER) and extraction engine, TERMite, and our high quality semantically enriched biomedical vocabularies.

Learn more about TERMite

[1] For example, see Su EW, and Sanger TM. (2017). Systematic drug repositioning through mining adverse event data in PeerJ 5:e3154.

[2] Taken from Cathy and Alicia’s presentation, ‘Termite Integration with Pfizer’s Data Commons Platform,’ presented at SciBite’s 2018 UGM in Boston.

Related articles

  1. How the use of Machine Learning can augment adverse event detection

    When it comes to identifying adverse events (AEs), things are not always as they seem. Consider a paper describing a new treatment for a given illness - how can we determine which adverse event terms refer to actual adverse events as opposed to symptoms of the illness itself, given that those terms may be identical? Is this new drug treating arrhythmias or causing them, for example?

  2. Are ontologies relevant in a machine learning-centric world?

    SciBite CSO and Founder Lee Harland shares his views on why ontologies are relevant in a machine learning-centric world and are essential to help "clean up" scientific data in the Life Sciences industry.


How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us