This blog focuses on the use and value of data catalogs and Master Data Management (MDM) tools and how the additional layer of Semantics is required in order to truly see their value for enterprises looking to manage their data better.
Data catalogs are quickly becoming a core technology for large enterprises looking to manage their data better. Given the sheer scale of major companies, there will be perhaps thousands of data repositories, covering potentially billions of data sets. Researchers looking to use/reuse this data face significant challenges, not least of which is finding an individual dataset in the first place.
This is where data catalogs promise to help, bringing in data with key metadata into a centralized environment, akin to a “data supermarket,” to provide a one-stop shop to find the data needed. While there are many commercial providers of data catalog software in the marketplace, many companies choose to build their own systems specific to their needs.
Some data catalog systems will look to add a certain level of standardization and metadata to the data they are importing. For instance, normalizing “M/Male/Men/Boy” to “MALE” to provide more consistency across data from different sources. This is more complex than it seems: What are the rules for this change (for instance, does “Boy”=“Man”)?;
Who maintains these rules? Should they be applied to all data? Should they be applied to all data? Should the original data be changed, or are copies made? This becomes even more complex when large numbers of diverse data are added to the catalog. Indeed, it is essential to note that the role of a data catalog is not to extensively enrich and manipulate the incoming data – that could have downstream consequences if users are looking for “the original data,” for instance, for regulatory submission. Nevertheless, data catalogs are rightly seen as a key foundation for many data science activities within large enterprises.
In an ideal world, everyone would use precisely the same language to describe the same physical things. While such a world would be fantastic from a data processing perspective, it’d likely make life pretty boring! Nevertheless, organizations are faced with the challenge of implementing some standardization level in preference to complete chaos. For instance, whenever you buy something online, you’re almost always presented with a drop-down menu to select your country of residence (which will often be in alphabetical order, and I wonder how many companies make the majority of their income from customers in Afghanistan vs. those lower down the alphabet!).
e-Commerce sites do this so you don’t make a mistake in typing in your country, removing a potential error in the financial transaction and thus increasing the system’s efficiency. The same is true inside large companies, it is much better for everyone to use the same names or identifiers for compounds, countries, employees, suppliers, projects, and many more things (also known as “entities”) to generate more consistency in the data.
Master Data Management (MDM) tools are often employed to address this critical need. These systems are generally thought of as repositories of the “ground truth” and serve this to multiple I.T. applications across the enterprise. Many MDM systems go beyond this and allow the construction of reference models and data flow pipelines to understand the complex relationship between different data systems. In recent years, the line between data catalogs and master data management has become blurred, with many companies offering solutions that encompass both.
While data standardization is good, it still needs more depth of information required to power the latest generation of analytics and answer business questions. For instance, let’s say we standardized all different ways of writing about the humble mouse, mapping words such as “Mice,” “Mouse model,” “M. Musculus,” and so on to a common term “MOUSE.”
That does start to achieve some form of standardization, but we cannot go beyond basic Boolean search. For instance, many users may want to search for data on “any rodent.” While the data is there, the computer does not understand that a “MOUSE” is a rodent, as are rats, Guinea pigs, and other critical experimental models. Fundamental data retrieval for things like “all kinases,” “all pyrimidine compounds,” and “all anti-inflammatories” are not solved by standardization, to address these, we need semantics.
Representing the relationships between entities is the core function of an ontology. Ontologies such as Uberon represent the relationships between body organs and tissues, the BioAssay Ontology represents key parts of drug discovery, and the Gene Ontology represents a deep understanding of cellular processes.
Together with many other cornerstones, ontologies help bridge the gap between humans and machines. Indeed, many studies demonstrate the synergistic power of two key pillars of Artificial Intelligence, namely semantics and deep learning, to answer very complex scientific and medical questions. Watch our webinar on scaling the data mountain with Ontologies, Deep Learning, and FAIR to learn more.
While many organizations understand the power of ontologies, they have struggled in some situations due to the lack of highly tuned tools designed to maintain them. Many ontology solutions have suffered from:
For these reasons we created SciBite’s ontology management platform CENtree – a 21st-century resource designed specifically for life-science enterprises and employing a unique Artificial Intelligence (AI) engine to assist ontology management.
Having described three critical pieces of enterprise reference data management, one may wonder how they fit together. As described above, we’re seeing the merging of master data management and data cataloging tools, though the key functions of the two remain quite different. But a key question concerns the role of semantics and where/how should semantic technologies be deployed within this stack? There is no ‘right’ answer here, and every organization will have its own architecture and needs. However, there are two generic models which are likely the most prevalent.
1. Enhancing Data Catalogues with Built-in Semantics
An obvious starting point would be to integrate semantic enrichment technology within your data catalog. The advantage is that the catalog itself becomes much more accessible to your users, through semantics, they can now ask questions such as “return data on any kinase” and others outlined above. Such an architecture could look like the following.
Here we have a mix of data sources, some of which are compliant with the company’s MDM reference data and some which are not. SciBite’s technologies are able to take the MDM references and ensure these are applied across the entire data catalog, building in a much greater degree of standardization. However, the benefits go further, as query and analytical tools can leverage the power of ontologies to ask complex questions that invoke the meaning of data, far beyond traditional keyword-based searches. A good example of such an implementation is at Pfizer and AstraZeneca, which use SciBite’s award-winning technology to build intelligent data catalogs.
2. Adding semantic enrichment at the knowledge graph phase
The first approach is one we’ve seen employed by a number of our customers, but it doesn’t fit all use cases. Where there are very large data catalogs spanning vast data resources across all aspects of global enterprises, semantic enrichment can be instrumental to help narrow down searches to specific classes of data. However, it can also be overkill to label billions of data points in a large data catalog that may never be of interest. Thus, a second model is a “just-in-time” approach where data is enriched with semantics once it has been selected for downstream processing. The architecture here is a little different:
Here we rely on the standard data catalog flow, but once data have been selected, the application of ontologies over data and associated metadata allows for the creation of much richer knowledge graphs. We have seen such workflows work in several customer use cases and have previously outlined why this approach of knowledge graphs and semantics is so powerful.
When one thinks of data catalogs, the image of a large infrastructure spanning all parts of an enterprise probably comes to mind. Indeed, any web search for the phrase will return a myriad of companies providing their view on how large organizations can benefit from bringing together data across the company. However, we’ve also seen a need for more “local” data catalogs, perhaps just on a departmental basis, that don’t need to be exposed companywide but require more nuanced tailoring to individual group needs.
Creating a database of entity-entity relationships (drug->adverse event, gene->phenotype) data for use cases such as pharmacovigilance and drug repurposing.
As an instance of DOCstore can be set up in minutes, our customers can create quick prototypes to demonstrate the utility of semantically enriched data portals. Often the prototype matures into the end solution but sometimes is simply the starting point for demonstrating how future tools should evolve. Because DOCstore is designed to implicitly understand data semantics, it can be a powerful mechanism to expose company data and deliver on the FAIR (Findable, Accessible, Interoperable, Reusable) mantra so crucial to today’s life science IT infrastructure.
Data Catalogs and Master Data Management tools provide a critical and foundational data infrastructure within any large corporation. However, additional layers are required to see the value of such an investment truly. Semantics represents such a critical addition, providing a mechanism for using the data in real-world scenarios. SciBite’s API-first, integration-centric software is designed to enable semantics within these environments, be they enterprise-wide or small department-specific installations.
If your organization is looking to improve its data management, whether you’re a large organization in need of enterprise-level technology or simply on a departmental basis, semantic enrichment technology like SciBite’s is key in ultimately producing machine-readable data to help you power the latest in technological advances.
Richard is a seasoned marketing professional with over two decades of experience in the information services and life sciences sectors. Currently, he is the Senior Manager, Portfolio Marketing at Elsevier’s SciBite, where he drives strategic campaigns and harnesses data-driven strategies to amplify the platform’s online visibility and impact.