In part 1 of our three-part series, we covered why healthcare organizations recognize the value of clean data in using Real World Evidence (RWE) to improve patient outcomes. We focused on the key part SciBite plays in constructing this standardized data from clinical notes. In this part of our series, we will demonstrate our expertise in various important data domains using SciBite tooling, including:
1. Problem list diagnoses
2. Lab orders
3. Medication orders
In this deep dive, like in part 1, we have asked ChatGPT to generate a spreadsheet of simulated patient data, including patients’ names, Medical Record Numbers (MRNs), problem list diagnoses, medication orders and lab orders. The data is included as a table below:
Table 1. Simulated patient data from ChatGPT, not standardized.
We can quickly see that several of our artificial patients have similar conditions, lab orders, and medication orders. While at first glance, these different data points may look different, clinicians immediately understand that many of these specifications are, in fact, the same entity, just represented with different synonyms (like “Advil” and “Ibuprofen”).
The above provides a simple but clear example of the challenge of aggregating similar records in a specific search. Imagine if a researcher just wanted to ask, “Which of these patients are taking Tylenol?”. Only Emily Roberts’ patient data mentions this drug by that exact name despite other test patients taking an equivalent drug. Consequently, because it is challenging to ask even relatively simple questions of this dataset, creating value of this data downstream, by enabling end users to search through this data, is next to impossible.
Remember, this is only 10 artificial patients! The severity of the problem scales with the size of the dataset, especially when we consider the number of records in the largest healthcare systems. Looking deeper into the data, Let’s consider the various data domains first and then see how SciBite’s tools improve the quality and accuracy of the foundational data.
Problem List Diagnoses, in most hospital systems, are used to keep an up-to-date free text list of the diagnoses/problems that a patient is facing (both in the past and currently). Usually, these free text entries are aligned to standard diagnoses by coders (to either ICD-10 or SNOMED code sets), but it takes time for this manual standardization to be performed. Overall, while this practice helps alleviate the real documentation burden clinicians face, it does make it harder to find relevant patient data down the line for a variety of different use cases. Even if clinicians are forced to select a SNOMED diagnosis, other clinicians may prefer another code set when searching for patient data due to their own personal preferences.
Consider patients “John Doe” and “Adam Johnson” from Table 1. Both have some back pain, but their clinicians typed the problems slightly differently (“back ache” vs. “backache”). To find all relevant patients with a problem a researcher is interested in studying, they would have to account for all these synonyms in a large Boolean search query.
Unlike diagnoses, lab orders are seldom typed as free text in Electronic Health Records (EHRs). Instead, many lab orders are built by application analysts to promote ease of use during documentation for clinicians. That means there can be functionally equivalent lab orders that are named differently so clinicians easily find the order they need. For example, let’s take the example of an “eye examination.” While most application analysts would prefer to build one eye examination order, to reduce clinician burnout, analysts might be directed to build additional eye examination orders with various names to ensure that all clinicians will be able to find the order they need efficiently.
Again, that overall purpose to reduce clinician burnout is paramount; however, for a clinical trial with complex inclusion and exclusion criteria, being able to accurately find patients that meet said criteria is just as important.
If we take a closer look at Table 1, there are a lot of synonyms for the same lab orders. For instance, both Michael Johnson and Olivia Thompson have equivalent lab orders for eye examinations: “examination of eye” and “Examining eye.”
There is considerable overlap between the data domains of medication and lab orders. For instance, medication orders aren’t typed in via free text by clinicians. Similarly, multiple medication orders are typically built by application analysts to mean the same thing, but to promote ease of use when clinicians are placing these orders (like “Tylenol, oral tablet” vs. “Tylenol, oral liquid”).
In our example, looking back at Table 1, Patients John Doe, Jane Smith, Adam Johnson, and Emily Roberts all have equivalent medication orders named “Acetaminophen 500mg”, “Tylenol 500mg”, “Paracetamol 500mg,” and “Tylenol 500mg” respectively.
To solve these problems and to harmonize our data, we will use TERMite, SciBite’s award-winning Named Entity Recognition engine. When using TERMite, users receive machine-readable annotations that allow them to normalize their data to their standards. The raw output from TERMite, specifically for our problem list diagnoses aligned to the SNOMED standard, looks like the following:
Right away, TERMite recognizes each term and aligns it to a specific term within the SNOMED ontology. For instance, “migraine” and “migraine headache” are aligned to the standard term Migraine with a specific SNOMED ID (SNOMED37796009). On top of this, in the metadata that comes back for each annotation made by TERMite, TERMite includes mappings to additional important diagnosis code sets and taxonomy information to enable the conversion between one code set to another simply.
For the diagnoses and the lab orders, we have opted to use SciBite’s SNOMED VOCab which is built using the SNOMED ontology, but enriched with additional synonyms for the specific job of named entity recognition. Therefore, in the output, we receive specific SNOMED IDs that correspond with the natural language terms identified within the patient data.
For our medication orders, we will use SciBite’s Drug VOCab which is built from ChEMBL and maps to DailyMed, PubChem, and DrugBank – which is why our data is normalized to the public standard of ChEMBL IDs in the generated output.
Similar outputs can be generated to normalize our patient data in the remaining data domains. At the end, using SciBite’s Python Toolkit, you can simply compile the generated annotations to an easy-to-understand output like this (for brevity, the name and MRN columns have been removed):
Table 2. Clean normalized patient data run through TERMite.
In the above table, we can see that each data domain has been aligned to their respective standards using TERMite. For example, the ID “SNOMED 38341003”, the SNOMED ID for “Hypertensive disorder, system arterial (disorder)” has matched to both the free text diagnosis of “Hypertension” and “Hypertensive disorder” (from rows 2 and 7). Further instances of this synonym recognition can be found in the output for each data domain – a few such examples are summarized below:
As mentioned previously, this normalization is just a snippet of TERMite’s full machine-readable output, which includes taxonomy or hierarchical information as well as mappings to other standards. Ultimately, with this output, with little to no effort, organizations can use SciBite’s tools to normalize their data to the relevant standards they require for their work.
Additionally, as shown in part one, CENtree enables organizations to maintain their standards by augmenting their ontologies simply.
In the final installment of our 3-part series, we will discuss how organizations can take these machine-readable annotations to visualize patient data and easily identify the right patients when they are needed. Please read on and reach out to us here at SciBite if you would like to hear more about how SciBite can help you along this digital transformation journey.
Arvind Swaminathan, Technical Consultant. He is passionate in helping organizations overcome their digital transformation challenges to enable data discovery and research. Over his professional career, first at Epic Systems, Arvind has worked in the healthcare space to help clean and aggregate data for research and commercial use. He has been with SciBite since 2022.
1. [Blog] Healthcare digital transformation challenges: Can we enable healthcare systems to trust their data? read more.
2. [Blog] Unlocking Important RWE from Patient Data – Why and How? read more.
In this three-part blog series, we explore the challenges healthcare organizations face in unlocking patient data for real-world evidence. In part 1 Unlocking Important Real World Evidence (RWE) from Patient Data – Why and How?Read
Precision medicine is changing the way that we think about the treatment of disease, moving from broad-acting therapies to therapies tailored to the individual patient. This increasingly relies on real-world data (RWD), encompassing a diverse range of sources, spanning multi-omic molecular characterisation of the patient’s condition, clinical presentation, treatment, and broader medical histories.Read
Get in touch with us to find out how we can transform your data
© Copyright © 2024 Elsevier Ltd., its licensors, and contributors. All rights are reserved, including those for text and data mining, AI training, and similar technologies.