SciBite
News
Unlocking important RWE from patient data (Part 3) – Can we find all the relevant patients?

Unlocking important RWE from patient data (Part 3) – Can we find all the relevant patients?

In our final installment of this series, we demonstrate how to extract a relevant subset of patients from the simulated data using two approaches – one using SciBite tools and one without.

Wind Power Generation In The Sea Of Clouds，Wind Power Generator Before Sunrise Sunset

In Part 1 of our three part series, we covered why healthcare organizations must find a way to appropriately provide Real World Evidence (RWE) to better all patient outcomes and improve their financial standing. In Part 2, we dove deep to understand how we can normalize various data domains using SciBite, with concrete patient examples.

In our final installment of this series, we demonstrate how to extract a relevant subset of patients from the simulated data using two approaches – one using SciBite tools and one without. This data, like real patient data, is messy. Specifically, the data includes a bunch of equivalent data points that are described differently. Without providing a semantic layer of information on top of this data, it will be next to impossible to quickly find the correct patients aligned to a query.

You can access this dataset on Kaggle. Please follow the instructions in the reference to access the dataset.1

This data has intentional duplicates (specifically every 5 patients are semantic equivalents) for the purpose of simplifying what patients should come up for what queries.

Semantic understanding of the underlying data

Now that we’ve been able to analyze the dataset briefly, let’s define the problem we are trying to solve. In this situation, an end user is looking for relevant patients for their use (e.g., a study). They need to find patients that match specific criteria (have certain diagnoses, lab orders, etc. Without a semantic understanding of the underlying data, it will be difficult to match those patients to the end user’s query. To find these patients, the end user will type (free-text) the diagnoses, lab orders, and medication orders that they would like each patient in the resulting subset to have like so:

[Blog] Unlocking Important Real World Evidence From Patient Data Pt3 Screen

In the rest of this article, we will analyze two solutions to solve this end user’s problem to provide them with all the patients that match their query.

The solutions – High-level architecture

The high-level architecture is visualized in the graphic below.

[Blog] Unlocking RWE From Patient Data Pt3 Picture1

Figure 1. High-level architecture of both approaches– SciBite tools are highlighted in purple.

For both fixes, it is important to note that no pre-processing of the data is allowed. The steps that involve SciBite tools and therefore are only used in the second approach, are highlighted in purple. As highlighted by the graphic there are two key differences between both approaches, which are both characterized below:

In approach 1 – With SciBite, the data is first normalized via Workbench.

Workbench is SciBite’s tabular extraction tool, which takes the effort out of tabular data curation.
Workbench is powered by TERMite to enable end users to normalize their tabular data (like this simulated patient CSV data file).

In approach 2 – With SciBite, we normalize the end users’ input with TERMite.

This ensures that the user’s query also aligns to the same data standards we used to normalize the data.

Now that we have described both solutions, let’s analyze the results.

Result analysis

To easily analyze the validity of these two approaches together, we will use 3 simple example queries. For each example, we will visualize the results using a bar chart to show how many patients each solution found matched the query.

Example 1

In the first example query, we will only look for patients that have a “backache”:

[Blog] Unlocking Important Real World Evidence From Patient Data Pt3 Screen2

The results are shown in the bar graph below:

[Blog] Unlocking RWE From Patient Data Pt3 Chart1 Backache

As you can see by the bar chart, without SciBite, we could only return 3 patients, while with SciBite technology, we could return all 8 patients. Why did this happen?

Let’s take a look at the diagnosis data for all 8 patients with some sort of backache:

[Blog] Unlocking Important RWE From Patient Data Pt3 Table 1

Without SciBite, the term “backache” is searched for literally – and there are only 3 patients (Adam Johnson, Ethan Johnson and Michael Smith) with that literal text as their diagnosis. Therefore, we cannot find the remaining patients.

In contrast, with SciBite tools, the following occurs:

TERMite immediately normalizes the term “backache to its public identifier that comes from SNOMED (using SciBite’s SNOMED VOCab): “SNOMED161891005”.
Since the terms “backache” and “back ache” both resolve to this public identifier (during the ingestion step highlighted in the architecture diagram), we can get to 6 patients easily.
To correctly identify all 8 patients, TERMite then recognizes that the term “Lower back pain” is a term that lives under “backache” in the hierarchy of the SNOMED standard.
After all, “Lower back pain” is simply a more specific type of “backache” which is captured appropriately in the standard. SciBite tools understand the hierarchy within the standard as well.

Example 2

In this example, we will look for patients with the medication order “prozac”. The results are shown below:

[Blog] Unlocking RWE From Patient Data Pt3 Chart2 Prozac

What happened here? Let’s look at our data for each patient that has taken “prozac”:

[Blog] Unlocking Important RWE From Patient Data Pt3 Table 2_v2

Without SciBite, that literal text of “prozac” has caused a whole host of problems. Because the string of “prozac” wasn’t capitalized and didn’t include the dose at the end, it wasn’t able to find any of the relevant patients.

With SciBite the following occurs:

TERMite automatically normalizes “prozac” to its public identifier “CHEMBL41” that comes from SciBite’s DRUG VOCab.
During ingestion as specified in the architecture diagram, both “Prozac 20mg” and “Fluoxetine 20mg” were normalized to “CHEMBL41”, because Fluoxetine is a curated synonym within the DRUG VOCab.

[Blog] Unlocking Important RWE From Patient Data Pt3 Table 3_v2

Therefore, all 8 patients were correctly identified in Solution 2 – With SciBite.

Example 3

In the final example, I’ll make the query more complicated. Let’s look for patients that have the diagnosis “backache” and with a lab order of “MRI”. The results are below:

[Blog] Unlocking RWE From Patient Data Pt3 Chart3 Backache+MRI

Now, without SciBite tools, we weren’t able to correctly identify the three patients it found in Example 1. Let’s look at the data to find out why.
name

Without SciBite, the way the user typed in “MRI” caused problems – since every patient with the literal diagnosis “backache” has a lab order with a spelled out version of MRI (Magnetic resonance imaging scan”, it couldn’t find any patients.

Conversely, with SciBite tools, the system was able to respect the hierarchy discussed in example 1 while also recognizing the different synonyms of “MRI”.

Conclusion

Ultimately, SciBite’s approach, with the help of extensive VOCabularies and an emphasis on ontologies, greatly increases the recall and retrieval of patients. This approach creates a semantic layer that sits on top of the data, ensuring that all patients can be found appropriately.

It is important to note that it may be possible to replicate the functionality offered within SciBite’s tools to improve the outcomes found without SciBite. However, to do so, significant temporal and financial investment would be required to develop the logic used to accurately find all patients that match a user’s query. Let’s go example by example to explain why.

To solve the issues identified in Example 1, a data scientist would first have to be allocated. Following that investment, the data scientist would then need to maintain a list of all the synonyms of backache and make use of the hierarchy of the SNOMED standard. They would then have to implement logic to make use of these lists and hierarchy while a user makes an input and during the ingestion of this data.

To solve the issues identified in Example 2, a data scientist would have to first solve the capitalization issue. While this issue isn’t significantly difficult to solve, they would have to then strip the dosage information from all the medication orders. After solving these two issues, the data scientist is back to the same problem presented in example 1 – they would need the list of synonyms that match to the medication “Prozac”. Similar efforts would be required to solve the issues discussed in the last example.

The fundamental purpose of this effort is to better patient outcomes while taking advantage of the financial incentives of unlocking this RWD. It is imperative that all healthcare systems work both quickly and effectively in pursuit of this ultimate goal. As evidenced by our work with City of Hope, SciBite can help unlock the full potential of your data by providing tools that generate the foundation layer of that data.

About Arvind Swaminathan

Technical Consultant, SciBite

Arvind Swaminathan, Technical Consultant. He is passionate in helping organizations overcome their digital transformation challenges to enable data discovery and research. Over his professional career, first at Epic Systems, Arvind has worked in the healthcare space to help clean and aggregate data for research and commercial use. He has been with SciBite since 2022.

View LinkedIn profile

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Unlocking important RWE from patient data (Part 3) – Can we find all the relevant patients?

Semantic understanding of the underlying data

The solutions – High-level architecture

In approach 1 – With SciBite, the data is first normalized via Workbench.

In approach 2 – With SciBite, we normalize the end users’ input with TERMite.