Use Cases

Discover how SciBite’s powerful solutions are supporting scientists and researchers.

Use Cases Overview

Gartner report

Gartner® How to calculate business value and cost for generative AI use cases

Access report

Knowledge Hub

Explore expert insights, articles, and thought leadership on scientific data challenges.

Knowledge Hub

Resources

Discover our whitepapers, spec sheets, and webinars for in-depth product knowledge.

Resources

Events

Join us at upcoming events and webinars to learn more about SciBite solutions.

Events

News

Stay informed with the latest SciBite updates, announcements, and industry news.

News

About SciBite

Explore SciBite’s full suite of solutions to unlock the potential of your data.

Discover more about us

Our Partners

We build powerful partnerships with world-leading organizations.

Our Partners

Unlocking important RWE from patient data [Part 3]: Can we find all the relevant patients?
Wind power generation in clouds

In Part 1 of our three part series, we covered why healthcare organizations must find a way to appropriately provide Real World Evidence (RWE) to better all patient outcomes and improve their financial standing. In Part 2, we dove deep to understand how we can normalize various data domains using SciBite, with concrete patient examples.

In our final installment of this series, we demonstrate how to extract a relevant subset of patients from the simulated data using two approaches – one using SciBite tools and one without. This data, like real patient data, is messy. Specifically, the data includes a bunch of equivalent data points that are described differently. Without providing a semantic layer of information on top of this data, it will be next to impossible to quickly find the correct patients aligned to a query.

You can access this dataset on Kaggle. Please follow the instructions in the reference to access the dataset.1

This data has intentional duplicates (specifically every 5 patients are semantic equivalents) for the purpose of simplifying what patients should come up for what queries.

Semantic understanding of the underlying data

Now that we’ve been able to analyze the dataset briefly, let’s define the problem we are trying to solve. In this situation, an end user is looking for relevant patients for their use (e.g., a study). They need to find patients that match specific criteria (have certain diagnoses, lab orders, etc. Without a semantic understanding of the underlying data, it will be difficult to match those patients to the end user’s query. To find these patients, the end user will type (free-text) the diagnoses, lab orders, and medication orders that they would like each patient in the resulting subset to have like so:

In the rest of this article, we will analyze two solutions to solve this end user’s problem to provide them with all the patients that match their query.

The solutions – High-level architecture

The high-level architecture is visualized in the graphic below.

Figure 1. High-level architecture of both approaches– SciBite tools are highlighted in purple.

For both fixes, it is important to note that no pre-processing of the data is allowed. The steps that involve SciBite tools and therefore are only used in the second approach, are highlighted in purple. As highlighted by the graphic there are two key differences between both approaches, which are both characterized below:

In approach 1 – With SciBite, the data is first normalized via Workbench.

  1. Workbench is SciBite’s tabular extraction tool, which takes the effort out of tabular data curation.
  2. Workbench is powered by TERMite to enable end users to normalize their tabular data (like this simulated patient CSV data file).

In approach 2 – With SciBite, we normalize the end users’ input with TERMite.

  1. This ensures that the user’s query also aligns to the same data standards we used to normalize the data.

Now that we have described both solutions, let’s analyze the results.

Result analysis

To easily analyze the validity of these two approaches together, we will use 3 simple example queries. For each example, we will visualize the results using a bar chart to show how many patients each solution found matched the query.


Example 1

In the first example query, we will only look for patients that have a “backache”:

The results are shown in the bar graph below:

As you can see by the bar chart, without SciBite, we could only return 3 patients, while with SciBite technology, we could return all 8 patients. Why did this happen?

Let’s take a look at the diagnosis data for all 8 patients with some sort of backache:

Without SciBite, the term “backache” is searched for literally – and there are only 3 patients (Adam Johnson, Ethan Johnson and Michael Smith) with that literal text as their diagnosis. Therefore, we cannot find the remaining patients.

In contrast, with SciBite tools, the following occurs:

  1. TERMite immediately normalizes the term “backache to its public identifier that comes from SNOMED (using SciBite’s SNOMED VOCab): “SNOMED161891005”.
    Since the terms “backache” and “back ache” both resolve to this public identifier (during the ingestion step highlighted in the architecture diagram), we can get to 6 patients easily.
  2. To correctly identify all 8 patients, TERMite then recognizes that the term “Lower back pain” is a term that lives under “backache” in the hierarchy of the SNOMED standard.
    After all, “Lower back pain” is simply a more specific type of “backache” which is captured appropriately in the standard. SciBite tools understand the hierarchy within the standard as well.

Example 2

In this example, we will look for patients with the medication order “prozac”. The results are shown below:

What happened here? Let’s look at our data for each patient that has taken “prozac”:

Without SciBite, that literal text of “prozac” has caused a whole host of problems. Because the string of “prozac” wasn’t capitalized and didn’t include the dose at the end, it wasn’t able to find any of the relevant patients.

With SciBite the following occurs:

  1. TERMite automatically normalizes “prozac” to its public identifier “CHEMBL41” that comes from SciBite’s DRUG VOCab.
  2. During ingestion as specified in the architecture diagram, both “Prozac 20mg” and “Fluoxetine 20mg” were normalized to “CHEMBL41”, because Fluoxetine is a curated synonym within the DRUG VOCab.

Therefore, all 8 patients were correctly identified in Solution 2 – With SciBite.


Example 3

In the final example, I’ll make the query more complicated. Let’s look for patients that have the diagnosis “backache” and with a lab order of “MRI”. The results are below:

Now, without SciBite tools, we weren’t able to correctly identify the three patients it found in Example 1. Let’s look at the data to find out why.

Without SciBite, the way the user typed in “MRI” caused problems – since every patient with the literal diagnosis “backache” has a lab order with a spelled out version of MRI (Magnetic resonance imaging scan”, it couldn’t find any patients.

Conversely, with SciBite tools, the system was able to respect the hierarchy discussed in example 1 while also recognizing the different synonyms of “MRI”.

Conclusion

Ultimately, SciBite’s approach, with the help of extensive VOCabularies and an emphasis on ontologies, greatly increases the recall and retrieval of patients. This approach creates a semantic layer that sits on top of the data, ensuring that all patients can be found appropriately.

It is important to note that it may be possible to replicate the functionality offered within SciBite’s tools to improve the outcomes found without SciBite. However, to do so, significant temporal and financial investment would be required to develop the logic used to accurately find all patients that match a user’s query. Let’s go example by example to explain why.

To solve the issues identified in Example 1, a data scientist would first have to be allocated. Following that investment, the data scientist would then need to maintain a list of all the synonyms of backache and make use of the hierarchy of the SNOMED standard. They would then have to implement logic to make use of these lists and hierarchy while a user makes an input and during the ingestion of this data.

To solve the issues identified in Example 2, a data scientist would have to first solve the capitalization issue. While this issue isn’t significantly difficult to solve, they would have to then strip the dosage information from all the medication orders. After solving these two issues, the data scientist is back to the same problem presented in example 1 – they would need the list of synonyms that match to the medication “Prozac”. Similar efforts would be required to solve the issues discussed in the last example.

The fundamental purpose of this effort is to better patient outcomes while taking advantage of the financial incentives of unlocking this RWD. It is imperative that all healthcare systems work both quickly and effectively in pursuit of this ultimate goal. As evidenced by our work with City of Hope, SciBite can help unlock the full potential of your data by providing tools that generate the foundation layer of that data.

Richard Harrison
Senior Manager, Portfolio Marketing, SciBite

Richard is a seasoned marketing professional with over two decades of experience in the information services and life sciences sectors. Currently, he is the Senior Manager, Portfolio Marketing at Elsevier’s SciBite, where he drives strategic campaigns and harnesses data-driven strategies to amplify the platform’s online visibility and impact.

Share this article
Relevant resources, events and news