Rare disease new target identification using Elsevier full text data

SciBite / News / Rare disease new target identification using Elsevier full text data

stripes in the sandstone of the Wave, Coyote Buttes North, Arizona

Introduction

Rare diseases, often referred to as orphan diseases, affect over 300 million people worldwide [1]. Conditions such as Erdheim-Chester disease, Stiff Person Syndrome, and Fibrodysplasia Ossificans Progressiva are examples of these rare diseases. Despite their collective impact, each individual rare disease affects only a tiny fraction of people, making the identification of new therapeutic targets a significant challenge. The rarity of these diseases often results in substantial delays in diagnosis and a lack of effective treatments.

Challenges in rare disease research

Researching rare diseases presents unique challenges. One of the primary obstacles is the limited availability of data, as the small patient populations make it difficult to gather large datasets. Additionally, funding for rare disease research is often limited, as pharmaceutical companies may be less inclined to invest in treatments for conditions that affect a relative minority of people.

The financial burden of treatment development

Developing treatments for rare diseases is significantly more expensive compared to other diseases. A recent study highlighted this disparity, revealing that the median cost for orphan drugs was a staggering USD 218,872, which translates to higher treatment costs for patients. In contrast, the median cost for non-orphan drugs was much lower at USD 12,798 [2]. This substantial difference underscores the financial challenges associated with bringing treatments for rare diseases to market.

Several factors contribute to these high costs. First, the complexity of rare diseases requires extensive research and specialized expertise, lengthening development timelines and increasing expenses. Additionally, rare diseases often affect small patient populations, making it challenging for pharmaceutical companies to recoup R&D costs through sales. Lastly, regulatory challenges further complicate the landscape, as obtaining approval can be difficult due to limited data and smaller clinical trial populations.

This scarcity of data and resources calls for novel methods to identify therapeutic targets. In this context, as we’ll see shortly, the benefit of using full-text resources over open-source datasets is particularly relevant.

How we approached it

Operating on FAIR data principles, we take a comprehensive and innovative approach to identifying new therapeutic targets for rare diseases. Our method leverages our proprietary annotation software, TERMite, and our search tool, SciBite Search.

These tools allow us to meticulously annotate textual data using tailored created by our expert curation team, enabling the conversion of strings (such as idiopathic pulmonary fibrosis) to things with IDs (such as Mesh:D054990) (Figure 1).

This allows us to identify various scientific entities, such as genes, single nucleotide polymorphisms (SNPs), diseases, pathways, drugs, metabolites, etc. Through identifying entities, we can also understand the relationships that are described between them.

Rare Disease Blog Figure 1 SciBite Search results for Huntingtons disease

Click on image to enlarge

Figure 1: SciBite Search results for recent studies on Huntington’s disease.

Integrating knowledge graphs for deeper insights

To help support the identification of novel targets, we integrate two types of relationships into a knowledge graph. These include relationships provided by structured sources such as SNP databases and ChEMBL, alongside relationships mined from annotated and enriched literature via SciBite Search.

SciBite Search uses our curated vocabularies to annotate text, enabling us to identify and select sentences containing specific entities for further analysis. For example, one can identify sentences that contain a gene and disease, and then analyze these sentences to understand the relationship that exists between the mentioned entities.

This relationship extraction step allows us to determine patterns such as upregulation, downregulation, gain-of-function, and loss-of-function mutations. By focusing on specific diseases or therapeutic areas, and applying graph analysis approaches, we can uncover critical insights to drive the discovery of new therapeutic targets. Figure 2 shows an overview of the end-to-end process, from raw data to insights using SciBite technology and expertise.

This targeted and precise method not only enhances the accuracy of our research, but also accelerates the identification process, potentially contributing to the development of effective treatments for rare diseases. Approaches can also be applied to a variety of data sources, such as Elsevier Datasets, public open access data, or even internal data.

Rare Disease Blog Figure2 End to process

Figure 2: End-to-end process: Raw data enrichment via SciBite search, relationship extraction (RE) using cutting-edge technologies, knowledge graph (KG) design and analysis.

Use-case: Hartnup disease

Let’s take Hartnup disease as an example and see what we can find in research texts about it. Hartnup disease is an autosomal recessive disorder characterized by defective transport of neutral amino acids in the small intestine and kidneys. It has variable clinical presentations, ranging from asymptomatic cases to severe symptoms like photosensitive rashes and neurological issues. It is a rare disease, affecting about 1 in 24,000 people. The condition is primarily caused by mutations in the SLC6A19 gene [3]. Later, we will examine the sort of information that is available in various sources.

The power of full-text resources

Technology is only as good as the content to which it is applied. There exist multiple sources of high-quality scientific literature, including open access PubMed abstracts. The PubMed dataset includes both MEDLINE data, i.e. a subset of articles that have been indexed with MeSH terms but also includes in-process citations which haven’t been approved by the NLM or indexed with MeSH terms. PubMed also provides a search interface for interrogating the entire dataset.

PubMed also provides a search interface for exploring the dataset. While PubMed is a valuable resource for scientists to access publicly available scientific abstracts, there are times when a deeper exploration of full-text articles is necessary to enhance scientific understanding and uncover more nuanced insights.

SciBite, from Elsevier, has access to the extensive full-text resources available through ScienceDirect. Unlike data sources such as PubMed, which primarily offer article abstracts, ScienceDirect provides experimental data, methodologies, and recent findings often hidden in the full text of research articles (Figure 3).

This vast repository of information allows us to uncover new insights and connections that might be missed when relying solely on abstracts.

Rare Disease Blog Figure3 ScienceDirect detailed insights

Click on image to enlarge

Figure 3: ScienceDirect full text uncovers detailed and quality research insights.

By utilizing full-text mining of ScienceDirect data in the knowledge graph, we can extract detailed information on various entity relationships, including genes, SNPs, diseases, pathways, drugs, and metabolites. This comprehensive approach enables us to better understand the underlying mechanisms of rare diseases and identify novel potential therapeutic targets. The width and richness of full-text resources significantly enhance our ability to perform in-depth analyses, potentially leading to more accurate and actionable findings.

In comparison to other resources which primarily offer abstracts, ScienceDirect’s full-text articles provide a more holistic view of scientific research. This allows us to delve deeper into the data and uncover hidden relationships that are crucial for advancing rare disease research and developing effective treatments.

Comparative analysis

To illustrate the value of utilizing full-text resources like ScienceDirect, we present a comparative overview with other scientific literature sources notably PubMed. This overview highlights the richness and depth of information accessible through full-text resources in generating knowledge graphs, which can significantly enhance rare disease research.

When it comes to the extraction of specific entity relationships via SciBite Search, ScienceDirect provides significantly more scientific context. Table 1 shows the number of sentences containing some specific rare diseases and genes that have a regulatory association with them.

Table 2 shows the same numbers for disease-pathway associations.

Rare Disease Blog Table 1 Number of disease gene association sentences

Table 1: Number of disease-gene association-related sentences found in the ScienceDirect dataset and PubMed.

Rare Disease Blog Table 2 Number of disease pathway association sentences

Table 2: Number of disease-pathway association-related sentences found in ScienceDirect and PubMed.

Extracting relevant relationships using SciBite Search

For each set of entities, we extracted the most relevant sentences/documents using SciBite Search and analyzed them. For instance, we captured texts that have mentions of disease/gene (or other relevant entities), followed by employing cutting-edge technologies such as machine learning-based solutions to extract valid relationships between disease, genes, and regulatory verbs, REGVERB.

REGVERB is a vocabulary that has been developed by SciBite’s curation team to enhance the precision of our proposed methods. REGVERB includes verbs and terms that signify regulation, such as upregulation, downregulation, block, and increase. Integrating these data into our knowledge graph allows us to uncover new insights. In our study, we considered around 200 diseases in a given therapeutic area of interest. Here are the key findings in ScienceDirect data:

Click on image to enlarge

Figure 4: Pathway – gene associations found in full-text.

Total inferred relationships: We identified 60,591 inferred relationships from the annotated ScienceDirect data, providing a broad set of relationships to be included in our knowledge graph.
New relationships between nodes: Out of these, 52,723 are relationships with regulatory associations (Distinct regulation phrases) that were not seen in PubMed. These additional connections between nodes enhance the depth and breadth of our knowledge graph by linking entities that were either previously connected or not connected.
Relationships between previously disconnected nodes: Specifically, 18,768 of the relationships from ScienceDirect are between nodes that had no prior connection in the populated knowledge graph, highlighting the power of full text in uncovering new relationships.
High-confidence relationships: When considering relationships with a frequency greater than 5, there are 2,032 high-confidence relationships out of the 18,768 previously disconnected nodes. These high-confidence connections provide robust targets for further investigation.
Based on the customer’s observations and our team comparisons, the difference is not merely in terms of quantity, incorporating ScienceDirect full text gives us access to recent, quality research data (see Figures 3 and 4).
The depth of full-text resources, such as those available through ScienceDirect, lets us uncover a great deal meaningful scientific relationships that may well be missed from abstract data alone. Our approach leveraging ScienceDirect data expands the network of inferred relationships enhancing the overall utility of our knowledge graph in rare disease research.

In-depth analysis of Hartnup Disease through ScienceDirect

Let’s get back to our initial example, Hartnup disease. When we look at the yield knowledge graph, in the context of Hartnup disease, it is clear that ScienceDirect offers more comprehensive information; 157 relationships to 42 entities, compared to 18 relationships to 12 entities extracted from PubMed. This extensive set of relationships from ScienceDirect includes a diverse array of pathways, genes/proteins, metabolites, and drugs, providing a richer context for understanding the biological mechanisms at play.

For instance, ScienceDirect highlights key genes such as ACE2 and CLTRN, which are crucial interactors of the amino acid transporter SLC6A19, implicated in Hartnup disease. Additionally, the presence of drugs like isonicotinamide and melanin in the results aligns with known biochemical pathways and clinical recommendations, underscoring the practical relevance of the data.

In contrast, PubMed data focuses on genes similar in function to SLC6A19, such as SLC6A15 and SLC1A5. This valuable, albeit narrower scope may be useful for identifying related metabolic pathways but lacks the detailed insights provided by ScienceDirect. For example, while both sources identify tryptophan as a significant metabolite, ScienceDirect’s broader context, including the sodium atom’s role in amino acid transport, offers more understanding of the metabolic processes involved. Thus, for researchers and clinicians seeking to develop targeted therapies or gain a deeper understanding of Hartnup disease mechanisms, ScienceDirect’s full-text articles provide a more detailed resource.

Future directions

Looking ahead, the integration of SciBite technology with Elsevier’s vast ScienceDirect data holds immense potential for advancing rare disease research. ScienceDirect provides unparalleled access to full-text resources, which, when combined with SciBite’s advanced annotation and search capabilities empowered by innovative generative AI approaches, create a powerful synergy for uncovering new therapeutic targets.

References:

Baynam G, et al. Global health for rare diseases through primary care. Lancet Glob Health. 2024;12(7): e1192-e1199. DOI: 10.1016/S2214-109X(24)00134-7 Accessed 18.01.2025.
Alobaidis H, Seoane-Vazquez E, Brown LM, Fleming ML, Rodriguez-Monguio R. Disentangling the cost of orphan drugs marketed in the United States. Healthcare (Basel). 2023 Feb 13;11(4):558. DOI: 10.3390/healthcare11040558 Accessed 18.01.2025.
Polavarapu A, Hasbani D. Neurological Complications of Nutritional Disease. Seminars in Pediatric Neurology, Volume 24, Issue 1, 2017, Pages 70-80. DOI: 10.1016/j.spen.2016.12.002 Accessed 18.01.2025.

Zahra Hosseini

Senior Data Scientist, SciBite

Zahra Hosseini, Senior Data Scientist. Holds a Ph.D. in machine learning from the Science and Research University of Tehran, focusing on Natural Language Processing (NLP) and knowledge discovery. She was an Assistant professor at Azad University of Isfahan for 7 years before switching to Industry. She has been with SciBite since 2021 as a part of the data science team.

Other articles by Zahra

How SciBite and Elsevier manage KOL identification read more.
How SciBite technology can facilitate gene-disease relationship extraction read more.

Share this article

Relevant resources, events and news

https://scibite.com/knowledge-hub/resources/webinar-fair-knowledge-graphs-with-full-text/ thumbnail image

Resource FAIR knowledge graphs that go beyond title and abstract with full text [Webinar]

Learn how FAIR Knowledge Graphs can propel your digital transformation to make large-scale clean data an opportunity not a hurdle.

Knowledge Graphs