Rare diseases, often referred to as orphan diseases, affect over 300 million people worldwide [1]. Conditions such as Erdheim-Chester disease, Stiff Person Syndrome, and Fibrodysplasia Ossificans Progressiva are examples of these rare diseases. Despite their collective impact, each individual rare disease affects only a tiny fraction of people, making the identification of new therapeutic targets a significant challenge. The rarity of these diseases often results in substantial delays in diagnosis and a lack of effective treatments.
Researching rare diseases presents unique challenges. One of the primary obstacles is the limited availability of data, as the small patient populations make it difficult to gather large datasets. Additionally, funding for rare disease research is often limited, as pharmaceutical companies may be less inclined to invest in treatments for conditions that affect a relative minority of people.
Developing treatments for rare diseases is significantly more expensive compared to other diseases. recent study highlighted this disparity, revealing that the median cost for orphan drugs was a staggering USD 218,872, which translates to higher treatment costs for patients. In contrast, the median cost for non-orphan drugs was much lower at USD 12,798 [2]. This substantial difference underscores the financial challenges associated with bringing treatments for rare diseases to market.
Several factors contribute to these high costs. First, the complexity of rare diseases requires extensive research and specialized expertise, lengthening development timelines and increasing expenses. Additionally, rare diseases often affect small patient populations, making it challenging for pharmaceutical companies to recoup R&D costs through sales. Lastly, regulatory challenges further complicate the landscape, as obtaining approval can be difficult due to limited data and smaller clinical trial populations.
This scarcity of data and resources calls for novel methods to identify therapeutic targets. In this context, as we’ll see shortly, the benefit of using full-text resources over open-source datasets is particularly relevant.
Operating on FAIR data principles, we take a comprehensive and innovative approach to identifying new therapeutic targets for rare diseases. Our method leverages our proprietary annotation software, TERMite, and our search tool, SciBite Search.
These tools allow us to meticulously annotate textual data using tailored created by our expert curation team, enabling the conversion of strings (such as idiopathic pulmonary fibrosis) to things with IDs (such as Mesh:D054990) (Figure 1).
This allows us to identify various scientific entities, such as genes, single nucleotide polymorphisms (SNPs), diseases, pathways, drugs, metabolites, etc. Through identifying entities, we can also understand the relationships that are described between them.
To help support the identification of novel targets, we integrate two types of relationships into a knowledge graph. These include relationships provided by structured sources such as SNP databases and ChEMBL, alongside relationships mined from annotated and enriched literature via SciBite Search.
SciBite Search uses our curated vocabularies to annotate text, enabling us to identify and select sentences containing specific entities for further analysis. For example, one can identify sentences that contain a gene and disease, and then analyze these sentences to understand the relationship that exists between the mentioned entities.
This relationship extraction step allows us to determine patterns such as upregulation, downregulation, gain-of-function, and loss-of-function mutations. By focusing on specific diseases or therapeutic areas, and applying graph analysis approaches, we can uncover critical insights to drive the discovery of new therapeutic targets. Figure 2 shows an overview of the end-to-end process, from raw data to insights using SciBite technology and expertise.
This targeted and precise method not only enhances the accuracy of our research, but also accelerates the identification process, potentially contributing to the development of effective treatments for rare diseases. Approaches can also be applied to a variety of data sources, such as Elsevier Datasets, public open access data, or even internal data.
Figure 2: End-to-end process: Raw data enrichment via SciBite search, relationship extraction (RE) using cutting-edge technologies, knowledge graph (KG) design and analysis.
Let’s take Hartnup disease as an example and see what we can find in research texts about it. Hartnup disease is an autosomal recessive disorder characterized by defective transport of neutral amino acids in the small intestine and kidneys. It has variable clinical presentations, ranging from asymptomatic cases to severe symptoms like photosensitive rashes and neurological issues. It is a rare disease, affecting about 1 in 24,000 people. The condition is primarily caused by mutations in the SLC6A19 gene [3]. Later, we will examine the sort of information that is available in various sources.
Technology is only as good as the content to which it is applied. There exist multiple sources of high-quality scientific literature, including open access PubMed abstracts. The PubMed dataset includes both MEDLINE data, i.e. a subset of articles that have been indexed with MeSH terms but also includes in-process citations which haven’t been approved by the NLM or indexed with MeSH terms. PubMed also provides a search interface for interrogating the entire dataset.
PubMed also provides a search interface for exploring the dataset. While PubMed is a valuable resource for scientists to access publicly available scientific abstracts, there are times when a deeper exploration of full-text articles is necessary to enhance scientific understanding and uncover more nuanced insights.
SciBite, from Elsevier, has access to the extensive full-text resources available through ScienceDirect. Unlike data sources such as PubMed which primarily offer article abstracts, ScienceDirect provides experimental data, methodologies, and recent findings often hidden in the full text of research articles (Figure 3).
This vast repository of information allows us to uncover new insights and connections that might be missed when relying solely on abstracts.
By utilizing full-text mining of ScienceDirect data in the knowledge graph, we can extract detailed information on various entity relationships, including genes, SNPs, diseases, pathways, drugs, and metabolites. This comprehensive approach enables us to better understand the underlying mechanisms of rare diseases and identify novel potential therapeutic targets. The width and richness of full-text resources significantly enhance our ability to perform in-depth analyses, potentially leading to more accurate and actionable findings.
In comparison to other resources which primarily offer abstracts, ScienceDirect’s full-text articles provide a more holistic view of scientific research. This allows us to delve deeper into the data and uncover hidden relationships that are crucial for advancing rare disease research and developing effective treatments.
To illustrate the value of utilizing full-text resources like ScienceDirect, we present a comparative overview with other scientific literature sources notably PubMed. This overview highlights the richness and depth of information accessible through full-text resources in generating knowledge graphs, which can significantly enhance rare disease research.
When it comes to the extraction of specific entity relationships via SciBite Search, ScienceDirect provides significantly more scientific context. Table 1 shows the number of sentences containing some specific rare diseases and genes that have a regulatory association with them.
Table 2 shows the same numbers for disease-pathway associations.
Table 1: Number of disease-gene association-related sentences found in the ScienceDirect dataset and PubMed.
Table 2: Number of disease-pathway association-related sentences found in ScienceDirect and PubMed.
For each set of entities, we extracted the most relevant sentences/documents using SciBite Search and analyzed them. For instance, we captured texts that have mentions of disease/gene (or other relevant entities), followed by employing cutting-edge technologies such as machine learning-based solutions to extract valid relationships between disease, genes, and regulatory verbs, REGVERB.
REGVERB is a vocabulary that has been developed by SciBite’s curation team to enhance the precision of our proposed methods. REGVERB includes verbs and terms that signify regulation, such as upregulation, downregulation, block, and increase. Integrating these data into our knowledge graph allows us to uncover new insights. In our study, we considered around 200 diseases in a given therapeutic area of interest. Here are the key findings in ScienceDirect data:
Let’s get back to our initial example, Hartnup disease. When we look at the yield knowledge graph, in the context of Hartnup disease, it is clear that ScienceDirect offers more comprehensive information; 157 relationships to 42 entities, compared to 18 relationships to 12 entities extracted from PubMed. This extensive set of relationships from ScienceDirect includes a diverse array of pathways, genes/proteins, metabolites, and drugs, providing a richer context for understanding the biological mechanisms at play.
For instance, ScienceDirect highlights key genes such as ACE2 and CLTRN, which are crucial interactors of the amino acid transporter SLC6A19, implicated in Hartnup disease. Additionally, the presence of drugs like isonicotinamide and melanin in the results aligns with known biochemical pathways and clinical recommendations, underscoring the practical relevance of the data.
In contrast, PubMed data focuses on genes similar in function to SLC6A19, such as SLC6A15 and SLC1A5. This valuable, albeit narrower scope may be useful for identifying related metabolic pathways but lacks the detailed insights provided by ScienceDirect. For example, while both sources identify tryptophan as a significant metabolite, ScienceDirect’s broader context, including the sodium atom’s role in amino acid transport, offers more understanding of the metabolic processes involved. Thus, for researchers and clinicians seeking to develop targeted therapies or gain a deeper understanding of Hartnup disease mechanisms, ScienceDirect’s full-text articles provide a more detailed resource.
Looking ahead, the integration of SciBite technology with Elsevier’s vast ScienceDirect data holds immense potential for advancing rare disease research. ScienceDirect provides unparalleled access to full-text resources, which, when combined with SciBite’s advanced annotation and search capabilities empowered by innovative generative AI approaches, create a powerful synergy for uncovering new therapeutic targets.
Zahra Hosseini, Senior Data Scientist. Holds a Ph.D. in machine learning from the Science and Research University of Tehran, focusing on Natural Language Processing (NLP) and knowledge discovery. She was an Assistant professor at Azad University of Isfahan for 7 years before switching to Industry. She has been with SciBite since 2021 as a part of the data science team.
Other articles by Zahra