Machine Learning and phenotype triangulation

Disease detective part 3: In our final disease detective article, we’ll take Part 2’s topic a little further and zoom in on how we can find new relationships between diseases where direct evidence is sparse.

SciBite Machine Learning Icon

Welcome to the final part of our blog trilogy for Rare Disease Day.  In parts one and two we explored:

  • How pharmaceutical companies can identify relevant research centers across the globe
  • Faster, deeper research into the mechanistic behavior of rare diseases

Today, we’ll take Part 2’s topic a little further and zoom in on how to find new relationships between diseases where direct evidence is sparse.  This is particularly important in the rare disease arena, where the amount of research is at a much lower volume, compared to more common conditions.

Getting closer to the ideal

Imagine if you could tap into a condition and get a ranking of all the phenotypes those diseases had in common.  Impossible?  Is too time intensive? Hugely expensive?  Think again.

Here, we’ll describe a method we’ve developed for the quantification of disease similarity based on phenotypic signatures text-mined from Medline.  We bring together machine learning algorithms and our super fast TERMite engine to map rare diseases linked by their common phenotypes.  It’s an area fraught with many complexities, not least because of the difficulties we mentioned in part 1 surrounding the spread of the research and the semantics of the terminology involved.


The triangulation comes in when we infer relationships between two nodes on a network that are indirectly connected via other nodes. The more intermediate connections that two disconnected nodes have in common, the more likely that there is some sort of relationship between them.

Phenotype triangulation

In the case of Phenotype Triangulation, we compare diseases based on their shared phenotype profiles. Where there is a strong overlap in phenotype signatures, we can hypothesize that a disease pair could share an underlying mechanistic relationship. Further weight is added to the hypothesis through overlaying known genetic associations where available, as we described in part 2.

Machine Learning

At SciBite, we’ve developed Machine Learning algorithms to apply weightings to these relationships and so predict how scientifically “interesting” they are. The method is described in more detail in our previous blog.

Calculating the scores

For every indication-phenotype pair, we count how often both entities appear in the same sentence and set this against counts of how often the pair members appeared independently. These values are then plugged into a specific statistical algorithm to generate a relationship score. You can find more background on similar techniques at wikipedia. This score can then be ranked against all other disease-phenotype co-occurrence scores, thus enabling filtering out of the less interesting relationships.

An extension of the method is to measure the similarity between diseases based on their phenotype signatures.

Here’s an example of this ranking comparing Insulin Resistance (IR) and Alzheimer’s Disease (AD), also sometimes called Type 3 Diabetes. Based on the extracted phenotype signatures, the computer has been trained to recognise that these two diseases are associated at some level, and this is backed up in the literature.


As you can see from this method, with no prior knowledge of IR or AD, SciBite’s algorithms can effectively extract themes from the scientific literature without any human intervention.

What does this all mean for rare diseases?

Through our three parts on Rare Disease Day, we’ve brought you ideas and examples of how the SciBite Platform can be applied in the real world to help solve the challenges that scientists researching rare diseases face:

  • Facilitating collaboration with other researchers across the globe in relevant areas
  • Enabling deeper research at a faster rate through making connections between diseases at a mechanistic level
  • Discovering relationships between diseases in light of potentially sparse evidence

Together, these elements could drive forward the research journey towards new therapies and treatments for rare diseases, all the while helping to avoid duplication and encouraging the pooling of resources.

Read other blogs in the Disease detective blog series

Part 1 “Rare disease collaboration networks”  Discover Part 1
Part 2 “Exploring mechanistically-related diseases through shared phenotypic profiles”  Read Part 2

We’ve written a White Paper on how we used Machine Learning to liberate data.  To find out more about our work and how we could best help you, please contact us with your name, contact details, and your organization.  We’d love to hear from you.

Related articles

  1. Rare disease collaboration networks

    Disease Detective Part 1: In celebration of Rare Disease Day 28th Feb, we have a 3 part blog post looking into some of the challenges/analysis techniques involved in the research process.

  2. Exploring mechanistically-related diseases through shared phenotypic profiles

    Disease detective part 2: Today, we’ll look at a fresh way of enabling scientific researchers, either in pharmaceutical R&D or in medical institutes to deepen their investigations and consider new links.


How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us