Microbiome-based therapeutics have long been used; since ancient Egyptians used mouldy bread poultices to treat infected wounds. But it wasn’t until 1928, that Alexander Fleming discovered penicillin, the first microbiome metabolite (antibiotic) from Penicillium notatum to treat sore throats and abscesses.
Nowadays, microbiome studies have unveiled significant implications of gut microbiota in health and disease, such as reproducing marathon runners’ performance in non-athletes, improving age-related cognitive impairments, preventing recurring bacterial infections (e.g., Seres Therapeutics’ SER-109 – the first FDA-approved oral microbiota therapy) and much more. Yet, identifying therapeutic gut microbiota, as well as their key metabolites and pathways, is very challenging.
There are an estimated 10–100 trillion symbiotic microbiota in our gut, including bacteria, viruses, fungi, protozoa, and helminths, which are taxonomically and metabolically unique to specific health and disease states. Thus, to develop microbiome-based therapeutics, clinicians need to:
Nevertheless, such a time-consuming and heavily expensive drug development process could come with a high risk of failure, if clinicians’ hypotheses about the selected microbiomes are outdated and/or irrelevant.
Identifying therapeutic microbiomes and establishing well-founded hypotheses about their implications are not always feasible due to the challenges in understanding and clinically translating the advancements of disease pathophysiology and microbiome cellular machinery in big literature data. But fortunately, generative AI makes it possible.
Large language models (LLMs), including ChatGPT, have demonstrated exciting potential in understanding and summarizing huge silos of scientific text-based content. They show creative power in writing medical abstracts that are convincing enough for physicians to validate clinically. However, to fill in knowledge gaps, LLMs tend to generate random facts and/or hallucinated content without valid scientific evidence or provenance.
Despite their intuitive factual generation, LLMs potential in transforming evidence-based medicine could fall behind, unless they provide clinical translations based on cumulative pathophysiological evidence and effective data analysis.
Figure 1: Screenshot from ChatGPT4 showing a nonspecific snippet of evidence from irrelevant PubMed ID, when asked to list the top 5 microbes that can be repurposed for Inflammatory bowel disease (IBD).
Here, we describe the discovery service designed and developed at SciBite to offer researchers and clinicians the key gut microbiota and microbiome-related entities, along with their literature-based cumulative evidence for clinical validation. The pipeline comprises 3 processes:
Many microbes can be identified using a variety of Named-Entity Recognition (NER) software and models. Huggingface transformers are overloaded with many NER models that researchers can use to annotate literature, such as PubMed, PMC, ClinicalTrials.gov, and ScienceDirect. But can those models recognize all gut microbiota with their synonyms, the metabolites they biosynthesize and target in living organisms, the pathways they up- or downregulate, as well as which of those metabolites and pathways relationships can specifically treat diseases?
As an essential prerequisite for answering all those questions, TERMite was used to annotate literature from different resources with its core curated and AI-generated standardized vocabularies and synonyms, such as taxonomy, metabolite, pathway, functional food, disease, and far more. Mapping recognized entities to ontology IDs and vocabularies and linking their co-occurrences to literature-derived provenances is the cornerstone for building findable, accessible, interoperable, and reusable (FAIR) discovery Knowledge Graphs (KGs).
Figure 2: Screenshot from TERMite showing the different entities identified and mapped to ontology terms and IDs.
Having a KG that links a wide range of microbiome-related entities and synonyms (nodes) to ontologies and their co-occurrences (edges) to the literature-derived provenance, ensures the full coverage of diverse associations between microbiome-related entities; however, which microbes, metabolites, pathways or functional foods can be therapeutic? Two approaches are employed to classify and score relationships.
Similarity metrics between the embeddings of sentences with co-occurring entities and the embeddings of ontology terms representing classes, such as therapeutic and pathogenic, managed to classify and score relationships. But embeddings can differ from one model to another.
Preliminary qualitative evaluation comparing different model embeddings, such as FASTTEXT, BIOBERT, BIOGPT, and GPT3, revealed that in-house pretrained custom models can suggest more precise and relevant microbiome-based therapeutics compared to public ones.
Figure 3: Screenshot from Neo4j showing the top 16 diseases (pink) that can be treated by 2 gut microbiota bacteria (blue) and their key metabolites (brown).
Figure 4: Screenshot from Neo4j showing Clostridioides difficile as one of the top 10 microbes associated with IBD, using eigenvector centrality metrics and GPT3 ML score as edge weight. For each edge, the embeddings from FASTTEXT, BIOBERT, BIOGPT, and GPT3 were used to predict and score the relationships between each disease (e.g., IBD) and microbe. Relationships were ranked using centrality metrics and ML scores. The results are based on 100K random relationships selected for model comparisons.
Whereas pretrained AI models can classify and score relationship contexts, disentangling complex relationships in multi-entities and/or multi-classes contexts can be challenging. Nevertheless, fine-tuning LLMs, such as GPT2 models (e.g., BIOGPT), with curated training datasets showed additional potential in accurately classifying and scoring complex relationships contexts, such as identifying and explaining precisely which microbiota and/or metabolites are disease-specific, and which upregulated and/or downregulated pathways are targeted by each.
Figure 5: Screenshot from a Jupyter notebook showing the results of in-house finetuned LLM (BIOGPT) for upregulation and downregulation. In this example, the finetuned BioGPT lists what Isaria cicadae metabolite (NADD) upregulates and downregulates for treating Ulcerative Colitis.
A diversity of scores is offered for sorting answers, including count, TFIDF, ML/LLM similarity scores, ML/LLM classifier probabilities, and many more. However, we found that sorting with those scores can only account for direct relationships, not indirect ones. Consequently, we offered more complex graph analysis using ML/LLMs scores as edge weights and centrality metrics, such as degree, eigenvector, closeness, and betweenness.
This approach managed to retrieve many relevant results that aligned with what researchers/clinicians are expecting, in addition to potential new microbes to study further. The fact that the researchers were able to see results they were already aware of, backed up by real evidence from the scientific literature and provided reassurance that the new suggestions warranted further investigation.
Figure 6: Screenshot from Neo4j ML-based graph analysis results, showing Plantago ovata seed as one of the first 10 functional food that increases the butyrate production by gut microbiome and has similar mesalamine therapeutic effects.
“The microbiome knowledge graph database has large number of Nodes and varieties of categories. As a result, there are a huge number of Edges which require a great deal of resource to extract and analyse the necessary information. This was a key consideration in the development process.
SciBite’s proposal to use a large-scale language model and machine learning to make sense of the Edges was very helpful. In the actual analysis, we were able to narrow down Edges according to our objectives, such as searching for diseases to be therapeutic targets or combinations of flora and diseases that are useful for diagnosis, and we confirmed that we could quickly reach the necessary information.
As we were able to check the textual basis of the Edge at the same time, we were also able to determine on the spot whether the information was of the standard we expected. We believe that this is a useful application to support the construction of new hypotheses by our researchers, because it has the effect of not causing a disconnect in thinking due to the complexity of searching and analyzing, which is often a problem in work toward intellectual creation.”
To summarize, repurposing microbiome is a challenging process that requires not only having comprehensive knowledge about therapeutic microbial communities and their target metabolites and disease-specific pathways; but ranking them based on cumulative scientific evidence.
Given their potential in understanding and summarizing complex contexts, pretrained and fine-tuned ML models, including LLMs, managed to accurately predict, classify and further score relationships between microbiome-related entities. Moreover, when combined with graph analysis, ML scores managed to rank microbiome-derived therapeutics for clinical validation based on direct and indirect cumulative scientific evidence.
Unlock the value of scientific text in seconds with our named entity recognition (NER) and extraction engine – TERMite.
Maaly joined SciBite in 2022 as a senior data scientist. With a Veterinary Medicine PhD in clinical diagnosis simulations (Freie Universitaet Berlin; FUB) and a neuroscience MSc in ML and graph analytics (Humboldt Universitaet Berlin).
Maaly developed and applied semantic computing applications and AI pipelines for data integration and enrichment (EBI EuropePMC and MGnify), medical diagnosis (clinical trials FUB), knowledge and drug discovery (EBI EuropePMC and MGnify, FUB), drug repurposing (SciBite).
1. [Article] The neurocognitive gains of diagnostic reasoning training using simulated interactive veterinary cases. M. Nassar, Sci Rep, 2019, 9, 19878. read more.
2. [Article] A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications, M. Nassar et al., GigaScience, 2022, 11, giac077, read more.
3. [Blog] Matching patients to clinical trials. read more
Patient X, suffering from an untreatable gastrointestinal disease, chats with a large language model for advice. GPT suggests looking at clinical trials and Patient X finds 10 active recruiting trials but is unsure which to choose. Patient X consults his doctor, who recommends a trial from a pharmaceutical company. What could go wrong?Read
Large language models (LLMs) have limitations when applied to search due to their inability to distinguish between fact and fiction, potential privacy concerns, and provenance issues. LLMs can, however, support search when used in conjunction with FAIR data and could even support the democratisation of data, if used correctly…Read
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456