Use Cases

Discover how SciBite’s powerful solutions are supporting scientists and researchers.

Use Cases Overview

Gartner report

Gartner® How to Build Knowledge Graphs That Enable AI-Driven Enterprise Applications

 

Access report

Resources

Discover our whitepapers, spec sheets, and webinars for in-depth product knowledge.

Resources

Events

Join us at upcoming events and webinars to learn more about SciBite solutions.

Events

News

Stay informed with the latest SciBite updates, announcements, and industry news.

News

Ctrl Alt Tech Podcast

Where technology meets curiosity. In each episode, we chat with expert guests to explore a wide range of STEM topics.

Podcast

About SciBite

Explore SciBite’s full suite of solutions to unlock the potential of your data.

Discover more about us

Our Partners

We build powerful partnerships with world-leading organizations.

Our Partners

Bridging language barriers in bio-curation: An LLM-enhanced workflow for ontology translation into Japanese
Aerial view at the sea with waves at a sunrise

Semantic search has revolutionized how researchers locate biomedical information. Unlike traditional keyword searches, which rely solely on matching input words, semantic search operates on the understanding of concepts—entities linked to a network of synonyms and related terms. This approach—encapsulated by the mantra “Things, not Strings”—allows for more accurate, comprehensive retrieval of relevant scientific literature, even when different terminology is used across publications.

However, implementing effective semantic search in languages beyond English, such as Japanese, presents unique challenges. Most public biomedical ontologies and named entity recognition (NER) dictionaries are developed primarily in English, creating a significant barrier for Japanese researchers and clinicians seeking to leverage these tools. Japanese translation involves more than simply finding the closest equivalent in English; it requires capturing the full diversity of terminology used in Japanese biomedical literature, including:

  • Orthographic variation: Japanese can be written using multiple alphabets—kanji, hiragana, katakana, Latin script, or combinations of these—creating far more valid “spellings” of the same word than possible in English
  • Language-specific terminology: Japanese terminology can include multiple expressions, synonyms, or variants for a single concept that have no direct counterpart in English
  • Abbreviations and acronyms: Many abbreviations and acronyms are unique to Japanese medical literature and may not directly correspond to English equivalents.

At SciBite, we are actively working to bridge this gap using innovative workflows that combine human expertise with advanced AI technologies, including large language models (LLMs). During ISMB/ECCB 2025 in Liverpool, I had the opportunity to present our comprehensive pipeline designed to automate and enhance the process of translating and enriching biomedical ontologies for Japanese semantic search. This pipeline aims to facilitate more accessible, accurate, and scalable ontology management, ultimately making scientific data more FAIR—Findable, Accessible, Interoperable, and Reusable—for researchers worldwide.

Ontology translation: strategies and limitations

Building robust NER dictionaries for Japanese biomedical literature presents several obstacles. In our workflow, the raw material for translation and enrichment is prepared through a combination of complementary strategies:

  • Curated Japanese resources: Most public biomedical ontologies are developed in English, with limited Japanese coverage. External curated resources—such as data from Japan’s Database Center for Life Science (DBCLS)—provide high-quality translations that are high-quality but lack in diversity, typically offering just a single translation per class.
  • Neural machine translation (MT): DeepL, Google Translate, and other MT APIs will attempt to translate any input given them, increasing coverage for ontology classes that lack curated translations. However, since these services produce just one translation per term (and its accuracy is hardly guaranteed), it’s not a perfect solution for developing a comprehensive NER dictionary.
  • Large language models (LLMs): LLMs can easily be prompted to generate multiple synonyms, representing an improvement over pure MT. However, they often overproduce or “hallucinate” inaccurate, irrelevant, or unnatural synonyms in their effort to fulfil the prompt and satisfy the user’s request for many alternatives, requiring some degree of validation downstream.
Terminology validation: automation and human review

Once our candidate synonyms from curated resources and automated MT/LLM-based translation, they are fed into our rigorous validation pipeline to filter out inappropriate terms and confirm that the remainder are both accurate and conceptually specific.

  1. LLM jury: We prompt three LLMs to independently evaluate each candidate synonym, judging whether it is valid based on the source definition and context within the ontology, and providing justifications for their ratings in case human review is needed down the line. Synonyms supported by at least two out of three models are retained, balancing precision and recall. During testing, we found that some models performed better than others—in fact, multi-step “thinking” models often performed poorly, seeming to amplify irrelevant differences in terminology out of an abundance of caution during their decision-making process.
  1. Ontology vector search: We vectorize each candidate synonym using multilingual embeddings and search against a vectorized version of the source ontology. The top-ranked classes are examined by the LLMs to determine if the candidate synonym is an exact, broader, or narrower match with respect to them, improving conceptual fit by ensuring each term is matched to the most semantically similar entity within the ontology.
  1. Literature validation: We verify whether the candidate synonyms appear in biomedical literature using a web search API. Terms not found in literature are discarded, reducing the likelihood of including obscure or unused synonyms that could bloat the dictionary.
  2. Human-in-the-loop: Finally, the data collected and generated during this process is made available via a Streamlit-based curation dashboard. Entity by entity, curators can review all candidate synonyms, their provenance, LLM ratings, and justifications at once. While high-confidence synonyms can be registered into the NER dictionary automatically, this “human-in-the-loop” system allows for efficient review of disputed or borderline cases.
Results and value

Applying this pipeline to Japanese biomedical ontologies has yielded promising results:

  • Efficiency gains: Curation workload has been reduced by approximately 75%, allowing our human expertise to be focused on edge cases and quality assurance rather than bulk synonym generation.
  • High confidence in synonyms: Validation metrics show accuracy, precision, and recall around 80%, indicating the pipeline’s effectiveness in generating reliable Japanese synonyms.
  • Language independence: While optimized for Japanese, the pipeline’s architecture is adaptable to other non-English languages, broadening the scope of multilingual ontology development.
Integrating AI translation in our ontology management platform

LLM-based translation isn’t just used at SciBite to develop NER dictionaries: it is now integrated directly into our ontology management platform, CENtree. From version 3.2.1 onwards, users can generate a variety of synonyms, definitions, and other properties across nearly 200 languages automatically, from which they can choose and revise their preferred option(s) for production. This seamless integration empowers ontology developers and curators to rapidly build multilingual vocabularies, facilitating international collaborations and cross-language data interoperability.

Conclusion

At SciBite, our goal is to break down linguistic barriers in biomedical data, enabling researchers worldwide to find, interpret, and utilize information effectively. By combining curated resources, cutting-edge AI, and human oversight, we are building scalable, accurate workflows for ontology translation and enrichment in Japanese and beyond.  For researchers working in multilingual environments, especially those focused on Japanese biomedical literature, these advancements open new horizons for comprehensive data analysis and discovery.

If you’re involved in multilingual ontology deployment or biomedical text mining, especially in Japanese, we’d love to hear from you. Feel free to reach out to discuss collaborations or share your experiences.

Headshot - Mark Streer
Mark Streer
Senior Scientific Curator (Japanese), SciBite

Mark Streer has served as a Scientific Curator in SciBite’s Ontologies Team since 2021, specializing in Japanese language integration. Leveraging his expertise at the nexus of biomedical research and technical translation, he brings to the table extensive knowledge of English and Japanese scientific terminology, along with enthusiasm for applying cutting-edge ML/AI technologies to ontology curation, software development, and linguistic solutions.

Share this article
Relevant resources, events and news