Semantic search has revolutionized how researchers locate biomedical information. Unlike traditional keyword searches, which rely solely on matching input words, semantic search operates on the understanding of concepts—entities linked to a network of synonyms and related terms. This approach—encapsulated by the mantra “Things, not Strings”—allows for more accurate, comprehensive retrieval of relevant scientific literature, even when different terminology is used across publications.
However, implementing effective semantic search in languages beyond English, such as Japanese, presents unique challenges. Most public biomedical ontologies and named entity recognition (NER) dictionaries are developed primarily in English, creating a significant barrier for Japanese researchers and clinicians seeking to leverage these tools. Japanese translation involves more than simply finding the closest equivalent in English; it requires capturing the full diversity of terminology used in Japanese biomedical literature, including:
At SciBite, we are actively working to bridge this gap using innovative workflows that combine human expertise with advanced AI technologies, including large language models (LLMs). During ISMB/ECCB 2025 in Liverpool, I had the opportunity to present our comprehensive pipeline designed to automate and enhance the process of translating and enriching biomedical ontologies for Japanese semantic search. This pipeline aims to facilitate more accessible, accurate, and scalable ontology management, ultimately making scientific data more FAIR—Findable, Accessible, Interoperable, and Reusable—for researchers worldwide.
Ontology translation: strategies and limitationsBuilding robust NER dictionaries for Japanese biomedical literature presents several obstacles. In our workflow, the raw material for translation and enrichment is prepared through a combination of complementary strategies:
Once our candidate synonyms from curated resources and automated MT/LLM-based translation, they are fed into our rigorous validation pipeline to filter out inappropriate terms and confirm that the remainder are both accurate and conceptually specific.
Applying this pipeline to Japanese biomedical ontologies has yielded promising results:
LLM-based translation isn’t just used at SciBite to develop NER dictionaries: it is now integrated directly into our ontology management platform, CENtree. From version 3.2.1 onwards, users can generate a variety of synonyms, definitions, and other properties across nearly 200 languages automatically, from which they can choose and revise their preferred option(s) for production. This seamless integration empowers ontology developers and curators to rapidly build multilingual vocabularies, facilitating international collaborations and cross-language data interoperability.
ConclusionAt SciBite, our goal is to break down linguistic barriers in biomedical data, enabling researchers worldwide to find, interpret, and utilize information effectively. By combining curated resources, cutting-edge AI, and human oversight, we are building scalable, accurate workflows for ontology translation and enrichment in Japanese and beyond. For researchers working in multilingual environments, especially those focused on Japanese biomedical literature, these advancements open new horizons for comprehensive data analysis and discovery.
If you’re involved in multilingual ontology deployment or biomedical text mining, especially in Japanese, we’d love to hear from you. Feel free to reach out to discuss collaborations or share your experiences.
Mark Streer has served as a Scientific Curator in SciBite’s Ontologies Team since 2021, specializing in Japanese language integration. Leveraging his expertise at the nexus of biomedical research and technical translation, he brings to the table extensive knowledge of English and Japanese scientific terminology, along with enthusiasm for applying cutting-edge ML/AI technologies to ontology curation, software development, and linguistic solutions.