D5.7 Assessment report (& proto) for multi-lingual approach

20260 citationsJournal Articlegreen Open Access

Authors

Wei Sun · KU Leuven

Kris Collins · Averbis (Germany)

Marie-Francine Moens · KU Leuven

Stefan Schulz · Averbis (Germany)

Svetla Boytcheva · Ontotext (Bulgaria)

Ivelina Nikolova · Ontotext (Bulgaria)

Abstract

This assessment report evaluates the multilingual natural language processing approach developed to enable reliable extraction of structured clinical knowledge from unstructured health records in German, Dutch, and Estonian. The multilingual pipeline integrates domain-adaptive transformer pretraining, multitask named entity recognition, semantic similarity–based entity linking, and large-language-model–driven temporal relation extraction within a unified architecture. The adopted strategy combines the strengths of fine-tuned encoder models—demonstrating high precision and stability for token-level clinical extraction—with the contextual flexibility of LLMs, which are particularly well suited for sparse-data phenomena, document-level reasoning, and temporal interpretation. Extensive multilingual pretraining with concept-masked language modeling ensures clinically grounded cross-lingual representations, providing a robust backbone for all downstream components.The evaluation on manually annotated datasets from three European clinical partners confirms that the multilingual approach attains strong performance across languages, achieving F1 scores above 0.84 for Dutch and Estonian named entity recognition and stable accuracy for entity linking even in the presence of substantial terminological ambiguity. The extended entity linking service, which combines gazetteer-based precision matching with contrastively trained transformer similarity models and enriched multilingual terminology resources, yields major gains in normalization reliability, particularly for low-resource languages. Temporal relation extraction implemented via an LLM-based pipeline demonstrates consistent detection and normalization of clinical timelines across languages, enabling the reconstruction of event chronologies within Personal Health Knowledge Graphs. Collectively, the results validate the fitness of the multilingual pipeline for large-scale deployment within the AIDAVA platform as a technically mature solution for cross-lingual clinical data structuring and interoperability.

Topics & Keywords

Machine Learning in Healthcare Biomedical Text Mining and Ontologies Topic Modeling

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18311711

Field-Weighted Citation Impact: 0.00

Command Palette

D5.7 Assessment report (& proto) for multi-lingual approach

Authors

Abstract

Topics & Keywords

Publication Details