A multi-layer annotated corpus for information extraction in Russian clinical NLP

20260 citationsJournal Articlegold Open Access

Authors

Anar Sultangaziyeva · Astana Medical University

Madina Sambetbayeva · L. N. Gumilyov Eurasian National University

Bayangali Abdygalym · L. N. Gumilyov Eurasian National University

Sandugash Serikbayeva · L. N. Gumilyov Eurasian National University

Abstract

Introduction Clinical exome sequencing reports contain valuable genetic and phenotypic information but are typically stored in unstructured text form, making automated biomedical information extraction challenging. For the Russian language, publicly available annotated corpora for genetic report analysis remain extremely limited. Methods We present GENEXOM, the first multi-level annotated corpus of Russian-language clinical exome sequencing reports designed for biomedical information extraction. The corpus includes 5,318 reports (318 authentic and 5,000 synthetic) and comprises 16 entity types and 7 relation types aligned with HGVS, OMIM, ClinVar, and ACMG/AMP standards. Annotation was performed in the Label Studio platform by expert geneticists. Baseline transformer models (RuBERT, RuBioBERT, ModernBERT) were fine-tuned for Named Entity Recognition (NER) and Relation Extraction (RE). Results The annotation achieved span-level F1-IAA = 0.83 and macro κ = 0.79 ± 0.04, indicating substantial inter-annotator agreement. Among the evaluated models, ModernBERT achieved the best performance with F1 = 0.88 ± 0.03 for NER and F1 = 0.836 ± 0.04 for RE on the held-out test set. Discussion The GENEXOM corpus provides a linguistically and clinically adapted resource for Russian medical NLP and supports downstream tasks such as variant interpretation, phenotype–disease mapping, and biomedical knowledge graph construction. The corpus and accompanying code are publicly available for research purposes.

Topics & Keywords

Biomedical Text Mining and Ontologies Topic Modeling Genomics and Rare Diseases

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Frontiers in Artificial Intelligence

Volume 9

DOI: 10.3389/frai.2026.1766899

Field-Weighted Citation Impact: 0.00