Search for a command to run...
Introduction Clinical exome sequencing reports contain valuable genetic and phenotypic information but are typically stored in unstructured text form, making automated biomedical information extraction challenging. For the Russian language, publicly available annotated corpora for genetic report analysis remain extremely limited. Methods We present GENEXOM, the first multi-level annotated corpus of Russian-language clinical exome sequencing reports designed for biomedical information extraction. The corpus includes 5,318 reports (318 authentic and 5,000 synthetic) and comprises 16 entity types and 7 relation types aligned with HGVS, OMIM, ClinVar, and ACMG/AMP standards. Annotation was performed in the Label Studio platform by expert geneticists. Baseline transformer models (RuBERT, RuBioBERT, ModernBERT) were fine-tuned for Named Entity Recognition (NER) and Relation Extraction (RE). Results The annotation achieved span-level F1-IAA = 0.83 and macro κ = 0.79 ± 0.04, indicating substantial inter-annotator agreement. Among the evaluated models, ModernBERT achieved the best performance with F1 = 0.88 ± 0.03 for NER and F1 = 0.836 ± 0.04 for RE on the held-out test set. Discussion The GENEXOM corpus provides a linguistically and clinically adapted resource for Russian medical NLP and supports downstream tasks such as variant interpretation, phenotype–disease mapping, and biomedical knowledge graph construction. The corpus and accompanying code are publicly available for research purposes.