Improving Causal Gene Identification Using Large Language Models

20260 citationsJournal Articlegreen Open Access

Authors

Dan Ofer · Hebrew University of Jerusalem

Abstract

Abstract Genome-Wide Association Studies (GWAS) have successfully identified numerous loci associated with complex traits and diseases, yet pinpointing causal genes remains a significant challenge. The reliance on simple proximity-based heuristics is often insufficient due to linkage disequilibrium, gene interactions, and regulatory effects. Recent advancements in Large Language Models (LLMs) have demonstrated potential in automating causal gene identification, but their effectiveness remains limited by knowledge representation and retrieval mechanisms. This study builds on previous research by evaluating LLMs for causal gene identification, with a focus on enhancing performance through Retrieval-Augmented Generation (RAG) and the incorporation of genomic distance information. We replicate prior results using smaller model Qwen2.5—assessing their predictive accuracy using a benchmark dataset from Open Targets. We improved the preformences when integrating RAG-based literature retrieval (F1 = 0.795) and gene distance information (F1 = 0.806). However, the combined approach yielded diminishing returns, suggesting interactions between these enhancements. Error analysis revealed that genomic distance features improved predictions by reinforcing established heuristics, while RAG enhanced domain knowledge but occasionally led to semantic biases. These findings highlight the potential of hybrid approaches in leveraging both structured genomic features and unstructured textual data.

Topics & Keywords

Biomedical Text Mining and Ontologies Bioinformatics and Genomic Networks Genomics and Rare Diseases

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: bioRxiv (Cold Spring Harbor Laboratory)

DOI: 10.64898/2026.03.08.710344

Field-Weighted Citation Impact: 0.00