Machine learning for multi-omics data integration in crop improvement: a systematic review

20260 citationsJournal Articlegold Open Access

Authors

Abstract

This systematic review synthesizes applications of machine learning (ML) for multi-omics data integration in crop improvement, evaluating its dual potential to enhance predictive accuracy for selection (breeding utility) and to generate interpretable biological insights (mechanistic discovery). Following a systematic review of 76 eligible studies, we synthesized patterns in methodological adoption. The integration of environmental data (envirotyping) with multiple omics layers for genotype-by-environment (G×E) prediction was identified as a major emerging frontier, though currently addressed in fewer than 20% of studies. Tree-based models such as random forest and XGBoost were the most prevalent, favored for their interpretability and robustness with small to medium-sized datasets. In contrast, deep learning approaches, while reporting high performance, were primarily applied to larger datasets and constrained by higher computational costs. Emerging hybrid models show promise, but their efficacy is highly architecture- dependent. The most consistent accuracy gains (10–15%) were observed for feature-engineering hybrids (autoencoders compressing multi-omics data followed by XGBoost). Stacking ensembles showed more variable performance (5–12% gains), while integrated hybrids like convolutional neural network-long short-term memory (CNN-LSTM) delivered high accuracy for specific data structures. A recurring trend indicated that genomics with transcriptomics frequently boosted prediction for stress-related traits, while genomics with metabolomics excelled for quality traits. Tri-omics integration enhanced prediction for complex yield traits, though with marginal gains (< 5%) and substantial computational cost increases. Comparatively, ML-based approaches often outperformed classical genomic selection (GS) for low-heritability traits, while GS remained competitive for high-heritability traits. Deep learning models showed particular strength in handling population structure, reducing prediction errors by up to 20% in diverse panels. Critical gaps were identified: an overwhelming focus on point estimates of accuracy, (with fewer than 10% of studies reporting calibrated uncertainty metrics: and a relative scarcity of intrinsically interpretable model architectures that incorporates biological constraints as a core design principle. ML- driven multi-omics integration holds transformative potential but requires strategic implementation tailored to specific breeding objectives, trait architecture, and resource availability. Collective efforts to standardize data protocols, enhance model interpretability, and democratize computational tools are critical. Realizing equitable potential requires strategies to develop user-friendly platforms that extend advances to under-resourced crops. Transfer learning from data-rich species and federated data-sharing models are concrete avenues for promoting equitable innovations and enhancing global agricultural resilience.

Topics & Keywords

Genetic Mapping and Diversity in Plants and Animals Genetic and phenotypic traits in livestock Genetic Associations and Epidemiology

Publication Details

Published in: BMC Bioinformatics

DOI: 10.1186/s12859-026-06438-8

Field-Weighted Citation Impact: 0.00

Command Palette

Machine learning for multi-omics data integration in crop improvement: a systematic review

Authors

Abstract

Topics & Keywords

Publication Details