Search for a command to run...
Abstract: The rise of Antimicrobial Resistance (AMR) necessitates fast and accurate computationalapproaches to predict resistance phenotypes directly from genomic data. While Whole-GenomeSequencing (WGS) coupled with Deep Learning (DL) models is the state-of-the-art paradigm, asystematic comparative evaluation of different genomic encoding and visualization methodsremains limited, particularly in the critical context of AMR prediction for Escherichia coli. Thisstudy systematically assesses four distinct genomic representation strategies: traditional K-mercounting with ensemble tree-based classifiers, reference-based SNP profiles with ensemble learning,One-Hot Encoding with a 1D-Convolutional Neural Network (1D-CNN), and Chaos GameRepresentation (CGR) with a 2D-Convolutional Neural Network (2D-CNN), for predictingresistance to ciprofloxacin, gentamicin, and ampicillin. The results reveal a consistent and superiordiscriminatory power of the alignment-free traditional Machine Learning approach based on Kmer frequency profiles (specifically 4-mers) when coupled with gradient boosting algorithms (suchas XGBoost and LightGBM), compared to both SNP-based Machine Learning and Deep Learningarchitectures. This performance advantage was most pronounced for gentamicin and ampicillin,where complex resistance mechanisms involving mobile genetic elements are captured moreeffectively by the K-mer approach. Crucially, the study benchmarks the limitations of DeepLearning: while the One-Hot 1D-CNN model exhibited a severe calibration failure characterized by an extremely low Recall for ampicillin (F1-Score of only 0.1132), the SNP-based Machine Learningmodels maintained robust performance on the same feature set, highlighting the architecturalefficiency of gradient boosting over CNNs for tabular genomic data. Statistical analysis confirmedthe significance of these differences, with K-mer ML significantly outperforming Deep Learningacross all antibiotics (p < 0.001 for Gentamicin and Ampicillin). The amino acid 4-mer XGBoostmodel achieved an AUC of 0.9917 (95% CI: 0.9827-0.9983) for Ciprofloxacin. The studyconcludes that, for current dataset sizes and complex resistance phenotypes, the denseinformation representation of K-mers offers a more accurate and robust solution, and identifies the4-mer XGBoost and Combined K-mer LightGBM configurations as the optimal modeling strategies.Keywords: Machine learning, Deep learning, Bioinformatics, Computational Biology,Antimicrobials, Bacteria, Escherichia coli, Applied microbiology.
Published in: VNU Journal of Science Computer Science and Communication Engineering
Volume 42, Issue 1