Search for a command to run...
The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (<i>n</i> = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (<i>n</i> = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (<i>n</i> = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.
Published in: Journal of Chemical Information and Modeling
Volume 65, Issue 19, pp. 10037-10053