Benchmarking Machine Learning Models for HIV-1 Protease Inhibitor Resistance Prediction: Impact of Data Set Construction and Feature Representation

20252 citationsJournal Articlehybrid Open Access

Authors

Rocío Lucía Beatriz Riveros Maidana · Agriaquaculture Nutritional Genomic Center

Lucas de Almeida Machado · Angiologica (Italy)

Ana Carolina Ramos Guimarães · Agriaquaculture Nutritional Genomic Center

Abstract

The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.

Topics & Keywords

HIV/AIDS drug development and treatment HIV Research and Treatment Hepatitis C virus research

Publication Details

Published in: Journal of Chemical Information and Modeling

Volume 65, Issue 19, pp. 10037-10053

DOI: 10.1021/acs.jcim.5c01544

Field-Weighted Citation Impact: 2.80

Command Palette

Benchmarking Machine Learning Models for HIV-1 Protease Inhibitor Resistance Prediction: Impact of Data Set Construction and Feature Representation

Authors

Abstract

Topics & Keywords

Publication Details