Deep Batch Active Learning for Protein Structure Modeling

20260 citationsJournal Article

Authors

Zexin Xue

Michael Bailey

Abhinav Gupta · Sanofi (France)

Ruijiang Li · Sanofi (France)

Alejandro Corrochano-Navarro

Sizhen Li

Lorenzo Kogler-Anele

Qui Bo Yu · Sanofi (France)

Heidi Rommelaere · Sanofi (Belgium)

Abstract

Molecular structure prediction is essential for understanding therapeutic functions and accelerating pharmaceutical research. While state-of-the-art deep learning models like AlphaFold demonstrate strong performance on general protein backbone prediction, they struggle with critical regions of VHH antibodies, a novel family of molecules underrepresented in current training datasets. Many academic and industry laboratories can generate high-quality VHH structures for novel sequences, presenting an opportunity to improve model performance through iterative fine-tuning with strategically selected new data. However, experimental structure determination requires weeks to months of effort and significant costs per structure, making exhaustive data collection impractical. Randomly curating subset of full collection yields suboptimal improvements, as many structures provide redundant information while key regions remain unexplored. Strategic data selection can identify which structures, once experimentally determined, will maximally improve prediction accuracy, enabling superior model performance with fewer iterations and lower costs. We propose DEWDROP, an active learning selection method that guides VHH structure curation to maximally improve fine-tuned model performance. DEWDROP leverages Monte Carlo dropout to generate prediction ensembles that inform optimal data selection. While we focus on VHH antibodies, underrepresentation issues affect many molecular domains, making DEWDROP broadly applicable as a model-agnostic method for structural biology applications. To demonstrate this effectiveness, we evaluate our approach through retrospective iterative fine-tuning experiments and batch selection analysis on two distinct structural families: VHH antibodies from SAbDab-nano as our target application and primary benchmark and <i>Mycobacterium leprae</i> proteins from the AlphaFold Protein Database to demonstrate broader applicability across different molecular domains. For all analyses, we use a structured prediction model based on coarse-grain molecular representations that operates independently of multiple sequence alignments called Equifold. We demonstrate that DEWDROP (1) improves model training efficiency through optimized batch selection, outperforming baseline methods and (2) selects structurally informative data with high information content.

Topics & Keywords

Protein Structure and Dynamics vaccines and immunoinformatics approaches Computational Drug Discovery Methods

UN Sustainable Development Goals

Industry, innovation and infrastructure

Publication Details

Published in: Journal of Computational Biology

Volume 33, Issue 1, pp. 184-200

DOI: 10.1177/15578666251405823

Field-Weighted Citation Impact: 0.00