Evaluating transformer-based models for structural characterization of orphan proteins

20260 citationsJournal Articlegreen Open Access

Authors

Ercan Seçkin · Centre National de la Recherche Scientifique

Dominique Colinet · Centre National de la Recherche Scientifique

Etienne GJ Danchin · Centre National de la Recherche Scientifique

Edoardo Sarti · Institut national de recherche en sciences et technologies du numérique

Abstract

Abstract Motivation Transformer-based models (TBMs) are state-of-the-art deep learning architectures that predict protein structural features with high accuracy. Despite methodological differences, they all rely on large protein sequence datasets structured by homology, as homologous proteins typically share similar structures. However, 5–30% of eukaryotic proteomes consist of orphan proteins—sequences without detectable similarity to known families. Although they may share structural traits with characterized proteins, their lack of homology makes them and ideal dataset for evaluating TBM generalization beyond familiar sequence space. Results We compared predictions from several widely used TBM architectures on an expert-curated set of orphan proteins from the Meloidogyne genus. None of these proteins has an experimentally determined structure. To assess model performance, we conducted consistency analyses, comparing predicted features with those observed in sets of known homologous proteins and across models. Multiple sequence alignment–based approaches such as AlphaFold2 performed poorly on orphan proteins, as did single-sequence or embedding-based language models including ESMFold, OmegaFold, and ProtT5. This limited performance cannot be fully attributed to intrinsic disorder, as confirmed by independent non-TBM disorder predictors. While accurate tertiary structure prediction remains out of reach, secondary structure is more reliably captured: predictors share about 70% of secondary structure elements on average, regardless of global fold similarity, and these elements are consistently identified by dedicated secondary structure tools. Availability All data and analysis scripts are available at https://doi.org/10.5281/zenodo.18788931 Contact edoardo.sarti@inria.fr

Topics & Keywords

Protein Structure and Dynamics Machine Learning in Bioinformatics Bioinformatics and Genomic Networks

Publication Details

Published in: bioRxiv (Cold Spring Harbor Laboratory)

DOI: 10.64898/2026.03.10.709490

Field-Weighted Citation Impact: 0.00

Command Palette

Evaluating transformer-based models for structural characterization of orphan proteins

Authors

Abstract

Topics & Keywords

Publication Details