Search for a command to run...
Abstract Motivation Transformer-based models (TBMs) are state-of-the-art deep learning architectures that predict protein structural features with high accuracy. Despite methodological differences, they all rely on large protein sequence datasets structured by homology, as homologous proteins typically share similar structures. However, 5–30% of eukaryotic proteomes consist of orphan proteins—sequences without detectable similarity to known families. Although they may share structural traits with characterized proteins, their lack of homology makes them and ideal dataset for evaluating TBM generalization beyond familiar sequence space. Results We compared predictions from several widely used TBM architectures on an expert-curated set of orphan proteins from the Meloidogyne genus. None of these proteins has an experimentally determined structure. To assess model performance, we conducted consistency analyses, comparing predicted features with those observed in sets of known homologous proteins and across models. Multiple sequence alignment–based approaches such as AlphaFold2 performed poorly on orphan proteins, as did single-sequence or embedding-based language models including ESMFold, OmegaFold, and ProtT5. This limited performance cannot be fully attributed to intrinsic disorder, as confirmed by independent non-TBM disorder predictors. While accurate tertiary structure prediction remains out of reach, secondary structure is more reliably captured: predictors share about 70% of secondary structure elements on average, regardless of global fold similarity, and these elements are consistently identified by dedicated secondary structure tools. Availability All data and analysis scripts are available at https://doi.org/10.5281/zenodo.18788931 Contact edoardo.sarti@inria.fr