Horizyn-1 Development Dataset: Dual-encoder contrastive learning accelerates enzyme discovery

20250 citationsDatasetgreen Open Access

Authors

Jason W. Rocks · Dyckerhoff (Germany)

Dat Truong · Dyckerhoff (Germany)

Dmitrij Rappoport · Dyckerhoff (Germany)

Samuel Maddrell-Mander · Dyckerhoff (Germany)

Daniel A. Martin-Alarcon · Dyckerhoff (Germany)

Toni M. Lee · Dyckerhoff (Germany)

Steven Crossan ·

Abstract

Overview This repository contains the development dataset used to train and evaluate the Horizyn-1 development model, as described in the paper "Dual-encoder contrastive learning accelerates enzyme discovery." The dataset includes the full set of reaction SMILES, protein-reaction pairs, and pre-computed ProtT5 protein embeddings. The accompanying code for training and evaluation is available at: https://github.com/dayhofflabs/horizyn. A model checkpoint produced using the data and code is also included. Methodology & Splits The training and test sets were created by splitting on reactions to prevent data leakage. The test set was strictly filtered to exclude any reactions with high similarity to any reactions in the training set (see manuscript). For evaluation, the test setup involves identifying the correct enzyme for each test reaction from a total screening pool of 216,132 proteins contained in this dataset. Dataset Statistics Reactions: 10,785 (Train) / 1,012 (Test) Enzymes: 192,769 (Train) / 32,100 (Test) Reaction-Enzyme Pairs: 257,733 (Train) / 33,996 (Test) File Manifest The archive contains the following standardized files: train_rxns.csv & test_rxns.csv Contains reaction SMILES strings. Columns: rs_id, reaction_id, reaction_smiles train_pairs.csv & test_pairs.csv Defines the positive training and testing pairs. Columns: pr_id, reaction_id, protein_id prots_t5.h5 HDF5 file containing pre-computed protein embeddings (ProtT5-XL). Structure: /ids: Dataset of protein IDs (strings) /vectors: Dataset of embeddings (float32, shape: [N, 1024]) prots.fasta (Reference) FASTA file containing protein sequences corresponding to the embeddings in the HDF5 file. horizyn-v1.ckpt Model checkpoint created using this dataset and the official implementation. Code repository: https://github.com/dayhofflabs/horizyn Citation If you use this dataset, please cite the associated preprint: Rocks, J. W., Truong, D. P., Rappoport, D., Maddrell-Mander, S., Martin-Alarcon, D. A., Lee, T., Crossan, S., & Goldford, J. E. (2025). Dual-encoder contrastive learning accelerates enzyme discovery. bioRxiv. DOI: 10.1101/2025.08.21.671639

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.17957034

Command Palette

Horizyn-1 Development Dataset: Dual-encoder contrastive learning accelerates enzyme discovery

Authors

Abstract

Topics & Keywords

Publication Details