Search for a command to run...
# MHC-Diff 8K Dataset: HLA-A*02:01 9-mer pMHC Structures [](https://creativecommons.org/licenses/by/4.0/) ## Overview This dataset contains **7,928 peptide-MHC class I (pMHC-I) structures** for the HLA-A*02:01 allele with 9-mer peptides. It is designed for training and evaluating machine learning models for pMHC structure prediction. | Property | Value | |----------|-------| | **Total structures** | 7,928 | | **X-ray structures** | 202 (from PDB) | | **PANDORA structures** | 7,726 (computationally modeled) | | **MHC allele** | HLA-A\*02:01 | | **Peptide length** | 9 amino acids | | **Number of clusters** | 10 | | **Total size** | ~600 MB | ## Data Sources - **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) - **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA) ## Clustering Strategy Peptides were clustered using **GibbsCluster** based on sequence similarity into 10 clusters. This ensures that test peptides share no close sequence similarity with training peptides, enabling rigorous evaluation of generalization to novel peptide sequences. ## Files ``` mhc-diff-8k-v1.0/ ├── README.md # This file ├── LICENSE # CC-BY-4.0 license ├── SHA256SUMS # Checksums for all files ├── samples.parquet # Sample index (recommended) ├── samples.tsv.gz # Sample index (alternative format) ├── split_recipes/ # Cross-validation fold definitions │ ├── fold_0.json # Leave cluster 0 out │ ├── fold_1.json # Leave cluster 1 out │ ├── ... │ └── README.json # Split recipe documentation └── structures/ # HDF5 structure files ├── combined_cluster0.hdf5 ├── combined_cluster1.hdf5 ├── ... └── combined_cluster9.hdf5 ``` ## Data Format ### Sample Index (`samples.parquet`) | Column | Description | |--------|-------------| | `sample_id` | Unique structure identifier | | `cluster_id` | Cluster assignment (0-9) | | `source` | `xray` or `pandora` | | `structure_file` | HDF5 file containing the structure | | `allele` | MHC allele (all HLA-A\*02:01) | | `peptide_length` | Peptide length (all 9) | ### HDF5 Structure Files Each HDF5 file contains multiple structures indexed by `sample_id`: **X-ray structures** (4-letter PDB codes): ```python import h5py with h5py.File('combined_cluster0.hdf5', 'r') as f: pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format ``` **PANDORA structures** (IDs starting with `BA-`): ```python with h5py.File('combined_cluster0.hdf5', 'r') as f: entry = f['BA-12345'] peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords ``` ## Usage ### 10-Fold Cross-Validation ```python import pandas as pd import json # Load sample index samples = pd.read_parquet('samples.parquet') # Load fold definition with open('split_recipes/fold_0.json') as f: fold = json.load(f) # Split data train = samples[samples['cluster_id'].isin(fold['train_clusters'])] test = samples[samples['cluster_id'].isin(fold['test_clusters'])] ``` ## Related Datasets This dataset is a **subset** of the larger MHC-Diff 100K dataset, which covers 110 diverse MHC alleles and peptide lengths 8-13. - **MHC-Diff 100K Dataset**: [Zenodo DOI to be added] ## Citation If you use this dataset, please cite: ```bibtex @article{fruhbuss2025mhcdiff, title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model}, author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li}, year={2025} } ``` ## References 1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235 2. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762 3. Andreatta, M., et al. "GibbsCluster: unsupervised clustering and alignment of peptide sequences." *Nucleic Acids Research* 45(W1), W458–W463 (2017). https://doi.org/10.1093/nar/gkx248 ## License This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/). ## Contact - Li Xue: Li.Xue@radboudumc.nl